netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/13] "Task_isolation" mode
@ 2020-07-22 14:44 Alex Belits
  2020-07-22 14:47 ` [PATCH v4 01/13] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
                   ` (13 more replies)
  0 siblings, 14 replies; 49+ messages in thread
From: Alex Belits @ 2020-07-22 14:44 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

This is a new version of task isolation implementation. Previous version is at
https://lore.kernel.org/lkml/07c25c246c55012981ec0296eee23e68c719333a.camel@marvell.com/

Mostly this covers race conditions prevention on breaking isolation. Early after kernel entry,
task_isolation_enter() is called to update flags visible to other CPU cores and to perform
synchronization if necessary. Before this call only "safe" operations happen, as long as
CONFIG_TRACE_IRQFLAGS is not enabled.

This is also intended for future TLB handling -- the idea is to also isolate those CPU cores from
TLB flushes while they are running isolated task in userspace, and do one flush on exiting, before
any code is called that may touch anything updated.

The functionality and interface is unchanged, except for /sys/devices/system/cpu/isolation_running
containing the list of CPUs running isolated tasks. This should be useful for userspace helper
libraries.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v4 01/13] task_isolation: vmstat: add quiet_vmstat_sync function
  2020-07-22 14:44 [PATCH v4 00/13] "Task_isolation" mode Alex Belits
@ 2020-07-22 14:47 ` Alex Belits
  2020-07-22 14:48 ` [PATCH v4 02/13] task_isolation: vmstat: add vmstat_idle function Alex Belits
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Alex Belits @ 2020-07-22 14:47 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
the quiet_vmstat() function became asynchronous, in the sense that
the vmstat work was still scheduled to run on the core when the
function returned.  For task isolation, we need a synchronous
version of the function that guarantees that the vmstat worker
will not run on the core on return from the function.  Add a
quiet_vmstat_sync() function with that semantic.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 include/linux/vmstat.h | 2 ++
 mm/vmstat.c            | 9 +++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index aa961088c551..ded16dfd21fa 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -272,6 +272,7 @@ extern void __dec_zone_state(struct zone *, enum
zone_stat_item);
 extern void __dec_node_state(struct pglist_data *, enum
node_stat_item);
 
 void quiet_vmstat(void);
+void quiet_vmstat_sync(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -374,6 +375,7 @@ static inline void __dec_node_page_state(struct
page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat_sync(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3fb23a21f6dd..93534f8537ca 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1889,6 +1889,15 @@ void quiet_vmstat(void)
 	refresh_cpu_vm_stats(false);
 }
 
+/*
+ * Synchronously quiet vmstat so the work is guaranteed not to run on
return.
+ */
+void quiet_vmstat_sync(void)
+{
+	cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work));
+	refresh_cpu_vm_stats(false);
+}
+
 /*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 02/13] task_isolation: vmstat: add vmstat_idle function
  2020-07-22 14:44 [PATCH v4 00/13] "Task_isolation" mode Alex Belits
  2020-07-22 14:47 ` [PATCH v4 01/13] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
@ 2020-07-22 14:48 ` Alex Belits
  2020-07-22 14:49 ` [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel Alex Belits
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Alex Belits @ 2020-07-22 14:48 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

From 7823be8cd3ba2e66308f334a2e47f60ba7829e0b Mon Sep 17 00:00:00 2001
From: Chris Metcalf <cmetcalf@mellanox.com>
Date: Sat, 1 Feb 2020 08:05:45 +0000
Subject: [PATCH 02/13] task_isolation: vmstat: add vmstat_idle function

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index ded16dfd21fa..97bc9ed92036 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -273,6 +273,7 @@ extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
 void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -376,6 +377,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 93534f8537ca..f3693ef0a958 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1898,6 +1898,16 @@ void quiet_vmstat_sync(void)
 	refresh_cpu_vm_stats(false);
 }
 
+/*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+	return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) &&
+		!need_update(smp_processor_id());
+}
+
 /*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
  2020-07-22 14:44 [PATCH v4 00/13] "Task_isolation" mode Alex Belits
  2020-07-22 14:47 ` [PATCH v4 01/13] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
  2020-07-22 14:48 ` [PATCH v4 02/13] task_isolation: vmstat: add vmstat_idle function Alex Belits
@ 2020-07-22 14:49 ` Alex Belits
  2020-10-01 13:56   ` Frederic Weisbecker
  2020-10-01 14:40   ` Frederic Weisbecker
  2020-07-22 14:51 ` [PATCH v4 04/13] task_isolation: Add task isolation hooks to arch-independent code Alex Belits
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 49+ messages in thread
From: Alex Belits @ 2020-07-22 14:49 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
"isolcpus=nohz,domain,CPULIST" boot argument to enable
nohz_full and isolcpus. The "task_isolation" state is then indicated
by setting a new task struct field, task_isolation_flag, to the
value passed by prctl(), and also setting a TIF_TASK_ISOLATION
bit in the thread_info flags. When the kernel is returning to
userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
it calls the new task_isolation_start() routine to arrange for
the task to avoid being interrupted in the future.

With interrupts disabled, task_isolation_start() ensures that kernel
subsystems that might cause a future interrupt are quiesced. If it
doesn't succeed, it adjusts the syscall return value to indicate that
fact, and userspace can retry as desired. In addition to stopping
the scheduler tick, the code takes any actions that might avoid
a future interrupt to the core, such as a worker thread being
scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
other exception or irq, the kernel will send it a signal to indicate
isolation loss. In addition to sending a signal, the code supports a
kernel command-line "task_isolation_debug" flag which causes a stack
backtrace to be generated whenever a task loses isolation.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
clear the bit again later, and ignores exit/exit_group to allow
exiting the task without a pointless signal being delivered.

The prctl() API allows for specifying a signal number to use instead
of the default SIGKILL, to allow for catching the notification
signal; for example, in a production environment, it might be
helpful to log information to the application logging mechanism
before exiting. Or, the signal handler might choose to reset the
program counter back to the code segment intended to be run isolated
via prctl() to continue execution.

In a number of cases we can tell on a remote cpu that we are
going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
In that case we generate the diagnostic (and optional stack dump)
on the remote core to be able to deliver better diagnostics.
If the interrupt is not something caught by Linux (e.g. a
hypervisor interrupt) we can also request a reschedule IPI to
be sent to the remote core so it can be sure to generate a
signal to notify the process.

Isolation also disables CPU state synchronization mechanisms that
are. normally done by IPI. In the future, more synchronization
mechanisms, such as TLB flushes, may be disabled for isolated tasks.
This requires careful handling of kernel entry from isolated task --
remote synchronization requests must be re-enabled and
synchronization procedure triggered, before anything other than
low-level kernel entry code is called. Same applies to exiting from
kernel to userspace after isolation is enabled -- either the code
should not depend on synchronization, or isolation should be broken.

For this purpose, per-CPU low-level flags ll_isol_flags are used to
indicate isolation state, and task_isolation_kernel_enter() is used
to safely clear them early in kernel entry. CPU mask corresponding
to isolation bit in ll_isol_flags is visible to userspace as
/sys/devices/system/cpu/isolation_running, and can be used for
monitoring.

Separate patches that follow provide these changes for x86, arm,
and arm64 architectures, xen and irqchip drivers.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 .../admin-guide/kernel-parameters.txt         |   6 +
 drivers/base/cpu.c                            |  23 +
 include/linux/hrtimer.h                       |   4 +
 include/linux/isolation.h                     | 295 ++++++
 include/linux/sched.h                         |   5 +
 include/linux/tick.h                          |   3 +
 include/uapi/linux/prctl.h                    |   6 +
 init/Kconfig                                  |  28 +
 kernel/Makefile                               |   2 +
 kernel/isolation.c                            | 841 ++++++++++++++++++
 kernel/signal.c                               |   2 +
 kernel/sys.c                                  |   6 +
 kernel/time/hrtimer.c                         |  27 +
 kernel/time/tick-sched.c                      |  18 +
 14 files changed, 1266 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb95fad81c79..321f9d21a99a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5015,6 +5015,12 @@
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation_debug	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION, this
+			setting will generate console backtraces to
+			accompany the diagnostics generated about
+			interrupting tasks running with task isolation.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index d2136ab9b14a..733b0e7d598c 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -287,6 +287,26 @@ static ssize_t print_cpus_isolated(struct device *dev,
 }
 static DEVICE_ATTR(isolated, 0444, print_cpus_isolated, NULL);
 
+#ifdef CONFIG_TASK_ISOLATION
+static ssize_t print_isolation_running(struct device *dev,
+				      struct device_attribute *attr, char *buf)
+{
+	int n;
+	cpumask_var_t isolation_running;
+
+	if (!zalloc_cpumask_var(&isolation_running, GFP_KERNEL))
+		return -ENOMEM;
+
+	task_isolation_cpumask(isolation_running);
+	n = sprintf(buf, "%*pbl\n", cpumask_pr_args(isolation_running));
+
+	free_cpumask_var(isolation_running);
+
+	return n;
+}
+static DEVICE_ATTR(isolation_running, 0444, print_isolation_running, NULL);
+#endif
+
 #ifdef CONFIG_NO_HZ_FULL
 static ssize_t print_cpus_nohz_full(struct device *dev,
 				  struct device_attribute *attr, char *buf)
@@ -473,6 +493,9 @@ static struct attribute *cpu_root_attrs[] = {
 #ifdef CONFIG_NO_HZ_FULL
 	&dev_attr_nohz_full.attr,
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	&dev_attr_isolation_running.attr,
+#endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
 #endif
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 15c8ac313678..e81252eb4f92 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -528,6 +528,10 @@ extern void __init hrtimers_init(void);
 /* Show pending timers: */
 extern void sysrq_timer_list_show(void);
 
+#ifdef CONFIG_TASK_ISOLATION
+extern void kick_hrtimer(void);
+#endif
+
 int hrtimers_prepare_cpu(unsigned int cpu);
 #ifdef CONFIG_HOTPLUG_CPU
 int hrtimers_dead_cpu(unsigned int cpu);
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..a92ff200db12
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,295 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Task isolation support
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@marvell.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <stdarg.h>
+#include <linux/errno.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/irqflags.h>
+#include <linux/prctl.h>
+#include <linux/types.h>
+
+struct task_struct;
+
+#ifdef CONFIG_TASK_ISOLATION
+
+int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...);
+
+#define pr_task_isol_emerg(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_EMERG, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_alert(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_ALERT, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_crit(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_CRIT, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_err(cpu, fmt, ...)				\
+	task_isolation_message(cpu, LOGLEVEL_ERR, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_warn(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_WARNING, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_notice(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_NOTICE, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_info(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_INFO, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_debug(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_DEBUG, false, fmt, ##__VA_ARGS__)
+
+#define pr_task_isol_emerg_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_EMERG, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_alert_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_ALERT, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_crit_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_CRIT, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_err_supp(cpu, fmt, ...)				\
+	task_isolation_message(cpu, LOGLEVEL_ERR, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_warn_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_WARNING, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_notice_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_NOTICE, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_info_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_INFO, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_debug_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_DEBUG, true, fmt, ##__VA_ARGS__)
+
+#define BIT_LL_TASK_ISOLATION	(0)
+#define FLAG_LL_TASK_ISOLATION	(1 << BIT_LL_TASK_ISOLATION)
+
+DECLARE_PER_CPU(unsigned long, ll_isol_flags);
+extern cpumask_var_t task_isolation_map;
+
+/**
+ * task_isolation_request() - prctl hook to request task isolation
+ * @flags:	Flags from <linux/prctl.h> PR_TASK_ISOLATION_xxx.
+ *
+ * This is called from the generic prctl() code for PR_TASK_ISOLATION.
+ *
+ * Return: Returns 0 when task isolation enabled, otherwise a negative
+ * errno.
+ */
+extern int task_isolation_request(unsigned int flags);
+
+/**
+ * task_isolation_kernel_enter() - clear low-level task isolation flag
+ *
+ * This should be called immediately after entering kernel.
+ */
+static inline void task_isolation_kernel_enter(void)
+{
+	unsigned long flags;
+
+	/*
+	 * This function runs on a CPU that ran isolated task.
+	 *
+	 * We don't want this CPU running code from the rest of kernel
+	 * until other CPUs know that it is no longer isolated.
+	 * When CPU is running isolated task until this point anything
+	 * that causes an interrupt on this CPU must end up calling this
+	 * before touching the rest of kernel. That is, this function or
+	 * fast_task_isolation_cpu_cleanup() or stop_isolation() calling
+	 * it. If any interrupt, including scheduling timer, arrives, it
+	 * will still end up here early after entering kernel.
+	 * From this point interrupts are disabled until all CPUs will see
+	 * that this CPU is no longer running isolated task.
+	 *
+	 * See also fast_task_isolation_cpu_cleanup().
+	 */
+	smp_rmb();
+	if((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION) == 0)
+		return;
+
+	local_irq_save(flags);
+
+	/* Clear low-level flags */
+	this_cpu_write(ll_isol_flags, 0);
+
+	/*
+	 * If something happened that requires a barrier that would
+	 * otherwise be called from remote CPUs by CPU kick procedure,
+	 * this barrier runs instead of it. After this barrier, CPU
+	 * kick procedure would see the updated ll_isol_flags, so it
+	 * will run its own IPI to trigger a barrier.
+	 */
+	smp_mb();
+	/*
+	 * Synchronize instructions -- this CPU was not kicked while
+	 * in isolated mode, so it might require synchronization.
+	 * There might be an IPI if kick procedure happened and
+	 * ll_isol_flags was already updated while it assembled a CPU
+	 * mask. However if this did not happen, synchronize everything
+	 * here.
+	 */
+	instr_sync();
+	local_irq_restore(flags);
+}
+
+extern void task_isolation_cpu_cleanup(void);
+/**
+ * task_isolation_start() - attempt to actually start task isolation
+ *
+ * This function should be invoked as the last thing prior to returning to
+ * user space if TIF_TASK_ISOLATION is set in the thread_info flags.  It
+ * will attempt to quiesce the core and enter task-isolation mode.  If it
+ * fails, it will reset the system call return value to an error code that
+ * indicates the failure mode.
+ */
+extern void task_isolation_start(void);
+
+/**
+ * is_isolation_cpu() - check if CPU is intended for running isolated tasks.
+ * @cpu:	CPU to check.
+ */
+static inline bool is_isolation_cpu(int cpu)
+{
+	return task_isolation_map != NULL &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+/**
+ * task_isolation_on_cpu() - check if the cpu is running isolated task
+ * @cpu:	CPU to check.
+ */
+static inline int task_isolation_on_cpu(int cpu)
+{
+  return test_bit(BIT_LL_TASK_ISOLATION, &per_cpu(ll_isol_flags, cpu));
+}
+
+extern void task_isolation_check_run_cleanup(void);
+
+/**
+ * task_isolation_cpumask() - set CPUs currently running isolated tasks
+ * @mask:	Mask to modify.
+ */
+extern void task_isolation_cpumask(struct cpumask *mask);
+
+/**
+ * task_isolation_clear_cpumask() - clear CPUs currently running isolated tasks
+ * @mask:      Mask to modify.
+ */
+extern void task_isolation_clear_cpumask(struct cpumask *mask);
+
+/**
+ * task_isolation_syscall() - report a syscall from an isolated task
+ * @nr:		The syscall number.
+ *
+ * This routine should be invoked at syscall entry if TIF_TASK_ISOLATION is
+ * set in the thread_info flags.  It checks for valid syscalls,
+ * specifically prctl() with PR_TASK_ISOLATION, exit(), and exit_group().
+ * For any other syscall it will raise a signal and return failure.
+ *
+ * Return: 0 for acceptable syscalls, -1 for all others.
+ */
+extern int task_isolation_syscall(int nr);
+
+/**
+ * _task_isolation_interrupt() - report an interrupt of an isolated task
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This routine should be invoked at any exception or IRQ if
+ * TIF_TASK_ISOLATION is set in the thread_info flags.  It is not necessary
+ * to invoke it if the exception will generate a signal anyway (e.g. a bad
+ * page fault), and in that case it is preferable not to invoke it but just
+ * rely on the standard Linux signal.  The macro task_isolation_syscall()
+ * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
+ */
+extern void _task_isolation_interrupt(const char *fmt, ...);
+#define task_isolation_interrupt(fmt, ...)				\
+	do {								\
+		if (current_thread_info()->flags & _TIF_TASK_ISOLATION)	\
+			_task_isolation_interrupt(fmt, ## __VA_ARGS__);	\
+	} while (0)
+
+/**
+ * task_isolation_remote() - report a remote interrupt of an isolated task
+ * @cpu:	The remote cpu that is about to be interrupted.
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This routine should be invoked any time a remote IPI or other type of
+ * interrupt is being delivered to another cpu. The function will check to
+ * see if the target core is running a task-isolation task, and generate a
+ * diagnostic on the console if so; in addition, we tag the task so it
+ * doesn't generate another diagnostic when the interrupt actually arrives.
+ * Generating a diagnostic remotely yields a clearer indication of what
+ * happened then just reporting only when the remote core is interrupted.
+ *
+ */
+extern void task_isolation_remote(int cpu, const char *fmt, ...);
+
+/**
+ * task_isolation_remote_cpumask() - report interruption of multiple cpus
+ * @mask:	The set of remotes cpus that are about to be interrupted.
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This is the cpumask variant of _task_isolation_remote().  We
+ * generate a single-line diagnostic message even if multiple remote
+ * task-isolation cpus are being interrupted.
+ */
+extern void task_isolation_remote_cpumask(const struct cpumask *mask,
+					  const char *fmt, ...);
+
+/**
+ * _task_isolation_signal() - disable task isolation when signal is pending
+ * @task:	The task for which to disable isolation.
+ *
+ * This function generates a diagnostic and disables task isolation; it
+ * should be called if TIF_TASK_ISOLATION is set when notifying a task of a
+ * pending signal.  The task_isolation_interrupt() function normally
+ * generates a diagnostic for events that just interrupt a task without
+ * generating a signal; here we need to hook the paths that correspond to
+ * interrupts that do generate a signal.  The macro task_isolation_signal()
+ * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
+ */
+extern void _task_isolation_signal(struct task_struct *task);
+#define task_isolation_signal(task)					\
+	do {								\
+		if (task_thread_info(task)->flags & _TIF_TASK_ISOLATION) \
+			_task_isolation_signal(task);			\
+	} while (0)
+
+/**
+ * task_isolation_user_exit() - debug all user_exit calls
+ *
+ * By default, we don't generate an exception in the low-level user_exit()
+ * code, because programs lose the ability to disable task isolation: the
+ * user_exit() hook will cause a signal prior to task_isolation_syscall()
+ * disabling task isolation.  In addition, it means that we lose all the
+ * diagnostic info otherwise available from task_isolation_interrupt() hooks
+ * later in the interrupt-handling process.  But you may enable it here for
+ * a special kernel build if you are having undiagnosed userspace jitter.
+ */
+static inline void task_isolation_user_exit(void)
+{
+#ifdef DEBUG_TASK_ISOLATION
+	task_isolation_interrupt("user_exit");
+#endif
+}
+
+#else /* !CONFIG_TASK_ISOLATION */
+static inline int task_isolation_request(unsigned int flags) { return -EINVAL; }
+static inline void task_isolation_kernel_enter(void) {}
+static inline void task_isolation_start(void) { }
+static inline bool is_isolation_cpu(int cpu) { return 0; }
+static inline int task_isolation_on_cpu(int cpu) { return 0; }
+static inline void task_isolation_cpumask(struct cpumask *mask) { }
+static inline void task_isolation_clear_cpumask(struct cpumask *mask) { }
+static inline void task_isolation_cpu_cleanup(void) { }
+static inline void task_isolation_check_run_cleanup(void) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_interrupt(const char *fmt, ...) { }
+static inline void task_isolation_remote(int cpu, const char *fmt, ...) { }
+static inline void task_isolation_remote_cpumask(const struct cpumask *mask,
+						 const char *fmt, ...) { }
+static inline void task_isolation_signal(struct task_struct *task) { }
+static inline void task_isolation_user_exit(void) { }
+#endif
+
+#endif /* _LINUX_ISOLATION_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 683372943093..7fb7bb3fddaa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1301,6 +1301,11 @@ struct task_struct {
 	unsigned long			prev_lowest_stack;
 #endif
 
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned short			task_isolation_flags; /* prctl */
+	unsigned short			task_isolation_state;
+#endif
+
 #ifdef CONFIG_X86_MCE
 	u64				mce_addr;
 	__u64				mce_ripv : 1,
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 7340613c7eff..27c7c033d5a8 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -268,6 +268,9 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
 extern void __init tick_nohz_full_setup(cpumask_var_t cpumask);
+#ifdef CONFIG_TASK_ISOLATION
+extern int try_stop_full_tick(void);
+#endif
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f8131e36..f4848ed2a069 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -238,4 +238,10 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Enable task_isolation mode for TASK_ISOLATION kernels. */
+#define PR_TASK_ISOLATION		48
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 0498af567f70..7cbfe3eae3b6 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -620,6 +620,34 @@ config CPU_ISOLATION
 
 source "kernel/rcu/Kconfig"
 
+config HAVE_ARCH_TASK_ISOLATION
+	bool
+
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
+	help
+
+	Allow userspace processes that place themselves on cores with
+	nohz_full and isolcpus enabled, and run prctl(PR_TASK_ISOLATION),
+	to "isolate" themselves from the kernel.  Prior to returning to
+	userspace, isolated tasks will arrange that no future kernel
+	activity will interrupt the task while the task is running in
+	userspace.  Attempting to re-enter the kernel while in this mode
+	will cause the task to be terminated with a signal; you must
+	explicitly use prctl() to disable task isolation before resuming
+	normal use of the kernel.
+
+	This "hard" isolation from the kernel is required for userspace
+	tasks that are running hard real-time tasks in userspace, such as
+	a high-speed network driver in userspace.  Without this option, but
+	with NO_HZ_FULL enabled, the kernel will make a best-faith, "soft"
+	effort to shield a single userspace process from interrupts, but
+	makes no guarantees.
+
+	You should say "N" unless you are intending to run a
+	high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index f3218bc5ec69..546d8d83cc2f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -130,6 +130,8 @@ KASAN_SANITIZE_stackleak.o := n
 KCSAN_SANITIZE_stackleak.o := n
 KCOV_INSTRUMENT_stackleak.o := n
 
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
+
 $(obj)/configs.o: $(obj)/config_data.gz
 
 targets += config_data.gz
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..afeff490b76c
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,841 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation of task isolation.
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@marvell.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/sched.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include <linux/smp.h>
+#include <linux/tick.h>
+#include <asm/unistd.h>
+#include <asm/syscall.h>
+#include <linux/hrtimer.h>
+
+/*
+ * These values are stored in task_isolation_state.
+ * Note that STATE_NORMAL + TIF_TASK_ISOLATION means we are still
+ * returning from sys_prctl() to userspace.
+ */
+enum {
+	STATE_NORMAL = 0,	/* Not isolated */
+	STATE_ISOLATED = 1	/* In userspace, isolated */
+};
+
+/*
+ * Counter for isolation state on a given CPU, increments when entering
+ * isolation and decrements when exiting isolation (before or after the
+ * cleanup). Multiple simultaneously running procedures entering or
+ * exiting isolation are prevented by checking the result of
+ * incrementing or decrementing this variable. This variable is both
+ * incremented and decremented by CPU that caused isolation entering or
+ * exit.
+ *
+ * This is necessary because multiple isolation-breaking events may happen
+ * at once (or one as the result of the other), however isolation exit
+ * may only happen once to transition from isolated to non-isolated state.
+ * Therefore, if decrementing this counter results in a value less than 0,
+ * isolation exit procedure can't be started -- it already happened, or is
+ * in progress, or isolation is not entered yet.
+ */
+DEFINE_PER_CPU(atomic_t, isol_counter);
+
+/*
+ * Low-level isolation flags.
+ * Those flags are used by low-level isolation set/clear/check routines.
+ * Those flags should be set last before return to userspace and cleared
+ * first upon kernel entry, and synchronized to allow isolation breaking
+ * detection before touching potentially unsynchronized parts of kernel.
+ * Isolated task does not receive synchronization events of any kind, so
+ * at the time of the first entry into kernel it might not be ready to
+ * run most of the kernel code. However to perform synchronization
+ * properly, kernel entry code should also enable synchronization events
+ * at the same time. This presents a problem because more kernel code
+ * should run to determine the cause of isolation breaking, signals may
+ * have to be generated, etc. So some flag clearing and synchronization
+ * should happen in "low-level" entry code but processing of isolation
+ * breaking should happen in "high-level" code. Low-level isolation flags
+ * should be set in that low-level code, possibly long before the cause
+ * of isolation breaking is known. Symmetrically, entering isolation
+ * should disable synchronization events before returning to userspace
+ * but after all potentially volatile code is finished.
+ */
+DEFINE_PER_CPU(unsigned long, ll_isol_flags);
+
+/*
+ * Description of the last two tasks that ran isolated on a given CPU.
+ * This is intended only for messages about isolation breaking. We
+ * don't want any references to actual task while accessing this from
+ * CPU that caused isolation breaking -- we know nothing about timing
+ * and don't want to use locking or RCU.
+ */
+struct isol_task_desc {
+	atomic_t curr_index;
+	atomic_t curr_index_wr;
+	bool	warned[2];
+	pid_t	pid[2];
+	pid_t	tgid[2];
+	char	comm[2][TASK_COMM_LEN];
+};
+static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs);
+
+/*
+ * Counter for isolation exiting procedures (from request to the start of
+ * cleanup) being attempted at once on a CPU. Normally incrementing of
+ * this counter is performed from the CPU that caused isolation breaking,
+ * however decrementing is done from the cleanup procedure, delegated to
+ * the CPU that is exiting isolation, not from the CPU that caused isolation
+ * breaking.
+ *
+ * If incrementing this counter while starting isolation exit procedure
+ * results in a value greater than 0, isolation exiting is already in
+ * progress, and cleanup did not start yet. This means, counter should be
+ * decremented back, and isolation exit that is already in progress, should
+ * be allowed to complete. Otherwise, a new isolation exit procedure should
+ * be started.
+ */
+DEFINE_PER_CPU(atomic_t, isol_exit_counter);
+
+/*
+ * Descriptor for isolation-breaking SMP calls
+ */
+DEFINE_PER_CPU(call_single_data_t, isol_break_csd);
+
+cpumask_var_t task_isolation_map;
+cpumask_var_t task_isolation_cleanup_map;
+static DEFINE_SPINLOCK(task_isolation_cleanup_lock);
+
+/* We can run on cpus that are isolated from the scheduler and are nohz_full. */
+static int __init task_isolation_init(void)
+{
+	alloc_bootmem_cpumask_var(&task_isolation_cleanup_map);
+	if (alloc_cpumask_var(&task_isolation_map, GFP_KERNEL))
+		/*
+		 * At this point task isolation should match
+		 * nohz_full. This may change in the future.
+		 */
+		cpumask_copy(task_isolation_map, tick_nohz_full_mask);
+	return 0;
+}
+core_initcall(task_isolation_init)
+
+/* Enable stack backtraces of any interrupts of task_isolation cores. */
+static bool task_isolation_debug;
+static int __init task_isolation_debug_func(char *str)
+{
+	task_isolation_debug = true;
+	return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+/*
+ * Record name, pid and group pid of the task entering isolation on
+ * the current CPU.
+ */
+static void record_curr_isolated_task(void)
+{
+	int ind;
+	int cpu = smp_processor_id();
+	struct isol_task_desc *desc = &per_cpu(isol_task_descs, cpu);
+	struct task_struct *task = current;
+
+	/* Finish everything before recording current task */
+	smp_mb();
+	ind = atomic_inc_return(&desc->curr_index_wr) & 1;
+	desc->comm[ind][sizeof(task->comm) - 1] = '\0';
+	memcpy(desc->comm[ind], task->comm, sizeof(task->comm) - 1);
+	desc->pid[ind] = task->pid;
+	desc->tgid[ind] = task->tgid;
+	desc->warned[ind] = false;
+	/* Write everything, to be seen by other CPUs */
+	smp_mb();
+	atomic_inc(&desc->curr_index);
+	/* Everyone will see the new record from this point */
+	smp_mb();
+}
+
+/*
+ * Print message prefixed with the description of the current (or
+ * last) isolated task on a given CPU. Intended for isolation breaking
+ * messages that include target task for the user's convenience.
+ *
+ * Messages produced with this function may have obsolete task
+ * information if isolated tasks managed to exit, start and enter
+ * isolation multiple times, or multiple tasks tried to enter
+ * isolation on the same CPU at once. For those unusual cases it would
+ * contain a valid description of the cause for isolation breaking and
+ * target CPU number, just not the correct description of which task
+ * ended up losing isolation.
+ */
+int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...)
+{
+	struct isol_task_desc *desc;
+	struct task_struct *task;
+	va_list args;
+	char buf_prefix[TASK_COMM_LEN + 20 + 3 * 20];
+	char buf[200];
+	int curr_cpu, ind_counter, ind_counter_old, ind;
+
+	curr_cpu = get_cpu();
+	desc = &per_cpu(isol_task_descs, cpu);
+	ind_counter = atomic_read(&desc->curr_index);
+
+	if (curr_cpu == cpu) {
+		/*
+		 * Message is for the current CPU so current
+		 * task_struct should be used instead of cached
+		 * information.
+		 *
+		 * Like in other diagnostic messages, if issued from
+		 * interrupt context, current will be the interrupted
+		 * task. Unlike other diagnostic messages, this is
+		 * always relevant because the message is about
+		 * interrupting a task.
+		 */
+		ind = ind_counter & 1;
+		if (supp && desc->warned[ind]) {
+			/*
+			 * If supp is true, skip the message if the
+			 * same task was mentioned in the message
+			 * originated on remote CPU, and it did not
+			 * re-enter isolated state since then (warned
+			 * is true). Only local messages following
+			 * remote messages, likely about the same
+			 * isolation breaking event, are skipped to
+			 * avoid duplication. If remote cause is
+			 * immediately followed by a local one before
+			 * isolation is broken, local cause is skipped
+			 * from messages.
+			 */
+			put_cpu();
+			return 0;
+		}
+		task = current;
+		snprintf(buf_prefix, sizeof(buf_prefix),
+			 "isolation %s/%d/%d (cpu %d)",
+			 task->comm, task->tgid, task->pid, cpu);
+		put_cpu();
+	} else {
+		/*
+		 * Message is for remote CPU, use cached information.
+		 */
+		put_cpu();
+		/*
+		 * Make sure, index remained unchanged while data was
+		 * copied. If it changed, data that was copied may be
+		 * inconsistent because two updates in a sequence could
+		 * overwrite the data while it was being read.
+		 */
+		do {
+			/* Make sure we are reading up to date values */
+			smp_mb();
+			ind = ind_counter & 1;
+			snprintf(buf_prefix, sizeof(buf_prefix),
+				 "isolation %s/%d/%d (cpu %d)",
+				 desc->comm[ind], desc->tgid[ind],
+				 desc->pid[ind], cpu);
+			desc->warned[ind] = true;
+			ind_counter_old = ind_counter;
+			/* Record the warned flag, then re-read descriptor */
+			smp_mb();
+			ind_counter = atomic_read(&desc->curr_index);
+			/*
+			 * If the counter changed, something was updated, so
+			 * repeat everything to get the current data
+			 */
+		} while (ind_counter != ind_counter_old);
+	}
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	switch (level) {
+	case LOGLEVEL_EMERG:
+		pr_emerg("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_ALERT:
+		pr_alert("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_CRIT:
+		pr_crit("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_ERR:
+		pr_err("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_WARNING:
+		pr_warn("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_NOTICE:
+		pr_notice("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_INFO:
+		pr_info("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_DEBUG:
+		pr_debug("%s: %s", buf_prefix, buf);
+		break;
+	default:
+		/* No message without a valid level */
+		return 0;
+	}
+	return 1;
+}
+
+/*
+ * Dump stack if need be. This can be helpful even from the final exit
+ * to usermode code since stack traces sometimes carry information about
+ * what put you into the kernel, e.g. an interrupt number encoded in
+ * the initial entry stack frame that is still visible at exit time.
+ */
+static void debug_dump_stack(void)
+{
+	if (task_isolation_debug)
+		dump_stack();
+}
+
+/*
+ * Set the flags word but don't try to actually start task isolation yet.
+ * We will start it when entering user space in task_isolation_start().
+ */
+int task_isolation_request(unsigned int flags)
+{
+	struct task_struct *task = current;
+
+	/*
+	 * The task isolation flags should always be cleared just by
+	 * virtue of having entered the kernel.
+	 */
+	WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_TASK_ISOLATION));
+	WARN_ON_ONCE(task->task_isolation_flags != 0);
+	WARN_ON_ONCE(task->task_isolation_state != STATE_NORMAL);
+
+	task->task_isolation_flags = flags;
+	if (!(task->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
+		return 0;
+
+	/* We are trying to enable task isolation. */
+	set_tsk_thread_flag(task, TIF_TASK_ISOLATION);
+
+	/*
+	 * Shut down the vmstat worker so we're not interrupted later.
+	 * We have to try to do this here (with interrupts enabled) since
+	 * we are canceling delayed work and will call flush_work()
+	 * (which enables interrupts) and possibly schedule().
+	 */
+	quiet_vmstat_sync();
+
+	/* We return 0 here but we may change that in task_isolation_start(). */
+	return 0;
+}
+
+/*
+ * Perform actions that should be done immediately on exit from isolation.
+ */
+static void fast_task_isolation_cpu_cleanup(void *info)
+{
+	unsigned long flags;
+
+	/*
+	 * This function runs on a CPU that ran isolated task.
+	 *
+	 * We don't want this CPU running code from the rest of kernel
+	 * until other CPUs know that it is no longer isolated.
+	 * When CPU is running isolated task until this point anything
+	 * that causes an interrupt on this CPU must end up calling this
+	 * or task_isolation_kernel_enter() before touching the rest of
+	 * kernel. That is, task_isolation_kernel_enter(), IPI to this
+	 * function or stop_isolation() calling it. If any interrupt,
+	 * including scheduling timer, arrives before a call to this
+	 * function, it will still end up in task_isolation_kernel_enter()
+	 * early after entering kernel.
+	 * From this point interrupts are disabled until all CPUs will see
+	 * that this CPU is no longer running isolated task.
+	 *
+	 * See also task_isolation_kernel_enter().
+	 */
+	local_irq_save(flags);
+	atomic_dec(&per_cpu(isol_exit_counter, smp_processor_id()));
+	smp_mb__after_atomic();
+	/*
+	 * At this point breaking isolation from other CPUs is possible again,
+	 * however interrupts won't arrive until local_irq_restore()
+	 */
+
+	/*
+	 * This task is no longer isolated (and if by any chance this
+	 * is the wrong task, it's already not isolated)
+	 */
+	current->task_isolation_flags = 0;
+	clear_tsk_thread_flag(current, TIF_TASK_ISOLATION);
+
+	/* Run the rest of cleanup later */
+	set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+
+	/* Clear low-level flags if they are not cleared yet */
+	this_cpu_write(ll_isol_flags, 0);
+
+	/*
+	 * If something happened that requires a barrier that would
+	 * otherwise be called from remote CPUs by CPU kick procedure,
+	 * this barrier runs instead of it. After this barrier, CPU
+	 * kick procedure would see the updated ll_isol_flags, so it
+	 * will run its own IPI to trigger a barrier.
+	 */
+	smp_mb();
+	/*
+	 * Synchronize instructions -- this CPU was not kicked while
+	 * in isolated mode, so it might require synchronization.
+	 * There might be an IPI if kick procedure happened and
+	 * ll_isol_flags was already updated while it assembled a CPU
+	 * mask. However if this did not happen, synchronize everything
+	 * here.
+	 */
+	instr_sync();
+	local_irq_restore(flags);
+}
+
+/* Disable task isolation for the specified task. */
+static void stop_isolation(struct task_struct *p)
+{
+	int cpu, this_cpu;
+	unsigned long flags;
+
+	this_cpu = get_cpu();
+	cpu = task_cpu(p);
+	if (atomic_inc_return(&per_cpu(isol_exit_counter, cpu)) > 1) {
+		/* Already exiting isolation */
+		atomic_dec(&per_cpu(isol_exit_counter, cpu));
+		put_cpu();
+		return;
+	}
+
+	if (p == current) {
+		p->task_isolation_state = STATE_NORMAL;
+		fast_task_isolation_cpu_cleanup(NULL);
+		task_isolation_cpu_cleanup();
+		if (atomic_dec_return(&per_cpu(isol_counter, cpu)) < 0) {
+			/* Is not isolated already */
+			atomic_inc(&per_cpu(isol_counter, cpu));
+		}
+		put_cpu();
+	} else {
+		if (atomic_dec_return(&per_cpu(isol_counter, cpu)) < 0) {
+			/* Is not isolated already */
+			atomic_inc(&per_cpu(isol_counter, cpu));
+			atomic_dec(&per_cpu(isol_exit_counter, cpu));
+			put_cpu();
+			return;
+		}
+		/*
+		 * Schedule "slow" cleanup. This relies on
+		 * TIF_NOTIFY_RESUME being set
+		 */
+		spin_lock_irqsave(&task_isolation_cleanup_lock, flags);
+		cpumask_set_cpu(cpu, task_isolation_cleanup_map);
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+		/*
+		 * Setting flags is delegated to the CPU where
+		 * isolated task is running
+		 * isol_exit_counter will be decremented from there as well.
+		 */
+		per_cpu(isol_break_csd, cpu).func =
+		    fast_task_isolation_cpu_cleanup;
+		per_cpu(isol_break_csd, cpu).info = NULL;
+		per_cpu(isol_break_csd, cpu).flags = 0;
+		smp_call_function_single_async(cpu,
+					       &per_cpu(isol_break_csd, cpu));
+		put_cpu();
+	}
+}
+
+/*
+ * This code runs with interrupts disabled just before the return to
+ * userspace, after a prctl() has requested enabling task isolation.
+ * We take whatever steps are needed to avoid being interrupted later:
+ * drain the lru pages, stop the scheduler tick, etc.  More
+ * functionality may be added here later to avoid other types of
+ * interrupts from other kernel subsystems.
+ *
+ * If we can't enable task isolation, we update the syscall return
+ * value with an appropriate error.
+ */
+void task_isolation_start(void)
+{
+	int error;
+	unsigned long flags;
+
+	/*
+	 * We should only be called in STATE_NORMAL (isolation disabled),
+	 * on our way out of the kernel from the prctl() that turned it on.
+	 * If we are exiting from the kernel in another state, it means we
+	 * made it back into the kernel without disabling task isolation,
+	 * and we should investigate how (and in any case disable task
+	 * isolation at this point).  We are clearly not on the path back
+	 * from the prctl() so we don't touch the syscall return value.
+	 */
+	if (WARN_ON_ONCE(current->task_isolation_state != STATE_NORMAL)) {
+		/* Increment counter, this will allow isolation breaking */
+		if (atomic_inc_return(&per_cpu(isol_counter,
+					      smp_processor_id())) > 1) {
+			atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
+		}
+		//atomic_inc(&per_cpu(isol_counter, smp_processor_id()));
+		stop_isolation(current);
+		return;
+	}
+
+	/*
+	 * Must be affinitized to a single core with task isolation possible.
+	 * In principle this could be remotely modified between the prctl()
+	 * and the return to userspace, so we have to check it here.
+	 */
+	if (current->nr_cpus_allowed != 1 ||
+	    !is_isolation_cpu(smp_processor_id())) {
+		error = -EINVAL;
+		goto error;
+	}
+
+	/* If the vmstat delayed work is not canceled, we have to try again. */
+	if (!vmstat_idle()) {
+		error = -EAGAIN;
+		goto error;
+	}
+
+	/* Try to stop the dynamic tick. */
+	error = try_stop_full_tick();
+	if (error)
+		goto error;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/*
+	 * Task is going to be marked as isolated. This disables IPIs
+	 * used for synchronization, so to avoid inconsistency
+	 * don't let anything interrupt us and issue a barrier at the end.
+	 */
+	local_irq_save(flags);
+
+	/* Increment counter, this will allow isolation breaking */
+	if (atomic_inc_return(&per_cpu(isol_counter,
+				      smp_processor_id())) > 1) {
+		atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
+	}
+
+	/* Record isolated task IDs and name */
+	record_curr_isolated_task();
+	smp_wmb();
+
+	/* From this point this is recognized as isolated by other CPUs */
+	current->task_isolation_state = STATE_ISOLATED;
+	this_cpu_write(ll_isol_flags, FLAG_LL_TASK_ISOLATION);
+	smp_mb();
+	local_irq_restore(flags);
+	/*
+	 * If anything interrupts us at this point, it will trigger
+	 * isolation breaking procedure.
+	 */
+	return;
+
+error:
+	/* Increment counter, this will allow isolation breaking */
+	if (atomic_inc_return(&per_cpu(isol_counter,
+				      smp_processor_id())) > 1) {
+		atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
+	}
+	stop_isolation(current);
+	syscall_set_return_value(current, current_pt_regs(), error, 0);
+}
+
+/* Stop task isolation on the remote task and send it a signal. */
+static void send_isolation_signal(struct task_struct *task)
+{
+	int flags = task->task_isolation_flags;
+	kernel_siginfo_t info = {
+		.si_signo = PR_TASK_ISOLATION_GET_SIG(flags) ?: SIGKILL,
+	};
+
+	stop_isolation(task);
+	send_sig_info(info.si_signo, &info, task);
+}
+
+/* Only a few syscalls are valid once we are in task isolation mode. */
+static bool is_acceptable_syscall(int syscall)
+{
+	/* No need to incur an isolation signal if we are just exiting. */
+	if (syscall == __NR_exit || syscall == __NR_exit_group)
+		return true;
+
+	/* Check to see if it's the prctl for isolation. */
+	if (syscall == __NR_prctl) {
+		unsigned long arg[SYSCALL_MAX_ARGS];
+
+		syscall_get_arguments(current, current_pt_regs(), arg);
+		if (arg[0] == PR_TASK_ISOLATION)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * This routine is called from syscall entry, prevents most syscalls
+ * from executing, and if needed raises a signal to notify the process.
+ *
+ * Note that we have to stop isolation before we even print a message
+ * here, since otherwise we might end up reporting an interrupt due to
+ * kicking the printk handling code, rather than reporting the true
+ * cause of interrupt here.
+ *
+ * The message is not suppressed by previous remotely triggered
+ * messages.
+ */
+int task_isolation_syscall(int syscall)
+{
+	struct task_struct *task = current;
+
+	if (is_acceptable_syscall(syscall)) {
+		stop_isolation(task);
+		return 0;
+	}
+
+	send_isolation_signal(task);
+
+	pr_task_isol_warn(smp_processor_id(),
+			  "task_isolation lost due to syscall %d\n",
+			  syscall);
+	debug_dump_stack();
+
+	syscall_set_return_value(task, current_pt_regs(), -ERESTARTNOINTR, -1);
+	return -1;
+}
+
+/*
+ * This routine is called from any exception or irq that doesn't
+ * otherwise trigger a signal to the user process (e.g. page fault).
+ *
+ * Messages will be suppressed if there is already a reported remote
+ * cause for isolation breaking, so we don't generate multiple
+ * confusingly similar messages about the same event.
+ */
+void _task_isolation_interrupt(const char *fmt, ...)
+{
+	struct task_struct *task = current;
+	va_list args;
+	char buf[100];
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	/* Are we exiting isolation already? */
+	if (atomic_read(&per_cpu(isol_exit_counter, smp_processor_id())) != 0) {
+		task->task_isolation_state = STATE_NORMAL;
+		return;
+	}
+	/*
+	 * Avoid reporting interrupts that happen after we have prctl'ed
+	 * to enable isolation, but before we have returned to userspace.
+	 */
+	if (task->task_isolation_state == STATE_NORMAL)
+		return;
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	/* Handle NMIs minimally, since we can't send a signal. */
+	if (in_nmi()) {
+		task_isolation_kernel_enter();
+		pr_task_isol_err(smp_processor_id(),
+				 "isolation: in NMI; not delivering signal\n");
+	} else {
+		send_isolation_signal(task);
+	}
+
+	if (pr_task_isol_warn_supp(smp_processor_id(),
+				   "task_isolation lost due to %s\n", buf))
+		debug_dump_stack();
+}
+
+/*
+ * Called before we wake up a task that has a signal to process.
+ * Needs to be done to handle interrupts that trigger signals, which
+ * we don't catch with task_isolation_interrupt() hooks.
+ *
+ * This message is also suppressed if there was already a remotely
+ * caused message about the same isolation breaking event.
+ */
+void _task_isolation_signal(struct task_struct *task)
+{
+	struct isol_task_desc *desc;
+	int ind, cpu;
+	bool do_warn = (task->task_isolation_state == STATE_ISOLATED);
+
+	cpu = task_cpu(task);
+	desc = &per_cpu(isol_task_descs, cpu);
+	ind = atomic_read(&desc->curr_index) & 1;
+	if (desc->warned[ind])
+		do_warn = false;
+
+	stop_isolation(task);
+
+	if (do_warn) {
+		pr_warn("isolation: %s/%d/%d (cpu %d): task_isolation lost due to signal\n",
+			task->comm, task->tgid, task->pid, cpu);
+		debug_dump_stack();
+	}
+}
+
+/*
+ * Generate a stack backtrace if we are going to interrupt another task
+ * isolation process.
+ */
+void task_isolation_remote(int cpu, const char *fmt, ...)
+{
+	struct task_struct *curr_task;
+	va_list args;
+	char buf[200];
+
+	smp_rmb();
+	if (!is_isolation_cpu(cpu) || !task_isolation_on_cpu(cpu))
+		return;
+
+	curr_task = current;
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	if (pr_task_isol_warn(cpu,
+			      "task_isolation lost due to %s by %s/%d/%d on cpu %d\n",
+			      buf,
+			      curr_task->comm, curr_task->tgid,
+			      curr_task->pid, smp_processor_id()))
+		debug_dump_stack();
+}
+
+/*
+ * Generate a stack backtrace if any of the cpus in "mask" are running
+ * task isolation processes.
+ */
+void task_isolation_remote_cpumask(const struct cpumask *mask,
+				   const char *fmt, ...)
+{
+	struct task_struct *curr_task;
+	cpumask_var_t warn_mask;
+	va_list args;
+	char buf[200];
+	int cpu, first_cpu;
+
+	if (task_isolation_map == NULL ||
+		!zalloc_cpumask_var(&warn_mask, GFP_KERNEL))
+		return;
+
+	first_cpu = -1;
+	smp_rmb();
+	for_each_cpu_and(cpu, mask, task_isolation_map) {
+		if (task_isolation_on_cpu(cpu)) {
+			if (first_cpu < 0)
+				first_cpu = cpu;
+			else
+				cpumask_set_cpu(cpu, warn_mask);
+		}
+	}
+
+	if (first_cpu < 0)
+		goto done;
+
+	curr_task = current;
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	if (cpumask_weight(warn_mask) == 0)
+		pr_task_isol_warn(first_cpu,
+				  "task_isolation lost due to %s by %s/%d/%d on cpu %d\n",
+				  buf, curr_task->comm, curr_task->tgid,
+				  curr_task->pid, smp_processor_id());
+	else
+		pr_task_isol_warn(first_cpu,
+				  " and cpus %*pbl: task_isolation lost due to %s by %s/%d/%d on cpu %d\n",
+				  cpumask_pr_args(warn_mask),
+				  buf, curr_task->comm, curr_task->tgid,
+				  curr_task->pid, smp_processor_id());
+	debug_dump_stack();
+
+done:
+	free_cpumask_var(warn_mask);
+}
+
+/*
+ * Set CPUs currently running isolated tasks in CPU mask.
+ */
+void task_isolation_cpumask(struct cpumask *mask)
+{
+	int cpu;
+
+	if (task_isolation_map == NULL)
+		return;
+
+	smp_rmb();
+	for_each_cpu(cpu, task_isolation_map)
+		if (task_isolation_on_cpu(cpu))
+			cpumask_set_cpu(cpu, mask);
+}
+
+/*
+ * Clear CPUs currently running isolated tasks in CPU mask.
+ */
+void task_isolation_clear_cpumask(struct cpumask *mask)
+{
+	int cpu;
+
+	if (task_isolation_map == NULL)
+		return;
+
+	smp_rmb();
+	for_each_cpu(cpu, task_isolation_map)
+		if (task_isolation_on_cpu(cpu))
+			cpumask_clear_cpu(cpu, mask);
+}
+
+/*
+ * Cleanup procedure. The call to this procedure may be delayed.
+ */
+void task_isolation_cpu_cleanup(void)
+{
+	kick_hrtimer();
+}
+
+/*
+ * Check if cleanup is scheduled on the current CPU, and if so, run it.
+ * Intended to be called from notify_resume() or another such callback
+ * on the target CPU.
+ */
+void task_isolation_check_run_cleanup(void)
+{
+	int cpu;
+	unsigned long flags;
+
+	spin_lock_irqsave(&task_isolation_cleanup_lock, flags);
+
+	cpu = smp_processor_id();
+
+	if (cpumask_test_cpu(cpu, task_isolation_cleanup_map)) {
+		cpumask_clear_cpu(cpu, task_isolation_cleanup_map);
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+		task_isolation_cpu_cleanup();
+	} else
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+}
diff --git a/kernel/signal.c b/kernel/signal.c
index ee22ec78fd6d..d01537c32a67 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -46,6 +46,7 @@
 #include <linux/livepatch.h>
 #include <linux/cgroup.h>
 #include <linux/audit.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -758,6 +759,7 @@ static int dequeue_synchronous_signal(kernel_siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+	task_isolation_signal(t);
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/sys.c b/kernel/sys.c
index 00a96746e28a..6538a744c734 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,6 +42,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2527,6 +2528,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+	case PR_TASK_ISOLATION:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = task_isolation_request(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index d89da1c7e005..f3d1f3d47242 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -30,6 +30,7 @@
 #include <linux/syscalls.h>
 #include <linux/interrupt.h>
 #include <linux/tick.h>
+#include <linux/isolation.h>
 #include <linux/err.h>
 #include <linux/debugobjects.h>
 #include <linux/sched/signal.h>
@@ -721,6 +722,19 @@ static void retrigger_next_event(void *arg)
 	raw_spin_unlock(&base->lock);
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+void kick_hrtimer(void)
+{
+	unsigned long flags;
+
+	preempt_disable();
+	local_irq_save(flags);
+	retrigger_next_event(NULL);
+	local_irq_restore(flags);
+	preempt_enable();
+}
+#endif
+
 /*
  * Switch to high resolution mode
  */
@@ -868,8 +882,21 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
 void clock_was_set(void)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
+#ifdef CONFIG_TASK_ISOLATION
+	struct cpumask mask;
+
+	cpumask_clear(&mask);
+	task_isolation_cpumask(&mask);
+	cpumask_complement(&mask, &mask);
+	/*
+	 * Retrigger the CPU local events everywhere except CPUs
+	 * running isolated tasks.
+	 */
+	on_each_cpu_mask(&mask, retrigger_next_event, NULL, 1);
+#else
 	/* Retrigger the CPU local events everywhere */
 	on_each_cpu(retrigger_next_event, NULL, 1);
+#endif
 #endif
 	timerfd_clock_was_set();
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3e2dc9b8858c..6e4cd8459f05 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -888,6 +888,24 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts)
 #endif
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+int try_stop_full_tick(void)
+{
+	int cpu = smp_processor_id();
+	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+	/* For an unstable clock, we should return a permanent error code. */
+	if (atomic_read(&tick_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE)
+		return -EINVAL;
+
+	if (!can_stop_full_tick(cpu, ts))
+		return -EAGAIN;
+
+	tick_nohz_stop_sched_tick(ts, cpu);
+	return 0;
+}
+#endif
+
 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 {
 	/*
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 04/13] task_isolation: Add task isolation hooks to arch-independent code
  2020-07-22 14:44 [PATCH v4 00/13] "Task_isolation" mode Alex Belits
                   ` (2 preceding siblings ...)
  2020-07-22 14:49 ` [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel Alex Belits
@ 2020-07-22 14:51 ` Alex Belits
  2020-07-22 14:51 ` [PATCH v4 05/13] task_isolation: Add xen-specific hook Alex Belits
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Alex Belits @ 2020-07-22 14:51 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

This commit adds task isolation hooks as follows:

- __handle_domain_irq() and handle_domain_nmi() generate an
  isolation warning for the local task

- irq_work_queue_on() generates an isolation warning for the remote
  task being interrupted for irq_work (through
  __smp_call_single_queue())

- generic_exec_single() generates a remote isolation warning for
  the remote cpu being IPI'd (through __smp_call_single_queue())

- smp_call_function_many() generates a remote isolation warning for
  the set of remote cpus being IPI'd (through
  smp_call_function_many_cond())

- on_each_cpu_cond_mask() generates a remote isolation warning for
  the set of remote cpus being IPI'd (through
  smp_call_function_many_cond())

- __ttwu_queue_wakelist() generates a remote isolation warning for
  the remote cpu being IPI'd (through __smp_call_single_queue())

- nmi_enter(), __context_tracking_exit(), __handle_domain_irq(),
  handle_domain_nmi() and scheduler_ipi() clear low-level flags and
  synchronize CPUs by calling task_isolation_kernel_enter()

Calls to task_isolation_remote() or task_isolation_interrupt() can
be placed in the platform-independent code like this when doing so
results in fewer lines of code changes, as for example is true of
the users of the arch_send_call_function_*() APIs. Or, they can be
placed in the per-architecture code when there are many callers,
as for example is true of the smp_send_reschedule() call.

A further cleanup might be to create an intermediate layer, so that
for example smp_send_reschedule() is a single generic function that
just calls arch_smp_send_reschedule(), allowing generic code to be
called every time smp_send_reschedule() is invoked. But for now, we
just update either callers or callees as makes most sense.

Calls to task_isolation_kernel_enter() are intended for early
kernel entry code. They may be called in platform-independent or
platform-specific code.

It may be possible to clean up low-level entry code and somehow
organize calls to task_isolation_kernel_enter() to avoid multiple
per-architecture or driver-specific calls to it. RCU initialization
may be a good reference point for those places in kernel
(task_isolation_kernel_enter() should precede it), however right now
it is not unified between architectures.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
[abelits@marvell.com: adapted for kernel 5.8, added low-level flags handling]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 include/linux/hardirq.h   |  2 ++
 include/linux/sched.h     |  2 ++
 kernel/context_tracking.c |  4 ++++
 kernel/irq/irqdesc.c      | 13 +++++++++++++
 kernel/smp.c              |  6 +++++-
 5 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 03c9fece7d43..5aab1d0a580e 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -7,6 +7,7 @@
 #include <linux/lockdep.h>
 #include <linux/ftrace_irq.h>
 #include <linux/vtime.h>
+#include <linux/isolation.h>
 #include <asm/hardirq.h>
 
 extern void synchronize_irq(unsigned int irq);
@@ -114,6 +115,7 @@ extern void rcu_nmi_exit(void);
 #define nmi_enter()						\
 	do {							\
 		arch_nmi_enter();				\
+		task_isolation_kernel_enter();			\
 		printk_nmi_enter();				\
 		lockdep_off();					\
 		BUG_ON(in_nmi() == NMI_MASK);			\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7fb7bb3fddaa..cacfa415dc59 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -32,6 +32,7 @@
 #include <linux/posix-timers.h>
 #include <linux/rseq.h>
 #include <linux/kcsan.h>
+#include <linux/isolation.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -1743,6 +1744,7 @@ extern char *__get_task_comm(char *to, size_t len, struct task_struct *tsk);
 #ifdef CONFIG_SMP
 static __always_inline void scheduler_ipi(void)
 {
+	task_isolation_kernel_enter();
 	/*
 	 * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
 	 * TIF_NEED_RESCHED remotely (for the first time) will also send
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 36a98c48aedc..481a722ddbce 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -21,6 +21,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -148,6 +149,8 @@ void noinstr __context_tracking_exit(enum ctx_state state)
 	if (!context_tracking_recursion_enter())
 		return;
 
+	task_isolation_kernel_enter();
+
 	if (__this_cpu_read(context_tracking.state) == state) {
 		if (__this_cpu_read(context_tracking.active)) {
 			/*
@@ -159,6 +162,7 @@ void noinstr __context_tracking_exit(enum ctx_state state)
 				instrumentation_begin();
 				vtime_user_exit(current);
 				trace_user_exit(0);
+				task_isolation_user_exit();
 				instrumentation_end();
 			}
 		}
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..b351aac7732f 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -16,6 +16,7 @@
 #include <linux/bitmap.h>
 #include <linux/irqdomain.h>
 #include <linux/sysfs.h>
+#include <linux/isolation.h>
 
 #include "internals.h"
 
@@ -669,6 +670,8 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq,
 	unsigned int irq = hwirq;
 	int ret = 0;
 
+	task_isolation_kernel_enter();
+
 	irq_enter();
 
 #ifdef CONFIG_IRQ_DOMAIN
@@ -676,6 +679,10 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq,
 		irq = irq_find_mapping(domain, hwirq);
 #endif
 
+	task_isolation_interrupt((irq == hwirq) ?
+				 "irq %d (%s)" : "irq %d (%s hwirq %d)",
+				 irq, domain ? domain->name : "", hwirq);
+
 	/*
 	 * Some hardware gives randomly wrong interrupts.  Rather
 	 * than crashing, do something sensible.
@@ -710,6 +717,8 @@ int handle_domain_nmi(struct irq_domain *domain, unsigned int hwirq,
 	unsigned int irq;
 	int ret = 0;
 
+	task_isolation_kernel_enter();
+
 	/*
 	 * NMI context needs to be setup earlier in order to deal with tracing.
 	 */
@@ -717,6 +726,10 @@ int handle_domain_nmi(struct irq_domain *domain, unsigned int hwirq,
 
 	irq = irq_find_mapping(domain, hwirq);
 
+	task_isolation_interrupt((irq == hwirq) ?
+				 "NMI irq %d (%s)" : "NMI irq %d (%s hwirq %d)",
+				 irq, domain ? domain->name : "", hwirq);
+
 	/*
 	 * ack_bad_irq is not NMI-safe, just report
 	 * an invalid interrupt.
diff --git a/kernel/smp.c b/kernel/smp.c
index aa17eedff5be..6a6849783948 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -20,6 +20,7 @@
 #include <linux/sched.h>
 #include <linux/sched/idle.h>
 #include <linux/hypervisor.h>
+#include <linux/isolation.h>
 
 #include "smpboot.h"
 #include "sched/smp.h"
@@ -146,8 +147,10 @@ void __smp_call_single_queue(int cpu, struct llist_node *node)
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
-	if (llist_add(node, &per_cpu(call_single_queue, cpu)))
+	if (llist_add(node, &per_cpu(call_single_queue, cpu))) {
+		task_isolation_remote(cpu, "IPI function");
 		send_call_function_single_ipi(cpu);
+	}
 }
 
 /*
@@ -545,6 +548,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	task_isolation_remote_cpumask(cfd->cpumask_ipi, "IPI function");
 	arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
 
 	if (wait) {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 05/13] task_isolation: Add xen-specific hook
  2020-07-22 14:44 [PATCH v4 00/13] "Task_isolation" mode Alex Belits
                   ` (3 preceding siblings ...)
  2020-07-22 14:51 ` [PATCH v4 04/13] task_isolation: Add task isolation hooks to arch-independent code Alex Belits
@ 2020-07-22 14:51 ` Alex Belits
  2020-07-22 14:53 ` [PATCH 06/13] task_isolation: Add driver-specific hooks Alex Belits
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Alex Belits @ 2020-07-22 14:51 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

xen_evtchn_do_upcall() should call task_isolation_kernel_enter()
to indicate that isolation is broken and perform synchronization.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 drivers/xen/events/events_base.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 140c7bf33a98..4c16cd58f36b 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -33,6 +33,7 @@
 #include <linux/slab.h>
 #include <linux/irqnr.h>
 #include <linux/pci.h>
+#include <linux/isolation.h>
 
 #ifdef CONFIG_X86
 #include <asm/desc.h>
@@ -1236,6 +1237,8 @@ void xen_evtchn_do_upcall(struct pt_regs *regs)
 {
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
+	task_isolation_kernel_enter();
+
 	irq_enter();
 
 	__xen_evtchn_do_upcall();
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 06/13] task_isolation: Add driver-specific hooks
  2020-07-22 14:44 [PATCH v4 00/13] "Task_isolation" mode Alex Belits
                   ` (4 preceding siblings ...)
  2020-07-22 14:51 ` [PATCH v4 05/13] task_isolation: Add xen-specific hook Alex Belits
@ 2020-07-22 14:53 ` Alex Belits
  2020-07-22 14:54 ` [PATCH v4 07/13] task_isolation: arch/x86: enable task isolation functionality Alex Belits
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Alex Belits @ 2020-07-22 14:53 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

Some drivers don't call functions that call
task_isolation_kernel_enter() in interrupt handlers. Call it
directly.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 drivers/irqchip/irq-armada-370-xp.c | 6 ++++++
 drivers/irqchip/irq-gic-v3.c        | 3 +++
 drivers/irqchip/irq-gic.c           | 3 +++
 drivers/s390/cio/cio.c              | 3 +++
 4 files changed, 15 insertions(+)

diff --git a/drivers/irqchip/irq-armada-370-xp.c b/drivers/irqchip/irq-armada-370-xp.c
index c9bdc5221b82..df7f2cce3a54 100644
--- a/drivers/irqchip/irq-armada-370-xp.c
+++ b/drivers/irqchip/irq-armada-370-xp.c
@@ -29,6 +29,7 @@
 #include <linux/slab.h>
 #include <linux/syscore_ops.h>
 #include <linux/msi.h>
+#include <linux/isolation.h>
 #include <asm/mach/arch.h>
 #include <asm/exception.h>
 #include <asm/smp_plat.h>
@@ -473,6 +474,7 @@ static const struct irq_domain_ops armada_370_xp_mpic_irq_ops = {
 static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained)
 {
 	u32 msimask, msinr;
+	int isol_entered = 0;
 
 	msimask = readl_relaxed(per_cpu_int_base +
 				ARMADA_370_XP_IN_DRBEL_CAUSE_OFFS)
@@ -489,6 +491,10 @@ static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained)
 			continue;
 
 		if (is_chained) {
+			if (!isol_entered) {
+				task_isolation_kernel_enter();
+				isol_entered = 1;
+			}
 			irq = irq_find_mapping(armada_370_xp_msi_inner_domain,
 					       msinr - PCI_MSI_DOORBELL_START);
 			generic_handle_irq(irq);
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index cc46bc2d634b..be0e0ffa0fb7 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -18,6 +18,7 @@
 #include <linux/percpu.h>
 #include <linux/refcount.h>
 #include <linux/slab.h>
+#include <linux/isolation.h>
 
 #include <linux/irqchip.h>
 #include <linux/irqchip/arm-gic-common.h>
@@ -629,6 +630,8 @@ static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs *regs
 {
 	u32 irqnr;
 
+	task_isolation_kernel_enter();
+
 	irqnr = gic_read_iar();
 
 	if (gic_supports_nmi() &&
diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index c17fabd6741e..fde547a31566 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -35,6 +35,7 @@
 #include <linux/interrupt.h>
 #include <linux/percpu.h>
 #include <linux/slab.h>
+#include <linux/isolation.h>
 #include <linux/irqchip.h>
 #include <linux/irqchip/chained_irq.h>
 #include <linux/irqchip/arm-gic.h>
@@ -353,6 +354,8 @@ static void __exception_irq_entry gic_handle_irq(struct pt_regs *regs)
 	struct gic_chip_data *gic = &gic_data[0];
 	void __iomem *cpu_base = gic_data_cpu_base(gic);
 
+	task_isolation_kernel_enter();
+
 	do {
 		irqstat = readl_relaxed(cpu_base + GIC_CPU_INTACK);
 		irqnr = irqstat & GICC_IAR_INT_ID_MASK;
diff --git a/drivers/s390/cio/cio.c b/drivers/s390/cio/cio.c
index 6d716db2a46a..beab88881b6d 100644
--- a/drivers/s390/cio/cio.c
+++ b/drivers/s390/cio/cio.c
@@ -20,6 +20,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/interrupt.h>
 #include <linux/irq.h>
+#include <linux/isolation.h>
 #include <asm/cio.h>
 #include <asm/delay.h>
 #include <asm/irq.h>
@@ -584,6 +585,8 @@ void cio_tsch(struct subchannel *sch)
 	struct irb *irb;
 	int irq_context;
 
+	task_isolation_kernel_enter();
+
 	irb = this_cpu_ptr(&cio_irb);
 	/* Store interrupt response block to lowcore. */
 	if (tsch(sch->schid, irb) != 0)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 07/13] task_isolation: arch/x86: enable task isolation functionality
  2020-07-22 14:44 [PATCH v4 00/13] "Task_isolation" mode Alex Belits
                   ` (5 preceding siblings ...)
  2020-07-22 14:53 ` [PATCH 06/13] task_isolation: Add driver-specific hooks Alex Belits
@ 2020-07-22 14:54 ` Alex Belits
  2020-07-22 14:55 ` [PATCH 08/13] task_isolation: arch/arm64: " Alex Belits
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Alex Belits @ 2020-07-22 14:54 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

In prepare_exit_to_usermode(), run cleanup for tasks exited fromi
isolation and call task_isolation_start() for tasks that entered
TIF_TASK_ISOLATION.

In syscall_trace_enter(), add the necessary support for reporting
syscalls for task-isolation processes.

Add task_isolation_remote() calls for the kernel exception types
that do not result in signals, namely non-signalling page faults.

Add task_isolation_kernel_enter() calls to interrupt and syscall
entry handlers.

This mechanism relies on calls to functions that call
task_isolation_kernel_enter() early after entry into kernel. Those
functions are:

enter_from_user_mode()
  called from do_syscall_64(), do_int80_syscall_32(),
  do_fast_syscall_32(), idtentry_enter_user(),
  idtentry_enter_cond_rcu()
idtentry_enter_cond_rcu()
  called from non-raw IDT macros and other entry points
idtentry_enter_user()
nmi_enter()
xen_call_function_interrupt()
xen_call_function_single_interrupt()
xen_irq_work_interrupt()

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
[abelits@marvell.com: adapted for kernel 5.8]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/entry/common.c            | 20 +++++++++++++++++++-
 arch/x86/include/asm/barrier.h     |  2 ++
 arch/x86/include/asm/thread_info.h |  4 +++-
 arch/x86/kernel/apic/ipi.c         |  2 ++
 arch/x86/mm/fault.c                |  4 ++++
 arch/x86/xen/smp.c                 |  3 +++
 arch/x86/xen/smp_pv.c              |  2 ++
 8 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 883da0abf779..3a80142f85c8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -149,6 +149,7 @@ config X86
 	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_STACKLEAK
 	select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index f09288431f28..ab94d90a2bd5 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -26,6 +26,7 @@
 #include <linux/livepatch.h>
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
+#include <linux/isolation.h>
 
 #ifdef CONFIG_XEN_PV
 #include <xen/xen-ops.h>
@@ -86,6 +87,7 @@ static noinstr void enter_from_user_mode(void)
 {
 	enum ctx_state state = ct_state();
 
+	task_isolation_kernel_enter();
 	lockdep_hardirqs_off(CALLER_ADDR0);
 	user_exit_irqoff();
 
@@ -97,6 +99,7 @@ static noinstr void enter_from_user_mode(void)
 #else
 static __always_inline void enter_from_user_mode(void)
 {
+	task_isolation_kernel_enter();
 	lockdep_hardirqs_off(CALLER_ADDR0);
 	instrumentation_begin();
 	trace_hardirqs_off_finish();
@@ -161,6 +164,15 @@ static long syscall_trace_enter(struct pt_regs *regs)
 			return -1L;
 	}
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (work & _TIF_TASK_ISOLATION) {
+	if (task_isolation_syscall(regs->orig_ax) == -1)
+		return -1L;
+		work &= ~_TIF_TASK_ISOLATION;
+	}
 #ifdef CONFIG_SECCOMP
 	/*
 	 * Do seccomp after ptrace, to catch any tracer changes.
@@ -263,6 +275,8 @@ static void __prepare_exit_to_usermode(struct pt_regs *regs)
 	lockdep_assert_irqs_disabled();
 	lockdep_sys_exit();
 
+	task_isolation_check_run_cleanup();
+
 	cached_flags = READ_ONCE(ti->flags);
 
 	if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
@@ -278,6 +292,9 @@ static void __prepare_exit_to_usermode(struct pt_regs *regs)
 	if (unlikely(cached_flags & _TIF_NEED_FPU_LOAD))
 		switch_fpu_return();
 
+	if (cached_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
+
 #ifdef CONFIG_COMPAT
 	/*
 	 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
@@ -597,7 +614,8 @@ bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs)
 		check_user_regs(regs);
 		enter_from_user_mode();
 		return false;
-	}
+	} else
+		task_isolation_kernel_enter();
 
 	/*
 	 * If this entry hit the idle task invoke rcu_irq_enter() whether
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 7f828fe49797..5be6ca0519fc 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -4,6 +4,7 @@
 
 #include <asm/alternative.h>
 #include <asm/nops.h>
+#include <asm/processor.h>
 
 /*
  * Force strict CPU ordering.
@@ -53,6 +54,7 @@ static inline unsigned long array_index_mask_nospec(unsigned long index,
 
 #define dma_rmb()	barrier()
 #define dma_wmb()	barrier()
+#define instr_sync()	sync_core()
 
 #ifdef CONFIG_X86_32
 #define __smp_mb()	asm volatile("lock; addl $0,-4(%%esp)" ::: "memory", "cc")
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 8de8ceccb8bc..6dd1a5cc286d 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -93,6 +93,7 @@ struct thread_info {
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
 #define TIF_SLD			18	/* Restore split lock detection on context switch */
+#define TIF_TASK_ISOLATION	19	/* task isolation enabled for task */
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
@@ -123,6 +124,7 @@ struct thread_info {
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_SLD		(1 << TIF_SLD)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
@@ -136,7 +138,7 @@ struct thread_info {
 /* Work to do before invoking the actual syscall. */
 #define _TIF_WORK_SYSCALL_ENTRY	\
 	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |	\
-	 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT)
+	 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | _TIF_TASK_ISOLATION)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE					\
diff --git a/arch/x86/kernel/apic/ipi.c b/arch/x86/kernel/apic/ipi.c
index 6ca0f91372fd..b4dfaad6a440 100644
--- a/arch/x86/kernel/apic/ipi.c
+++ b/arch/x86/kernel/apic/ipi.c
@@ -2,6 +2,7 @@
 
 #include <linux/cpumask.h>
 #include <linux/smp.h>
+#include <linux/isolation.h>
 
 #include "local.h"
 
@@ -67,6 +68,7 @@ void native_smp_send_reschedule(int cpu)
 		WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu);
 		return;
 	}
+	task_isolation_remote(cpu, "reschedule IPI");
 	apic->send_IPI(cpu, RESCHEDULE_VECTOR);
 }
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 1ead568c0101..e16a4f5c7e57 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 #include <linux/efi.h>			/* efi_recover_from_page_fault()*/
 #include <linux/mm_types.h>
+#include <linux/isolation.h>		/* task_isolation_interrupt     */
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1332,6 +1333,9 @@ void do_user_addr_fault(struct pt_regs *regs,
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
 	}
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	task_isolation_interrupt("page fault at %#lx", address);
+
 	check_v8086_mode(regs, address, tsk);
 }
 NOKPROBE_SYMBOL(do_user_addr_fault);
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 2097fa0ebdb5..9a3a9bae7d06 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -4,6 +4,7 @@
 #include <linux/slab.h>
 #include <linux/cpumask.h>
 #include <linux/percpu.h>
+#include <linux/isolation.h>
 
 #include <xen/events.h>
 
@@ -265,6 +266,7 @@ void xen_send_IPI_allbutself(int vector)
 
 static irqreturn_t xen_call_function_interrupt(int irq, void *dev_id)
 {
+	task_isolation_kernel_enter();
 	irq_enter();
 	generic_smp_call_function_interrupt();
 	inc_irq_stat(irq_call_count);
@@ -275,6 +277,7 @@ static irqreturn_t xen_call_function_interrupt(int irq, void *dev_id)
 
 static irqreturn_t xen_call_function_single_interrupt(int irq, void *dev_id)
 {
+	task_isolation_kernel_enter();
 	irq_enter();
 	generic_smp_call_function_single_interrupt();
 	inc_irq_stat(irq_call_count);
diff --git a/arch/x86/xen/smp_pv.c b/arch/x86/xen/smp_pv.c
index 171aff1b11f2..d71d3cc36c51 100644
--- a/arch/x86/xen/smp_pv.c
+++ b/arch/x86/xen/smp_pv.c
@@ -24,6 +24,7 @@
 #include <linux/cpuhotplug.h>
 #include <linux/stackprotector.h>
 #include <linux/pgtable.h>
+#include <linux/isolation.h>
 
 #include <asm/paravirt.h>
 #include <asm/idtentry.h>
@@ -482,6 +483,7 @@ static void xen_pv_stop_other_cpus(int wait)
 
 static irqreturn_t xen_irq_work_interrupt(int irq, void *dev_id)
 {
+	task_isolation_kernel_enter();
 	irq_enter();
 	irq_work_run();
 	inc_irq_stat(apic_irq_work_irqs);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 08/13] task_isolation: arch/arm64: enable task isolation functionality
  2020-07-22 14:44 [PATCH v4 00/13] "Task_isolation" mode Alex Belits
                   ` (6 preceding siblings ...)
  2020-07-22 14:54 ` [PATCH v4 07/13] task_isolation: arch/x86: enable task isolation functionality Alex Belits
@ 2020-07-22 14:55 ` Alex Belits
  2020-07-22 14:56 ` [PATCH v4 09/13] task_isolation: arch/arm: " Alex Belits
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Alex Belits @ 2020-07-22 14:55 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

From: Chris Metcalf <cmetcalf@mellanox.com>

In do_notify_resume(), call task_isolation_start() for
TIF_TASK_ISOLATION tasks. Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop,
since we don't clear _TIF_TASK_ISOLATION in the loop.

We instrument the smp_send_reschedule() routine so that it checks for
isolated tasks and generates a suitable warning if needed.

Finally, report on page faults in task-isolation processes in
do_page_faults().

Early kernel entry code calls task_isolation_kernel_enter(). In
particular:

Vectors:
el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter()
el1_irq -> asm_nmi_enter(), handle_arch_irq()
el1_error -> do_serror()
el0_sync -> el0_sync_handler()
el0_irq -> handle_arch_irq()
el0_error -> do_serror()
el0_sync_compat -> el0_sync_compat_handler()
el0_irq_compat -> handle_arch_irq()
el0_error_compat -> do_serror()

SDEI entry:
__sdei_asm_handler -> __sdei_handler() -> nmi_enter()

Functions called from there:
asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter()
asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return()

Handlers:
do_serror() -> nmi_enter() -> task_isolation_kernel_enter()
  or task_isolation_kernel_enter()
el1_sync_handler() -> task_isolation_kernel_enter()
el0_sync_handler() -> task_isolation_kernel_enter()
el0_sync_compat_handler() -> task_isolation_kernel_enter()

handle_arch_irq() is irqchip-specific, most call handle_domain_irq()
  or handle_IPI()
There is a separate patch for irqchips that do not follow this rule.

handle_domain_irq() -> task_isolation_kernel_enter()
handle_IPI() -> task_isolation_kernel_enter()
nmi_enter() -> task_isolation_kernel_enter()

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
[abelits@marvell.com: simplified to match kernel 5.6]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/barrier.h     |  2 ++
 arch/arm64/include/asm/thread_info.h |  5 ++++-
 arch/arm64/kernel/entry-common.c     |  7 +++++++
 arch/arm64/kernel/ptrace.c           | 16 +++++++++++++++-
 arch/arm64/kernel/sdei.c             |  2 ++
 arch/arm64/kernel/signal.c           | 13 ++++++++++++-
 arch/arm64/kernel/smp.c              |  9 +++++++++
 arch/arm64/mm/fault.c                |  5 +++++
 9 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 66dc41fd49f2..96fefabfa10f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -137,6 +137,7 @@ config ARM64
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_STACKLEAK
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index fb4c27506ef4..bf4a2adabd5b 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -48,6 +48,8 @@
 #define dma_rmb()	dmb(oshld)
 #define dma_wmb()	dmb(oshst)
 
+#define instr_sync()	isb()
+
 /*
  * Generate a mask for array_index__nospec() that is ~0UL when 0 <= idx < sz
  * and 0 otherwise.
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 5e784e16ee89..73269bb8a57d 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -67,6 +67,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
 #define TIF_UPROBE		4	/* uprobe breakpoint or singlestep */
 #define TIF_FSCHECK		5	/* Check FS is USER_DS on return */
+#define TIF_TASK_ISOLATION	6	/* task isolation enabled for task */
 #define TIF_SYSCALL_TRACE	8	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	9	/* syscall auditing */
 #define TIF_SYSCALL_TRACEPOINT	10	/* syscall tracepoint for ftrace */
@@ -86,6 +87,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SYSCALL_TRACEPOINT	(1 << TIF_SYSCALL_TRACEPOINT)
@@ -99,7 +101,8 @@ void arch_release_task_struct(struct task_struct *tsk);
 
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
-				 _TIF_UPROBE | _TIF_FSCHECK)
+				 _TIF_UPROBE | _TIF_FSCHECK | \
+				 _TIF_TASK_ISOLATION)
 
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index d3be9dbf5490..8b682aa020ae 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -8,6 +8,7 @@
 #include <linux/context_tracking.h>
 #include <linux/ptrace.h>
 #include <linux/thread_info.h>
+#include <linux/isolation.h>
 
 #include <asm/cpufeature.h>
 #include <asm/daifflags.h>
@@ -70,6 +71,8 @@ asmlinkage void notrace el1_sync_handler(struct pt_regs *regs)
 {
 	unsigned long esr = read_sysreg(esr_el1);
 
+	task_isolation_kernel_enter();
+
 	switch (ESR_ELx_EC(esr)) {
 	case ESR_ELx_EC_DABT_CUR:
 	case ESR_ELx_EC_IABT_CUR:
@@ -231,6 +234,8 @@ asmlinkage void notrace el0_sync_handler(struct pt_regs *regs)
 {
 	unsigned long esr = read_sysreg(esr_el1);
 
+	task_isolation_kernel_enter();
+
 	switch (ESR_ELx_EC(esr)) {
 	case ESR_ELx_EC_SVC64:
 		el0_svc(regs);
@@ -300,6 +305,8 @@ asmlinkage void notrace el0_sync_compat_handler(struct pt_regs *regs)
 {
 	unsigned long esr = read_sysreg(esr_el1);
 
+	task_isolation_kernel_enter();
+
 	switch (ESR_ELx_EC(esr)) {
 	case ESR_ELx_EC_SVC32:
 		el0_svc_compat(regs);
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 1e02e98e68dd..5acfc194bdd0 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -29,6 +29,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/cpufeature.h>
@@ -1851,7 +1852,11 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 int syscall_trace_enter(struct pt_regs *regs)
 {
-	unsigned long flags = READ_ONCE(current_thread_info()->flags);
+	unsigned long flags;
+
+	task_isolation_kernel_enter();
+
+	flags = READ_ONCE(current_thread_info()->flags);
 
 	if (flags & (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE)) {
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
@@ -1859,6 +1864,15 @@ int syscall_trace_enter(struct pt_regs *regs)
 			return NO_SYSCALL;
 	}
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (test_thread_flag(TIF_TASK_ISOLATION)) {
+		if (task_isolation_syscall(regs->syscallno) == -1)
+			return NO_SYSCALL;
+	}
+
 	/* Do the secure computing after ptrace; failures should be fast. */
 	if (secure_computing() == -1)
 		return NO_SYSCALL;
diff --git a/arch/arm64/kernel/sdei.c b/arch/arm64/kernel/sdei.c
index dab88260b137..f65b676132f9 100644
--- a/arch/arm64/kernel/sdei.c
+++ b/arch/arm64/kernel/sdei.c
@@ -8,6 +8,7 @@
 #include <linux/irqflags.h>
 #include <linux/sched/task_stack.h>
 #include <linux/uaccess.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/kprobes.h>
@@ -185,6 +186,7 @@ static __kprobes unsigned long _sdei_handler(struct pt_regs *regs,
 	__uaccess_enable_hw_pan();
 
 	err = sdei_event_handler(regs, arg);
+	task_isolation_interrupt("SDEI handled");
 	if (err)
 		return SDEI_EV_FAILED;
 
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 3b4f31f35e45..ece90c5756be 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -20,6 +20,7 @@
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
 #include <linux/syscalls.h>
+#include <linux/isolation.h>
 
 #include <asm/daifflags.h>
 #include <asm/debug-monitors.h>
@@ -907,6 +908,11 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
+#define NOTIFY_RESUME_LOOP_FLAGS \
+	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
+	_TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
+	_TIF_UPROBE | _TIF_FSCHECK)
+
 asmlinkage void do_notify_resume(struct pt_regs *regs,
 				 unsigned long thread_flags)
 {
@@ -917,6 +923,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 	 */
 	trace_hardirqs_off();
 
+	task_isolation_check_run_cleanup();
+
 	do {
 		/* Check valid user FS if needed */
 		addr_limit_user_check();
@@ -947,7 +955,10 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 
 		local_daif_mask();
 		thread_flags = READ_ONCE(current_thread_info()->flags);
-	} while (thread_flags & _TIF_WORK_MASK);
+	} while (thread_flags & NOTIFY_RESUME_LOOP_FLAGS);
+
+	if (thread_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
 }
 
 unsigned long __ro_after_init signal_minsigstksz;
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index e43a8ff19f0f..c893c8babe76 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -32,6 +32,7 @@
 #include <linux/irq_work.h>
 #include <linux/kexec.h>
 #include <linux/kvm_host.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -835,6 +836,7 @@ void arch_send_call_function_single_ipi(int cpu)
 #ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
 void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "wakeup IPI");
 	smp_cross_call(mask, IPI_WAKEUP);
 }
 #endif
@@ -896,11 +898,16 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 	unsigned int cpu = smp_processor_id();
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
+	task_isolation_kernel_enter();
+
 	if ((unsigned)ipinr < NR_IPI) {
 		trace_ipi_entry_rcuidle(ipi_types[ipinr]);
 		__inc_irq_stat(cpu, ipi_irqs[ipinr]);
 	}
 
+	task_isolation_interrupt("IPI type %d (%s)", ipinr,
+				 ipinr < NR_IPI ? ipi_types[ipinr] : "unknown");
+
 	switch (ipinr) {
 	case IPI_RESCHEDULE:
 		scheduler_ipi();
@@ -963,12 +970,14 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 
 void smp_send_reschedule(int cpu)
 {
+	task_isolation_remote(cpu, "reschedule IPI");
 	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
 }
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 void tick_broadcast(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "timer IPI");
 	smp_cross_call(mask, IPI_TIMER);
 }
 #endif
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 8afb238ff335..d01f3cbed87f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -23,6 +23,7 @@
 #include <linux/perf_event.h>
 #include <linux/preempt.h>
 #include <linux/hugetlb.h>
+#include <linux/isolation.h>
 
 #include <asm/acpi.h>
 #include <asm/bug.h>
@@ -539,6 +540,10 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 	 */
 	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
 			      VM_FAULT_BADACCESS)))) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (user_mode(regs))
+			task_isolation_interrupt("page fault at %#lx", addr);
+
 		/*
 		 * Major/minor page fault accounting is only done
 		 * once. If we go through a retry, it is extremely
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 09/13] task_isolation: arch/arm: enable task isolation functionality
  2020-07-22 14:44 [PATCH v4 00/13] "Task_isolation" mode Alex Belits
                   ` (7 preceding siblings ...)
  2020-07-22 14:55 ` [PATCH 08/13] task_isolation: arch/arm64: " Alex Belits
@ 2020-07-22 14:56 ` Alex Belits
  2020-07-22 14:57 ` [PATCH v4 10/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Alex Belits @ 2020-07-22 14:56 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

From: Francis Giraldeau <francis.giraldeau@gmail.com>

This patch is a port of the task isolation functionality to the arm 32-bit
architecture. The task isolation needs an additional thread flag that
requires to change the entry assembly code to accept a bitfield larger than
one byte. The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now
defined in the literal pool. The rest of the patch is straightforward and
reflects what is done on other architectures.

To avoid problems with the tst instruction in the v7m build, we renumber
TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7.

Early kernel entry relies on task_isolation_kernel_enter().

vector_swi to label __sys_trace
  -> syscall_trace_enter() when task isolation is enabled,
  -> task_isolation_kernel_enter()

nvic_handle_irq()
  -> handle_IRQ() -> __handle_domain_irq() -> task_isolation_kernel_enter()

__fiq_svc, __fiq_abt __fiq_usr
  -> handle_fiq_as_nmi() -> uses nmi_enter() / nmi_exit()

__irq_svc -> irq_handler
__irq_usr -> irq_handler
  irq_handler
    -> (handle_arch_irq or
    (arch_irq_handler_default -> (asm_do_IRQ() -> __handle_domain_irq())
      or do_IPI() -> handle_IPI())
  asm_do_IRQ()
    -> __handle_domain_irq() -> task_isolation_kernel_enter()
  do_IPI()
    -> handle_IPI() -> task_isolation_kernel_enter()

handle_arch_irq for arm-specific controllers calls
  (handle_IRQ() -> __handle_domain_irq() -> task_isolation_kernel_enter())
    or (handle_domain_irq() -> __handle_domain_irq()
      -> task_isolation_kernel_enter())

Not covered:
__dabt_svc -> dabt_helper
__dabt_usr -> dabt_helper
  dabt_helper -> CPU_DABORT_HANDLER (cpu-specific)
    -> do_DataAbort or PROCESSOR_DABT_FUNC
    -> _data_abort (cpu-specific) -> do_DataAbort

__pabt_svc -> pabt_helper
__pabt_usr -> pabt_helper
  pabt_helper -> CPU_PABORT_HANDLER (cpu-specific)
    -> do_PrefetchAbort or PROCESSOR_PABT_FUNC
    -> _prefetch_abort (cpu-specific) -> do_PrefetchAbort

Signed-off-by: Francis Giraldeau <francis.giraldeau@gmail.com>
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> [with modifications]
[abelits@marvell.com: modified for kernel 5.6, added isolation cleanup]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 arch/arm/Kconfig                   |  1 +
 arch/arm/include/asm/barrier.h     |  2 ++
 arch/arm/include/asm/thread_info.h | 10 +++++++---
 arch/arm/kernel/entry-common.S     | 15 ++++++++++-----
 arch/arm/kernel/ptrace.c           | 12 ++++++++++++
 arch/arm/kernel/signal.c           | 13 ++++++++++++-
 arch/arm/kernel/smp.c              |  6 ++++++
 arch/arm/mm/fault.c                |  8 +++++++-
 8 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 2ac74904a3ce..f06d0e0e4fe9 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -67,6 +67,7 @@ config ARM
 	select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
 	select HAVE_ARCH_MMAP_RND_BITS if MMU
 	select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARM_SMCCC if CPU_V7
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 83ae97c049d9..3c603df6c290 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -66,12 +66,14 @@ extern void arm_heavy_mb(void);
 #define wmb()		__arm_heavy_mb(st)
 #define dma_rmb()	dmb(osh)
 #define dma_wmb()	dmb(oshst)
+#define instr_sync()	isb()
 #else
 #define mb()		barrier()
 #define rmb()		barrier()
 #define wmb()		barrier()
 #define dma_rmb()	barrier()
 #define dma_wmb()	barrier()
+#define instr_sync()	barrier()
 #endif
 
 #define __smp_mb()	dmb(ish)
diff --git a/arch/arm/include/asm/thread_info.h b/arch/arm/include/asm/thread_info.h
index 3609a6980c34..ec0f11e1bb4c 100644
--- a/arch/arm/include/asm/thread_info.h
+++ b/arch/arm/include/asm/thread_info.h
@@ -139,7 +139,8 @@ extern int vfp_restore_user_hwstate(struct user_vfp *,
 #define TIF_SYSCALL_TRACE	4	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	5	/* syscall auditing active */
 #define TIF_SYSCALL_TRACEPOINT	6	/* syscall tracepoint instrumentation */
-#define TIF_SECCOMP		7	/* seccomp syscall filtering active */
+#define TIF_TASK_ISOLATION	7	/* task isolation enabled for task */
+#define TIF_SECCOMP		8	/* seccomp syscall filtering active */
 
 #define TIF_USING_IWMMXT	17
 #define TIF_MEMDIE		18	/* is terminating due to OOM killer */
@@ -152,18 +153,21 @@ extern int vfp_restore_user_hwstate(struct user_vfp *,
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SYSCALL_TRACEPOINT	(1 << TIF_SYSCALL_TRACEPOINT)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_USING_IWMMXT	(1 << TIF_USING_IWMMXT)
 
 /* Checks for any syscall work in entry-common.S */
 #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
-			   _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP)
+			   _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
+			   _TIF_TASK_ISOLATION)
 
 /*
  * Change these and you break ASM code in entry-common.S
  */
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
-				 _TIF_NOTIFY_RESUME | _TIF_UPROBE)
+				 _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
+				 _TIF_TASK_ISOLATION)
 
 #endif /* __KERNEL__ */
 #endif /* __ASM_ARM_THREAD_INFO_H */
diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
index 271cb8a1eba1..6ceb5cb808a9 100644
--- a/arch/arm/kernel/entry-common.S
+++ b/arch/arm/kernel/entry-common.S
@@ -53,7 +53,8 @@ __ret_fast_syscall:
 	cmp	r2, #TASK_SIZE
 	blne	addr_limit_check_failed
 	ldr	r1, [tsk, #TI_FLAGS]		@ re-check for syscall tracing
-	tst	r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	ldr	r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	tst	r1, r2
 	bne	fast_work_pending
 
 
@@ -90,7 +91,8 @@ __ret_fast_syscall:
 	cmp	r2, #TASK_SIZE
 	blne	addr_limit_check_failed
 	ldr	r1, [tsk, #TI_FLAGS]		@ re-check for syscall tracing
-	tst	r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	ldr	r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	tst	r1, r2
 	beq	no_work_pending
  UNWIND(.fnend		)
 ENDPROC(ret_fast_syscall)
@@ -98,7 +100,8 @@ ENDPROC(ret_fast_syscall)
 	/* Slower path - fall through to work_pending */
 #endif
 
-	tst	r1, #_TIF_SYSCALL_WORK
+	ldr	r2, =_TIF_SYSCALL_WORK
+	tst	r1, r2
 	bne	__sys_trace_return_nosave
 slow_work_pending:
 	mov	r0, sp				@ 'regs'
@@ -131,7 +134,8 @@ ENTRY(ret_to_user_from_irq)
 	cmp	r2, #TASK_SIZE
 	blne	addr_limit_check_failed
 	ldr	r1, [tsk, #TI_FLAGS]
-	tst	r1, #_TIF_WORK_MASK
+	ldr	r2, =_TIF_WORK_MASK
+	tst	r1, r2
 	bne	slow_work_pending
 no_work_pending:
 	asm_trace_hardirqs_on save = 0
@@ -251,7 +255,8 @@ local_restart:
 	ldr	r10, [tsk, #TI_FLAGS]		@ check for syscall tracing
 	stmdb	sp!, {r4, r5}			@ push fifth and sixth args
 
-	tst	r10, #_TIF_SYSCALL_WORK		@ are we tracing syscalls?
+	ldr	r11, =_TIF_SYSCALL_WORK		@ are we tracing syscalls?
+	tst	r10, r11
 	bne	__sys_trace
 
 	invoke_syscall tbl, scno, r10, __ret_fast_syscall
diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
index d0f7c8896c96..7b4fe15ba376 100644
--- a/arch/arm/kernel/ptrace.c
+++ b/arch/arm/kernel/ptrace.c
@@ -24,6 +24,7 @@
 #include <linux/audit.h>
 #include <linux/tracehook.h>
 #include <linux/unistd.h>
+#include <linux/isolation.h>
 
 #include <asm/traps.h>
 
@@ -917,9 +918,20 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs, int scno)
 {
 	current_thread_info()->syscall = scno;
 
+	task_isolation_kernel_enter();
+
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (test_thread_flag(TIF_TASK_ISOLATION)) {
+		if (task_isolation_syscall(scno) == -1)
+			return -1;
+	}
+
 	/* Do seccomp after ptrace; syscall may have changed. */
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
 	if (secure_computing() == -1)
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index ab2568996ddb..29ccef8403cd 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -12,6 +12,7 @@
 #include <linux/tracehook.h>
 #include <linux/uprobes.h>
 #include <linux/syscalls.h>
+#include <linux/isolation.h>
 
 #include <asm/elf.h>
 #include <asm/cacheflush.h>
@@ -639,6 +640,9 @@ static int do_signal(struct pt_regs *regs, int syscall)
 	return 0;
 }
 
+#define WORK_PENDING_LOOP_FLAGS	(_TIF_NEED_RESCHED | _TIF_SIGPENDING |	\
+				 _TIF_NOTIFY_RESUME | _TIF_UPROBE)
+
 asmlinkage int
 do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 {
@@ -648,6 +652,9 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 	 * Update the trace code with the current status.
 	 */
 	trace_hardirqs_off();
+
+	task_isolation_check_run_cleanup();
+
 	do {
 		if (likely(thread_flags & _TIF_NEED_RESCHED)) {
 			schedule();
@@ -676,7 +683,11 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 		}
 		local_irq_disable();
 		thread_flags = current_thread_info()->flags;
-	} while (thread_flags & _TIF_WORK_MASK);
+	} while (thread_flags & WORK_PENDING_LOOP_FLAGS);
+
+	if (thread_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
+
 	return 0;
 }
 
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 9a6432557871..6a7a4ba02cc4 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -26,6 +26,7 @@
 #include <linux/completion.h>
 #include <linux/cpufreq.h>
 #include <linux/irq_work.h>
+#include <linux/isolation.h>
 
 #include <linux/atomic.h>
 #include <asm/bugs.h>
@@ -559,6 +560,7 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask)
 
 void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "wakeup IPI");
 	smp_cross_call(mask, IPI_WAKEUP);
 }
 
@@ -578,6 +580,7 @@ void arch_irq_work_raise(void)
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 void tick_broadcast(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "timer IPI");
 	smp_cross_call(mask, IPI_TIMER);
 }
 #endif
@@ -633,6 +636,8 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 	unsigned int cpu = smp_processor_id();
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
+	task_isolation_kernel_enter();
+
 	if ((unsigned)ipinr < NR_IPI) {
 		trace_ipi_entry_rcuidle(ipi_types[ipinr]);
 		__inc_irq_stat(cpu, ipi_irqs[ipinr]);
@@ -701,6 +706,7 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 
 void smp_send_reschedule(int cpu)
 {
+	task_isolation_remote(cpu, "reschedule IPI");
 	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
 }
 
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index c6550eddfce1..23bb383e479e 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -17,6 +17,7 @@
 #include <linux/sched/debug.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/isolation.h>
 
 #include <asm/system_misc.h>
 #include <asm/system_info.h>
@@ -330,8 +331,13 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	/*
 	 * Handle the "normal" case first - VM_FAULT_MAJOR
 	 */
-	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
+	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
+			      VM_FAULT_BADACCESS)))) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (user_mode(regs))
+			task_isolation_interrupt("page fault at %#lx", addr);
 		return 0;
+	}
 
 	/*
 	 * If we are in kernel mode at this point, we
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 10/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-07-22 14:44 [PATCH v4 00/13] "Task_isolation" mode Alex Belits
                   ` (8 preceding siblings ...)
  2020-07-22 14:56 ` [PATCH v4 09/13] task_isolation: arch/arm: " Alex Belits
@ 2020-07-22 14:57 ` Alex Belits
  2020-10-01 14:44   ` Frederic Weisbecker
  2020-07-22 14:58 ` [PATCH v4 11/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks Alex Belits
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 49+ messages in thread
From: Alex Belits @ 2020-07-22 14:57 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

From: Yuri Norov <ynorov@marvell.com>

For nohz_full CPUs the desirable behavior is to receive interrupts
generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
obviously not desirable because it breaks isolation.

This patch adds check for it.

Signed-off-by: Yuri Norov <ynorov@marvell.com>
[abelits@marvell.com: updated, only exclude CPUs running isolated tasks]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/time/tick-sched.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6e4cd8459f05..2f82a6daf8fc 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include <linux/sched/clock.h>
 #include <linux/sched/stat.h>
 #include <linux/sched/nohz.h>
+#include <linux/isolation.h>
 #include <linux/module.h>
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
@@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
  */
 void tick_nohz_full_kick_cpu(int cpu)
 {
-	if (!tick_nohz_full_cpu(cpu))
+	smp_rmb();
+	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
 		return;
 
 	irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 11/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks
  2020-07-22 14:44 [PATCH v4 00/13] "Task_isolation" mode Alex Belits
                   ` (9 preceding siblings ...)
  2020-07-22 14:57 ` [PATCH v4 10/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
@ 2020-07-22 14:58 ` Alex Belits
  2020-10-01 14:47   ` Frederic Weisbecker
  2020-07-22 14:59 ` [PATCH v4 12/13] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize Alex Belits
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 49+ messages in thread
From: Alex Belits @ 2020-07-22 14:58 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

From: Yuri Norov <ynorov@marvell.com>

If CPU runs isolated task, there's no any backlog on it, and
so we don't need to flush it. Currently flush_all_backlogs()
enqueues corresponding work on all CPUs including ones that run
isolated tasks. It leads to breaking task isolation for nothing.

In this patch, backlog flushing is enqueued only on non-isolated CPUs.

Signed-off-by: Yuri Norov <ynorov@marvell.com>
[abelits@marvell.com: use safe task_isolation_on_cpu() implementation]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 net/core/dev.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 90b59fc50dc9..83a282f7453d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -74,6 +74,7 @@
 #include <linux/cpu.h>
 #include <linux/types.h>
 #include <linux/kernel.h>
+#include <linux/isolation.h>
 #include <linux/hash.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
@@ -5624,9 +5625,13 @@ static void flush_all_backlogs(void)
 
 	get_online_cpus();
 
-	for_each_online_cpu(cpu)
+	smp_rmb();
+	for_each_online_cpu(cpu) {
+		if (task_isolation_on_cpu(cpu))
+			continue;
 		queue_work_on(cpu, system_highpri_wq,
 			      per_cpu_ptr(&flush_works, cpu));
+	}
 
 	for_each_online_cpu(cpu)
 		flush_work(per_cpu_ptr(&flush_works, cpu));
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 12/13] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize
  2020-07-22 14:44 [PATCH v4 00/13] "Task_isolation" mode Alex Belits
                   ` (10 preceding siblings ...)
  2020-07-22 14:58 ` [PATCH v4 11/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks Alex Belits
@ 2020-07-22 14:59 ` Alex Belits
  2020-07-22 14:59 ` [PATCH 13/13] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits
  2020-07-23 13:17 ` [PATCH v4 00/13] "Task_isolation" mode Thomas Gleixner
  13 siblings, 0 replies; 49+ messages in thread
From: Alex Belits @ 2020-07-22 14:59 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

From: Yuri Norov <ynorov@marvell.com>

CPUs running isolated tasks are in userspace, so they don't have to
perform ring buffer updates immediately. If ring_buffer_resize()
schedules the update on those CPUs, isolation is broken. To prevent
that, updates for CPUs running isolated tasks are performed locally,
like for offline CPUs.

A race condition between this update and isolation breaking is avoided
at the cost of disabling per_cpu buffer writing for the time of update
when it coincides with isolation breaking.

Signed-off-by: Yuri Norov <ynorov@marvell.com>
[abelits@marvell.com: updated to prevent race with isolation breaking]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/trace/ring_buffer.c | 63 ++++++++++++++++++++++++++++++++++----
 1 file changed, 57 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 00867ff82412..22d4731f0def 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -21,6 +21,7 @@
 #include <linux/delay.h>
 #include <linux/slab.h>
 #include <linux/init.h>
+#include <linux/isolation.h>
 #include <linux/hash.h>
 #include <linux/list.h>
 #include <linux/cpu.h>
@@ -1705,6 +1706,38 @@ static void update_pages_handler(struct work_struct *work)
 	complete(&cpu_buffer->update_done);
 }
 
+static bool update_if_isolated(struct ring_buffer_per_cpu *cpu_buffer,
+			       int cpu)
+{
+	bool rv = false;
+
+	smp_rmb();
+	if (task_isolation_on_cpu(cpu)) {
+		/*
+		 * CPU is running isolated task. Since it may lose
+		 * isolation and re-enter kernel simultaneously with
+		 * this update, disable recording until it's done.
+		 */
+		atomic_inc(&cpu_buffer->record_disabled);
+		/* Make sure, update is done, and isolation state is current */
+		smp_mb();
+		if (task_isolation_on_cpu(cpu)) {
+			/*
+			 * If CPU is still running isolated task, we
+			 * can be sure that breaking isolation will
+			 * happen while recording is disabled, and CPU
+			 * will not touch this buffer until the update
+			 * is done.
+			 */
+			rb_update_pages(cpu_buffer);
+			cpu_buffer->nr_pages_to_update = 0;
+			rv = true;
+		}
+		atomic_dec(&cpu_buffer->record_disabled);
+	}
+	return rv;
+}
+
 /**
  * ring_buffer_resize - resize the ring buffer
  * @buffer: the buffer to resize.
@@ -1794,13 +1827,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
 			if (!cpu_buffer->nr_pages_to_update)
 				continue;
 
-			/* Can't run something on an offline CPU. */
+			/*
+			 * Can't run something on an offline CPU.
+			 *
+			 * CPUs running isolated tasks don't have to
+			 * update ring buffers until they exit
+			 * isolation because they are in
+			 * userspace. Use the procedure that prevents
+			 * race condition with isolation breaking.
+			 */
 			if (!cpu_online(cpu)) {
 				rb_update_pages(cpu_buffer);
 				cpu_buffer->nr_pages_to_update = 0;
 			} else {
-				schedule_work_on(cpu,
-						&cpu_buffer->update_pages_work);
+				if (!update_if_isolated(cpu_buffer, cpu))
+					schedule_work_on(cpu,
+					&cpu_buffer->update_pages_work);
 			}
 		}
 
@@ -1849,13 +1891,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
 
 		get_online_cpus();
 
-		/* Can't run something on an offline CPU. */
+		/*
+		 * Can't run something on an offline CPU.
+		 *
+		 * CPUs running isolated tasks don't have to update
+		 * ring buffers until they exit isolation because they
+		 * are in userspace. Use the procedure that prevents
+		 * race condition with isolation breaking.
+		 */
 		if (!cpu_online(cpu_id))
 			rb_update_pages(cpu_buffer);
 		else {
-			schedule_work_on(cpu_id,
+			if (!update_if_isolated(cpu_buffer, cpu_id))
+				schedule_work_on(cpu_id,
 					 &cpu_buffer->update_pages_work);
-			wait_for_completion(&cpu_buffer->update_done);
+				wait_for_completion(&cpu_buffer->update_done);
+			}
 		}
 
 		cpu_buffer->nr_pages_to_update = 0;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 13/13] task_isolation: kick_all_cpus_sync: don't kick isolated cpus
  2020-07-22 14:44 [PATCH v4 00/13] "Task_isolation" mode Alex Belits
                   ` (11 preceding siblings ...)
  2020-07-22 14:59 ` [PATCH v4 12/13] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize Alex Belits
@ 2020-07-22 14:59 ` Alex Belits
  2020-07-23 13:17 ` [PATCH v4 00/13] "Task_isolation" mode Thomas Gleixner
  13 siblings, 0 replies; 49+ messages in thread
From: Alex Belits @ 2020-07-22 14:59 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

From: Yuri Norov <ynorov@marvell.com>

Make sure that kick_all_cpus_sync() does not call CPUs that are running
isolated tasks.

Signed-off-by: Yuri Norov <ynorov@marvell.com>
[abelits@marvell.com: use safe task_isolation_cpumask() implementation]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/smp.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 6a6849783948..ff0d95db33b3 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -803,9 +803,21 @@ static void do_nothing(void *unused)
  */
 void kick_all_cpus_sync(void)
 {
+	struct cpumask mask;
+
 	/* Make sure the change is visible before we kick the cpus */
 	smp_mb();
-	smp_call_function(do_nothing, NULL, 1);
+
+	preempt_disable();
+#ifdef CONFIG_TASK_ISOLATION
+	cpumask_clear(&mask);
+	task_isolation_cpumask(&mask);
+	cpumask_complement(&mask, &mask);
+#else
+	cpumask_setall(&mask);
+#endif
+	smp_call_function_many(&mask, do_nothing, NULL, 1);
+	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 00/13] "Task_isolation" mode
  2020-07-22 14:44 [PATCH v4 00/13] "Task_isolation" mode Alex Belits
                   ` (12 preceding siblings ...)
  2020-07-22 14:59 ` [PATCH 13/13] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits
@ 2020-07-23 13:17 ` Thomas Gleixner
  2020-07-23 14:26   ` Peter Zijlstra
                     ` (2 more replies)
  13 siblings, 3 replies; 49+ messages in thread
From: Thomas Gleixner @ 2020-07-23 13:17 UTC (permalink / raw)
  To: Alex Belits, frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, will, linux-arm-kernel, linux-kernel, netdev

Alex,

Alex Belits <abelits@marvell.com> writes:
> This is a new version of task isolation implementation. Previous version is at
> https://lore.kernel.org/lkml/07c25c246c55012981ec0296eee23e68c719333a.camel@marvell.com/
>
> Mostly this covers race conditions prevention on breaking isolation. Early after kernel entry,
> task_isolation_enter() is called to update flags visible to other CPU cores and to perform
> synchronization if necessary. Before this call only "safe" operations happen, as long as
> CONFIG_TRACE_IRQFLAGS is not enabled.

Without going into details of the individual patches, let me give you a
high level view of this series:

  1) Entry code handling:

     That's completely broken vs. the careful ordering and instrumentation
     protection of the entry code. You can't just slap stuff randomly
     into places which you think are safe w/o actually trying to understand
     why this code is ordered in the way it is.

     This clearly was never built and tested with any of the relevant
     debug options enabled. Both build and boot would have told you.

  2) Instruction synchronization

     Trying to do instruction synchronization delayed is a clear recipe
     for hard to diagnose failures. Just because it blew not up in your
     face does not make it correct in any way. It's broken by design and
     violates _all_ rules of safe instruction patching and introduces a
     complete trainwreck in x86 NMI processing.

     If you really think that this is correct, then please have at least
     the courtesy to come up with a detailed and precise argumentation
     why this is a valid approach.

     While writing that up you surely will find out why it is not.

  3) Debug calls

     Sprinkling debug calls around the codebase randomly is not going to
     happen. That's an unmaintainable mess.

     Aside of that none of these dmesg based debug things is necessary.
     This can simply be monitored with tracing.

  4) Tons of undocumented smp barriers

     See Documentation/process/submit-checklist.rst #25

  5) Signal on page fault

     Why is this a magic task isolation feature instead of making it
     something which can be used in general? There are other legit
     reasons why a task might want a notification about an unexpected
     (resolved) page fault.

  6) Coding style violations all over the place

     Using checkpatch.pl is mandatory

  7) Not Cc'ed maintainers

     While your Cc list is huge, you completely fail to Cc the relevant
     maintainers of various files and subsystems as requested in
     Documentation/process/*

  8) Changelogs

     Most of the changelogs have something along the lines:

     'task isolation does not want X, so do Y to make it not do X'

     without any single line of explanation why this approach was chosen
     and why it is correct under all circumstances and cannot have nasty
     side effects.

     It's not the job of the reviewers/maintainers to figure this out.

Please come up with a coherent design first and then address the
identified issues one by one in a way which is palatable and reviewable.

Throwing a big pile of completely undocumented 'works for me' mess over
the fence does not get you anywhere, not even to the point that people
are willing to review it in detail.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 00/13] "Task_isolation" mode
  2020-07-23 13:17 ` [PATCH v4 00/13] "Task_isolation" mode Thomas Gleixner
@ 2020-07-23 14:26   ` Peter Zijlstra
  2020-07-23 14:53     ` Thomas Gleixner
  2020-07-23 14:29   ` Peter Zijlstra
  2020-07-23 15:18   ` Alex Belits
  2 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2020-07-23 14:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Alex Belits, frederic, rostedt, Prasun Kapoor, mingo, davem,
	linux-api, linux-arch, catalin.marinas, will, linux-arm-kernel,
	linux-kernel, netdev

On Thu, Jul 23, 2020 at 03:17:04PM +0200, Thomas Gleixner wrote:

>   2) Instruction synchronization
> 
>      Trying to do instruction synchronization delayed is a clear recipe
>      for hard to diagnose failures. Just because it blew not up in your
>      face does not make it correct in any way. It's broken by design and
>      violates _all_ rules of safe instruction patching and introduces a
>      complete trainwreck in x86 NMI processing.
> 
>      If you really think that this is correct, then please have at least
>      the courtesy to come up with a detailed and precise argumentation
>      why this is a valid approach.
> 
>      While writing that up you surely will find out why it is not.

So delaying the sync_core() IPIs for kernel text patching _might_ be
possible, but it very much wants to be a separate patchset and not
something hidden inside a 'gem' like this.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 00/13] "Task_isolation" mode
  2020-07-23 13:17 ` [PATCH v4 00/13] "Task_isolation" mode Thomas Gleixner
  2020-07-23 14:26   ` Peter Zijlstra
@ 2020-07-23 14:29   ` Peter Zijlstra
  2020-07-23 15:41     ` [EXT] " Alex Belits
  2020-07-23 15:18   ` Alex Belits
  2 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2020-07-23 14:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Alex Belits, frederic, rostedt, Prasun Kapoor, mingo, davem,
	linux-api, linux-arch, catalin.marinas, will, linux-arm-kernel,
	linux-kernel, netdev

On Thu, Jul 23, 2020 at 03:17:04PM +0200, Thomas Gleixner wrote:
>   8) Changelogs
> 
>      Most of the changelogs have something along the lines:
> 
>      'task isolation does not want X, so do Y to make it not do X'
> 
>      without any single line of explanation why this approach was chosen
>      and why it is correct under all circumstances and cannot have nasty
>      side effects.
> 
>      It's not the job of the reviewers/maintainers to figure this out.
> 
> Please come up with a coherent design first and then address the
> identified issues one by one in a way which is palatable and reviewable.
> 
> Throwing a big pile of completely undocumented 'works for me' mess over
> the fence does not get you anywhere, not even to the point that people
> are willing to review it in detail.

This.. as presented it is an absolutely unreviewable pile of junk. It
presents code witout any coherent problem description and analysis. And
the patches are not split sanely either.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 00/13] "Task_isolation" mode
  2020-07-23 14:26   ` Peter Zijlstra
@ 2020-07-23 14:53     ` Thomas Gleixner
  0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gleixner @ 2020-07-23 14:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Belits, frederic, rostedt, Prasun Kapoor, mingo, davem,
	linux-api, linux-arch, catalin.marinas, will, linux-arm-kernel,
	linux-kernel, netdev

Peter Zijlstra <peterz@infradead.org> writes:
> On Thu, Jul 23, 2020 at 03:17:04PM +0200, Thomas Gleixner wrote:
>
>>   2) Instruction synchronization
>> 
>>      Trying to do instruction synchronization delayed is a clear recipe
>>      for hard to diagnose failures. Just because it blew not up in your
>>      face does not make it correct in any way. It's broken by design and
>>      violates _all_ rules of safe instruction patching and introduces a
>>      complete trainwreck in x86 NMI processing.
>> 
>>      If you really think that this is correct, then please have at least
>>      the courtesy to come up with a detailed and precise argumentation
>>      why this is a valid approach.
>> 
>>      While writing that up you surely will find out why it is not.
>
> So delaying the sync_core() IPIs for kernel text patching _might_ be
> possible, but it very much wants to be a separate patchset and not
> something hidden inside a 'gem' like this.

I'm not saying it's impossible, but the proposed hack is definitely
beyond broken and you really don't want to be the one who has to mop up
the pieces later.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 00/13] "Task_isolation" mode
  2020-07-23 13:17 ` [PATCH v4 00/13] "Task_isolation" mode Thomas Gleixner
  2020-07-23 14:26   ` Peter Zijlstra
  2020-07-23 14:29   ` Peter Zijlstra
@ 2020-07-23 15:18   ` Alex Belits
  2020-07-23 15:49     ` Peter Zijlstra
  2020-07-23 21:31     ` Thomas Gleixner
  2 siblings, 2 replies; 49+ messages in thread
From: Alex Belits @ 2020-07-23 15:18 UTC (permalink / raw)
  To: tglx, frederic, rostedt
  Cc: mingo, Prasun Kapoor, linux-api, peterz, linux-arch,
	catalin.marinas, will, linux-arm-kernel, davem, linux-kernel,
	netdev

On Thu, 2020-07-23 at 15:17 +0200, Thomas Gleixner wrote:
> 
> Without going into details of the individual patches, let me give you a
> high level view of this series:
> 
>   1) Entry code handling:
> 
>      That's completely broken vs. the careful ordering and instrumentation
>      protection of the entry code. You can't just slap stuff randomly
>      into places which you think are safe w/o actually trying to understand
>      why this code is ordered in the way it is.
> 
>      This clearly was never built and tested with any of the relevant
>      debug options enabled. Both build and boot would have told you.

This is intended to avoid a race condition when entry or exit from isolation
happens at the same time as an event that requires synchronization. The idea
is, it is possible to insulate the core from all events while it is running
isolated task in userspace, it will receive those calls normally after
breaking isolation and entering kernel, and it will synchronize itself on
kernel entry.

This has two potential problems that I am trying to solve:

1. Without careful ordering, there will be a race condition with events that
happen at the same time as kernel entry or exit.

2. CPU runs some kernel code after entering but before synchronization. This
code should be restricted to early entry that is not affected by the "stale"
state, similar to how IPI code that receives synchronization events does it
normally.

I can't say that I am completely happy with the amount of kernel entry
handling that had to be added. The problem is, I am trying to introduce a
feature that allows CPU cores to go into "de-synchronized" state while running
isolated tasks and not receiving synchronization events that normally would
reach them. This means, there should be established some point on kernel entry
when it is safe for the core to catch up with the rest of kernel. It may be
useful for other purposes, however at this point task isolation is the first
to need it, so I had to determine where such point is for every supported
architecture and method of kernel entry.

I have found that each architecture has its own way of handling this,
and sometimes individual interrupt controller drivers vary in their
sequence of calls on early kernel entry. For x86 I also have an
implementation for kernel 5.6, before your changes to IDT macros.
That version is much less straightforward, so I am grateful for those
relatively recent improvements.

Nevertheless, I believe that the goal of finding those points and using
them for synchronization is valid. If you can recommend me a better way
for at least x86, I will be happy to follow your advice. I have tried to
cover kernel entry in a generic way while making the changes least
disruptive, and this is why it looks simple and spread over multiple
places. I also had to do the same for arm and arm64 (that I use for
development), and for each architecture I had to produce sequences of
entry points and function calls to determine the correct placement of
task_isolation_enter() calls in them. It is not random, however it does
reflect the complex nature of kernel entry code. I believe, RCU
implementation faced somewhat similar requirements for calls on kernel
entry, however it is not completely unified, either

>  2) Instruction synchronization
>     Trying to do instruction synchronization delayed is a clear recipe
>     for hard to diagnose failures. Just because it blew not up in your
>     face does not make it correct in any way. It's broken by design and
>     violates _all_ rules of safe instruction patching and introduces a
>     complete trainwreck in x86 NMI processing.

The idea is that just like synchronization events are handled by regular IPI,
we already use some code with the assumption that it is safe to be entered in
"stale" state before synchronization. I have extended it to allow
synchronization points on all kernel entry points.

>     If you really think that this is correct, then please have at least
>     the courtesy to come up with a detailed and precise argumentation
>     why this is a valid approach.
>
>     While writing that up you surely will find out why it is not.
> 

I had to document a sequence of calls for every entry point on three supported
architectures, to determine the points for synchronization. It is possible that
I have somehow missed something, however I don't see a better approach, save
for establishing a kernel-wide infrastructure for this. And even if we did just
that, it would be possible to implement this kind of synchronization point
calls first, and convert them to something more generic later.

> 
>   3) Debug calls
> 
>      Sprinkling debug calls around the codebase randomly is not going to
>      happen. That's an unmaintainable mess.

Those report isolation breaking causes, and are intended for application and
system debugging.

> 
>      Aside of that none of these dmesg based debug things is necessary.
>      This can simply be monitored with tracing.

I think, it would be better to make all that information available to the
userspace application, however I have based this on the Chris Metcalf code,
and gradually updated the mechanisms and interfaces. The original reporting
of isolation breaking causes had far greater problems, so at first I wanted
to have something that produces easily visible and correct reporting, and
does not break things while doing so.

> 
>   4) Tons of undocumented smp barriers
> 
>      See Documentation/process/submit-checklist.rst #25
> 

That should be fixed.

>   5) Signal on page fault
> 
>      Why is this a magic task isolation feature instead of making it
>      something which can be used in general? There are other legit
>      reasons why a task might want a notification about an unexpected
>      (resolved) page fault.

Page fault causes isolation breaking. When a task runs in isolated mode it
does so because it requires predictable timing, so causing page faults and
expecting them to be handled along the way would defeat the purpose of
isolation. So if page fault did happen, it is important that application will
receive notification about isolation being broken, and then may decide to do
something about it, re-enter isolation, etc.

> 
>   6) Coding style violations all over the place
> 
>      Using checkpatch.pl is mandatory
> 
>   7) Not Cc'ed maintainers
> 
>      While your Cc list is huge, you completely fail to Cc the relevant
>      maintainers of various files and subsystems as requested in
>      Documentation/process/*

To be honest, I am not sure, whom I have missed, I tried to include everyone
from my previous attempt.

> 
>   8) Changelogs
> 
>      Most of the changelogs have something along the lines:
> 
>      'task isolation does not want X, so do Y to make it not do X'
> 
>      without any single line of explanation why this approach was chosen
>      and why it is correct under all circumstances and cannot have nasty
>      side effects.

This is the same as the previous version, except for the addition of kernel
entry handling. As far as I can tell, the rest was discussed before, and not
many questions remained except for the race condition on kernel entry. I
agree that kernel entry handling is a complex issue in itself, so I have
included explanation of entry points / function calls sequences for each
supported architecture. I have longer call diagram, that I used to track
each particular function, it probably should be included as a separate
document.

>      It's not the job of the reviewers/maintainers to figure this out.
> 
> Please come up with a coherent design first and then address the
> identified issues one by one in a way which is palatable and reviewable.
> 
> Throwing a big pile of completely undocumented 'works for me' mess over
> the fence does not get you anywhere, not even to the point that people
> are willing to review it in detail.

There is a design, and it is a result of a careful tracking of calls in the
kernel source. It has multiple point where task_isolation_enter() is called
for a reason similar to why RCU-related functions are called in multiple
places.

If someone can recommend a better way to introduce a kernel entry
checkpoint for synchronization that did not exist before, I will be happy
to hear it.

-- 
Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 00/13] "Task_isolation" mode
  2020-07-23 14:29   ` Peter Zijlstra
@ 2020-07-23 15:41     ` Alex Belits
  2020-07-23 15:48       ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Alex Belits @ 2020-07-23 15:41 UTC (permalink / raw)
  To: tglx, peterz
  Cc: mingo, Prasun Kapoor, linux-api, rostedt, linux-arch,
	catalin.marinas, will, frederic, davem, linux-kernel, netdev,
	linux-arm-kernel


On Thu, 2020-07-23 at 16:29 +0200, Peter Zijlstra wrote:
> .
> 
> This.. as presented it is an absolutely unreviewable pile of junk. It
> presents code witout any coherent problem description and analysis.
> And
> the patches are not split sanely either.

There is a more complete and slightly outdated description in the
previous version of the patch at 
https://lore.kernel.org/lkml/07c25c246c55012981ec0296eee23e68c719333a.camel@marvell.com/
 .

It allows userspace application to take a CPU core for itself and run
completely isolated, with no disturbances. There is work in progress
that also disables and re-enables TLB flushes, and depending on CPU it
may be possible to also pre-allocate cache, so it would not be affected
by the rest of the system. Events that cause interaction with isolated
task, cause isolation breaking, turning the task into a regular
userspace task that can continue running normally and enter isolated
state again if necessary.

To make this feature suitable for any practical use, many mechanisms
that normally would cause events on a CPU, should exclude CPU cores in
this state, and synchronization should happen later, at the time of
isolation breaking.

There are three architectures supported, x86, arm and arm64, and it
should be possible to extend it to others. Unfortunately kernel entry
procedures are neither unified, nor straightforward, so introducing new
feature to them causes an appearance of a mess.

-- 
Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 00/13] "Task_isolation" mode
  2020-07-23 15:41     ` [EXT] " Alex Belits
@ 2020-07-23 15:48       ` Peter Zijlstra
  2020-07-23 16:19         ` Alex Belits
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2020-07-23 15:48 UTC (permalink / raw)
  To: Alex Belits
  Cc: tglx, mingo, Prasun Kapoor, linux-api, rostedt, linux-arch,
	catalin.marinas, will, frederic, davem, linux-kernel, netdev,
	linux-arm-kernel

On Thu, Jul 23, 2020 at 03:41:46PM +0000, Alex Belits wrote:
> 
> On Thu, 2020-07-23 at 16:29 +0200, Peter Zijlstra wrote:
> > .
> > 
> > This.. as presented it is an absolutely unreviewable pile of junk. It
> > presents code witout any coherent problem description and analysis.
> > And
> > the patches are not split sanely either.
> 
> There is a more complete and slightly outdated description in the
> previous version of the patch at 
> https://lore.kernel.org/lkml/07c25c246c55012981ec0296eee23e68c719333a.camel@marvell.com/

Not the point, you're mixing far too many things in one go. You also
have the patches split like 'generic / arch-1 / arch-2' which is wrong
per definition, as patches should be split per change and not care about
sily boundaries.

Also, if you want generic entry code, there's patches for that here:

  https://lkml.kernel.org/r/20200722215954.464281930@linutronix.de



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 00/13] "Task_isolation" mode
  2020-07-23 15:18   ` Alex Belits
@ 2020-07-23 15:49     ` Peter Zijlstra
  2020-07-23 16:50       ` Alex Belits
  2020-07-23 21:31     ` Thomas Gleixner
  1 sibling, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2020-07-23 15:49 UTC (permalink / raw)
  To: Alex Belits
  Cc: tglx, frederic, rostedt, mingo, Prasun Kapoor, linux-api,
	linux-arch, catalin.marinas, will, linux-arm-kernel, davem,
	linux-kernel, netdev

On Thu, Jul 23, 2020 at 03:18:42PM +0000, Alex Belits wrote:
> On Thu, 2020-07-23 at 15:17 +0200, Thomas Gleixner wrote:
> > 
> > Without going into details of the individual patches, let me give you a
> > high level view of this series:
> > 
> >   1) Entry code handling:
> > 
> >      That's completely broken vs. the careful ordering and instrumentation
> >      protection of the entry code. You can't just slap stuff randomly
> >      into places which you think are safe w/o actually trying to understand
> >      why this code is ordered in the way it is.
> > 
> >      This clearly was never built and tested with any of the relevant
> >      debug options enabled. Both build and boot would have told you.
> 
> This is intended to avoid a race condition when entry or exit from isolation
> happens at the same time as an event that requires synchronization. The idea
> is, it is possible to insulate the core from all events while it is running
> isolated task in userspace, it will receive those calls normally after
> breaking isolation and entering kernel, and it will synchronize itself on
> kernel entry.

'What does noinstr mean? and why do we have it" -- don't dare touch the
entry code until you can answer that.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 00/13] "Task_isolation" mode
  2020-07-23 15:48       ` Peter Zijlstra
@ 2020-07-23 16:19         ` Alex Belits
  0 siblings, 0 replies; 49+ messages in thread
From: Alex Belits @ 2020-07-23 16:19 UTC (permalink / raw)
  To: tglx, peterz
  Cc: mingo, Prasun Kapoor, linux-api, rostedt, frederic,
	catalin.marinas, will, linux-arch, davem, linux-kernel,
	linux-arm-kernel, netdev


On Thu, 2020-07-23 at 17:48 +0200, Peter Zijlstra wrote:
> On Thu, Jul 23, 2020 at 03:41:46PM +0000, Alex Belits wrote:
> > On Thu, 2020-07-23 at 16:29 +0200, Peter Zijlstra wrote:
> > > .
> > > 
> > > This.. as presented it is an absolutely unreviewable pile of
> > > junk. It
> > > presents code witout any coherent problem description and
> > > analysis.
> > > And
> > > the patches are not split sanely either.
> > 
> > There is a more complete and slightly outdated description in the
> > previous version of the patch at 
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_07c25c246c55012981ec0296eee23e68c719333a.camel-40marvell.com_&d=DwIBAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=1qgvOnXfk3ZHJA3p7RIb6NFqs4SPPDyPI_PcwNFp8KY&m=shk9a5FDwktOZysSbFIjxmgUg-IPyw2UkbVAHGBhNV0&s=FFZaj-KanwqEiXYCdjd96JOgP_GAOnanpkw6bBvNrK4&e= 
> 
> Not the point, you're mixing far too many things in one go. You also
> have the patches split like 'generic / arch-1 / arch-2' which is
> wrong
> per definition, as patches should be split per change and not care
> about
> sily boundaries.

This follows the original patch by Chris Metcalf. There is a reason for
that -- per-architecture changes are independent from each other and
affect not just code but functionality that was implemented per-
architecture. To support more architectures, it will be necessary to do
it separately for each, and mark them supported with
HAVE_ARCH_TASK_ISOLATION. Having only some architectures supported does
not break anything for the rest -- architectures that are not covered,
would not have this functionality.

> 
> Also, if you want generic entry code, there's patches for that here:
> 
>   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.kernel.org_r_20200722215954.464281930-40linutronix.de&d=DwIBAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=1qgvOnXfk3ZHJA3p7RIb6NFqs4SPPDyPI_PcwNFp8KY&m=shk9a5FDwktOZysSbFIjxmgUg-IPyw2UkbVAHGBhNV0&s=nZXIviY7rva31KvPgSVnTacwFNbsmkdW0LxSTfYSiqg&e=
>  
> 
> 

That looks useful. Why didn't Thomas Gleixner mention it in his
criticism of my approach if he already solved that exact problem, at
least for x86?

-- 
Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 00/13] "Task_isolation" mode
  2020-07-23 15:49     ` Peter Zijlstra
@ 2020-07-23 16:50       ` Alex Belits
  2020-07-23 21:44         ` Thomas Gleixner
  0 siblings, 1 reply; 49+ messages in thread
From: Alex Belits @ 2020-07-23 16:50 UTC (permalink / raw)
  To: peterz
  Cc: davem, Prasun Kapoor, mingo, linux-api, rostedt, frederic, tglx,
	catalin.marinas, linux-arch, will, linux-kernel, netdev,
	linux-arm-kernel


On Thu, 2020-07-23 at 17:49 +0200, Peter Zijlstra wrote:
> 
> 'What does noinstr mean? and why do we have it" -- don't dare touch
> the
> entry code until you can answer that.

noinstr disables instrumentation, so there would not be calls and
dependencies on other parts of the kernel when it's not yet safe to
call them. Relevant functions already have it, and I add an inline call
to perform flags update and synchronization. Unless something else is
involved, those operations are safe, so I am not adding anything that
can break those.

-- 
Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 00/13] "Task_isolation" mode
  2020-07-23 15:18   ` Alex Belits
  2020-07-23 15:49     ` Peter Zijlstra
@ 2020-07-23 21:31     ` Thomas Gleixner
  1 sibling, 0 replies; 49+ messages in thread
From: Thomas Gleixner @ 2020-07-23 21:31 UTC (permalink / raw)
  To: Alex Belits, frederic, rostedt
  Cc: mingo, Prasun Kapoor, linux-api, peterz, linux-arch,
	catalin.marinas, will, linux-arm-kernel, davem, linux-kernel,
	netdev

Alex,

Alex Belits <abelits@marvell.com> writes:
> On Thu, 2020-07-23 at 15:17 +0200, Thomas Gleixner wrote:
>> 
>> Without going into details of the individual patches, let me give you a
>> high level view of this series:
>> 
>>   1) Entry code handling:
>> 
>>      That's completely broken vs. the careful ordering and instrumentation
>>      protection of the entry code. You can't just slap stuff randomly
>>      into places which you think are safe w/o actually trying to understand
>>      why this code is ordered in the way it is.
>> 
>>      This clearly was never built and tested with any of the relevant
>>      debug options enabled. Both build and boot would have told you.
>
> This is intended to avoid a race condition when entry or exit from isolation
> happens at the same time as an event that requires synchronization. The idea
> is, it is possible to insulate the core from all events while it is running
> isolated task in userspace, it will receive those calls normally after
> breaking isolation and entering kernel, and it will synchronize itself on
> kernel entry.

It does not matter what your intention is. Fact is that you disrupt a
carefully designed entry code sequence without even trying to point out
that you did so because you don't know how to do it better. There is a
big fat comment above enter_from_user_mode() which should have make you
ask at least. Peter and myself spent month on getting this correct
vs. RCU, instrumentation, code patching and some more things.

From someone who tries to fiddle with such a sensitive area of code it's
not too much asked that he follows or reads up on these changes instead
of just making uninformed choices of placement by defining that this new
stuff is the most important thing on the planet or at least documenting
why this is correct and not violating any of the existing constraints.

> This has two potential problems that I am trying to solve:
>
> 1. Without careful ordering, there will be a race condition with events that
> happen at the same time as kernel entry or exit.

Entry code is all about ordering. News at 11.

> 2. CPU runs some kernel code after entering but before synchronization. This
> code should be restricted to early entry that is not affected by the "stale"
> state, similar to how IPI code that receives synchronization events does it
> normally.

And because of that you define that you can place anything you need
_before_ functionality which is essential for establishing kernel state
correctly without providing the minimum proof that this does not violate
any of the existing contraints.

> reach them. This means, there should be established some point on kernel entry
> when it is safe for the core to catch up with the rest of kernel. It may be
> useful for other purposes, however at this point task isolation is the first
> to need it, so I had to determine where such point is for every supported
> architecture and method of kernel entry.

You decided that your feature has to run first. Where is the analysis
that this is safe and correct vs. the existing ordering constraints?

Why does this trigger build and run time warnings? (I neither built nor
ran it, but with full debug enabled it will for sure).

> Nevertheless, I believe that the goal of finding those points and using
> them for synchronization is valid.

The goal does not justify the means. 

> If you can recommend me a better way for at least x86, I will be happy
> to follow your advice. I have tried to cover kernel entry in a generic
> way while making the changes least disruptive, and this is why it
> looks simple and spread over multiple places.

It does not look simple. It looks random and like the outcome of try and
error. Oh, here it explodes, lets slap another instance into it.

> I also had to do the same for arm and arm64 (that I use for
> development), and for each architecture I had to produce sequences of
> entry points and function calls to determine the correct placement of
> task_isolation_enter() calls in them. It is not random, however it does
> reflect the complex nature of kernel entry code. I believe, RCU
> implementation faced somewhat similar requirements for calls on kernel
> entry, however it is not completely unified, either

But RCU has a well defined design and requirement list and people are
working on making the entry sequence generic and convert architectures
over to it. And no, we don't try to do 5 architectures at once. We did
x86 with an eye on others. It's not perfect and it never will be because
of hardware.

>>  2) Instruction synchronization
>>     Trying to do instruction synchronization delayed is a clear recipe
>>     for hard to diagnose failures. Just because it blew not up in your
>>     face does not make it correct in any way. It's broken by design and
>>     violates _all_ rules of safe instruction patching and introduces a
>>     complete trainwreck in x86 NMI processing.
>
> The idea is that just like synchronization events are handled by regular IPI,
> we already use some code with the assumption that it is safe to be entered in
> "stale" state before synchronization. I have extended it to allow
> synchronization points on all kernel entry points.

The idea is clear, just where is the analysis that this is safe?

Just from quickly skimming the code it's clear that this has never been
done. Experimental development on the base of 'does not explode' is not
a valid approach in the kernel whatever your goal is.

>>     If you really think that this is correct, then please have at least
>>     the courtesy to come up with a detailed and precise argumentation
>>     why this is a valid approach.
>>
>>     While writing that up you surely will find out why it is not.
>
> I had to document a sequence of calls for every entry point on three supported
> architectures, to determine the points for synchronization.

Why is that documentation not part of the patches in form of
documentation or proper changelogs?

> It is possible that I have somehow missed something, however I don't
> see a better approach, save for establishing a kernel-wide
> infrastructure for this. And even if we did just that, it would be
> possible to implement this kind of synchronization point calls first,
> and convert them to something more generic later.

You're putting the cart before the horse.

You want delayed instruction patching synchronization. So the right
approach is to:

   1) Analyze the constraints of instruction patching on a given
      architecture.

   2) Implement a scheme for this architecture to handle delayed
      patching as a stand alone feature with well documented and fine
      grained patches and proper prove that none of the constraints is
      violated.

      Find good arguments why such a feature is generally useful and not
      only for your personal pet pieve.

Once you've done that, then you'll find out that there is no need for
magic task isolation hackery simply because it's already there.

Code patching is very much architecture specific and the constraints
vary due to the different hardware requirements. The idea of making this
generic is laudable, but naive at best. Once you have done #1 above on
two architectures you will know why.

>>   3) Debug calls
>> 
>>      Sprinkling debug calls around the codebase randomly is not going to
>>      happen. That's an unmaintainable mess.
>
> Those report isolation breaking causes, and are intended for application and
> system debugging.

I don't care what they do as that does not make them more palatable or
maintainable.

>> 
>>      Aside of that none of these dmesg based debug things is necessary.
>>      This can simply be monitored with tracing.
>
> I think, it would be better to make all that information available to the
> userspace application, however I have based this on the Chris Metcalf code,
> and gradually updated the mechanisms and interfaces. The original reporting
> of isolation breaking causes had far greater problems, so at first I wanted
> to have something that produces easily visible and correct reporting, and
> does not break things while doing so.

Why are you exposing other people to these horrors? I don't care what
you use in your development branch and I don't care what you share with
your friends, but if you want maintainers and reviewers to look at that
stuff then ensure that what you present:

  - Makes sense
  - Is properly implemented
  - Is properly documented
  - Is properly argumented why this is the right approach.

'I need', 'I want', 'this does' are non-arguments to begin with.

>>   5) Signal on page fault
>> 
>>      Why is this a magic task isolation feature instead of making it
>>      something which can be used in general? There are other legit
>>      reasons why a task might want a notification about an unexpected
>>      (resolved) page fault.
>
> Page fault causes isolation breaking. When a task runs in isolated mode it
> does so because it requires predictable timing, so causing page faults and
> expecting them to be handled along the way would defeat the purpose of
> isolation. So if page fault did happen, it is important that application will
> receive notification about isolation being broken, and then may decide to do
> something about it, re-enter isolation, etc.

Did you actually read what I wrote? I very much understood what you are
trying to do and why. Otherwise I wouldn't have written the above.

>>   6) Coding style violations all over the place
>> 
>>      Using checkpatch.pl is mandatory
>> 
>>   7) Not Cc'ed maintainers
>> 
>>      While your Cc list is huge, you completely fail to Cc the relevant
>>      maintainers of various files and subsystems as requested in
>>      Documentation/process/*
>
> To be honest, I am not sure, whom I have missed, I tried to include everyone
> from my previous attempt.

May I ask you to read, understand and follow the documentation I pointed
you to?

>>   8) Changelogs
>> 
>>      Most of the changelogs have something along the lines:
>> 
>>      'task isolation does not want X, so do Y to make it not do X'
>> 
>>      without any single line of explanation why this approach was chosen
>>      and why it is correct under all circumstances and cannot have nasty
>>      side effects.
>
> This is the same as the previous version, except for the addition of kernel
> entry handling. As far as I can tell, the rest was discussed before, and not
> many questions remained except for the race condition on kernel entry.

How is that related to changelogs which are useless? 

> agree that kernel entry handling is a complex issue in itself, so I have
> included explanation of entry points / function calls sequences for each
> supported architecture.

Which explanations? Let's talk about 7/13 the x86 part:

> In prepare_exit_to_usermode(), run cleanup for tasks exited fromi
> isolation and call task_isolation_start() for tasks that entered
> TIF_TASK_ISOLATION.
>
> In syscall_trace_enter(), add the necessary support for reporting
> syscalls for task-isolation processes.
>
> Add task_isolation_remote() calls for the kernel exception types
> that do not result in signals, namely non-signalling page faults.
>
> Add task_isolation_kernel_enter() calls to interrupt and syscall
> entry handlers.
>
> This mechanism relies on calls to functions that call
> task_isolation_kernel_enter() early after entry into kernel. Those
> functions are:
>
> enter_from_user_mode()
>  called from do_syscall_64(), do_int80_syscall_32(),
>  do_fast_syscall_32(), idtentry_enter_user(),
>  idtentry_enter_cond_rcu()
> idtentry_enter_cond_rcu()
>  called from non-raw IDT macros and other entry points
> idtentry_enter_user()
> nmi_enter()
> xen_call_function_interrupt()
> xen_call_function_single_interrupt()
> xen_irq_work_interrupt()

Can you point me to a single word of explanation in this blurb?

It's a list of things WHAT the patch does without a single word of WHY
and without a single word of WHY any of this would be correct.

> I have longer call diagram, that I used to track each particular
> function, it probably should be included as a separate document.

Call diagrams are completely useless. The people who have to review this
know how that works. They want real explanations:

     - Why is this the right approach
     - Why does this not violate constraints A, B, C
     - What are the potential side effects
     - ...

All of this is asked for in Documentation/process/* for a reason.

>>      It's not the job of the reviewers/maintainers to figure this out.
>> 
>> Please come up with a coherent design first and then address the
>> identified issues one by one in a way which is palatable and reviewable.
>> 
>> Throwing a big pile of completely undocumented 'works for me' mess over
>> the fence does not get you anywhere, not even to the point that people
>> are willing to review it in detail.
>
> There is a design, and it is a result of a careful tracking of calls in the
> kernel source. It has multiple point where task_isolation_enter() is called
> for a reason similar to why RCU-related functions are called in multiple
> places.

Design based on call tracking? That must be some newfangled method of
design which was not taught when I was in school.

You can do analysis with call tracking, but not design. 

Comparing this to RCU is beyond hillarious. RCU has design and
requirements documented and every single instance of RCU state
establishment has been argued in the changelogs and is most of the time
(except for the obvious places) extensively commented.

> If someone can recommend a better way to introduce a kernel entry
> checkpoint for synchronization that did not exist before, I will be happy
> to hear it.

Start with a coherent explanation of:

  - What you are trying to achieve

  - Which problems did you observe in your analysis including the
    impact of the problem on your goal.

  - A per problem conceptual approach to solve it along with cleanly
    implemented and independent RFC code for each particular problem
    without tons of debug hacks and the vain attempts to make everything
    generic. There might be common parts of it, but as explained with
    code patching and #PF signals they can be completely independent of
    each other.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 00/13] "Task_isolation" mode
  2020-07-23 16:50       ` Alex Belits
@ 2020-07-23 21:44         ` Thomas Gleixner
  2020-07-24  3:00           ` [EXT] " Alex Belits
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gleixner @ 2020-07-23 21:44 UTC (permalink / raw)
  To: Alex Belits, peterz
  Cc: davem, Prasun Kapoor, mingo, linux-api, rostedt, frederic,
	catalin.marinas, linux-arch, will, linux-kernel, netdev,
	linux-arm-kernel

Alex Belits <abelits@marvell.com> writes:
> On Thu, 2020-07-23 at 17:49 +0200, Peter Zijlstra wrote:
>> 
>> 'What does noinstr mean? and why do we have it" -- don't dare touch
>> the
>> entry code until you can answer that.
>
> noinstr disables instrumentation, so there would not be calls and
> dependencies on other parts of the kernel when it's not yet safe to
> call them. Relevant functions already have it, and I add an inline call
> to perform flags update and synchronization. Unless something else is
> involved, those operations are safe, so I am not adding anything that
> can break those.

Sure.

 1) That inline function can be put out of line by the compiler and
    placed into the regular text section which makes it subject to
    instrumentation

 2) That inline function invokes local_irq_save() which is subject to
    instrumentation _before_ the entry state for the instrumentation
    mechanisms is established.

 3) That inline function invokes sync_core() before important state has
    been established, which is especially interesting in NMI like
    exceptions.

As you clearly documented why all of the above is safe and does not
cause any problems, it's just me and Peter being silly, right?

Try again.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 00/13] "Task_isolation" mode
  2020-07-23 21:44         ` Thomas Gleixner
@ 2020-07-24  3:00           ` Alex Belits
  2020-07-24 16:08             ` Thomas Gleixner
  0 siblings, 1 reply; 49+ messages in thread
From: Alex Belits @ 2020-07-24  3:00 UTC (permalink / raw)
  To: tglx, peterz
  Cc: mingo, Prasun Kapoor, linux-api, rostedt, frederic,
	catalin.marinas, will, linux-arch, davem, linux-kernel,
	linux-arm-kernel, netdev


On Thu, 2020-07-23 at 23:44 +0200, Thomas Gleixner wrote:
> External Email
> 
> -------------------------------------------------------------------
> ---
> Alex Belits <abelits@marvell.com> writes:
> > On Thu, 2020-07-23 at 17:49 +0200, Peter Zijlstra wrote:
> > > 'What does noinstr mean? and why do we have it" -- don't dare
> > > touch
> > > the
> > > entry code until you can answer that.
> > 
> > noinstr disables instrumentation, so there would not be calls and
> > dependencies on other parts of the kernel when it's not yet safe to
> > call them. Relevant functions already have it, and I add an inline
> > call
> > to perform flags update and synchronization. Unless something else
> > is
> > involved, those operations are safe, so I am not adding anything
> > that
> > can break those.
> 
> Sure.
> 
>  1) That inline function can be put out of line by the compiler and
>     placed into the regular text section which makes it subject to
>     instrumentation
> 
>  2) That inline function invokes local_irq_save() which is subject to
>     instrumentation _before_ the entry state for the instrumentation
>     mechanisms is established.
> 
>  3) That inline function invokes sync_core() before important state
> has
>     been established, which is especially interesting in NMI like
>     exceptions.
> 
> As you clearly documented why all of the above is safe and does not
> cause any problems, it's just me and Peter being silly, right?
> 
> Try again.

I don't think, accusations and mockery are really necessary here.

I am trying to do the right thing here. In particular, I am trying to
port the code that was developed on platforms that have not yet
implemented those useful instrumentation safety features of x86 arch
support. For most of the development time I had to figure out, where
the synchronization can be safely inserted into kernel entry code on
three platforms and tens of interrupt controller drivers, with some of
those presenting unusual exceptions (forgive me the pun) from platform-
wide conventions. I really appreciate the work you did cleaning up
kernel entry procedures, my 5.6 version of this patch had to follow a
much more complex and I would say, convoluted entry handling on x86,
and now I don't have to do that, thanks to you.

Unfortunately, most of my mental effort recently had to be spent on
three things:

1. (small): finding a way to safely enable events and synchronize state
on kernel entry, so it will not have a race condition between
isolation-breaking kernel entry and an event that was disabled while
the task was isolated.

2. (big): trying to derive any useful rules applicable to kernel entry
in various architectures, finding that there is very little consistency
across architectures, and whatever exists, can be broken by interrupt
controller drivers that don't all follow the same rules as the rest of
the platform.

3. (medium): introducing calls to synchronization on all kernel entry
procedures, in places where it is guaranteed to not normally yet have
done any calls to parts of the kernel that may be affected by "stale"
state, and do it in a manner as consistent and generalized as possible.

The current state of kernel entry handling on arm and arm64
architectures has significant differences from x86 and from each other.
There is also a matter of interrupt controllers. As can be seen in
interrupt controller-specific patch, I had to accommodate some variety
of custom interrupt entry code. What can not be seen, is that I had to
check that all other interrupt controller drivers and architecture-
specific entry procedures, and find that they _do_ follow some
understandable rules -- unfortunately architecture-specific and not
documented in any manner.

I have no valid reasons for complaining about it. I could not expect
that authors of all kernel entry procedures would have any
foreknowledge that someone at some point may have a reason to establish
any kind of synchronization point for CPU cores. And this is why I had
to do my research by manually drawing call trees and sequences,
separately for every entry on every supported architecture, and across
two or three versions of kernel, as those were changing along the way.

The result of this may be not a "design" per se, but an understanding
of how things are implemented, and what rules are being followed, so I
could add my code in a manner consistent with what is done, and
document the whole thing. Then there will be some written rules to
check for, when anything of this kind will be necessary again (say,
with TLB, but considering how much now is done in userspace, possibly
to accommodate more exotic CPU features that may have state messed up
by userspace). I am afraid, this task, kernel entry documentation,
would take me some time, and I did not want to delay my task isolation
patch for this reason.

As I followed whatever rules I have determined to be applicable, I have
produced code that introduces hooks in multiple seemingly unrelated to
each other places. Whenever there was a, forgive me the pun, exception
to those rules, another special case had to be handled.

So no, I did not just add entry hooks randomly, and your accusations of
having no design are unwarranted. My patches reflect what is already in
code and in its design, I have added one simple rule that entry hook
runs at the point when no dependency on something that requires
synchronization, exists yet. The entry hook is small, you have already
seen all of it while listing things that are not compatible with
noinst. Its mechanism and purpose are explained in general description
of task isolation. I don't think, I can provide a better explanation.

I have to apologize for not taking into account all your carefully
built instrumentation safety support. That was one thing I have missed.
However at this point the only way for me to find out that I am wrong
about it, and my code does not comply with expectations defined by
advanced state of x86 architecture development, was to present whatever
I could do right, based on experience with other platforms. I don't
think, this warrants such hostility.

Another issue that you have asked me to defend is the existence and
scope of task isolation itself. I have provided long explanation in
changelog and previous discussions of this patch, and before me so did
Chris Metcalf and occasionally Yuri Norov. While I understand that this
is an unusual feature and by its nature it affects kernel in multiple
places, it does not deserve to be called a "mess" and other synonyms of
"mess". It's an attempt to introduce a feature that turns Linux
userspace into superior replacement of RTOS. Considering current state
of CPU and SoC development, it is becoming very difficult even for
vendor-specific RTOS to keep up with advanced hardware features. Linux
keeps up with them just fine, however it lacks the ability to truly
leave the CPU core alone, to run the performance-critical and latency-
critical part of a task in a manner that RTOS user would expect. Very
close but not yet. Task isolation provides the last step for this RTOS
replacement. It is implemented in a manner that allows the user to
combine Linux resource handling and initialization with RTOS isolation
and latency. The decision about page faults is a part of this design,
as well as many other decisions implemented in this code. Many may
disagree with either those decisions, or the validity of a goal, some
may even argue that it's a bad thing to provide a reason to stop RTOS
development (I think, this is a good thing but that's not the point).

However most definitely this is not a "mess", and it I do not believe
that I have to defend the validity of this direction of development, or
be accused of general incompetence every time someone finds a
frustrating mistake in my code. As I said, I am trying to do the right
thing, and want to bring my code not only to the state where x86
support is on par with other platforms (that is, working when
instrumentation is disabled), but also make it fully compliant with
current requirements of x86 platform.

-- 
Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 00/13] "Task_isolation" mode
  2020-07-24  3:00           ` [EXT] " Alex Belits
@ 2020-07-24 16:08             ` Thomas Gleixner
  0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gleixner @ 2020-07-24 16:08 UTC (permalink / raw)
  To: Alex Belits, peterz
  Cc: mingo, Prasun Kapoor, linux-api, rostedt, frederic,
	catalin.marinas, will, linux-arch, davem, linux-kernel,
	linux-arm-kernel, netdev

Alex,

Alex Belits <abelits@marvell.com> writes:
> On Thu, 2020-07-23 at 23:44 +0200, Thomas Gleixner wrote:
>>  1) That inline function can be put out of line by the compiler and
>>     placed into the regular text section which makes it subject to
>>     instrumentation
>> 
>>  2) That inline function invokes local_irq_save() which is subject to
>>     instrumentation _before_ the entry state for the instrumentation
>>     mechanisms is established.
>> 
>>  3) That inline function invokes sync_core() before important state
>> has
>>     been established, which is especially interesting in NMI like
>>     exceptions.
>> 
>> As you clearly documented why all of the above is safe and does not
>> cause any problems, it's just me and Peter being silly, right?
>> 
>> Try again.
>
> I don't think, accusations and mockery are really necessary here.

Let's get some context to this.

  I told you in my first mail, that this breaks noinstr and that
  building with full debug would have told you.

  Peter gave you a clear hint where to look.

Now it might be expected that you investigate that or at least ask
questions before making the bold claim:

>> > Unless something else is involved, those operations are safe, so I
>> > am not adding anything that can break those.

Surely I could have avoided the snide remark, but after you demonstrably
ignored technically valid concerns and suggestions in your other reply,
I was surely not in the mood to be overly careful in my choice of words.

> The result of this may be not a "design" per se, but an understanding
> of how things are implemented, and what rules are being followed, so I
> could add my code in a manner consistent with what is done, and
> document the whole thing.

Every other big and technically complex project which has to change the
very inner workings of the kernel started the same way. I'm not aware of
any of them getting accepted as is or in a big code dump.

What you have now qualifies as proof of concept and the big challenge is
to turn it into something which is acceptable and maintainable.

You talk in great length about how inconsistent stuff is all over the
place. Yes, it is indeed. You even call that inconsistency an existing
design:

> My patches reflect what is already in code and in its design.

I agree that you just work with the code as is, but you might have
noticed that quite some of this stuff is clearly not designed at all or
designed badly.

The solution is not to pile on top of the inconsistency, the solution is
to make it consistent in the first place.

You are surely going to say, that's beyond the scope of your project. I
can tell you that it is in the scope of your project simply because just
proliferating the status quo and piling new stuff on top is not an
option. And no, there are no people waiting in a row to mop up after
you either.

Quite some of the big technology projects have spent and still spend
considerable amount of time to do exactly this kind of consolidation
work upfront in order to make their features acceptable in a
maintainable form.

All of these projects have been merged or are still being merged
piecewise in reviewable chunks.

We are talking about intrusive technology which requires a very careful
integration to prevent it from becoming a roadblock or a maintenaince
headache. The approach and implementation has to be _agreed_ on by the
involved parties, i.e. submitters, reviewers and maintainers.

> While I understand that this is an unusual feature and by its nature
> it affects kernel in multiple places, it does not deserve to be called
> a "mess" and other synonyms of "mess".

The feature is perfectly fine and I completely understand why you want
it. Guess who started to lay the grounds for NOHZ_FULL more than a
decade ago and why?

The implementation is not acceptable on technical grounds,
maintainability reasons, lack of design and proper consolidated
integration.

> Another issue that you have asked me to defend is the existence and
> scope of task isolation itself.

I have not asked you to defend the existance. I asked you for coherent
explanations how the implementation works and why the chosen approach is
correct and valid. That's a completely different thing.

> It's an attempt to introduce a feature that turns Linux userspace into
> superior replacement of RTOS.....

Can you please spare me the advertising and marketing? I'm very well
aware what an RTOS is and I'm also very well aware that there is no such
thing like a 'superior replacement' for RTOS in general.

If your view of RTOS is limited to this particular feature, then I have
to tell you that this particular feature is only useful for a very small
portion of the overall RTOS use cases.

> However most definitely this is not a "mess", and it I do not believe
> that I have to defend the validity of this direction of development, or
> be accused of general incompetence every time someone finds a
> frustrating mistake in my code.

Nobody accuses you of incompetence, but you will have to defend the
validity of your approach and implementation and accept that things
might not be as shiny as you think they are. That's not hostility,
that's just how Linux kernel development works whether you like it or
not. 

I surely can understand your frustration over my view of this series,
but you might have noticed that aside of criticism I gave you very clear
technical arguments and suggestions how to proceed.

It's your decision what you make of that.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
  2020-07-22 14:49 ` [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel Alex Belits
@ 2020-10-01 13:56   ` Frederic Weisbecker
  2020-10-04 14:44     ` [EXT] " Alex Belits
  2020-10-01 14:40   ` Frederic Weisbecker
  1 sibling, 1 reply; 49+ messages in thread
From: Frederic Weisbecker @ 2020-10-01 13:56 UTC (permalink / raw)
  To: Alex Belits
  Cc: rostedt, Prasun Kapoor, mingo, davem, linux-api, peterz,
	linux-arch, catalin.marinas, tglx, will, linux-arm-kernel,
	linux-kernel, netdev

On Wed, Jul 22, 2020 at 02:49:49PM +0000, Alex Belits wrote:
> +/*
> + * Description of the last two tasks that ran isolated on a given CPU.
> + * This is intended only for messages about isolation breaking. We
> + * don't want any references to actual task while accessing this from
> + * CPU that caused isolation breaking -- we know nothing about timing
> + * and don't want to use locking or RCU.
> + */
> +struct isol_task_desc {
> +	atomic_t curr_index;
> +	atomic_t curr_index_wr;
> +	bool	warned[2];
> +	pid_t	pid[2];
> +	pid_t	tgid[2];
> +	char	comm[2][TASK_COMM_LEN];
> +};
> +static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs);

So that's quite a huge patch that would have needed to be split up.
Especially this tracing engine.

Speaking of which, I agree with Thomas that it's unnecessary. It's too much
code and complexity. We can use the existing trace events and perform the
analysis from userspace to find the source of the disturbance.

Thanks.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
  2020-07-22 14:49 ` [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel Alex Belits
  2020-10-01 13:56   ` Frederic Weisbecker
@ 2020-10-01 14:40   ` Frederic Weisbecker
  2020-10-04 15:01     ` [EXT] " Alex Belits
  1 sibling, 1 reply; 49+ messages in thread
From: Frederic Weisbecker @ 2020-10-01 14:40 UTC (permalink / raw)
  To: Alex Belits
  Cc: rostedt, Prasun Kapoor, mingo, davem, linux-api, peterz,
	linux-arch, catalin.marinas, tglx, will, linux-arm-kernel,
	linux-kernel, netdev

On Wed, Jul 22, 2020 at 02:49:49PM +0000, Alex Belits wrote:
> +/**
> + * task_isolation_kernel_enter() - clear low-level task isolation flag
> + *
> + * This should be called immediately after entering kernel.
> + */
> +static inline void task_isolation_kernel_enter(void)
> +{
> +	unsigned long flags;
> +
> +	/*
> +	 * This function runs on a CPU that ran isolated task.
> +	 *
> +	 * We don't want this CPU running code from the rest of kernel
> +	 * until other CPUs know that it is no longer isolated.
> +	 * When CPU is running isolated task until this point anything
> +	 * that causes an interrupt on this CPU must end up calling this
> +	 * before touching the rest of kernel. That is, this function or
> +	 * fast_task_isolation_cpu_cleanup() or stop_isolation() calling
> +	 * it. If any interrupt, including scheduling timer, arrives, it
> +	 * will still end up here early after entering kernel.
> +	 * From this point interrupts are disabled until all CPUs will see
> +	 * that this CPU is no longer running isolated task.
> +	 *
> +	 * See also fast_task_isolation_cpu_cleanup().
> +	 */
> +	smp_rmb();

I'm a bit confused what this read memory barrier is ordering. Also against
what it pairs.

> +	if((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION) == 0)
> +		return;
> +
> +	local_irq_save(flags);
> +
> +	/* Clear low-level flags */
> +	this_cpu_write(ll_isol_flags, 0);
> +
> +	/*
> +	 * If something happened that requires a barrier that would
> +	 * otherwise be called from remote CPUs by CPU kick procedure,
> +	 * this barrier runs instead of it. After this barrier, CPU
> +	 * kick procedure would see the updated ll_isol_flags, so it
> +	 * will run its own IPI to trigger a barrier.
> +	 */
> +	smp_mb();
> +	/*
> +	 * Synchronize instructions -- this CPU was not kicked while
> +	 * in isolated mode, so it might require synchronization.
> +	 * There might be an IPI if kick procedure happened and
> +	 * ll_isol_flags was already updated while it assembled a CPU
> +	 * mask. However if this did not happen, synchronize everything
> +	 * here.
> +	 */
> +	instr_sync();

It's the first time I meet an instruction barrier. I should get information
about that but what is it ordering here?

> +	local_irq_restore(flags);
> +}

Thanks.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 10/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-07-22 14:57 ` [PATCH v4 10/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
@ 2020-10-01 14:44   ` Frederic Weisbecker
  2020-10-04 15:22     ` [EXT] " Alex Belits
  0 siblings, 1 reply; 49+ messages in thread
From: Frederic Weisbecker @ 2020-10-01 14:44 UTC (permalink / raw)
  To: Alex Belits
  Cc: rostedt, Prasun Kapoor, mingo, davem, linux-api, peterz,
	linux-arch, catalin.marinas, tglx, will, linux-arm-kernel,
	linux-kernel, netdev

On Wed, Jul 22, 2020 at 02:57:33PM +0000, Alex Belits wrote:
> From: Yuri Norov <ynorov@marvell.com>
> 
> For nohz_full CPUs the desirable behavior is to receive interrupts
> generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> obviously not desirable because it breaks isolation.
> 
> This patch adds check for it.
> 
> Signed-off-by: Yuri Norov <ynorov@marvell.com>
> [abelits@marvell.com: updated, only exclude CPUs running isolated tasks]
> Signed-off-by: Alex Belits <abelits@marvell.com>
> ---
>  kernel/time/tick-sched.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 6e4cd8459f05..2f82a6daf8fc 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -20,6 +20,7 @@
>  #include <linux/sched/clock.h>
>  #include <linux/sched/stat.h>
>  #include <linux/sched/nohz.h>
> +#include <linux/isolation.h>
>  #include <linux/module.h>
>  #include <linux/irq_work.h>
>  #include <linux/posix-timers.h>
> @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
>   */
>  void tick_nohz_full_kick_cpu(int cpu)
>  {
> -	if (!tick_nohz_full_cpu(cpu))
> +	smp_rmb();

What is it ordering?

> +	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
>  		return;

You can't simply ignore an IPI. There is always a reason for a nohz_full CPU
to be kicked. Something triggered a tick dependency. It can be posix cpu timers
for example, or anything.


>  
>  	irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 11/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks
  2020-07-22 14:58 ` [PATCH v4 11/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks Alex Belits
@ 2020-10-01 14:47   ` Frederic Weisbecker
  2020-10-04 17:12     ` [EXT] " Alex Belits
  2021-01-22 14:13     ` Marcelo Tosatti
  0 siblings, 2 replies; 49+ messages in thread
From: Frederic Weisbecker @ 2020-10-01 14:47 UTC (permalink / raw)
  To: Alex Belits
  Cc: rostedt, Prasun Kapoor, mingo, davem, linux-api, peterz,
	linux-arch, catalin.marinas, tglx, will, linux-arm-kernel,
	linux-kernel, netdev

On Wed, Jul 22, 2020 at 02:58:24PM +0000, Alex Belits wrote:
> From: Yuri Norov <ynorov@marvell.com>
> 
> If CPU runs isolated task, there's no any backlog on it, and
> so we don't need to flush it.

What guarantees that we have no backlog on it?

> Currently flush_all_backlogs()
> enqueues corresponding work on all CPUs including ones that run
> isolated tasks. It leads to breaking task isolation for nothing.
> 
> In this patch, backlog flushing is enqueued only on non-isolated CPUs.
> 
> Signed-off-by: Yuri Norov <ynorov@marvell.com>
> [abelits@marvell.com: use safe task_isolation_on_cpu() implementation]
> Signed-off-by: Alex Belits <abelits@marvell.com>
> ---
>  net/core/dev.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 90b59fc50dc9..83a282f7453d 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -74,6 +74,7 @@
>  #include <linux/cpu.h>
>  #include <linux/types.h>
>  #include <linux/kernel.h>
> +#include <linux/isolation.h>
>  #include <linux/hash.h>
>  #include <linux/slab.h>
>  #include <linux/sched.h>
> @@ -5624,9 +5625,13 @@ static void flush_all_backlogs(void)
>  
>  	get_online_cpus();
>  
> -	for_each_online_cpu(cpu)
> +	smp_rmb();

What is it ordering?

> +	for_each_online_cpu(cpu) {
> +		if (task_isolation_on_cpu(cpu))
> +			continue;
>  		queue_work_on(cpu, system_highpri_wq,
>  			      per_cpu_ptr(&flush_works, cpu));
> +	}
>  
>  	for_each_online_cpu(cpu)
>  		flush_work(per_cpu_ptr(&flush_works, cpu));

Thanks.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
  2020-10-01 13:56   ` Frederic Weisbecker
@ 2020-10-04 14:44     ` Alex Belits
  2020-10-04 23:14       ` Frederic Weisbecker
  0 siblings, 1 reply; 49+ messages in thread
From: Alex Belits @ 2020-10-04 14:44 UTC (permalink / raw)
  To: frederic
  Cc: mingo, davem, linux-api, rostedt, peterz, linux-arch,
	catalin.marinas, tglx, will, Prasun Kapoor, linux-kernel, netdev,
	linux-arm-kernel

On Thu, 2020-10-01 at 15:56 +0200, Frederic Weisbecker wrote:
> External Email
> 
> -------------------------------------------------------------------
> ---
> On Wed, Jul 22, 2020 at 02:49:49PM +0000, Alex Belits wrote:
> > +/*
> > + * Description of the last two tasks that ran isolated on a given
> > CPU.
> > + * This is intended only for messages about isolation breaking. We
> > + * don't want any references to actual task while accessing this
> > from
> > + * CPU that caused isolation breaking -- we know nothing about
> > timing
> > + * and don't want to use locking or RCU.
> > + */
> > +struct isol_task_desc {
> > +	atomic_t curr_index;
> > +	atomic_t curr_index_wr;
> > +	bool	warned[2];
> > +	pid_t	pid[2];
> > +	pid_t	tgid[2];
> > +	char	comm[2][TASK_COMM_LEN];
> > +};
> > +static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs);
> 
> So that's quite a huge patch that would have needed to be split up.
> Especially this tracing engine.
> 
> Speaking of which, I agree with Thomas that it's unnecessary. It's
> too much
> code and complexity. We can use the existing trace events and perform
> the
> analysis from userspace to find the source of the disturbance.

The idea behind this is that isolation breaking events are supposed to
be known to the applications while applications run normally, and they
should not require any analysis or human intervention to be handled.

A process may exit isolation because some leftover delayed work, for
example, a timer or a workqueue, is still present on a CPU, or because
a page fault or some other exception, normally handled silently, is
caused by the task. It is also possible to direct an interrupt to a CPU
that is running an isolated task -- currently it's perfectly valid to
set interrupt smp affinity to a CPU running isolated task, and then
interrupt will cause breaking isolation. While it's probably not the
best way of handling interrupts, I would rather not prohibit this
explicitly.

There is also a matter of avoiding race conditions on entering
isolation. Once CPU entered isolation, other CPUs should avoid
disturbing it when they know that CPU is running a task in isolated
mode. However for a short time after entering isolation other CPUs may
be unaware of this, and will still send IPIs to it. Preventing this
scenario completely would be very costly in terms of what other CPUs
will have to do before notifying others, so similar to how EINTR works,
we can simply specify that this is allowed, and task is supposed to re-
enter isolation after this. It's still a bad idea to specify that
isolation breaking can continue happening while application is running
in isolated mode, however allowing some "grace period" after entering
is acceptable as long as application is aware of this happening.

In libtmc I have moved this handling of isolation breaking into a
separate thread, intended to become a separate daemon if necessary. In
part it was done because initial implementation of isolation made it
very difficult to avoid repeating delayed work on isolated CPUs, so
something had to watch for it from non-isolated CPU. It's possible that
now, when delayed work does not appear on isolated CPUs out of nowhere,
the need in isolation manager thread will disappear, and task itself
will be able to handle all isolation breaking, like original
implementation by Chris was supposed to.

However in either case it's still useful for the task, or isolation
manager, to get a description of the isolation-breaking event. This is
what those things are intended for. Now they only produce log messages
because this is where initially all description of isolation-breaking
events went, however I would prefer to make logging optional but always
let applications read those events descriptions, regardless of any
tracing mechanism being used. I was more focused on making the
reporting mechanism properly detect the cause of isolation breaking
because that functionality was not quite working in earlier work by
Chris and Yuri, so I have kept logging as the only output, but made it
suitable for producing events that applications will be able to
receive. Application, or isolation manager, will receive clear and
unambiguous reporting, so there will be no need for any additional
analysis or guesswork.

After adding a proper "low-level" isolation flags, I got the idea that
we might have a better yet reporting mechanism. Early isolation
breaking detection on kernel entry may set a flag that says that
isolation breaking happened, however its cause is unknown. Or, more
likely, only some general information about isolation breaking is
available, like a type of exception. Then, once a known isolation-
breaking reporting mechanism is called from interrupt, syscall, IPI or
exception processing, the flag is cleared, and reporting is supposed to
be done. However if then kernel returns to userspace on isolated task
but isolation breaking is not reported yet, an isolation breaking
reporting with "unknown cause" will happen. We may even add some more
optional lightweight tracing for debugging purposes, however the fact
that reporting will be done, will allow us to make sure that no matter
how complicated exception processing is, or how we managed to miss some
subtle details of an architecture where we are implementing task
isolation, there will be a reliable way to tell the user that something
is wrong, and tell the task that there is something it has to react to.


> 
> Thanks.
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
  2020-10-01 14:40   ` Frederic Weisbecker
@ 2020-10-04 15:01     ` Alex Belits
  0 siblings, 0 replies; 49+ messages in thread
From: Alex Belits @ 2020-10-04 15:01 UTC (permalink / raw)
  To: frederic
  Cc: mingo, davem, linux-api, rostedt, peterz, linux-arch,
	catalin.marinas, tglx, will, Prasun Kapoor, linux-kernel, netdev,
	linux-arm-kernel


On Thu, 2020-10-01 at 16:40 +0200, Frederic Weisbecker wrote:
> External Email
> 
> -------------------------------------------------------------------
> ---
> On Wed, Jul 22, 2020 at 02:49:49PM +0000, Alex Belits wrote:
> > +/**
> > + * task_isolation_kernel_enter() - clear low-level task isolation
> > flag
> > + *
> > + * This should be called immediately after entering kernel.
> > + */
> > +static inline void task_isolation_kernel_enter(void)
> > +{
> > +	unsigned long flags;
> > +
> > +	/*
> > +	 * This function runs on a CPU that ran isolated task.
> > +	 *
> > +	 * We don't want this CPU running code from the rest of kernel
> > +	 * until other CPUs know that it is no longer isolated.
> > +	 * When CPU is running isolated task until this point anything
> > +	 * that causes an interrupt on this CPU must end up calling
> > this
> > +	 * before touching the rest of kernel. That is, this function
> > or
> > +	 * fast_task_isolation_cpu_cleanup() or stop_isolation()
> > calling
> > +	 * it. If any interrupt, including scheduling timer, arrives,
> > it
> > +	 * will still end up here early after entering kernel.
> > +	 * From this point interrupts are disabled until all CPUs will
> > see
> > +	 * that this CPU is no longer running isolated task.
> > +	 *
> > +	 * See also fast_task_isolation_cpu_cleanup().
> > +	 */
> > +	smp_rmb();
> 
> I'm a bit confused what this read memory barrier is ordering. Also
> against
> what it pairs.

My bad, I have kept it after there were left no write accesses from
other CPUs.

> 
> > +	if((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION) ==
> > 0)
> > +		return;
> > +
> > +	local_irq_save(flags);
> > +
> > +	/* Clear low-level flags */
> > +	this_cpu_write(ll_isol_flags, 0);
> > +
> > +	/*
> > +	 * If something happened that requires a barrier that would
> > +	 * otherwise be called from remote CPUs by CPU kick procedure,
> > +	 * this barrier runs instead of it. After this barrier, CPU
> > +	 * kick procedure would see the updated ll_isol_flags, so it
> > +	 * will run its own IPI to trigger a barrier.
> > +	 */
> > +	smp_mb();
> > +	/*
> > +	 * Synchronize instructions -- this CPU was not kicked while
> > +	 * in isolated mode, so it might require synchronization.
> > +	 * There might be an IPI if kick procedure happened and
> > +	 * ll_isol_flags was already updated while it assembled a CPU
> > +	 * mask. However if this did not happen, synchronize everything
> > +	 * here.
> > +	 */
> > +	instr_sync();
> 
> It's the first time I meet an instruction barrier. I should get
> information
> about that but what is it ordering here?

Against barriers in instruction cache flushing (flush_icache_range()
and such). 

> > +	local_irq_restore(flags);
> > +}
> 
> Thanks.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 10/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-10-01 14:44   ` Frederic Weisbecker
@ 2020-10-04 15:22     ` Alex Belits
  2020-10-06 21:41       ` Frederic Weisbecker
  0 siblings, 1 reply; 49+ messages in thread
From: Alex Belits @ 2020-10-04 15:22 UTC (permalink / raw)
  To: frederic
  Cc: mingo, davem, linux-api, rostedt, peterz, linux-arch,
	catalin.marinas, tglx, will, Prasun Kapoor, linux-kernel, netdev,
	linux-arm-kernel


On Thu, 2020-10-01 at 16:44 +0200, Frederic Weisbecker wrote:
> External Email
> 
> -------------------------------------------------------------------
> ---
> On Wed, Jul 22, 2020 at 02:57:33PM +0000, Alex Belits wrote:
> > From: Yuri Norov <ynorov@marvell.com>
> > 
> > For nohz_full CPUs the desirable behavior is to receive interrupts
> > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> > obviously not desirable because it breaks isolation.
> > 
> > This patch adds check for it.
> > 
> > Signed-off-by: Yuri Norov <ynorov@marvell.com>
> > [abelits@marvell.com: updated, only exclude CPUs running isolated
> > tasks]
> > Signed-off-by: Alex Belits <abelits@marvell.com>
> > ---
> >  kernel/time/tick-sched.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 6e4cd8459f05..2f82a6daf8fc 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -20,6 +20,7 @@
> >  #include <linux/sched/clock.h>
> >  #include <linux/sched/stat.h>
> >  #include <linux/sched/nohz.h>
> > +#include <linux/isolation.h>
> >  #include <linux/module.h>
> >  #include <linux/irq_work.h>
> >  #include <linux/posix-timers.h>
> > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
> >   */
> >  void tick_nohz_full_kick_cpu(int cpu)
> >  {
> > -	if (!tick_nohz_full_cpu(cpu))
> > +	smp_rmb();
> 
> What is it ordering?

ll_isol_flags will be read in task_isolation_on_cpu(), that accrss
should be ordered against writing in
task_isolation_kernel_enter(), fast_task_isolation_cpu_cleanup()
and task_isolation_start().

Since task_isolation_on_cpu() is often called for multiple CPUs in a
sequence, it would be wasteful to include a barrier inside it.

> > +	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
> >  		return;
> 
> You can't simply ignore an IPI. There is always a reason for a
> nohz_full CPU
> to be kicked. Something triggered a tick dependency. It can be posix
> cpu timers
> for example, or anything.

I realize that this is unusual, however the idea is that while the task
is running in isolated mode in userspace, we assume that from this CPUs
point of view whatever is happening in kernel, can wait until CPU is
back in kernel, and when it first enters kernel from this mode, it
should "catch up" with everything that happened in its absence.
task_isolation_kernel_enter() is supposed to do that, so by the time
anything should be done involving the rest of the kernel, CPU is back
to normal.

It is application's responsibility to avoid triggering things that
break its isolation, so the application assumes that everything that
involves entering kernel will not be available while it is isolated. If
isolation will be broken, or application will request return from
isolation, everything will go back to normal environment with all
functionality available.

> >  
> >  	irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
> > -- 
> > 2.26.2
> > 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 11/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks
  2020-10-01 14:47   ` Frederic Weisbecker
@ 2020-10-04 17:12     ` Alex Belits
  2021-01-22 14:13     ` Marcelo Tosatti
  1 sibling, 0 replies; 49+ messages in thread
From: Alex Belits @ 2020-10-04 17:12 UTC (permalink / raw)
  To: frederic
  Cc: mingo, davem, linux-api, rostedt, peterz, linux-arch,
	catalin.marinas, tglx, will, Prasun Kapoor, linux-kernel, netdev,
	linux-arm-kernel


On Thu, 2020-10-01 at 16:47 +0200, Frederic Weisbecker wrote:
> External Email
> 
> -------------------------------------------------------------------
> ---
> On Wed, Jul 22, 2020 at 02:58:24PM +0000, Alex Belits wrote:
> > From: Yuri Norov <ynorov@marvell.com>
> > 
> > If CPU runs isolated task, there's no any backlog on it, and
> > so we don't need to flush it.
> 
> What guarantees that we have no backlog on it?

I believe, the logic was that it is not supposed to have backlog
because it could not be produced while the CPU was in userspace,
because one has to enter kernel to receive (by interrupt) or send (by
syscall) anything.

Now, looking at this patch. I don't think, it can be guaranteed that
there was no backlog before it entered userspace. Then backlog
processing will be delayed until exit from isolation. It won't be
queued, and flush_work() will not wait when no worker is assigned, so
there won't be a deadlock, however this delay may not be such a great
idea.

So it may be better to flush backlog before entering isolation, and in
flush_all_backlogs() instead of skipping all CPUs in isolated mode,
check if their per-CPU softnet_data->input_pkt_queue and softnet_data-
>process_queue are empty, and if they are not, call backlog anyway.
Then, if for whatever reason backlog will appear after flushing (we
can't guarantee that nothing preempted us then), it will cause one
isolation breaking event, and if nothing will be queued before re-
entering isolation, there will be no backlog until exiting isolation.

> 
> > Currently flush_all_backlogs()
> > enqueues corresponding work on all CPUs including ones that run
> > isolated tasks. It leads to breaking task isolation for nothing.
> > 
> > In this patch, backlog flushing is enqueued only on non-isolated
> > CPUs.
> > 
> > Signed-off-by: Yuri Norov <ynorov@marvell.com>
> > [abelits@marvell.com: use safe task_isolation_on_cpu()
> > implementation]
> > Signed-off-by: Alex Belits <abelits@marvell.com>
> > ---
> >  net/core/dev.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 90b59fc50dc9..83a282f7453d 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -74,6 +74,7 @@
> >  #include <linux/cpu.h>
> >  #include <linux/types.h>
> >  #include <linux/kernel.h>
> > +#include <linux/isolation.h>
> >  #include <linux/hash.h>
> >  #include <linux/slab.h>
> >  #include <linux/sched.h>
> > @@ -5624,9 +5625,13 @@ static void flush_all_backlogs(void)
> >  
> >  	get_online_cpus();
> >  
> > -	for_each_online_cpu(cpu)
> > +	smp_rmb();
> 
> What is it ordering?

Same as with other calls to task_isolation_on_cpu(cpu), it orders
access to ll_isol_flags.

> > +	for_each_online_cpu(cpu) {
> > +		if (task_isolation_on_cpu(cpu))
> > +			continue;
> >  		queue_work_on(cpu, system_highpri_wq,
> >  			      per_cpu_ptr(&flush_works, cpu));
> > +	}
> >  
> >  	for_each_online_cpu(cpu)
> >  		flush_work(per_cpu_ptr(&flush_works, cpu));
> 
> Thanks.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
  2020-10-04 14:44     ` [EXT] " Alex Belits
@ 2020-10-04 23:14       ` Frederic Weisbecker
  2020-10-05 18:52         ` Nitesh Narayan Lal
  2020-10-06 11:01         ` Alex Belits
  0 siblings, 2 replies; 49+ messages in thread
From: Frederic Weisbecker @ 2020-10-04 23:14 UTC (permalink / raw)
  To: Alex Belits
  Cc: mingo, davem, linux-api, rostedt, peterz, linux-arch,
	catalin.marinas, tglx, will, Prasun Kapoor, linux-kernel, netdev,
	linux-arm-kernel

On Sun, Oct 04, 2020 at 02:44:39PM +0000, Alex Belits wrote:
> On Thu, 2020-10-01 at 15:56 +0200, Frederic Weisbecker wrote:
> > External Email
> > 
> > -------------------------------------------------------------------
> > ---
> > On Wed, Jul 22, 2020 at 02:49:49PM +0000, Alex Belits wrote:
> > > +/*
> > > + * Description of the last two tasks that ran isolated on a given
> > > CPU.
> > > + * This is intended only for messages about isolation breaking. We
> > > + * don't want any references to actual task while accessing this
> > > from
> > > + * CPU that caused isolation breaking -- we know nothing about
> > > timing
> > > + * and don't want to use locking or RCU.
> > > + */
> > > +struct isol_task_desc {
> > > +	atomic_t curr_index;
> > > +	atomic_t curr_index_wr;
> > > +	bool	warned[2];
> > > +	pid_t	pid[2];
> > > +	pid_t	tgid[2];
> > > +	char	comm[2][TASK_COMM_LEN];
> > > +};
> > > +static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs);
> > 
> > So that's quite a huge patch that would have needed to be split up.
> > Especially this tracing engine.
> > 
> > Speaking of which, I agree with Thomas that it's unnecessary. It's
> > too much
> > code and complexity. We can use the existing trace events and perform
> > the
> > analysis from userspace to find the source of the disturbance.
> 
> The idea behind this is that isolation breaking events are supposed to
> be known to the applications while applications run normally, and they
> should not require any analysis or human intervention to be handled.

Sure but you can use trace events for that. Just trace interrupts, workqueues,
timers, syscalls, exceptions and scheduler events and you get all the local
disturbance. You might want to tune a few filters but that's pretty much it.

As for the source of the disturbances, if you really need that information,
you can trace the workqueue and timer queue events and just filter those that
target your isolated CPUs.

> 
> A process may exit isolation because some leftover delayed work, for
> example, a timer or a workqueue, is still present on a CPU, or because
> a page fault or some other exception, normally handled silently, is
> caused by the task. It is also possible to direct an interrupt to a CPU
> that is running an isolated task -- currently it's perfectly valid to
> set interrupt smp affinity to a CPU running isolated task, and then
> interrupt will cause breaking isolation. While it's probably not the
> best way of handling interrupts, I would rather not prohibit this
> explicitly.

Sure, but you can trace all these events with the existing tracing
interface we have.

> 
> There is also a matter of avoiding race conditions on entering
> isolation. Once CPU entered isolation, other CPUs should avoid
> disturbing it when they know that CPU is running a task in isolated
> mode. However for a short time after entering isolation other CPUs may
> be unaware of this, and will still send IPIs to it. Preventing this
> scenario completely would be very costly in terms of what other CPUs
> will have to do before notifying others, so similar to how EINTR works,
> we can simply specify that this is allowed, and task is supposed to re-
> enter isolation after this. It's still a bad idea to specify that
> isolation breaking can continue happening while application is running
> in isolated mode, however allowing some "grace period" after entering
> is acceptable as long as application is aware of this happening.

Right but that doesn't look related to tracing. Anyway I guess we
can make the CPU enter some specific mode after calling synchronize_rcu().

> 
> In libtmc I have moved this handling of isolation breaking into a
> separate thread, intended to become a separate daemon if necessary. In
> part it was done because initial implementation of isolation made it
> very difficult to avoid repeating delayed work on isolated CPUs, so
> something had to watch for it from non-isolated CPU. It's possible that
> now, when delayed work does not appear on isolated CPUs out of nowhere,
> the need in isolation manager thread will disappear, and task itself
> will be able to handle all isolation breaking, like original
> implementation by Chris was supposed to.
> 
> However in either case it's still useful for the task, or isolation
> manager, to get a description of the isolation-breaking event. This is
> what those things are intended for. Now they only produce log messages
> because this is where initially all description of isolation-breaking
> events went, however I would prefer to make logging optional but always
> let applications read those events descriptions, regardless of any
> tracing mechanism being used. I was more focused on making the
> reporting mechanism properly detect the cause of isolation breaking
> because that functionality was not quite working in earlier work by
> Chris and Yuri, so I have kept logging as the only output, but made it
> suitable for producing events that applications will be able to
> receive. Application, or isolation manager, will receive clear and
> unambiguous reporting, so there will be no need for any additional
> analysis or guesswork.

That still look like a job for userspace, based on trace events.

Thanks.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
  2020-10-04 23:14       ` Frederic Weisbecker
@ 2020-10-05 18:52         ` Nitesh Narayan Lal
  2020-10-06 10:35           ` Frederic Weisbecker
  2020-10-17  1:08           ` Alex Belits
  2020-10-06 11:01         ` Alex Belits
  1 sibling, 2 replies; 49+ messages in thread
From: Nitesh Narayan Lal @ 2020-10-05 18:52 UTC (permalink / raw)
  To: Frederic Weisbecker, Alex Belits
  Cc: mingo, davem, linux-api, rostedt, peterz, linux-arch,
	catalin.marinas, tglx, will, Prasun Kapoor, linux-kernel, netdev,
	linux-arm-kernel


[-- Attachment #1.1: Type: text/plain, Size: 2335 bytes --]


On 10/4/20 7:14 PM, Frederic Weisbecker wrote:
> On Sun, Oct 04, 2020 at 02:44:39PM +0000, Alex Belits wrote:
>> On Thu, 2020-10-01 at 15:56 +0200, Frederic Weisbecker wrote:
>>> External Email
>>>
>>> -------------------------------------------------------------------
>>> ---
>>> On Wed, Jul 22, 2020 at 02:49:49PM +0000, Alex Belits wrote:
>>>> +/*
>>>> + * Description of the last two tasks that ran isolated on a given
>>>> CPU.
>>>> + * This is intended only for messages about isolation breaking. We
>>>> + * don't want any references to actual task while accessing this
>>>> from
>>>> + * CPU that caused isolation breaking -- we know nothing about
>>>> timing
>>>> + * and don't want to use locking or RCU.
>>>> + */
>>>> +struct isol_task_desc {
>>>> +	atomic_t curr_index;
>>>> +	atomic_t curr_index_wr;
>>>> +	bool	warned[2];
>>>> +	pid_t	pid[2];
>>>> +	pid_t	tgid[2];
>>>> +	char	comm[2][TASK_COMM_LEN];
>>>> +};
>>>> +static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs);
>>> So that's quite a huge patch that would have needed to be split up.
>>> Especially this tracing engine.
>>>
>>> Speaking of which, I agree with Thomas that it's unnecessary. It's
>>> too much
>>> code and complexity. We can use the existing trace events and perform
>>> the
>>> analysis from userspace to find the source of the disturbance.
>> The idea behind this is that isolation breaking events are supposed to
>> be known to the applications while applications run normally, and they
>> should not require any analysis or human intervention to be handled.
> Sure but you can use trace events for that. Just trace interrupts, workqueues,
> timers, syscalls, exceptions and scheduler events and you get all the local
> disturbance. You might want to tune a few filters but that's pretty much it.
>
> As for the source of the disturbances, if you really need that information,
> you can trace the workqueue and timer queue events and just filter those that
> target your isolated CPUs.
>

I agree that we can do all those things with tracing.
However, IMHO having a simplified logging mechanism to gather the source of
violation may help in reducing the manual effort.

Although, I am not sure how easy will it be to maintain such an interface
over time.

--
Thanks
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
  2020-10-05 18:52         ` Nitesh Narayan Lal
@ 2020-10-06 10:35           ` Frederic Weisbecker
  2020-10-17  1:13             ` Alex Belits
  2020-10-17  1:08           ` Alex Belits
  1 sibling, 1 reply; 49+ messages in thread
From: Frederic Weisbecker @ 2020-10-06 10:35 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: Alex Belits, mingo, davem, linux-api, rostedt, peterz,
	linux-arch, catalin.marinas, tglx, will, Prasun Kapoor,
	linux-kernel, netdev, linux-arm-kernel

On Mon, Oct 05, 2020 at 02:52:49PM -0400, Nitesh Narayan Lal wrote:
> 
> On 10/4/20 7:14 PM, Frederic Weisbecker wrote:
> > On Sun, Oct 04, 2020 at 02:44:39PM +0000, Alex Belits wrote:
> >> On Thu, 2020-10-01 at 15:56 +0200, Frederic Weisbecker wrote:
> >>> External Email
> >>>
> >>> -------------------------------------------------------------------
> >>> ---
> >>> On Wed, Jul 22, 2020 at 02:49:49PM +0000, Alex Belits wrote:
> >>>> +/*
> >>>> + * Description of the last two tasks that ran isolated on a given
> >>>> CPU.
> >>>> + * This is intended only for messages about isolation breaking. We
> >>>> + * don't want any references to actual task while accessing this
> >>>> from
> >>>> + * CPU that caused isolation breaking -- we know nothing about
> >>>> timing
> >>>> + * and don't want to use locking or RCU.
> >>>> + */
> >>>> +struct isol_task_desc {
> >>>> +	atomic_t curr_index;
> >>>> +	atomic_t curr_index_wr;
> >>>> +	bool	warned[2];
> >>>> +	pid_t	pid[2];
> >>>> +	pid_t	tgid[2];
> >>>> +	char	comm[2][TASK_COMM_LEN];
> >>>> +};
> >>>> +static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs);
> >>> So that's quite a huge patch that would have needed to be split up.
> >>> Especially this tracing engine.
> >>>
> >>> Speaking of which, I agree with Thomas that it's unnecessary. It's
> >>> too much
> >>> code and complexity. We can use the existing trace events and perform
> >>> the
> >>> analysis from userspace to find the source of the disturbance.
> >> The idea behind this is that isolation breaking events are supposed to
> >> be known to the applications while applications run normally, and they
> >> should not require any analysis or human intervention to be handled.
> > Sure but you can use trace events for that. Just trace interrupts, workqueues,
> > timers, syscalls, exceptions and scheduler events and you get all the local
> > disturbance. You might want to tune a few filters but that's pretty much it.
> >
> > As for the source of the disturbances, if you really need that information,
> > you can trace the workqueue and timer queue events and just filter those that
> > target your isolated CPUs.
> >
> 
> I agree that we can do all those things with tracing.
> However, IMHO having a simplified logging mechanism to gather the source of
> violation may help in reducing the manual effort.
> 
> Although, I am not sure how easy will it be to maintain such an interface
> over time.

The thing is: tracing is your simplified logging mechanism here. You can achieve
the same in userspace with _way_ less code, no race, and you can do it in
bash.

Thanks.




^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
  2020-10-04 23:14       ` Frederic Weisbecker
  2020-10-05 18:52         ` Nitesh Narayan Lal
@ 2020-10-06 11:01         ` Alex Belits
  1 sibling, 0 replies; 49+ messages in thread
From: Alex Belits @ 2020-10-06 11:01 UTC (permalink / raw)
  To: frederic
  Cc: mingo, davem, linux-api, rostedt, peterz, linux-arch, tglx,
	catalin.marinas, will, Prasun Kapoor, linux-kernel,
	linux-arm-kernel, netdev

On Mon, 2020-10-05 at 01:14 +0200, Frederic Weisbecker wrote:
> Speaking of which, I agree with Thomas that it's unnecessary.
> > > It's
> > > too much
> > > code and complexity. We can use the existing trace events and
> > > perform
> > > the
> > > analysis from userspace to find the source of the disturbance.
> > 
> > The idea behind this is that isolation breaking events are supposed
> > to
> > be known to the applications while applications run normally, and
> > they
> > should not require any analysis or human intervention to be
> > handled.
> 
> Sure but you can use trace events for that. Just trace interrupts,
> workqueues,
> timers, syscalls, exceptions and scheduler events and you get all the
> local
> disturbance. You might want to tune a few filters but that's pretty
> much it.

And keep all tracing enabled all the time, just to be able to figure
out that disturbance happened at all?

Or do you mean that we can use kernel entry mechanism to reliably
determine that isolation breaking event happened (so the isolation-
breaking procedure can be triggered as early as possible), yet avoid
trying to determine why exactly it happened, and use tracing if we want
to know?

Original patch did the opposite, it triggered any isolation-breaking
procedure only once it was known specifically, what kind of event
happened -- a hardware interrupt, IPI, syscall, page fault, or any
other kind of exception, possibly something architecture-specific.
This, of course, always had a potential problem with coverage -- if
handling of something is missing, isolation breaking is not handled at
all, and there is no obvious way of finding if we covered everything.
This also made the patch large and somewhat ugly.

When I have added a mechanism for low-level isolation breaking handling
on kernel entry, it also partially improved the problem with
completeness. Partially because I have not yet added handling of
"unknown cause" before returning to userspace, however that would be a
logical thing to do. Then if we entered kernel from isolation, did
something, and are returning to userspace still not knowing what kind
of isolation-breaking event happened, we can still trigger isolation
breaking.

Did I get it right, and you mean that we can remove all specific
handling of isolation breaking causes, except for syscall that exits
isolation, and report isolation breaking instead of normally returning
to userspace? Then isolation breaking will be handled reliably without
knowing the cause, and we can leave determining the cause to the
tracing mechanism (if enabled)?

This does make sense. However for me it looks somewhat strange, because
I assume isolation breaking to be a kind of runtime error, that
userspace software is supposed to get some basic information about --
like, signals distinguishing between, say, SIGSEGV and SIGPIPE, or
write() being able to set errno to ENOSPC or EIO. Then userspace
receives basic information about the cause of exception or error, and
can do some meaningful reporting, or decide if the error should be
fatal for the application or handled differently, based on its internal
logic. To get those distinctions, application does not have to be aware
of anything internal to the kernel.

Similarly distinguishing between, say, a page fault, device interrupt
and a timer may be important for a logic implemented in userspace, and
I think, it may be nice to allow userspace to get this information
immediately and without being aware of any additional details of kernel
implementation. The current patch doesn't do this yet, however the
intention is to implement reliable isolation breaking by checking on
userspace re-entry, plus make reporting of causes, if any were found,
visible to the userspace in some convenient way.

The part that determines the cause can be implemented separately from
isolation breaking mechanism. Then we can have isolation breaking on
kernel entry (or potentially some other condition on kernel entry that
requires logging the cause) enable reporting, then reporting mechanism,
if it exists will fill the blanks, and once either cause is known, or
it's time to return to userspace, notification will be done with
whatever information is available. For some in-depth analysis, if
necessary for debugging the kernel, we can have tracing check if we are
in this "suspicious kernel entry" mode, and log things that otherwise
would not be.

> As for the source of the disturbances, if you really need that
> information,
> you can trace the workqueue and timer queue events and just filter
> those that
> target your isolated CPUs.

For the purpose of human debugging the kernel or application, the more
information is (usually) the better, so the only concern here is that
now user is responsible for completeness of things he is tracing.
However from application's point of view, or for logging in a
production environment it's usually more important to get general type
of events, so it's possible to, say, confirm that nothing "really bad"
happened, or to trigger the emergency response if it did. Say, if the
only causes of isolation breaking was IPI within few moments of
application startup, or signal from somewhere else when application was
restarted, there is no cause for concern. However if hardware
interrupts arrive at random points in time, something is clearly wrong.
And if page faults happen, most likely application forgot to page-in
and lock its address space.

Again, in my opinion this is not unlike reporting ENOSPC vs. EIO while
doing file I/O -- the former (usually) indicates a common problem that
may require application-level cleanup, the latter (also usually) means
that something is seriously wrong.

> > A process may exit isolation because some leftover delayed work,
> > for
> > example, a timer or a workqueue, is still present on a CPU, or
> > because
> > a page fault or some other exception, normally handled silently, is
> > caused by the task. It is also possible to direct an interrupt to a
> > CPU
> > that is running an isolated task -- currently it's perfectly valid
> > to
> > set interrupt smp affinity to a CPU running isolated task, and then
> > interrupt will cause breaking isolation. While it's probably not
> > the
> > best way of handling interrupts, I would rather not prohibit this
> > explicitly.
> 
> Sure, but you can trace all these events with the existing tracing
> interface we have.

Right. However it would require someone to intentionally do tracing of
all those events, all for the purpose of obtaining a type of runtime
error. As an embedded systems developer, who had to look for signs of
unusual bugs on a large number of customers' systems, and had to
distinguish them from reports of hardware malfunctions, I would prefer
something clearly identifiable in the logs (of kernel, application, or
anything else) when no one is specifically investigating any problem.

When anything suspicious happens, often the system is physically
unreachable, and the problem may or may not happen again, so the first
report from a running system may be the only thing available. When
everything is going well, the same systems more often have hardware
failures than report valid software bugs (or, ideally, all reports are
from hardware failures), so it's much better to know that if software
will do something wrong, it would be possible to identify the problem
from the first report, rather than guess.

Sometimes equipment gets firmware updates many years after production,
when there are reports of all kinds of failures due to mechanical or
thermal damage, faulty parts, bad repair work, deteriorating flash,
etc. Among those there might be something that indicates new bugs made
by a new generation of developers (occasionally literally),
regressions, etc. In those situations getting useful information from
the error message in the first report can make a difference between
quickly identifying the problem and going on a wild goose chase.

-- 
Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 10/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-10-04 15:22     ` [EXT] " Alex Belits
@ 2020-10-06 21:41       ` Frederic Weisbecker
  2020-10-17  0:17         ` Alex Belits
  0 siblings, 1 reply; 49+ messages in thread
From: Frederic Weisbecker @ 2020-10-06 21:41 UTC (permalink / raw)
  To: Alex Belits
  Cc: mingo, davem, linux-api, rostedt, peterz, linux-arch,
	catalin.marinas, tglx, will, Prasun Kapoor, linux-kernel, netdev,
	linux-arm-kernel

On Sun, Oct 04, 2020 at 03:22:09PM +0000, Alex Belits wrote:
> 
> On Thu, 2020-10-01 at 16:44 +0200, Frederic Weisbecker wrote:
> > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
> > >   */
> > >  void tick_nohz_full_kick_cpu(int cpu)
> > >  {
> > > -	if (!tick_nohz_full_cpu(cpu))
> > > +	smp_rmb();
> > 
> > What is it ordering?
> 
> ll_isol_flags will be read in task_isolation_on_cpu(), that accrss
> should be ordered against writing in
> task_isolation_kernel_enter(), fast_task_isolation_cpu_cleanup()
> and task_isolation_start().
> 
> Since task_isolation_on_cpu() is often called for multiple CPUs in a
> sequence, it would be wasteful to include a barrier inside it.

Then I think you meant a full barrier: smp_mb()

> 
> > > +	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
> > >  		return;
> > 
> > You can't simply ignore an IPI. There is always a reason for a
> > nohz_full CPU
> > to be kicked. Something triggered a tick dependency. It can be posix
> > cpu timers
> > for example, or anything.
> 
> I realize that this is unusual, however the idea is that while the task
> is running in isolated mode in userspace, we assume that from this CPUs
> point of view whatever is happening in kernel, can wait until CPU is
> back in kernel and when it first enters kernel from this mode, it
> should "catch up" with everything that happened in its absence.
> task_isolation_kernel_enter() is supposed to do that, so by the time
> anything should be done involving the rest of the kernel, CPU is back
> to normal.

You can't assume that. If something needs the tick, this can't wait.
If the user did something wrong, such as setting a posix cpu timer
to an isolated task, that's his fault and the kernel has to stick with
correctness and kick that task out of isolation mode.

> 
> It is application's responsibility to avoid triggering things that
> break its isolation

Precisely.

> so the application assumes that everything that
> involves entering kernel will not be available while it is isolated.

We can't do things that way and just ignore IPIs. You need to solve the
source of the noise, not the symptoms.

Thanks.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 10/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-10-06 21:41       ` Frederic Weisbecker
@ 2020-10-17  0:17         ` Alex Belits
  0 siblings, 0 replies; 49+ messages in thread
From: Alex Belits @ 2020-10-17  0:17 UTC (permalink / raw)
  To: frederic
  Cc: mingo, davem, linux-api, rostedt, peterz, linux-arch, tglx,
	catalin.marinas, will, Prasun Kapoor, linux-kernel,
	linux-arm-kernel, netdev


On Tue, 2020-10-06 at 23:41 +0200, Frederic Weisbecker wrote:
> On Sun, Oct 04, 2020 at 03:22:09PM +0000, Alex Belits wrote:
> > On Thu, 2020-10-01 at 16:44 +0200, Frederic Weisbecker wrote:
> > > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
> > > >   */
> > > >  void tick_nohz_full_kick_cpu(int cpu)
> > > >  {
> > > > -	if (!tick_nohz_full_cpu(cpu))
> > > > +	smp_rmb();
> > > 
> > > What is it ordering?
> > 
> > ll_isol_flags will be read in task_isolation_on_cpu(), that accrss
> > should be ordered against writing in
> > task_isolation_kernel_enter(), fast_task_isolation_cpu_cleanup()
> > and task_isolation_start().
> > 
> > Since task_isolation_on_cpu() is often called for multiple CPUs in
> > a
> > sequence, it would be wasteful to include a barrier inside it.
> 
> Then I think you meant a full barrier: smp_mb()

For read-only operation? task_isolation_on_cpu() is the only place
where per-cpu ll_isol_flags is accessed, read-only, from multiple CPUs.
All other access to ll_isol_flags is done from the local CPU, and
writes are followed by smp_mb(). There are no other dependencies here,
except operations that depend on the value returned from
task_isolation_on_cpu().

If/when more flags will be added, those rules will be still followed,
because the intention is to store the state of isolation and phases of
entering/breaking/reporting it that can only be updated from the local
CPUs.

> 
> > > > +	if (!tick_nohz_full_cpu(cpu) ||
> > > > task_isolation_on_cpu(cpu))
> > > >  		return;
> > > 
> > > You can't simply ignore an IPI. There is always a reason for a
> > > nohz_full CPU
> > > to be kicked. Something triggered a tick dependency. It can be
> > > posix
> > > cpu timers
> > > for example, or anything.

This was added some time ago, when timers appeared and CPUs were kicked
seemingly out of nowhere. At that point breaking posix timers when
running tasks that are not supposed to rely on posix timers, was the
least problematic solution. From user's point of view in this case
entering isolation had an effect on timer similar to task exiting while
the timer is running.

Right now, there are still sources of superfluous calls to this, when
tick_nohz_full_kick_all() is used. If I will be able to confirm that
this is the only problematic place, I would rather fix calls to it, and
make this condition produce a warning.

This gives me an idea that if there will be a mechanism specifically
for reporting kernel entry and isolation breaking, maybe it should be
possible to add a distinction between:

1. isolation breaking that already happened upon kernel entry;
2. performing operation that will immediately and synchronously cause
isolation breaking;
3. operations or conditions that will eventually or asynchronously
cause isolation breaking (having timers running, possibly sending
signals should be in the same category).

This will be (2).

I assume that when reporting of isolation breaking will be separated
from the isolation implementation, it will be implemented as a runtime
error condition reporting mechanism. Then it can be focused on
providing information about category of events and their sources, and
have internal logic designed for that purpose, as opposed to designed
entirely for debugging, providing flexibility and obtaining maximum
details about internals involved.

> > 
> > I realize that this is unusual, however the idea is that while the
> > task
> > is running in isolated mode in userspace, we assume that from this
> > CPUs
> > point of view whatever is happening in kernel, can wait until CPU
> > is
> > back in kernel and when it first enters kernel from this mode, it
> > should "catch up" with everything that happened in its absence.
> > task_isolation_kernel_enter() is supposed to do that, so by the
> > time
> > anything should be done involving the rest of the kernel, CPU is
> > back
> > to normal.
> 
> You can't assume that. If something needs the tick, this can't wait.
> If the user did something wrong, such as setting a posix cpu timer
> to an isolated task, that's his fault and the kernel has to stick
> with
> correctness and kick that task out of isolation mode.

That would be true if not multiple "let's just tell all other CPUs that
they should check if they have to update something" situations like the
above.

In case of timers it's possible that I will be able to eliminate all
specific instances when this is done, however I think that as a general
approach we have to establish some distinction between things that must
cause IPI (and break isolation) and things that may be delayed until
the isolated userspace task will allow that or some other unavoidable
isolation-breaking event will happen.

> 
> > It is application's responsibility to avoid triggering things that
> > break its isolation
> 
> Precisely.

Right. However there are tings like tick_nohz_full_kick_all() and
similar procedures that result in mass-sending of IPIs without
determining if target CPUs have anything to do with the event at all,
leave alone have to handle it right now, it does not give me an
impression that we can blame application for it. I realize that this is
done for a reason, with the assumption that sending IPIs is "cheaper"
and does not require complex synchronization compared to determining
what and when should be notified, however this is not compatible with
goals of task isolation.

> 
> > so the application assumes that everything that
> > involves entering kernel will not be available while it is
> > isolated.
> 
> We can't do things that way and just ignore IPIs. You need to solve
> the
> source of the noise, not the symptoms.

It may be that eventually we can completely eliminate those things (at
least when isolation is enabled and this is relevant), however for the
purpose of having usable code without massive changes in numerous
callers, in my opinion, we should acknowledge that some things should
be disabled while the task is isolated, and called on isolation exit --
either unconditionally or conditionally if they were requested while
the task was isolated.

I believe that as long as we create a distinction between "must break
isolation", "delayed until the end of isolation" and "can be safely
ignored if the task is isolated" IPIs, we will end up with less
intrusive changes and reliably working functionality.

Then if we will be able to eliminate the sources of things in the last
two categories, we can treat them as if they were in the first one.

It may be that the timers are already ready to this, and I should just
check what causes tick_nohz_full_kick_all() calls. If so, this
particular check won't be necessary because all calls will happen for a
good reason in situations controlled by application. However as a
general approach I think, we need this longer way with decisions about
delaying or ignoring events.

-- 
Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
  2020-10-05 18:52         ` Nitesh Narayan Lal
  2020-10-06 10:35           ` Frederic Weisbecker
@ 2020-10-17  1:08           ` Alex Belits
  2020-10-17 16:08             ` Thomas Gleixner
  1 sibling, 1 reply; 49+ messages in thread
From: Alex Belits @ 2020-10-17  1:08 UTC (permalink / raw)
  To: nitesh, frederic
  Cc: mingo, davem, linux-api, rostedt, peterz, linux-arch, tglx,
	catalin.marinas, will, Prasun Kapoor, linux-kernel,
	linux-arm-kernel, netdev


On Mon, 2020-10-05 at 14:52 -0400, Nitesh Narayan Lal wrote:
> On 10/4/20 7:14 PM, Frederic Weisbecker wrote:
> > On Sun, Oct 04, 2020 at 02:44:39PM +0000, Alex Belits wrote:
> > > On Thu, 2020-10-01 at 15:56 +0200, Frederic Weisbecker wrote:
> > > > External Email
> > > > 
> > > > -------------------------------------------------------------
> > > > ------
> > > > ---
> > > > On Wed, Jul 22, 2020 at 02:49:49PM +0000, Alex Belits wrote:
> > > > > +/*
> > > > > + * Description of the last two tasks that ran isolated on a
> > > > > given
> > > > > CPU.
> > > > > + * This is intended only for messages about isolation
> > > > > breaking. We
> > > > > + * don't want any references to actual task while accessing
> > > > > this
> > > > > from
> > > > > + * CPU that caused isolation breaking -- we know nothing
> > > > > about
> > > > > timing
> > > > > + * and don't want to use locking or RCU.
> > > > > + */
> > > > > +struct isol_task_desc {
> > > > > +	atomic_t curr_index;
> > > > > +	atomic_t curr_index_wr;
> > > > > +	bool	warned[2];
> > > > > +	pid_t	pid[2];
> > > > > +	pid_t	tgid[2];
> > > > > +	char	comm[2][TASK_COMM_LEN];
> > > > > +};
> > > > > +static DEFINE_PER_CPU(struct isol_task_desc,
> > > > > isol_task_descs);
> > > > So that's quite a huge patch that would have needed to be split
> > > > up.
> > > > Especially this tracing engine.
> > > > 
> > > > Speaking of which, I agree with Thomas that it's unnecessary.
> > > > It's
> > > > too much
> > > > code and complexity. We can use the existing trace events and
> > > > perform
> > > > the
> > > > analysis from userspace to find the source of the disturbance.
> > > The idea behind this is that isolation breaking events are
> > > supposed to
> > > be known to the applications while applications run normally, and
> > > they
> > > should not require any analysis or human intervention to be
> > > handled.
> > Sure but you can use trace events for that. Just trace interrupts,
> > workqueues,
> > timers, syscalls, exceptions and scheduler events and you get all
> > the local
> > disturbance. You might want to tune a few filters but that's pretty
> > much it.
> > 
> > As for the source of the disturbances, if you really need that
> > information,
> > you can trace the workqueue and timer queue events and just filter
> > those that
> > target your isolated CPUs.
> > 
> 
> I agree that we can do all those things with tracing.
> However, IMHO having a simplified logging mechanism to gather the
> source of
> violation may help in reducing the manual effort.
> 
> Although, I am not sure how easy will it be to maintain such an
> interface
> over time.

I think that the goal of "finding source of disturbance" interface is
different from what can be accomplished by tracing in two ways:

1. "Source of disturbance" should provide some useful information about
category of event and it cause as opposed to determining all precise
details about things being called that resulted or could result in
disturbance. It should not depend on the user's knowledge about details
of implementations, it should provide some definite answer of what
happened (with whatever amount of details can be given in a generic
mechanism) even if the user has no idea how those things happen and
what part of kernel is responsible for either causing or processing
them. Then if the user needs further details, they can be obtained with
tracing.

2. It should be usable as a runtime error handling mechanism, so the
information it provides should be suitable for application use and
logging. It should be usable when applications are running on a system
in production, and no specific tracing or monitoring mechanism can be
in use. If, say, thousands of devices are controlling neutrino
detectors on an ocean floor, and in a month of work one of them got one
isolation breaking event, it should be able to report that isolation
was broken by an interrupt from a network interface, so the users will
be able to track it down to some userspace application reconfiguring
those interrupts.

It will be a good idea to make such mechanism optional and suitable for
tracking things on conditions other than "always enabled" and "enabled
with task isolation". However in my opinion, there should be something
in kernel entry procedure that, if enabled, prepared something to be
filled by the cause data, and we know at least one such situation when
this kernel entry procedure should be triggered -- when task isolation
is on.

-- 
Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
  2020-10-06 10:35           ` Frederic Weisbecker
@ 2020-10-17  1:13             ` Alex Belits
  0 siblings, 0 replies; 49+ messages in thread
From: Alex Belits @ 2020-10-17  1:13 UTC (permalink / raw)
  To: frederic, nitesh
  Cc: mingo, davem, linux-api, rostedt, peterz, linux-arch, tglx,
	catalin.marinas, will, Prasun Kapoor, linux-kernel,
	linux-arm-kernel, netdev


On Tue, 2020-10-06 at 12:35 +0200, Frederic Weisbecker wrote:
> On Mon, Oct 05, 2020 at 02:52:49PM -0400, Nitesh Narayan Lal wrote:
> > On 10/4/20 7:14 PM, Frederic Weisbecker wrote:
> > > On Sun, Oct 04, 2020 at 02:44:39PM +0000, Alex Belits wrote:
> > > 
> > > > The idea behind this is that isolation breaking events are
> > > > supposed to
> > > > be known to the applications while applications run normally,
> > > > and they
> > > > should not require any analysis or human intervention to be
> > > > handled.
> > > Sure but you can use trace events for that. Just trace
> > > interrupts, workqueues,
> > > timers, syscalls, exceptions and scheduler events and you get all
> > > the local
> > > disturbance. You might want to tune a few filters but that's
> > > pretty much it.
> 
> formation,
> > > you can trace the workqueue and timer queue events and just
> > > filter those that
> > > target your isolated CPUs.
> > > 
> > 
> > I agree that we can do all those things with tracing.
> > However, IMHO having a simplified logging mechanism to gather the
> > source of
> > violation may help in reducing the manual effort.
> > 
> > Although, I am not sure how easy will it be to maintain such an
> > interface
> > over time.
> 
> The thing is: tracing is your simplified logging mechanism here. You
> can achieve
> the same in userspace with _way_ less code, no race, and you can do
> it in
> bash.

The idea is that this mechanism should be usable when no one is there
to run things in bash, or no information about what might happen. It
should be able to report rare events in production when users may not
be able to reproduce them.

-- 
Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
  2020-10-17  1:08           ` Alex Belits
@ 2020-10-17 16:08             ` Thomas Gleixner
  2020-10-17 16:15               ` Alex Belits
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gleixner @ 2020-10-17 16:08 UTC (permalink / raw)
  To: Alex Belits, nitesh, frederic
  Cc: mingo, davem, linux-api, rostedt, peterz, linux-arch,
	catalin.marinas, will, Prasun Kapoor, linux-kernel,
	linux-arm-kernel, netdev

On Sat, Oct 17 2020 at 01:08, Alex Belits wrote:
> On Mon, 2020-10-05 at 14:52 -0400, Nitesh Narayan Lal wrote:
>> On 10/4/20 7:14 PM, Frederic Weisbecker wrote:
> I think that the goal of "finding source of disturbance" interface is
> different from what can be accomplished by tracing in two ways:
>
> 1. "Source of disturbance" should provide some useful information about
> category of event and it cause as opposed to determining all precise
> details about things being called that resulted or could result in
> disturbance. It should not depend on the user's knowledge about
> details

Tracepoints already give you selectively useful information.

> of implementations, it should provide some definite answer of what
> happened (with whatever amount of details can be given in a generic
> mechanism) even if the user has no idea how those things happen and
> what part of kernel is responsible for either causing or processing
> them. Then if the user needs further details, they can be obtained with
> tracing.

It's just a matter of defining the tracepoint at the right place.

> 2. It should be usable as a runtime error handling mechanism, so the
> information it provides should be suitable for application use and
> logging. It should be usable when applications are running on a system
> in production, and no specific tracing or monitoring mechanism can be
> in use.

That's a strawman really. There is absolutely no reason why a specific
set of tracepoints cannot be enabled on a production system.

Your tracker is a monitoring mechanism, just a different flavour.  By
your logic above it cannot be enabled on a production system either.

Also you can enable tracepoints from a control application, consume, log
and act upon them. It's not any different from opening some magic
isolation tracker interface. There are even multiple ways to do that
including libraries.

> If, say, thousands of devices are controlling neutrino detectors on an
> ocean floor, and in a month of work one of them got one isolation
> breaking event, it should be able to report that isolation was broken
> by an interrupt from a network interface, so the users will be able to
> track it down to some userspace application reconfiguring those
> interrupts.

Tracing can do that and it can do it selectively on the isolated
CPUs. It's just a matter of proper configuration and usage.

> It will be a good idea to make such mechanism optional and suitable for
> tracking things on conditions other than "always enabled" and "enabled
> with task isolation".

Tracing already provides that. Tracepoints are individually controlled
and filtered.

> However in my opinion, there should be something in kernel entry
> procedure that, if enabled, prepared something to be filled by the
> cause data, and we know at least one such situation when this kernel
> entry procedure should be triggered -- when task isolation is on.

A tracepoint will gather that information for you.

task isolation is not special, it's just yet another way to configure
and use a system and tracepoints provide everything you need with the
bonus that you can gather more correlated information when you need it.

In fact tracing and tracepoints have replaced all specialized trackers
which were in the kernel before tracing was available. We're not going
to add a new one just because.

If there is anything which you find that tracing and tracepoints cannot
provide then the obvious solution is to extend that infrastructure so it
can serve your usecase.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
  2020-10-17 16:08             ` Thomas Gleixner
@ 2020-10-17 16:15               ` Alex Belits
  2020-10-17 20:03                 ` Thomas Gleixner
  0 siblings, 1 reply; 49+ messages in thread
From: Alex Belits @ 2020-10-17 16:15 UTC (permalink / raw)
  To: tglx, nitesh, frederic
  Cc: mingo, davem, linux-api, rostedt, peterz, linux-arch,
	catalin.marinas, will, linux-kernel, Prasun Kapoor, netdev,
	linux-arm-kernel


On Sat, 2020-10-17 at 18:08 +0200, Thomas Gleixner wrote:
> On Sat, Oct 17 2020 at 01:08, Alex Belits wrote:
> > On Mon, 2020-10-05 at 14:52 -0400, Nitesh Narayan Lal wrote:
> > > On 10/4/20 7:14 PM, Frederic Weisbecker wrote:
> > I think that the goal of "finding source of disturbance" interface
> > is
> > different from what can be accomplished by tracing in two ways:
> > 
> > 1. "Source of disturbance" should provide some useful information
> > about
> > category of event and it cause as opposed to determining all
> > precise
> > details about things being called that resulted or could result in
> > disturbance. It should not depend on the user's knowledge about
> > details
> 
> Tracepoints already give you selectively useful information.

Carefully placed tracepoints also can give the user information about
failures of open(), write(), execve() or mmap(). However syscalls still
provide an error code instead of returning generic failure and letting
user debug the cause.

-- 
Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
  2020-10-17 16:15               ` Alex Belits
@ 2020-10-17 20:03                 ` Thomas Gleixner
  0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gleixner @ 2020-10-17 20:03 UTC (permalink / raw)
  To: Alex Belits, nitesh, frederic
  Cc: mingo, davem, linux-api, rostedt, peterz, linux-arch,
	catalin.marinas, will, linux-kernel, Prasun Kapoor, netdev,
	linux-arm-kernel

On Sat, Oct 17 2020 at 16:15, Alex Belits wrote:
> On Sat, 2020-10-17 at 18:08 +0200, Thomas Gleixner wrote:
>> On Sat, Oct 17 2020 at 01:08, Alex Belits wrote:
>> > I think that the goal of "finding source of disturbance" interface
>> > is
>> > different from what can be accomplished by tracing in two ways:
>> > 
>> > 1. "Source of disturbance" should provide some useful information
>> > about
>> > category of event and it cause as opposed to determining all
>> > precise
>> > details about things being called that resulted or could result in
>> > disturbance. It should not depend on the user's knowledge about
>> > details
>> 
>> Tracepoints already give you selectively useful information.
>
> Carefully placed tracepoints also can give the user information about
> failures of open(), write(), execve() or mmap(). However syscalls still
> provide an error code instead of returning generic failure and letting
> user debug the cause.

I have absolutely no idea what you are trying to tell me.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 11/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks
  2020-10-01 14:47   ` Frederic Weisbecker
  2020-10-04 17:12     ` [EXT] " Alex Belits
@ 2021-01-22 14:13     ` Marcelo Tosatti
  2021-01-22 16:13       ` Paolo Abeni
  1 sibling, 1 reply; 49+ messages in thread
From: Marcelo Tosatti @ 2021-01-22 14:13 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Alex Belits, rostedt, Prasun Kapoor, mingo, davem, linux-api,
	peterz, linux-arch, catalin.marinas, tglx, will,
	linux-arm-kernel, linux-kernel, netdev

On Thu, Oct 01, 2020 at 04:47:31PM +0200, Frederic Weisbecker wrote:
> On Wed, Jul 22, 2020 at 02:58:24PM +0000, Alex Belits wrote:
> > From: Yuri Norov <ynorov@marvell.com>
> > 

> > so we don't need to flush it.
> 
> What guarantees that we have no backlog on it?

From Paolo's work to use lockless reading of 
per-CPU skb lists

https://www.spinics.net/lists/netdev/msg682693.html

It also exposed skb queue length to userspace

https://www.spinics.net/lists/netdev/msg684939.html

But if i remember correctly waiting for a RCU grace
period was also necessary to ensure no backlog !?! 

Paolo would you please remind us what was the sequence of steps?
(and then also, for the userspace isolation interface, where 
the application informs the kernel that its entering isolated
mode, is just confirming the queues have zero length is
sufficient?).

TIA!

> 
> > Currently flush_all_backlogs()
> > enqueues corresponding work on all CPUs including ones that run
> > isolated tasks. It leads to breaking task isolation for nothing.
> > 
> > In this patch, backlog flushing is enqueued only on non-isolated CPUs.
> > 
> > Signed-off-by: Yuri Norov <ynorov@marvell.com>
> > [abelits@marvell.com: use safe task_isolation_on_cpu() implementation]
> > Signed-off-by: Alex Belits <abelits@marvell.com>
> > ---
> >  net/core/dev.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 90b59fc50dc9..83a282f7453d 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -74,6 +74,7 @@
> >  #include <linux/cpu.h>
> >  #include <linux/types.h>
> >  #include <linux/kernel.h>
> > +#include <linux/isolation.h>
> >  #include <linux/hash.h>
> >  #include <linux/slab.h>
> >  #include <linux/sched.h>
> > @@ -5624,9 +5625,13 @@ static void flush_all_backlogs(void)
> >  
> >  	get_online_cpus();
> >  
> > -	for_each_online_cpu(cpu)
> > +	smp_rmb();
> 
> What is it ordering?
> 
> > +	for_each_online_cpu(cpu) {
> > +		if (task_isolation_on_cpu(cpu))
> > +			continue;
> >  		queue_work_on(cpu, system_highpri_wq,
> >  			      per_cpu_ptr(&flush_works, cpu));
> > +	}
> >  
> >  	for_each_online_cpu(cpu)
> >  		flush_work(per_cpu_ptr(&flush_works, cpu));
> 
> Thanks.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 11/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks
  2021-01-22 14:13     ` Marcelo Tosatti
@ 2021-01-22 16:13       ` Paolo Abeni
  0 siblings, 0 replies; 49+ messages in thread
From: Paolo Abeni @ 2021-01-22 16:13 UTC (permalink / raw)
  To: Marcelo Tosatti, Frederic Weisbecker
  Cc: Alex Belits, rostedt, Prasun Kapoor, mingo, davem, linux-api,
	peterz, linux-arch, catalin.marinas, tglx, will,
	linux-arm-kernel, linux-kernel, netdev

On Fri, 2021-01-22 at 11:13 -0300, Marcelo Tosatti wrote:
> On Thu, Oct 01, 2020 at 04:47:31PM +0200, Frederic Weisbecker wrote:
> > On Wed, Jul 22, 2020 at 02:58:24PM +0000, Alex Belits wrote:
> > > From: Yuri Norov <ynorov@marvell.com>
> > > 
> > > so we don't need to flush it.
> > 
> > What guarantees that we have no backlog on it?
> 
> From Paolo's work to use lockless reading of 
> per-CPU skb lists
> 
> https://www.spinics.net/lists/netdev/msg682693.html
> 
> It also exposed skb queue length to userspace
> 
> https://www.spinics.net/lists/netdev/msg684939.html
> 
> But if i remember correctly waiting for a RCU grace
> period was also necessary to ensure no backlog !?! 
> 
> Paolo would you please remind us what was the sequence of steps?
> (and then also, for the userspace isolation interface, where 
> the application informs the kernel that its entering isolated
> mode, is just confirming the queues have zero length is
> sufficient?).

After commit 2de79ee27fdb52626ac4ac48ec6d8d52ba6f9047, for CONFIG_RPS
enabled build, with no RFS in place to ensure backlog will be empty on
CPU X, the user must:
- configure the RPS map on each device before the device goes up to
explicitly exclude CPU X.

If CPU X is isolated after some network device already went up, to
ensure that the backlog will be empty on CPU X the user must:
- configure RPS on all the network device to exclude CPU X (as in the
previous scenario)
- wait a RCU grace period
- wait untill the backlog len on CPU X reported by procfs is 0

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2021-01-22 16:20 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-22 14:44 [PATCH v4 00/13] "Task_isolation" mode Alex Belits
2020-07-22 14:47 ` [PATCH v4 01/13] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
2020-07-22 14:48 ` [PATCH v4 02/13] task_isolation: vmstat: add vmstat_idle function Alex Belits
2020-07-22 14:49 ` [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel Alex Belits
2020-10-01 13:56   ` Frederic Weisbecker
2020-10-04 14:44     ` [EXT] " Alex Belits
2020-10-04 23:14       ` Frederic Weisbecker
2020-10-05 18:52         ` Nitesh Narayan Lal
2020-10-06 10:35           ` Frederic Weisbecker
2020-10-17  1:13             ` Alex Belits
2020-10-17  1:08           ` Alex Belits
2020-10-17 16:08             ` Thomas Gleixner
2020-10-17 16:15               ` Alex Belits
2020-10-17 20:03                 ` Thomas Gleixner
2020-10-06 11:01         ` Alex Belits
2020-10-01 14:40   ` Frederic Weisbecker
2020-10-04 15:01     ` [EXT] " Alex Belits
2020-07-22 14:51 ` [PATCH v4 04/13] task_isolation: Add task isolation hooks to arch-independent code Alex Belits
2020-07-22 14:51 ` [PATCH v4 05/13] task_isolation: Add xen-specific hook Alex Belits
2020-07-22 14:53 ` [PATCH 06/13] task_isolation: Add driver-specific hooks Alex Belits
2020-07-22 14:54 ` [PATCH v4 07/13] task_isolation: arch/x86: enable task isolation functionality Alex Belits
2020-07-22 14:55 ` [PATCH 08/13] task_isolation: arch/arm64: " Alex Belits
2020-07-22 14:56 ` [PATCH v4 09/13] task_isolation: arch/arm: " Alex Belits
2020-07-22 14:57 ` [PATCH v4 10/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
2020-10-01 14:44   ` Frederic Weisbecker
2020-10-04 15:22     ` [EXT] " Alex Belits
2020-10-06 21:41       ` Frederic Weisbecker
2020-10-17  0:17         ` Alex Belits
2020-07-22 14:58 ` [PATCH v4 11/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks Alex Belits
2020-10-01 14:47   ` Frederic Weisbecker
2020-10-04 17:12     ` [EXT] " Alex Belits
2021-01-22 14:13     ` Marcelo Tosatti
2021-01-22 16:13       ` Paolo Abeni
2020-07-22 14:59 ` [PATCH v4 12/13] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize Alex Belits
2020-07-22 14:59 ` [PATCH 13/13] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits
2020-07-23 13:17 ` [PATCH v4 00/13] "Task_isolation" mode Thomas Gleixner
2020-07-23 14:26   ` Peter Zijlstra
2020-07-23 14:53     ` Thomas Gleixner
2020-07-23 14:29   ` Peter Zijlstra
2020-07-23 15:41     ` [EXT] " Alex Belits
2020-07-23 15:48       ` Peter Zijlstra
2020-07-23 16:19         ` Alex Belits
2020-07-23 15:18   ` Alex Belits
2020-07-23 15:49     ` Peter Zijlstra
2020-07-23 16:50       ` Alex Belits
2020-07-23 21:44         ` Thomas Gleixner
2020-07-24  3:00           ` [EXT] " Alex Belits
2020-07-24 16:08             ` Thomas Gleixner
2020-07-23 21:31     ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).