linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/9] "Task_isolation" mode
@ 2020-11-23 17:42 Alex Belits
  2020-11-23 17:56 ` [PATCH v5 1/9] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
                   ` (10 more replies)
  0 siblings, 11 replies; 37+ messages in thread
From: Alex Belits @ 2020-11-23 17:42 UTC (permalink / raw)
  To: nitesh, frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas,
	rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, pauld, netdev

This is an update of task isolation work that was originally done by
Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until
November 2017. It is adapted to the current kernel and cleaned up to
implement its functionality in a more complete and cleaner manner.

Previous version is at
https://lore.kernel.org/netdev/04be044c1bcd76b7438b7563edc35383417f12c8.camel@marvell.com/

The last version by Chris Metcalf (now obsolete but may be relevant
for comparison and understanding the origin of the changes) is at
https://lore.kernel.org/lkml/1509728692-10460-1-git-send-email-cmetcalf@mellanox.com

Supported architectures

This version includes only architecture-independent code and arm64
support. x86 and arm support, and everything related to virtualization
will be re-added later when new kernel entry/exit implementation will
be accommodated. Support for other architectures can be added in a
somewhat modular manner, however it heavily depends on the details of
a kernel entry/exit support on any particular architecture.
Development of common entry/exit and conversion to it should simplify
that task. For now, this is the version that is currently being
developed on arm64.

Major changes since v4

The goal was to make isolation-breaking detection as generic as
possible, and remove everything related to determining, _why_
isolation was broken. Originally reporting isolation breaking was done
with a large number of of hooks in specific code (hardware interrupts,
syscalls, IPIs, page faults, etc.), and it was necessary to cover all
possible such events to have a reliable notification of a task about
its isolation being broken. To avoid such a fragile mechanism, this
version relies on mere fact of kernel being entered in isolation
mode. As a result, reporting happens later in kernel code, however it
covers everything.

This means that now there is no specific reporting, in kernel log or
elsewhere, about the reasons for breaking isolation. Information about
that may be valuable at runtime, so a separate mechanism for generic
reporting "why did CPU enter kernel" (with isolation or under other
conditions) may be a good thing. That can be done later, however at
this point it's important that task isolation does not require it, and
such mechanism will not be developed with the limited purpose of
supporting isolation alone.

General description

This is the result of development and maintenance of task isolation
functionality that originally started based on task isolation patch
v15 and was later updated to include v16. It provided predictable
environment for userspace tasks running on arm64 processors alongside
with full-featured Linux environment. It is intended to provide
reliable interruption-free environment from the point when a userspace
task enters isolation and until the moment it leaves isolation or
receives a signal intentionally sent to it, and was successfully used
for this purpose. While CPU isolation with nohz provides an
environment that is close to this requirement, the remaining IPIs and
other disturbances keep it from being usable for tasks that require
complete predictability of CPU timing.

This set of patches only covers the implementation of task isolation,
however additional functionality, such as selective TLB flushes, may
be implemented to avoid other kinds of disturbances that affect
latency and performance of isolated tasks.

The userspace support and test program is now at
https://github.com/abelits/libtmc . It was originally developed for
earlier implementation, so it has some checks that may be redundant
now but kept for compatibility.

My thanks to Chris Metcalf for design and maintenance of the original
task isolation patch, Francis Giraldeau <francis.giraldeau@gmail.com>
and Yuri Norov <ynorov@marvell.com> for various contributions to this
work, Frederic Weisbecker <frederic@kernel.org> for his work on CPU
isolation and housekeeping that made possible to remove some less
elegant solutions that I had to devise for earlier, <4.17 kernels, and
Nitesh Narayan Lal <nitesh@redhat.com> for adapting earlier patches
related to interrupt and work distribution in presence of CPU
isolation.

-- 
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 1/9] task_isolation: vmstat: add quiet_vmstat_sync function
  2020-11-23 17:42 [PATCH v5 0/9] "Task_isolation" mode Alex Belits
@ 2020-11-23 17:56 ` Alex Belits
  2020-11-23 21:48   ` Thomas Gleixner
  2020-11-23 17:56 ` [PATCH v5 2/9] task_isolation: vmstat: add vmstat_idle function Alex Belits
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Alex Belits @ 2020-11-23 17:56 UTC (permalink / raw)
  To: nitesh, frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas,
	rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, pauld, netdev

From: Chris Metcalf <cmetcalf@mellanox.com>

In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
the quiet_vmstat() function became asynchronous, in the sense that
the vmstat work was still scheduled to run on the core when the
function returned.  For task isolation, we need a synchronous
version of the function that guarantees that the vmstat worker
will not run on the core on return from the function.  Add a
quiet_vmstat_sync() function with that semantic.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 include/linux/vmstat.h | 2 ++
 mm/vmstat.c            | 9 +++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 322dcbfcc933..300ce6648923 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -284,6 +284,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
+void quiet_vmstat_sync(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -391,6 +392,7 @@ static inline void __dec_node_page_state(struct page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat_sync(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 698bc0bc18d1..43999caf47a4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1936,6 +1936,15 @@ void quiet_vmstat(void)
 	refresh_cpu_vm_stats(false);
 }
 
+/*
+ * Synchronously quiet vmstat so the work is guaranteed not to run on return.
+ */
+void quiet_vmstat_sync(void)
+{
+	cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work));
+	refresh_cpu_vm_stats(false);
+}
+
 /*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
-- 
2.20.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 2/9] task_isolation: vmstat: add vmstat_idle function
  2020-11-23 17:42 [PATCH v5 0/9] "Task_isolation" mode Alex Belits
  2020-11-23 17:56 ` [PATCH v5 1/9] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
@ 2020-11-23 17:56 ` Alex Belits
  2020-11-23 21:49   ` Thomas Gleixner
  2020-11-23 17:56 ` [PATCH v5 3/9] task_isolation: userspace hard isolation from kernel Alex Belits
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Alex Belits @ 2020-11-23 17:56 UTC (permalink / raw)
  To: nitesh, frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas,
	rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, pauld, netdev

From: Chris Metcalf <cmetcalf@mellanox.com>

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 300ce6648923..24392a957cfc 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -285,6 +285,7 @@ extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
 void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -393,6 +394,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 43999caf47a4..5b0ad7ed65f7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1945,6 +1945,16 @@ void quiet_vmstat_sync(void)
 	refresh_cpu_vm_stats(false);
 }
 
+/*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+	return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) &&
+		!need_update(smp_processor_id());
+}
+
 /*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
-- 
2.20.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 3/9] task_isolation: userspace hard isolation from kernel
  2020-11-23 17:42 [PATCH v5 0/9] "Task_isolation" mode Alex Belits
  2020-11-23 17:56 ` [PATCH v5 1/9] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
  2020-11-23 17:56 ` [PATCH v5 2/9] task_isolation: vmstat: add vmstat_idle function Alex Belits
@ 2020-11-23 17:56 ` Alex Belits
  2020-11-23 22:01   ` Thomas Gleixner
  2020-11-23 17:57 ` [PATCH v5 4/9] task_isolation: Add task isolation hooks to arch-independent code Alex Belits
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Alex Belits @ 2020-11-23 17:56 UTC (permalink / raw)
  To: nitesh, frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas,
	rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, pauld, netdev

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE, 0, 0, 0) to do
so.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
"isolcpus=nohz,domain,CPULIST" boot argument to enable
nohz_full and isolcpus. The "task_isolation" state is then indicated
by setting a new task struct field, task_isolation_flag, to the
value passed by prctl(), and also setting a TIF_TASK_ISOLATION
bit in the thread_info flags. When the kernel is returning to
userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
it calls the new task_isolation_start() routine to arrange for
the task to avoid being interrupted in the future.

With interrupts disabled, task_isolation_start() ensures that kernel
subsystems that might cause a future interrupt are quiesced. If it
doesn't succeed, it adjusts the syscall return value to indicate that
fact, and userspace can retry as desired. In addition to stopping
the scheduler tick, the code takes any actions that might avoid
a future interrupt to the core, such as a worker thread being
scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

The last stage of enabling task isolation happens in
task_isolation_exit_to_user_mode() that runs last before returning
to userspace and changes ll_isol_flags (see later) to prevent other
CPUs from interfering with isolated task.

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
other exception or irq, the kernel will send it a signal to indicate
isolation loss. In addition to sending a signal, the code supports a
kernel command-line "task_isolation_debug" flag which causes a stack
backtrace to be generated whenever a task loses isolation.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
clear the bit again later, and ignores exit/exit_group to allow
exiting the task without a pointless signal being delivered.

The prctl() API allows for specifying a signal number to use instead
of the default SIGKILL, to allow for catching the notification
signal; for example, in a production environment, it might be
helpful to log information to the application logging mechanism
before exiting. Or, the signal handler might choose to reset the
program counter back to the code segment intended to be run isolated
via prctl() to continue execution.

Isolation also disables CPU state synchronization mechanisms that
are. normally done by IPI. In the future, more synchronization
mechanisms, such as TLB flushes, may be disabled for isolated tasks.
This requires careful handling of kernel entry from isolated task --
remote synchronization requests must be re-enabled and
synchronization procedure triggered, before anything other than
low-level kernel entry code is called. Same applies to exiting from
kernel to userspace after isolation is enabled.

For this purpose, per-CPU low-level flags ll_isol_flags are used to
indicate isolation state, and task_isolation_kernel_enter() is used
to safely clear them early in kernel entry. CPU mask corresponding
to isolation bit in ll_isol_flags is visible to userspace as
/sys/devices/system/cpu/isolation_running, and can be used for
monitoring.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 .../admin-guide/kernel-parameters.txt         |   6 +
 drivers/base/cpu.c                            |  23 +
 include/linux/hrtimer.h                       |   4 +
 include/linux/isolation.h                     | 326 ++++++++
 include/linux/sched.h                         |   5 +
 include/linux/tick.h                          |   3 +
 include/uapi/linux/prctl.h                    |   6 +
 init/Kconfig                                  |  27 +
 kernel/Makefile                               |   2 +
 kernel/isolation.c                            | 714 ++++++++++++++++++
 kernel/signal.c                               |   2 +
 kernel/sys.c                                  |   6 +
 kernel/time/hrtimer.c                         |  27 +
 kernel/time/tick-sched.c                      |  18 +
 14 files changed, 1169 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 44fde25bb221..4b28f2022c98 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5250,6 +5250,12 @@
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation_debug	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION, this
+			setting will generate console backtraces to
+			accompany the diagnostics generated about
+			interrupting tasks running with task isolation.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 8f1d6569564c..afa6163203c7 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -284,6 +284,26 @@ static ssize_t print_cpus_isolated(struct device *dev,
 }
 static DEVICE_ATTR(isolated, 0444, print_cpus_isolated, NULL);
 
+#ifdef CONFIG_TASK_ISOLATION
+static ssize_t isolation_running_show(struct device *dev,
+				      struct device_attribute *attr, char *buf)
+{
+	int n;
+	cpumask_var_t isolation_running;
+
+	if (!zalloc_cpumask_var(&isolation_running, GFP_KERNEL))
+		return -ENOMEM;
+
+	task_isolation_cpumask(isolation_running);
+	n = sprintf(buf, "%*pbl\n", cpumask_pr_args(isolation_running));
+
+	free_cpumask_var(isolation_running);
+
+	return n;
+}
+static DEVICE_ATTR_RO(isolation_running);
+#endif
+
 #ifdef CONFIG_NO_HZ_FULL
 static ssize_t print_cpus_nohz_full(struct device *dev,
 				    struct device_attribute *attr, char *buf)
@@ -471,6 +491,9 @@ static struct attribute *cpu_root_attrs[] = {
 #ifdef CONFIG_NO_HZ_FULL
 	&dev_attr_nohz_full.attr,
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	&dev_attr_isolation_running.attr,
+#endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
 #endif
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 107cedd7019a..9a6c9ce95080 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -529,6 +529,10 @@ extern void __init hrtimers_init(void);
 /* Show pending timers: */
 extern void sysrq_timer_list_show(void);
 
+#ifdef CONFIG_TASK_ISOLATION
+extern void kick_hrtimer(void);
+#endif
+
 int hrtimers_prepare_cpu(unsigned int cpu);
 #ifdef CONFIG_HOTPLUG_CPU
 int hrtimers_dead_cpu(unsigned int cpu);
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..eb6c99bd2db4
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,326 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Task isolation support
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@marvell.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <stdarg.h>
+#include <linux/errno.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/irqflags.h>
+#include <linux/prctl.h>
+#include <linux/types.h>
+
+struct task_struct;
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/*
+ * Logging
+ */
+int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...);
+
+#define pr_task_isol_emerg(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_EMERG, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_alert(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_ALERT, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_crit(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_CRIT, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_err(cpu, fmt, ...)				\
+	task_isolation_message(cpu, LOGLEVEL_ERR, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_warn(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_WARNING, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_notice(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_NOTICE, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_info(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_INFO, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_debug(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_DEBUG, false, fmt, ##__VA_ARGS__)
+
+#define pr_task_isol_emerg_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_EMERG, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_alert_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_ALERT, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_crit_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_CRIT, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_err_supp(cpu, fmt, ...)				\
+	task_isolation_message(cpu, LOGLEVEL_ERR, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_warn_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_WARNING, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_notice_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_NOTICE, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_info_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_INFO, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_debug_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_DEBUG, true, fmt, ##__VA_ARGS__)
+
+#define BIT_LL_TASK_ISOLATION		(0)
+#define BIT_LL_TASK_ISOLATION_BROKEN	(1)
+#define BIT_LL_TASK_ISOLATION_REQUEST	(2)
+#define FLAG_LL_TASK_ISOLATION		(1 << BIT_LL_TASK_ISOLATION)
+#define FLAG_LL_TASK_ISOLATION_BROKEN	(1 << BIT_LL_TASK_ISOLATION_BROKEN)
+#define FLAG_LL_TASK_ISOLATION_REQUEST	(1 << BIT_LL_TASK_ISOLATION_REQUEST)
+
+DECLARE_PER_CPU(unsigned long, ll_isol_flags);
+extern cpumask_var_t task_isolation_map;
+
+/**
+ * task_isolation_request() - prctl hook to request task isolation
+ * @flags:	Flags from <linux/prctl.h> PR_TASK_ISOLATION_xxx.
+ *
+ * This is called from the generic prctl() code for PR_TASK_ISOLATION.
+ *
+ * Return: Returns 0 when task isolation enabled, otherwise a negative
+ * errno.
+ */
+extern int task_isolation_request(unsigned int flags);
+
+/**
+ * task_isolation_kernel_enter() - clear low-level task isolation flag
+ *
+ * This should be called immediately after entering kernel. It must
+ * be inline, and suitable for running after leaving isolated
+ * userspace in a "stale" state when synchronization is required
+ * before the CPU can safely enter the rest of the kernel.
+ */
+static __always_inline void task_isolation_kernel_enter(void)
+{
+	unsigned long flags;
+
+	/*
+	 * This function runs on a CPU that ran isolated task.
+	 *
+	 * We don't want this CPU running code from the rest of kernel
+	 * until other CPUs know that it is no longer isolated.  When
+	 * CPU is running isolated task until this point anything that
+	 * causes an interrupt on this CPU must end up calling this
+	 * before touching the rest of kernel. That is, this function
+	 * or fast_task_isolation_cpu_cleanup() or stop_isolation()
+	 * calling it. If any interrupt, including scheduling timer,
+	 * arrives, it will still end up here early after entering
+	 * kernel.  From this point interrupts are disabled until all
+	 * CPUs will see that this CPU is no longer running isolated
+	 * task.
+	 *
+	 * See also fast_task_isolation_cpu_cleanup().
+	 */
+	if ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION) == 0)
+		return;
+
+	raw_local_irq_save(flags);
+
+	/* Change low-level flags to indicate broken isolation */
+	this_cpu_write(ll_isol_flags, FLAG_LL_TASK_ISOLATION_BROKEN);
+
+	/*
+	 * If something happened that requires a barrier that would
+	 * otherwise be called from remote CPUs by CPU kick procedure,
+	 * this barrier runs instead of it. After this barrier, CPU
+	 * kick procedure would see the updated ll_isol_flags, so it
+	 * will run its own IPI to trigger a barrier.
+	 */
+	smp_mb();
+	/*
+	 * Synchronize instructions -- this CPU was not kicked while
+	 * in isolated mode, so it might require synchronization.
+	 * There might be an IPI if kick procedure happened and
+	 * ll_isol_flags was already updated while it assembled a CPU
+	 * mask. However if this did not happen, synchronize everything
+	 * here.
+	 */
+	instr_sync();
+	raw_local_irq_restore(flags);
+}
+
+/**
+ * task_isolation_exit_to_user_mode() - set low-level task isolation flag
+ * if task isolation is requested
+ *
+ * This should be called immediately before exiting kernel. It must
+ * be inline, and the state of CPI may become "stale" between setting
+ * the flag and returning to the userspace.
+ */
+static __always_inline void task_isolation_exit_to_user_mode(void)
+{
+	unsigned long flags;
+
+	/* Check if this task is entering isolation */
+	if ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION_REQUEST)
+	    == 0)
+		return;
+	raw_local_irq_save(flags);
+
+	/* Set low-level flags */
+	this_cpu_write(ll_isol_flags, FLAG_LL_TASK_ISOLATION);
+	/*
+	 * After this barrier the rest of the system stops using IPIs
+	 * to synchronize this CPU state. Since only exit to userspace
+	 * follows, this is safe. Synchronization will happen again in
+	 * task_isolation_enter() when this CPU will enter kernel.
+	 */
+	smp_mb();
+	/*
+	 * From this point this is recognized as isolated by
+	 * other CPUs
+	 */
+	raw_local_irq_restore(flags);
+}
+
+extern void task_isolation_cpu_cleanup(void);
+/**
+ * task_isolation_start() - attempt to actually start task isolation
+ *
+ * This function should be invoked as the last thing prior to returning to
+ * user space if TIF_TASK_ISOLATION is set in the thread_info flags.  It
+ * will attempt to quiesce the core and enter task-isolation mode.  If it
+ * fails, it will reset the system call return value to an error code that
+ * indicates the failure mode.
+ */
+extern void task_isolation_start(void);
+
+/**
+ * is_isolation_cpu() - check if CPU is intended for running isolated tasks.
+ * @cpu:	CPU to check.
+ */
+static inline bool is_isolation_cpu(int cpu)
+{
+	return task_isolation_map != NULL &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+/**
+ * task_isolation_on_cpu() - check if the cpu is running isolated task
+ * @cpu:	CPU to check.
+ */
+static inline int task_isolation_on_cpu(int cpu)
+{
+	return test_bit(BIT_LL_TASK_ISOLATION, &per_cpu(ll_isol_flags, cpu));
+}
+
+/**
+ * task_isolation_cpumask() - set CPUs currently running isolated tasks
+ * @mask:	Mask to modify.
+ */
+extern void task_isolation_cpumask(struct cpumask *mask);
+
+/**
+ * task_isolation_clear_cpumask() - clear CPUs currently running isolated tasks
+ * @mask:      Mask to modify.
+ */
+extern void task_isolation_clear_cpumask(struct cpumask *mask);
+
+/**
+ * task_isolation_syscall() - report a syscall from an isolated task
+ * @nr:		The syscall number.
+ *
+ * This routine should be invoked at syscall entry if TIF_TASK_ISOLATION is
+ * set in the thread_info flags.  It checks for valid syscalls,
+ * specifically prctl() with PR_TASK_ISOLATION, exit(), and exit_group().
+ * For any other syscall it will raise a signal and return failure.
+ *
+ * Return: 0 for acceptable syscalls, -1 for all others.
+ */
+extern int task_isolation_syscall(int nr);
+
+/**
+ * task_isolation_before_pending_work_check() - check for isolation breaking
+ *
+ * This routine is called from the code responsible for exiting to user mode,
+ * before the point when thread flags are checked for pending work.
+ * That function must be called if the current task is isolated, because
+ * TIF_TASK_ISOLATION must trigger a call to it.
+ */
+void task_isolation_before_pending_work_check(void);
+
+/**
+ * _task_isolation_interrupt() - report an interrupt of an isolated task
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This routine should be invoked at any exception or IRQ if
+ * TIF_TASK_ISOLATION is set in the thread_info flags.  It is not necessary
+ * to invoke it if the exception will generate a signal anyway (e.g. a bad
+ * page fault), and in that case it is preferable not to invoke it but just
+ * rely on the standard Linux signal.  The macro task_isolation_syscall()
+ * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
+ */
+extern void _task_isolation_interrupt(const char *fmt, ...);
+#define task_isolation_interrupt(fmt, ...)				\
+	do {								\
+		if (current_thread_info()->flags & _TIF_TASK_ISOLATION)	\
+			_task_isolation_interrupt(fmt, ## __VA_ARGS__);	\
+	} while (0)
+
+/**
+ * task_isolation_remote() - report a remote interrupt of an isolated task
+ * @cpu:	The remote cpu that is about to be interrupted.
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This routine should be invoked any time a remote IPI or other type of
+ * interrupt is being delivered to another cpu. The function will check to
+ * see if the target core is running a task-isolation task, and generate a
+ * diagnostic on the console if so; in addition, we tag the task so it
+ * doesn't generate another diagnostic when the interrupt actually arrives.
+ * Generating a diagnostic remotely yields a clearer indication of what
+ * happened then just reporting only when the remote core is interrupted.
+ *
+ */
+extern void task_isolation_remote(int cpu, const char *fmt, ...);
+
+/**
+ * task_isolation_remote_cpumask() - report interruption of multiple cpus
+ * @mask:	The set of remotes cpus that are about to be interrupted.
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This is the cpumask variant of _task_isolation_remote().  We
+ * generate a single-line diagnostic message even if multiple remote
+ * task-isolation cpus are being interrupted.
+ */
+extern void task_isolation_remote_cpumask(const struct cpumask *mask,
+					  const char *fmt, ...);
+
+/**
+ * _task_isolation_signal() - disable task isolation when signal is pending
+ * @task:	The task for which to disable isolation.
+ *
+ * This function generates a diagnostic and disables task isolation;
+ * it should be called if TIF_TASK_ISOLATION is set when notifying a
+ * task of a pending signal. The task_isolation_interrupt() function
+ * normally generates a diagnostic for events that just interrupt a
+ * task without generating a signal; here we need to hook the paths
+ * that correspond to interrupts that do generate a signal.  The macro
+ * task_isolation_signal() wraps the TIF_TASK_ISOLATION flag test to
+ * simplify the caller code.
+ */
+extern void _task_isolation_signal(struct task_struct *task);
+#define task_isolation_signal(task)					\
+	do {								\
+		if (task_thread_info(task)->flags & _TIF_TASK_ISOLATION) \
+			_task_isolation_signal(task);			\
+	} while (0)
+
+#else /* !CONFIG_TASK_ISOLATION */
+static inline int task_isolation_request(unsigned int flags) { return -EINVAL; }
+static inline void task_isolation_kernel_enter(void) {}
+static inline void task_isolation_exit_to_user_mode(void) {}
+static inline void task_isolation_start(void) { }
+static inline bool is_isolation_cpu(int cpu) { return 0; }
+static inline int task_isolation_on_cpu(int cpu) { return 0; }
+static inline void task_isolation_cpumask(struct cpumask *mask) { }
+static inline void task_isolation_clear_cpumask(struct cpumask *mask) { }
+static inline void task_isolation_cpu_cleanup(void) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_before_pending_work_check(void) { }
+static inline void task_isolation_signal(struct task_struct *task) { }
+#endif
+
+#endif /* _LINUX_ISOLATION_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 063cd120b459..5d8b17aa544b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1316,6 +1316,11 @@ struct task_struct {
 	unsigned long			prev_lowest_stack;
 #endif
 
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned short			task_isolation_flags; /* prctl */
+	unsigned short			task_isolation_state;
+#endif
+
 #ifdef CONFIG_X86_MCE
 	void __user			*mce_vaddr;
 	__u64				mce_kflags;
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 7340613c7eff..27c7c033d5a8 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -268,6 +268,9 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
 extern void __init tick_nohz_full_setup(cpumask_var_t cpumask);
+#ifdef CONFIG_TASK_ISOLATION
+extern int try_stop_full_tick(void);
+#endif
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 7f0827705c9a..bf72afb0364f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -247,4 +247,10 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Enable task_isolation mode for TASK_ISOLATION kernels. */
+#define PR_TASK_ISOLATION		48
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits)	(((bits) >> 8) & 0x7f)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index c9446911cf41..f708580fb56d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -648,6 +648,33 @@ config CPU_ISOLATION
 
 source "kernel/rcu/Kconfig"
 
+config HAVE_ARCH_TASK_ISOLATION
+	bool
+
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
+	help
+	  Allow userspace processes that place themselves on cores with
+	  nohz_full and isolcpus enabled, and run prctl(PR_TASK_ISOLATION),
+	  to "isolate" themselves from the kernel.  Prior to returning to
+	  userspace, isolated tasks will arrange that no future kernel
+	  activity will interrupt the task while the task is running in
+	  userspace.  Attempting to re-enter the kernel while in this mode
+	  will cause the task to be terminated with a signal; you must
+	  explicitly use prctl() to disable task isolation before resuming
+	  normal use of the kernel.
+
+	  This "hard" isolation from the kernel is required for userspace
+	  tasks that are running hard real-time tasks in userspace, such as
+	  a high-speed network driver in userspace.  Without this option, but
+	  with NO_HZ_FULL enabled, the kernel will make a best-faith, "soft"
+	  effort to shield a single userspace process from interrupts, but
+	  makes no guarantees.
+
+	  You should say "N" unless you are intending to run a
+	  high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index af601b9bda0e..8a8a361dc1a5 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -133,6 +133,8 @@ KCOV_INSTRUMENT_stackleak.o := n
 
 obj-$(CONFIG_SCF_TORTURE_TEST) += scftorture.o
 
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
+
 $(obj)/configs.o: $(obj)/config_data.gz
 
 targets += config_data.gz
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..127b2a1cb0cb
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,714 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  Implementation of task isolation.
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@marvell.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/sched.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include <linux/smp.h>
+#include <linux/tick.h>
+#include <asm/unistd.h>
+#include <asm/syscall.h>
+#include <linux/hrtimer.h>
+
+/*
+ * These values are stored in task_isolation_state.
+ * Note that STATE_NORMAL + TIF_TASK_ISOLATION means we are still
+ * returning from sys_prctl() to userspace.
+ */
+enum {
+	STATE_NORMAL = 0,	/* Not isolated */
+	STATE_ISOLATED = 1	/* In userspace, isolated */
+};
+
+/*
+ * Low-level isolation flags.
+ * Those flags are used by low-level isolation set/clear/check routines.
+ * Those flags should be set last before return to userspace and cleared
+ * first upon kernel entry, and synchronized to allow isolation breaking
+ * detection before touching potentially unsynchronized parts of kernel.
+ * Isolated task does not receive synchronization events of any kind, so
+ * at the time of the first entry into kernel it might not be ready to
+ * run most of the kernel code. However to perform synchronization
+ * properly, kernel entry code should also enable synchronization events
+ * at the same time. This presents a problem because more kernel code
+ * should run to determine the cause of isolation breaking, signals may
+ * have to be generated, etc. So some flag clearing and synchronization
+ * should happen in "low-level" entry code but processing of isolation
+ * breaking should happen in "high-level" code. Low-level isolation flags
+ * should be set in that low-level code, possibly long before the cause
+ * of isolation breaking is known. Symmetrically, entering isolation
+ * should disable synchronization events before returning to userspace
+ * but after all potentially volatile code is finished.
+ */
+DEFINE_PER_CPU(unsigned long, ll_isol_flags);
+
+/*
+ * Description of the last two tasks that ran isolated on a given CPU.
+ * This is intended only for messages about isolation breaking. We
+ * don't want any references to actual task while accessing this from
+ * CPU that caused isolation breaking -- we know nothing about timing
+ * and don't want to use locking or RCU.
+ */
+struct isol_task_desc {
+	atomic_t curr_index;
+	atomic_t curr_index_wr;
+	bool	warned[2];
+	pid_t	pid[2];
+	pid_t	tgid[2];
+	char	comm[2][TASK_COMM_LEN];
+};
+static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs);
+
+/*
+ * Counter for isolation exiting procedures (from request to the start of
+ * cleanup) being attempted at once on a CPU. Normally incrementing of
+ * this counter is performed from the CPU that caused isolation breaking,
+ * however decrementing is done from the cleanup procedure, delegated to
+ * the CPU that is exiting isolation, not from the CPU that caused isolation
+ * breaking.
+ *
+ * If incrementing this counter while starting isolation exit procedure
+ * results in a value greater than 0, isolation exiting is already in
+ * progress, and cleanup did not start yet. This means, counter should be
+ * decremented back, and isolation exit that is already in progress, should
+ * be allowed to complete. Otherwise, a new isolation exit procedure should
+ * be started.
+ */
+DEFINE_PER_CPU(atomic_t, isol_exit_counter);
+
+/*
+ * Descriptor for isolation-breaking SMP calls
+ */
+DEFINE_PER_CPU(call_single_data_t, isol_break_csd);
+
+cpumask_var_t task_isolation_map;
+cpumask_var_t task_isolation_cleanup_map;
+static DEFINE_SPINLOCK(task_isolation_cleanup_lock);
+
+/* We can run on cpus that are isolated from the scheduler and are nohz_full. */
+static int __init task_isolation_init(void)
+{
+	alloc_bootmem_cpumask_var(&task_isolation_cleanup_map);
+	if (alloc_cpumask_var(&task_isolation_map, GFP_KERNEL))
+		/*
+		 * At this point task isolation should match
+		 * nohz_full. This may change in the future.
+		 */
+		cpumask_copy(task_isolation_map, tick_nohz_full_mask);
+	return 0;
+}
+core_initcall(task_isolation_init)
+
+/* Enable stack backtraces of any interrupts of task_isolation cores. */
+static bool task_isolation_debug;
+static int __init task_isolation_debug_func(char *str)
+{
+	task_isolation_debug = true;
+	return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+/*
+ * Record name, pid and group pid of the task entering isolation on
+ * the current CPU.
+ */
+static void record_curr_isolated_task(void)
+{
+	int ind;
+	int cpu = smp_processor_id();
+	struct isol_task_desc *desc = &per_cpu(isol_task_descs, cpu);
+	struct task_struct *task = current;
+
+	/* Finish everything before recording current task */
+	smp_mb();
+	ind = atomic_inc_return(&desc->curr_index_wr) & 1;
+	desc->comm[ind][sizeof(task->comm) - 1] = '\0';
+	memcpy(desc->comm[ind], task->comm, sizeof(task->comm) - 1);
+	desc->pid[ind] = task->pid;
+	desc->tgid[ind] = task->tgid;
+	desc->warned[ind] = false;
+	/* Write everything, to be seen by other CPUs */
+	smp_mb();
+	atomic_inc(&desc->curr_index);
+	/* Everyone will see the new record from this point */
+	smp_mb();
+}
+
+/*
+ * Print message prefixed with the description of the current (or
+ * last) isolated task on a given CPU. Intended for isolation breaking
+ * messages that include target task for the user's convenience.
+ *
+ * Messages produced with this function may have obsolete task
+ * information if isolated tasks managed to exit, start and enter
+ * isolation multiple times, or multiple tasks tried to enter
+ * isolation on the same CPU at once. For those unusual cases it would
+ * contain a valid description of the cause for isolation breaking and
+ * target CPU number, just not the correct description of which task
+ * ended up losing isolation.
+ */
+int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...)
+{
+	struct isol_task_desc *desc;
+	struct task_struct *task;
+	va_list args;
+	char buf_prefix[TASK_COMM_LEN + 20 + 3 * 20];
+	char buf[200];
+	int curr_cpu, ind_counter, ind_counter_old, ind;
+
+	curr_cpu = get_cpu();
+	/* Barrier to synchronize with recording isolated task information */
+	smp_rmb();
+	desc = &per_cpu(isol_task_descs, cpu);
+	ind_counter = atomic_read(&desc->curr_index);
+
+	if (curr_cpu == cpu) {
+		/*
+		 * Message is for the current CPU so current
+		 * task_struct should be used instead of cached
+		 * information.
+		 *
+		 * Like in other diagnostic messages, if issued from
+		 * interrupt context, current will be the interrupted
+		 * task. Unlike other diagnostic messages, this is
+		 * always relevant because the message is about
+		 * interrupting a task.
+		 */
+		ind = ind_counter & 1;
+		if (supp && desc->warned[ind]) {
+			/*
+			 * If supp is true, skip the message if the
+			 * same task was mentioned in the message
+			 * originated on remote CPU, and it did not
+			 * re-enter isolated state since then (warned
+			 * is true). Only local messages following
+			 * remote messages, likely about the same
+			 * isolation breaking event, are skipped to
+			 * avoid duplication. If remote cause is
+			 * immediately followed by a local one before
+			 * isolation is broken, local cause is skipped
+			 * from messages.
+			 */
+			put_cpu();
+			return 0;
+		}
+		task = current;
+		snprintf(buf_prefix, sizeof(buf_prefix),
+			 "isolation %s/%d/%d (cpu %d)",
+			 task->comm, task->tgid, task->pid, cpu);
+		put_cpu();
+	} else {
+		/*
+		 * Message is for remote CPU, use cached information.
+		 */
+		put_cpu();
+		/*
+		 * Make sure, index remained unchanged while data was
+		 * copied. If it changed, data that was copied may be
+		 * inconsistent because two updates in a sequence could
+		 * overwrite the data while it was being read.
+		 */
+		do {
+			/* Make sure we are reading up to date values */
+			smp_mb();
+			ind = ind_counter & 1;
+			snprintf(buf_prefix, sizeof(buf_prefix),
+				 "isolation %s/%d/%d (cpu %d)",
+				 desc->comm[ind], desc->tgid[ind],
+				 desc->pid[ind], cpu);
+			desc->warned[ind] = true;
+			ind_counter_old = ind_counter;
+			/* Record the warned flag, then re-read descriptor */
+			smp_mb();
+			ind_counter = atomic_read(&desc->curr_index);
+			/*
+			 * If the counter changed, something was updated, so
+			 * repeat everything to get the current data
+			 */
+		} while (ind_counter != ind_counter_old);
+	}
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	switch (level) {
+	case LOGLEVEL_EMERG:
+		pr_emerg("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_ALERT:
+		pr_alert("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_CRIT:
+		pr_crit("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_ERR:
+		pr_err("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_WARNING:
+		pr_warn("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_NOTICE:
+		pr_notice("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_INFO:
+		pr_info("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_DEBUG:
+		pr_debug("%s: %s", buf_prefix, buf);
+		break;
+	default:
+		/* No message without a valid level */
+		return 0;
+	}
+	return 1;
+}
+
+/*
+ * Dump stack if need be. This can be helpful even from the final exit
+ * to usermode code since stack traces sometimes carry information about
+ * what put you into the kernel, e.g. an interrupt number encoded in
+ * the initial entry stack frame that is still visible at exit time.
+ */
+static void debug_dump_stack(void)
+{
+	if (task_isolation_debug)
+		dump_stack();
+}
+
+/*
+ * Set the flags word but don't try to actually start task isolation yet.
+ * We will start it when entering user space in task_isolation_start().
+ */
+int task_isolation_request(unsigned int flags)
+{
+	struct task_struct *task = current;
+
+	/*
+	 * The task isolation flags should always be cleared just by
+	 * virtue of having entered the kernel.
+	 */
+	WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_TASK_ISOLATION));
+	WARN_ON_ONCE(task->task_isolation_flags != 0);
+	WARN_ON_ONCE(task->task_isolation_state != STATE_NORMAL);
+
+	task->task_isolation_flags = flags;
+	if (!(task->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
+		return 0;
+
+	/* We are trying to enable task isolation. */
+	set_tsk_thread_flag(task, TIF_TASK_ISOLATION);
+
+	/*
+	 * Shut down the vmstat worker so we're not interrupted later.
+	 * We have to try to do this here (with interrupts enabled) since
+	 * we are canceling delayed work and will call flush_work()
+	 * (which enables interrupts) and possibly schedule().
+	 */
+	quiet_vmstat_sync();
+
+	/* We return 0 here but we may change that in task_isolation_start(). */
+	return 0;
+}
+
+/*
+ * Perform actions that should be done immediately on exit from isolation.
+ */
+static void fast_task_isolation_cpu_cleanup(void *info)
+{
+	unsigned long flags;
+
+	/*
+	 * This function runs on a CPU that ran isolated task.
+	 * It should be called either directly when isolation breaking is
+	 * being processed, or using IPI from another CPU when it intends
+	 * to break isolation on the given CPU.
+	 *
+	 * We don't want this CPU running code from the rest of kernel
+	 * until other CPUs know that it is no longer isolated. Any
+	 * entry into kernel will call task_isolation_kernel_enter()
+	 * before calling this, so this will be already done by
+	 * setting per-cpu flags and synchronizing in that function.
+	 *
+	 * For development purposes it makes sense to check if it was
+	 * done, because there is a possibility that some entry points
+	 * were left unguarded. That would be clearly a bug because it
+	 * will mean that regular kernel code is running with no
+	 * synchronization.
+	 */
+	local_irq_save(flags);
+	atomic_dec(&per_cpu(isol_exit_counter, smp_processor_id()));
+	/* Barrier to sync with requesting a task isolation breaking */
+	smp_mb__after_atomic();
+	/*
+	 * At this point breaking isolation will be treated as a
+	 * separate isolation-breaking event, however interrupts won't
+	 * arrive until local_irq_restore()
+	 */
+
+	/*
+	 * Check for the above mentioned entry without a call to
+	 * task_isolation_kernel_enter()
+	 */
+	if ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION)) {
+		/*
+		 * If it did happen, call the function here, to
+		 * prevent further problems from running in
+		 * un-synchronized state
+		 */
+		task_isolation_kernel_enter();
+		/* Report the problem */
+		pr_task_isol_emerg(smp_processor_id(),
+		"Isolation breaking was not detected on kernel entry\n");
+	}
+	/*
+	 * This task is no longer isolated (and if by any chance this
+	 * is the wrong task, it's already not isolated)
+	 */
+	current->task_isolation_flags = 0;
+	clear_tsk_thread_flag(current, TIF_TASK_ISOLATION);
+
+	/* Run the rest of cleanup later */
+	set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+
+	local_irq_restore(flags);
+}
+
+/* Disable task isolation for the specified task. */
+static void stop_isolation(struct task_struct *p)
+{
+	int cpu, this_cpu;
+	unsigned long flags;
+
+	this_cpu = get_cpu();
+	cpu = task_cpu(p);
+	if (atomic_inc_return(&per_cpu(isol_exit_counter, cpu)) > 1) {
+		/* Already exiting isolation */
+		atomic_dec(&per_cpu(isol_exit_counter, cpu));
+		put_cpu();
+		return;
+	}
+
+	if (p == current) {
+		p->task_isolation_state = STATE_NORMAL;
+		fast_task_isolation_cpu_cleanup(NULL);
+		task_isolation_cpu_cleanup();
+		put_cpu();
+	} else {
+		/*
+		 * Schedule "slow" cleanup. This relies on
+		 * TIF_NOTIFY_RESUME being set
+		 */
+		spin_lock_irqsave(&task_isolation_cleanup_lock, flags);
+		cpumask_set_cpu(cpu, task_isolation_cleanup_map);
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+		/*
+		 * Setting flags is delegated to the CPU where
+		 * isolated task is running
+		 * isol_exit_counter will be decremented from there as well.
+		 */
+		per_cpu(isol_break_csd, cpu).func =
+		    fast_task_isolation_cpu_cleanup;
+		per_cpu(isol_break_csd, cpu).info = NULL;
+		per_cpu(isol_break_csd, cpu).flags = 0;
+		smp_call_function_single_async(cpu,
+					       &per_cpu(isol_break_csd, cpu));
+		put_cpu();
+	}
+}
+
+/*
+ * This code runs with interrupts disabled just before the return to
+ * userspace, after a prctl() has requested enabling task isolation.
+ * We take whatever steps are needed to avoid being interrupted later:
+ * drain the lru pages, stop the scheduler tick, etc. More
+ * functionality may be added here later to avoid other types of
+ * interrupts from other kernel subsystems. This, however, may still not
+ * have the intended result, so the rest of the system takes into account
+ * the possibility of receiving an interrupt and isolation breaking later.
+ *
+ * If we can't enable task isolation, we update the syscall return
+ * value with an appropriate error.
+ *
+ * This, however, does not enable isolation yet, as far as low-level
+ * flags are concerned. So if interrupts will be enabled, it's still
+ * possible for the task to be interrupted. The call to
+ * task_isolation_exit_to_user_mode() should finally enable task
+ * isolation after this function set FLAG_LL_TASK_ISOLATION_REQUEST.
+ */
+void task_isolation_start(void)
+{
+	int error;
+	unsigned long flags;
+
+	/*
+	 * We should only be called in STATE_NORMAL (isolation
+	 * disabled), on our way out of the kernel from the prctl()
+	 * that turned it on.  If we are exiting from the kernel in
+	 * another state, it means we made it back into the kernel
+	 * without disabling task isolation, and we should investigate
+	 * how (and in any case disable task isolation at this
+	 * point). We are clearly not on the path back from the
+	 * prctl() so we don't touch the syscall return value.
+	 */
+	if (WARN_ON_ONCE(current->task_isolation_state != STATE_NORMAL)) {
+		stop_isolation(current);
+		/* Report the problem */
+		pr_task_isol_emerg(smp_processor_id(),
+		"Isolation start requested while not in the normal state\n");
+		return;
+	}
+
+	/*
+	 * Must be affinitized to a single core with task isolation possible.
+	 * In principle this could be remotely modified between the prctl()
+	 * and the return to userspace, so we have to check it here.
+	 */
+	if (current->nr_cpus_allowed != 1 ||
+	    !is_isolation_cpu(smp_processor_id())) {
+		error = -EINVAL;
+		goto error;
+	}
+
+	/* If the vmstat delayed work is not canceled, we have to try again. */
+	if (!vmstat_idle()) {
+		error = -EAGAIN;
+		goto error;
+	}
+
+	/* Try to stop the dynamic tick. */
+	error = try_stop_full_tick();
+	if (error)
+		goto error;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	local_irq_save(flags);
+
+	/* Record isolated task IDs and name */
+	record_curr_isolated_task();
+
+	current->task_isolation_state = STATE_ISOLATED;
+	this_cpu_write(ll_isol_flags, FLAG_LL_TASK_ISOLATION_REQUEST);
+	/* Barrier to synchronize with reading of flags */
+	smp_mb();
+	local_irq_restore(flags);
+	return;
+
+error:
+	stop_isolation(current);
+	syscall_set_return_value(current, current_pt_regs(), error, 0);
+}
+
+/* Stop task isolation on the remote task and send it a signal. */
+static void send_isolation_signal(struct task_struct *task)
+{
+	int flags = task->task_isolation_flags;
+	kernel_siginfo_t info = {
+		.si_signo = PR_TASK_ISOLATION_GET_SIG(flags) ?: SIGKILL,
+	};
+
+	if ((flags & PR_TASK_ISOLATION_ENABLE) == 0)
+		return;
+
+	stop_isolation(task);
+	send_sig_info(info.si_signo, &info, task);
+}
+
+/* Only a few syscalls are valid once we are in task isolation mode. */
+static bool is_acceptable_syscall(int syscall)
+{
+	/* No need to incur an isolation signal if we are just exiting. */
+	if (syscall == __NR_exit || syscall == __NR_exit_group)
+		return true;
+
+	/* Check to see if it's the prctl for isolation. */
+	if (syscall == __NR_prctl) {
+		unsigned long arg[SYSCALL_MAX_ARGS];
+
+		syscall_get_arguments(current, current_pt_regs(), arg);
+		if (arg[0] == PR_TASK_ISOLATION)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * This routine is called from syscall entry, prevents most syscalls
+ * from executing, and if needed raises a signal to notify the process.
+ *
+ * Note that we have to stop isolation before we even print a message
+ * here, since otherwise we might end up reporting an interrupt due to
+ * kicking the printk handling code, rather than reporting the true
+ * cause of interrupt here.
+ *
+ * The message is not suppressed by previous remotely triggered
+ * messages.
+ */
+int task_isolation_syscall(int syscall)
+{
+	struct task_struct *task = current;
+
+	/*
+	 * Check if by any chance syscall is being processed from
+	 * isolated state without a call to
+	 * task_isolation_kernel_enter() happening on entry
+	 */
+	if ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION)) {
+		/*
+		 * If it did happen, call the function here, to
+		 * prevent further problems from running in
+		 * un-synchronized state
+		 */
+		task_isolation_kernel_enter();
+		/* Report the problem */
+		pr_task_isol_emerg(smp_processor_id(),
+			"Isolation breaking was not detected on syscall\n");
+		}
+	/*
+	 * Clear low-level isolation flags to avoid triggering
+	 * a signal on return to userspace
+	 */
+	this_cpu_write(ll_isol_flags, 0);
+
+	if (is_acceptable_syscall(syscall)) {
+		stop_isolation(task);
+		return 0;
+	}
+
+	send_isolation_signal(task);
+
+	pr_task_isol_warn(smp_processor_id(),
+			  "task_isolation lost due to syscall %d\n",
+			  syscall);
+	debug_dump_stack();
+
+	syscall_set_return_value(task, current_pt_regs(), -ERESTARTNOINTR, -1);
+	return -1;
+}
+
+/*
+ * This routine is called from the code responsible for exiting to user mode,
+ * before the point when thread flags are checked for pending work.
+ * That function must be called if the current task is isolated, because
+ * TIF_TASK_ISOLATION must trigger a call to it.
+ */
+void task_isolation_before_pending_work_check(void)
+{
+	int cpu;
+	unsigned long flags;
+
+	/* Handle isolation breaking */
+	if ((current->task_isolation_state != STATE_NORMAL)
+	    && ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION_BROKEN)
+	        != 0)) {
+		/*
+		 * Clear low-level isolation flags to avoid triggering
+		 * a signal again
+		 */
+		this_cpu_write(ll_isol_flags, 0);
+		/* Send signal to notify about isolation breaking */
+		send_isolation_signal(current);
+		/* Produce generic message about lost isolation */
+		pr_task_isol_warn(smp_processor_id(), "task_isolation lost\n");
+		debug_dump_stack();
+	}
+
+	/*
+	 * If this CPU is in the map of CPUs with cleanup pending,
+	 * remove it from the map and call cleanup
+	 */
+	spin_lock_irqsave(&task_isolation_cleanup_lock, flags);
+
+	cpu = smp_processor_id();
+
+	if (cpumask_test_cpu(cpu, task_isolation_cleanup_map)) {
+		cpumask_clear_cpu(cpu, task_isolation_cleanup_map);
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+		task_isolation_cpu_cleanup();
+	} else
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+}
+
+/*
+ * Called before we wake up a task that has a signal to process.
+ * Needs to be done to handle interrupts that trigger signals, which
+ * we don't catch with task_isolation_interrupt() hooks.
+ *
+ * This message is also suppressed if there was already a remotely
+ * caused message about the same isolation breaking event.
+ */
+void _task_isolation_signal(struct task_struct *task)
+{
+	struct isol_task_desc *desc;
+	int ind, cpu;
+	bool do_warn = (task->task_isolation_state == STATE_ISOLATED);
+
+	cpu = task_cpu(task);
+	desc = &per_cpu(isol_task_descs, cpu);
+	ind = atomic_read(&desc->curr_index) & 1;
+	if (desc->warned[ind])
+		do_warn = false;
+
+	stop_isolation(task);
+
+	if (do_warn) {
+		pr_warn("isolation: %s/%d/%d (cpu %d): task_isolation lost due to signal\n",
+			task->comm, task->tgid, task->pid, cpu);
+		debug_dump_stack();
+	}
+}
+
+/*
+ * Set CPUs currently running isolated tasks in CPU mask.
+ */
+void task_isolation_cpumask(struct cpumask *mask)
+{
+	int cpu;
+
+	if (task_isolation_map == NULL)
+		return;
+
+	/* Barrier to synchronize with writing task isolation flags */
+	smp_rmb();
+	for_each_cpu(cpu, task_isolation_map)
+		if (task_isolation_on_cpu(cpu))
+			cpumask_set_cpu(cpu, mask);
+}
+
+/*
+ * Clear CPUs currently running isolated tasks in CPU mask.
+ */
+void task_isolation_clear_cpumask(struct cpumask *mask)
+{
+	int cpu;
+
+	if (task_isolation_map == NULL)
+		return;
+
+	/* Barrier to synchronize with writing task isolation flags */
+	smp_rmb();
+	for_each_cpu(cpu, task_isolation_map)
+		if (task_isolation_on_cpu(cpu))
+			cpumask_clear_cpu(cpu, mask);
+}
+
+/*
+ * Cleanup procedure. The call to this procedure may be delayed.
+ */
+void task_isolation_cpu_cleanup(void)
+{
+	kick_hrtimer();
+}
diff --git a/kernel/signal.c b/kernel/signal.c
index ef8f2a28d37c..252886e6aa76 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -46,6 +46,7 @@
 #include <linux/livepatch.h>
 #include <linux/cgroup.h>
 #include <linux/audit.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -759,6 +760,7 @@ static int dequeue_synchronous_signal(kernel_siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+	task_isolation_signal(t);
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/sys.c b/kernel/sys.c
index a730c03ee607..0da007ce8666 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,6 +42,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+	case PR_TASK_ISOLATION:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = task_isolation_request(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 387b4bef7dd1..27fedddc2729 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -30,6 +30,7 @@
 #include <linux/syscalls.h>
 #include <linux/interrupt.h>
 #include <linux/tick.h>
+#include <linux/isolation.h>
 #include <linux/err.h>
 #include <linux/debugobjects.h>
 #include <linux/sched/signal.h>
@@ -720,6 +721,19 @@ static void retrigger_next_event(void *arg)
 	raw_spin_unlock(&base->lock);
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+void kick_hrtimer(void)
+{
+	unsigned long flags;
+
+	preempt_disable();
+	local_irq_save(flags);
+	retrigger_next_event(NULL);
+	local_irq_restore(flags);
+	preempt_enable();
+}
+#endif
+
 /*
  * Switch to high resolution mode
  */
@@ -867,8 +881,21 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
 void clock_was_set(void)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
+#ifdef CONFIG_TASK_ISOLATION
+	struct cpumask mask;
+
+	cpumask_clear(&mask);
+	task_isolation_cpumask(&mask);
+	cpumask_complement(&mask, &mask);
+	/*
+	 * Retrigger the CPU local events everywhere except CPUs
+	 * running isolated tasks.
+	 */
+	on_each_cpu_mask(&mask, retrigger_next_event, NULL, 1);
+#else
 	/* Retrigger the CPU local events everywhere */
 	on_each_cpu(retrigger_next_event, NULL, 1);
+#endif
 #endif
 	timerfd_clock_was_set();
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 81632cd5e3b7..a213952541db 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -896,6 +896,24 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts)
 #endif
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+int try_stop_full_tick(void)
+{
+	int cpu = smp_processor_id();
+	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+	/* For an unstable clock, we should return a permanent error code. */
+	if (atomic_read(&tick_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE)
+		return -EINVAL;
+
+	if (!can_stop_full_tick(cpu, ts))
+		return -EAGAIN;
+
+	tick_nohz_stop_sched_tick(ts, cpu);
+	return 0;
+}
+#endif
+
 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 {
 	/*
-- 
2.20.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 4/9] task_isolation: Add task isolation hooks to arch-independent code
  2020-11-23 17:42 [PATCH v5 0/9] "Task_isolation" mode Alex Belits
                   ` (2 preceding siblings ...)
  2020-11-23 17:56 ` [PATCH v5 3/9] task_isolation: userspace hard isolation from kernel Alex Belits
@ 2020-11-23 17:57 ` Alex Belits
  2020-11-23 22:31   ` Thomas Gleixner
  2020-11-23 17:57 ` [PATCH v5 5/9] task_isolation: Add driver-specific hooks Alex Belits
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Alex Belits @ 2020-11-23 17:57 UTC (permalink / raw)
  To: nitesh, frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas,
	rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, pauld, netdev

Kernel entry and exit functions for task isolation are added to context
tracking and common entry points. Common handling of pending work on exit
to userspace now processes isolation breaking, cleanup and start.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
[abelits@marvell.com: adapted for kernel 5.10]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 include/linux/hardirq.h   |  2 ++
 include/linux/sched.h     |  2 ++
 kernel/context_tracking.c |  5 +++++
 kernel/entry/common.c     | 10 +++++++++-
 kernel/irq/irqdesc.c      |  5 +++++
 5 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 754f67ac4326..b9e604ae6a0d 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -7,6 +7,7 @@
 #include <linux/lockdep.h>
 #include <linux/ftrace_irq.h>
 #include <linux/vtime.h>
+#include <linux/isolation.h>
 #include <asm/hardirq.h>
 
 extern void synchronize_irq(unsigned int irq);
@@ -115,6 +116,7 @@ extern void rcu_nmi_exit(void);
 	do {							\
 		lockdep_off();					\
 		arch_nmi_enter();				\
+		task_isolation_kernel_enter();			\
 		printk_nmi_enter();				\
 		BUG_ON(in_nmi() == NMI_MASK);			\
 		__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);	\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5d8b17aa544b..51c2d774250b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -34,6 +34,7 @@
 #include <linux/rseq.h>
 #include <linux/seqlock.h>
 #include <linux/kcsan.h>
+#include <linux/isolation.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -1762,6 +1763,7 @@ extern char *__get_task_comm(char *to, size_t len, struct task_struct *tsk);
 #ifdef CONFIG_SMP
 static __always_inline void scheduler_ipi(void)
 {
+	task_isolation_kernel_enter();
 	/*
 	 * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
 	 * TIF_NEED_RESCHED remotely (for the first time) will also send
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 36a98c48aedc..379a48fd0e65 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -21,6 +21,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -100,6 +101,8 @@ void noinstr __context_tracking_enter(enum ctx_state state)
 		__this_cpu_write(context_tracking.state, state);
 	}
 	context_tracking_recursion_exit();
+
+	task_isolation_exit_to_user_mode();
 }
 EXPORT_SYMBOL_GPL(__context_tracking_enter);
 
@@ -148,6 +151,8 @@ void noinstr __context_tracking_exit(enum ctx_state state)
 	if (!context_tracking_recursion_enter())
 		return;
 
+	task_isolation_kernel_enter();
+
 	if (__this_cpu_read(context_tracking.state) == state) {
 		if (__this_cpu_read(context_tracking.active)) {
 			/*
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index e9e2df3f3f9e..10a520894105 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -4,6 +4,7 @@
 #include <linux/entry-common.h>
 #include <linux/livepatch.h>
 #include <linux/audit.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
@@ -183,13 +184,20 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 
 static void exit_to_user_mode_prepare(struct pt_regs *regs)
 {
-	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+	unsigned long ti_work;
 
 	lockdep_assert_irqs_disabled();
 
+	task_isolation_before_pending_work_check();
+
+	ti_work = READ_ONCE(current_thread_info()->flags);
+
 	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
 		ti_work = exit_to_user_mode_loop(regs, ti_work);
 
+	if (unlikely(ti_work & _TIF_TASK_ISOLATION))
+		task_isolation_start();
+
 	arch_exit_to_user_mode_prepare(regs, ti_work);
 
 	/* Ensure that the address limit is intact and no locks are held */
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..b8f0a7574f55 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -16,6 +16,7 @@
 #include <linux/bitmap.h>
 #include <linux/irqdomain.h>
 #include <linux/sysfs.h>
+#include <linux/isolation.h>
 
 #include "internals.h"
 
@@ -669,6 +670,8 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq,
 	unsigned int irq = hwirq;
 	int ret = 0;
 
+	task_isolation_kernel_enter();
+
 	irq_enter();
 
 #ifdef CONFIG_IRQ_DOMAIN
@@ -710,6 +713,8 @@ int handle_domain_nmi(struct irq_domain *domain, unsigned int hwirq,
 	unsigned int irq;
 	int ret = 0;
 
+	task_isolation_kernel_enter();
+
 	/*
 	 * NMI context needs to be setup earlier in order to deal with tracing.
 	 */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 5/9] task_isolation: Add driver-specific hooks
  2020-11-23 17:42 [PATCH v5 0/9] "Task_isolation" mode Alex Belits
                   ` (3 preceding siblings ...)
  2020-11-23 17:57 ` [PATCH v5 4/9] task_isolation: Add task isolation hooks to arch-independent code Alex Belits
@ 2020-11-23 17:57 ` Alex Belits
  2020-12-02 14:18   ` Mark Rutland
  2020-11-23 17:58 ` [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality Alex Belits
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Alex Belits @ 2020-11-23 17:57 UTC (permalink / raw)
  To: nitesh, frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas,
	rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, pauld, netdev

Some drivers don't call functions that call
task_isolation_kernel_enter() in interrupt handlers. Call it
directly.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 drivers/irqchip/irq-armada-370-xp.c | 6 ++++++
 drivers/irqchip/irq-gic-v3.c        | 3 +++
 drivers/irqchip/irq-gic.c           | 3 +++
 drivers/s390/cio/cio.c              | 3 +++
 4 files changed, 15 insertions(+)

diff --git a/drivers/irqchip/irq-armada-370-xp.c b/drivers/irqchip/irq-armada-370-xp.c
index d7eb2e93db8f..4ac7babe1abe 100644
--- a/drivers/irqchip/irq-armada-370-xp.c
+++ b/drivers/irqchip/irq-armada-370-xp.c
@@ -29,6 +29,7 @@
 #include <linux/slab.h>
 #include <linux/syscore_ops.h>
 #include <linux/msi.h>
+#include <linux/isolation.h>
 #include <asm/mach/arch.h>
 #include <asm/exception.h>
 #include <asm/smp_plat.h>
@@ -572,6 +573,7 @@ static const struct irq_domain_ops armada_370_xp_mpic_irq_ops = {
 static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained)
 {
 	u32 msimask, msinr;
+	int isol_entered = 0;
 
 	msimask = readl_relaxed(per_cpu_int_base +
 				ARMADA_370_XP_IN_DRBEL_CAUSE_OFFS)
@@ -588,6 +590,10 @@ static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained)
 			continue;
 
 		if (is_chained) {
+			if (!isol_entered) {
+				task_isolation_kernel_enter();
+				isol_entered = 1;
+			}
 			irq = irq_find_mapping(armada_370_xp_msi_inner_domain,
 					       msinr - PCI_MSI_DOORBELL_START);
 			generic_handle_irq(irq);
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 16fecc0febe8..ded26dd4da0f 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -18,6 +18,7 @@
 #include <linux/percpu.h>
 #include <linux/refcount.h>
 #include <linux/slab.h>
+#include <linux/isolation.h>
 
 #include <linux/irqchip.h>
 #include <linux/irqchip/arm-gic-common.h>
@@ -646,6 +647,8 @@ static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs *regs
 {
 	u32 irqnr;
 
+	task_isolation_kernel_enter();
+
 	irqnr = gic_read_iar();
 
 	if (gic_supports_nmi() &&
diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index 6053245a4754..bb482b4ae218 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -35,6 +35,7 @@
 #include <linux/interrupt.h>
 #include <linux/percpu.h>
 #include <linux/slab.h>
+#include <linux/isolation.h>
 #include <linux/irqchip.h>
 #include <linux/irqchip/chained_irq.h>
 #include <linux/irqchip/arm-gic.h>
@@ -337,6 +338,8 @@ static void __exception_irq_entry gic_handle_irq(struct pt_regs *regs)
 	struct gic_chip_data *gic = &gic_data[0];
 	void __iomem *cpu_base = gic_data_cpu_base(gic);
 
+	task_isolation_kernel_enter();
+
 	do {
 		irqstat = readl_relaxed(cpu_base + GIC_CPU_INTACK);
 		irqnr = irqstat & GICC_IAR_INT_ID_MASK;
diff --git a/drivers/s390/cio/cio.c b/drivers/s390/cio/cio.c
index 6d716db2a46a..beab88881b6d 100644
--- a/drivers/s390/cio/cio.c
+++ b/drivers/s390/cio/cio.c
@@ -20,6 +20,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/interrupt.h>
 #include <linux/irq.h>
+#include <linux/isolation.h>
 #include <asm/cio.h>
 #include <asm/delay.h>
 #include <asm/irq.h>
@@ -584,6 +585,8 @@ void cio_tsch(struct subchannel *sch)
 	struct irb *irb;
 	int irq_context;
 
+	task_isolation_kernel_enter();
+
 	irb = this_cpu_ptr(&cio_irb);
 	/* Store interrupt response block to lowcore. */
 	if (tsch(sch->schid, irb) != 0)
-- 
2.20.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality
  2020-11-23 17:42 [PATCH v5 0/9] "Task_isolation" mode Alex Belits
                   ` (4 preceding siblings ...)
  2020-11-23 17:57 ` [PATCH v5 5/9] task_isolation: Add driver-specific hooks Alex Belits
@ 2020-11-23 17:58 ` Alex Belits
  2020-12-02 13:59   ` Mark Rutland
  2020-11-23 17:58 ` [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Alex Belits @ 2020-11-23 17:58 UTC (permalink / raw)
  To: nitesh, frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas,
	rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, pauld, netdev

In do_notify_resume(), call task_isolation_before_pending_work_check()
first, to report isolation breaking, then after handling all pending
work, call task_isolation_start() for TIF_TASK_ISOLATION tasks.

Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and _TIF_SYSCALL_WORK,
define local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since we
don't clear _TIF_TASK_ISOLATION in the loop.

Early kernel entry code calls task_isolation_kernel_enter(). In
particular:

Vectors:
el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter()
el1_irq -> asm_nmi_enter(), handle_arch_irq()
el1_error -> do_serror()
el0_sync -> el0_sync_handler()
el0_irq -> handle_arch_irq()
el0_error -> do_serror()
el0_sync_compat -> el0_sync_compat_handler()
el0_irq_compat -> handle_arch_irq()
el0_error_compat -> do_serror()

SDEI entry:
__sdei_asm_handler -> __sdei_handler() -> nmi_enter()

Functions called from there:
asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter()
asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return()

Handlers:
do_serror() -> nmi_enter() -> task_isolation_kernel_enter()
  or task_isolation_kernel_enter()
el1_sync_handler() -> task_isolation_kernel_enter()
el0_sync_handler() -> task_isolation_kernel_enter()
el0_sync_compat_handler() -> task_isolation_kernel_enter()

handle_arch_irq() is irqchip-specific, most call handle_domain_irq()
There is a separate patch for irqchips that do not follow this rule.

handle_domain_irq() -> task_isolation_kernel_enter()
do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant)
nmi_enter() -> task_isolation_kernel_enter()

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
[abelits@marvell.com: simplified to match kernel 5.10]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/barrier.h     |  1 +
 arch/arm64/include/asm/thread_info.h |  7 +++++--
 arch/arm64/kernel/entry-common.c     |  7 +++++++
 arch/arm64/kernel/ptrace.c           | 10 ++++++++++
 arch/arm64/kernel/signal.c           | 13 ++++++++++++-
 arch/arm64/kernel/smp.c              |  3 +++
 7 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1515f6f153a0..fc958d8d8945 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -141,6 +141,7 @@ config ARM64
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_STACKLEAK
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index c3009b0e5239..ad5a6dd380cf 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -49,6 +49,7 @@
 #define dma_rmb()	dmb(oshld)
 #define dma_wmb()	dmb(oshst)
 
+#define instr_sync()	isb()
 /*
  * Generate a mask for array_index__nospec() that is ~0UL when 0 <= idx < sz
  * and 0 otherwise.
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 1fbab854a51b..3321c69c46fe 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -68,6 +68,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define TIF_UPROBE		4	/* uprobe breakpoint or singlestep */
 #define TIF_FSCHECK		5	/* Check FS is USER_DS on return */
 #define TIF_MTE_ASYNC_FAULT	6	/* MTE Asynchronous Tag Check Fault */
+#define TIF_TASK_ISOLATION	7	/* task isolation enabled for task */
 #define TIF_SYSCALL_TRACE	8	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	9	/* syscall auditing */
 #define TIF_SYSCALL_TRACEPOINT	10	/* syscall tracepoint for ftrace */
@@ -87,6 +88,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SYSCALL_TRACEPOINT	(1 << TIF_SYSCALL_TRACEPOINT)
@@ -101,11 +103,12 @@ void arch_release_task_struct(struct task_struct *tsk);
 
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
-				 _TIF_UPROBE | _TIF_FSCHECK | _TIF_MTE_ASYNC_FAULT)
+				 _TIF_UPROBE | _TIF_FSCHECK | \
+				 _TIF_MTE_ASYNC_FAULT | _TIF_TASK_ISOLATION)
 
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
-				 _TIF_SYSCALL_EMU)
+				 _TIF_SYSCALL_EMU | _TIF_TASK_ISOLATION)
 
 #ifdef CONFIG_SHADOW_CALL_STACK
 #define INIT_SCS							\
diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index 43d4c329775f..8152760de683 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -8,6 +8,7 @@
 #include <linux/context_tracking.h>
 #include <linux/ptrace.h>
 #include <linux/thread_info.h>
+#include <linux/isolation.h>
 
 #include <asm/cpufeature.h>
 #include <asm/daifflags.h>
@@ -77,6 +78,8 @@ asmlinkage void notrace el1_sync_handler(struct pt_regs *regs)
 {
 	unsigned long esr = read_sysreg(esr_el1);
 
+	task_isolation_kernel_enter();
+
 	switch (ESR_ELx_EC(esr)) {
 	case ESR_ELx_EC_DABT_CUR:
 	case ESR_ELx_EC_IABT_CUR:
@@ -249,6 +252,8 @@ asmlinkage void notrace el0_sync_handler(struct pt_regs *regs)
 {
 	unsigned long esr = read_sysreg(esr_el1);
 
+	task_isolation_kernel_enter();
+
 	switch (ESR_ELx_EC(esr)) {
 	case ESR_ELx_EC_SVC64:
 		el0_svc(regs);
@@ -321,6 +326,8 @@ asmlinkage void notrace el0_sync_compat_handler(struct pt_regs *regs)
 {
 	unsigned long esr = read_sysreg(esr_el1);
 
+	task_isolation_kernel_enter();
+
 	switch (ESR_ELx_EC(esr)) {
 	case ESR_ELx_EC_SVC32:
 		el0_svc_compat(regs);
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index f49b349e16a3..2941f2b16796 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -29,6 +29,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/cpufeature.h>
@@ -1803,6 +1804,15 @@ int syscall_trace_enter(struct pt_regs *regs)
 			return NO_SYSCALL;
 	}
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (test_thread_flag(TIF_TASK_ISOLATION)) {
+		if (task_isolation_syscall(regs->syscallno) == -1)
+			return NO_SYSCALL;
+	}
+
 	/* Do the secure computing after ptrace; failures should be fast. */
 	if (secure_computing() == -1)
 		return NO_SYSCALL;
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index a8184cad8890..e3a82b75e39d 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -20,6 +20,7 @@
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
 #include <linux/syscalls.h>
+#include <linux/isolation.h>
 
 #include <asm/daifflags.h>
 #include <asm/debug-monitors.h>
@@ -911,6 +912,11 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
+#define NOTIFY_RESUME_LOOP_FLAGS \
+	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_RESUME | \
+	 _TIF_FOREIGN_FPSTATE | _TIF_UPROBE | _TIF_FSCHECK | \
+	 _TIF_MTE_ASYNC_FAULT)
+
 asmlinkage void do_notify_resume(struct pt_regs *regs,
 				 unsigned long thread_flags)
 {
@@ -921,6 +927,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 	 */
 	trace_hardirqs_off();
 
+	task_isolation_before_pending_work_check();
+
 	do {
 		/* Check valid user FS if needed */
 		addr_limit_user_check();
@@ -956,7 +964,10 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 
 		local_daif_mask();
 		thread_flags = READ_ONCE(current_thread_info()->flags);
-	} while (thread_flags & _TIF_WORK_MASK);
+	} while (thread_flags & NOTIFY_RESUME_LOOP_FLAGS);
+
+	if (thread_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
 }
 
 unsigned long __ro_after_init signal_minsigstksz;
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 18e9727d3f64..4401eac4710c 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -33,6 +33,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/kexec.h>
 #include <linux/kvm_host.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -890,6 +891,8 @@ static void do_handle_IPI(int ipinr)
 {
 	unsigned int cpu = smp_processor_id();
 
+	task_isolation_kernel_enter();
+
 	if ((unsigned)ipinr < NR_IPI)
 		trace_ipi_entry_rcuidle(ipi_types[ipinr]);
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-11-23 17:42 [PATCH v5 0/9] "Task_isolation" mode Alex Belits
                   ` (5 preceding siblings ...)
  2020-11-23 17:58 ` [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality Alex Belits
@ 2020-11-23 17:58 ` Alex Belits
  2020-11-23 22:13   ` Frederic Weisbecker
                     ` (2 more replies)
  2020-11-23 17:58 ` [PATCH v5 8/9] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize Alex Belits
                   ` (3 subsequent siblings)
  10 siblings, 3 replies; 37+ messages in thread
From: Alex Belits @ 2020-11-23 17:58 UTC (permalink / raw)
  To: nitesh, frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas,
	rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, pauld, netdev

From: Yuri Norov <ynorov@marvell.com>

For nohz_full CPUs the desirable behavior is to receive interrupts
generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
obviously not desirable because it breaks isolation.

This patch adds check for it.

Signed-off-by: Yuri Norov <ynorov@marvell.com>
[abelits@marvell.com: updated, only exclude CPUs running isolated tasks]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/time/tick-sched.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a213952541db..6c8679e200f0 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include <linux/sched/clock.h>
 #include <linux/sched/stat.h>
 #include <linux/sched/nohz.h>
+#include <linux/isolation.h>
 #include <linux/module.h>
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
@@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
  */
 void tick_nohz_full_kick_cpu(int cpu)
 {
-	if (!tick_nohz_full_cpu(cpu))
+	smp_rmb();
+	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
 		return;
 
 	irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 8/9] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize
  2020-11-23 17:42 [PATCH v5 0/9] "Task_isolation" mode Alex Belits
                   ` (6 preceding siblings ...)
  2020-11-23 17:58 ` [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
@ 2020-11-23 17:58 ` Alex Belits
  2020-11-23 17:58 ` [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Alex Belits @ 2020-11-23 17:58 UTC (permalink / raw)
  To: nitesh, frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas,
	rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, pauld, netdev

From: Yuri Norov <ynorov@marvell.com>

CPUs running isolated tasks are in userspace, so they don't have to
perform ring buffer updates immediately. If ring_buffer_resize()
schedules the update on those CPUs, isolation is broken. To prevent
that, updates for CPUs running isolated tasks are performed locally,
like for offline CPUs.

A race condition between this update and isolation breaking is avoided
at the cost of disabling per_cpu buffer writing for the time of update
when it coincides with isolation breaking.

Signed-off-by: Yuri Norov <ynorov@marvell.com>
[abelits@marvell.com: updated to prevent race with isolation breaking]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/trace/ring_buffer.c | 63 ++++++++++++++++++++++++++++++++++----
 1 file changed, 57 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index dc83b3fa9fe7..9e4fb3ed2af0 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -21,6 +21,7 @@
 #include <linux/delay.h>
 #include <linux/slab.h>
 #include <linux/init.h>
+#include <linux/isolation.h>
 #include <linux/hash.h>
 #include <linux/list.h>
 #include <linux/cpu.h>
@@ -1939,6 +1940,38 @@ static void update_pages_handler(struct work_struct *work)
 	complete(&cpu_buffer->update_done);
 }
 
+static bool update_if_isolated(struct ring_buffer_per_cpu *cpu_buffer,
+			       int cpu)
+{
+	bool rv = false;
+
+	smp_rmb();
+	if (task_isolation_on_cpu(cpu)) {
+		/*
+		 * CPU is running isolated task. Since it may lose
+		 * isolation and re-enter kernel simultaneously with
+		 * this update, disable recording until it's done.
+		 */
+		atomic_inc(&cpu_buffer->record_disabled);
+		/* Make sure, update is done, and isolation state is current */
+		smp_mb();
+		if (task_isolation_on_cpu(cpu)) {
+			/*
+			 * If CPU is still running isolated task, we
+			 * can be sure that breaking isolation will
+			 * happen while recording is disabled, and CPU
+			 * will not touch this buffer until the update
+			 * is done.
+			 */
+			rb_update_pages(cpu_buffer);
+			cpu_buffer->nr_pages_to_update = 0;
+			rv = true;
+		}
+		atomic_dec(&cpu_buffer->record_disabled);
+	}
+	return rv;
+}
+
 /**
  * ring_buffer_resize - resize the ring buffer
  * @buffer: the buffer to resize.
@@ -2028,13 +2061,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
 			if (!cpu_buffer->nr_pages_to_update)
 				continue;
 
-			/* Can't run something on an offline CPU. */
+			/*
+			 * Can't run something on an offline CPU.
+			 *
+			 * CPUs running isolated tasks don't have to
+			 * update ring buffers until they exit
+			 * isolation because they are in
+			 * userspace. Use the procedure that prevents
+			 * race condition with isolation breaking.
+			 */
 			if (!cpu_online(cpu)) {
 				rb_update_pages(cpu_buffer);
 				cpu_buffer->nr_pages_to_update = 0;
 			} else {
-				schedule_work_on(cpu,
-						&cpu_buffer->update_pages_work);
+				if (!update_if_isolated(cpu_buffer, cpu))
+					schedule_work_on(cpu,
+					&cpu_buffer->update_pages_work);
 			}
 		}
 
@@ -2083,13 +2125,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
 
 		get_online_cpus();
 
-		/* Can't run something on an offline CPU. */
+		/*
+		 * Can't run something on an offline CPU.
+		 *
+		 * CPUs running isolated tasks don't have to update
+		 * ring buffers until they exit isolation because they
+		 * are in userspace. Use the procedure that prevents
+		 * race condition with isolation breaking.
+		 */
 		if (!cpu_online(cpu_id))
 			rb_update_pages(cpu_buffer);
 		else {
-			schedule_work_on(cpu_id,
+			if (!update_if_isolated(cpu_buffer, cpu_id))
+				schedule_work_on(cpu_id,
 					 &cpu_buffer->update_pages_work);
-			wait_for_completion(&cpu_buffer->update_done);
+				wait_for_completion(&cpu_buffer->update_done);
+			}
 		}
 
 		cpu_buffer->nr_pages_to_update = 0;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus
  2020-11-23 17:42 [PATCH v5 0/9] "Task_isolation" mode Alex Belits
                   ` (7 preceding siblings ...)
  2020-11-23 17:58 ` [PATCH v5 8/9] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize Alex Belits
@ 2020-11-23 17:58 ` Alex Belits
  2020-11-23 22:29   ` Frederic Weisbecker
  2020-11-24 16:36 ` [PATCH v5 0/9] "Task_isolation" mode Tom Rix
  2020-12-05 20:40 ` Pavel Machek
  10 siblings, 1 reply; 37+ messages in thread
From: Alex Belits @ 2020-11-23 17:58 UTC (permalink / raw)
  To: nitesh, frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas,
	rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, pauld, netdev

From: Yuri Norov <ynorov@marvell.com>

Make sure that kick_all_cpus_sync() does not call CPUs that are running
isolated tasks.

Signed-off-by: Yuri Norov <ynorov@marvell.com>
[abelits@marvell.com: use safe task_isolation_cpumask() implementation]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/smp.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 4d17501433be..b2faecf58ed0 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -932,9 +932,21 @@ static void do_nothing(void *unused)
  */
 void kick_all_cpus_sync(void)
 {
+	struct cpumask mask;
+
 	/* Make sure the change is visible before we kick the cpus */
 	smp_mb();
-	smp_call_function(do_nothing, NULL, 1);
+
+	preempt_disable();
+#ifdef CONFIG_TASK_ISOLATION
+	cpumask_clear(&mask);
+	task_isolation_cpumask(&mask);
+	cpumask_complement(&mask, &mask);
+#else
+	cpumask_setall(&mask);
+#endif
+	smp_call_function_many(&mask, do_nothing, NULL, 1);
+	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 1/9] task_isolation: vmstat: add quiet_vmstat_sync function
  2020-11-23 17:56 ` [PATCH v5 1/9] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
@ 2020-11-23 21:48   ` Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2020-11-23 21:48 UTC (permalink / raw)
  To: Alex Belits, nitesh, frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas,
	rostedt, linux-kernel, peterx, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, pauld, netdev

Alex,

On Mon, Nov 23 2020 at 17:56, Alex Belits wrote:

why are you insisting on adding 'task_isolation: ' as prefix to every
single patch? That's wrong as I explained before.

The prefix denotes the affected subsystem and 'task_isolation' is _NOT_
a subsystem. It's the project name you are using but the affected code
belongs to the memory management subsystem and if you run

 git log mm/vmstat.c

you might get a hint what the proper prefix is, i.e. 'mm/vmstat: '

> In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
> the quiet_vmstat() function became asynchronous, in the sense that
> the vmstat work was still scheduled to run on the core when the
> function returned.  For task isolation, we need a synchronous

This changelog is useless because how should someone not familiar with
the term 'task isolation' figure out what that means?

It's not the reviewers job to figure that out. Again: Go read and adhere
to Documentation/process/*

Aside of that your patches are CR/LF inflicted. Please fix your work
flow and tools.

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 2/9] task_isolation: vmstat: add vmstat_idle function
  2020-11-23 17:56 ` [PATCH v5 2/9] task_isolation: vmstat: add vmstat_idle function Alex Belits
@ 2020-11-23 21:49   ` Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2020-11-23 21:49 UTC (permalink / raw)
  To: Alex Belits, nitesh, frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas,
	rostedt, linux-kernel, peterx, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, pauld, netdev

On Mon, Nov 23 2020 at 17:56, Alex Belits wrote:
> This function checks to see if a vmstat worker is not running,
> and the vmstat diffs don't require an update.  The function is
> called from the task-isolation code to see if we need to
> actually do some work to quiet vmstat.

A changelog has to explain _WHY_ this change is necessary and not _WHAT_
the patch is doing.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 3/9] task_isolation: userspace hard isolation from kernel
  2020-11-23 17:56 ` [PATCH v5 3/9] task_isolation: userspace hard isolation from kernel Alex Belits
@ 2020-11-23 22:01   ` Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2020-11-23 22:01 UTC (permalink / raw)
  To: Alex Belits, nitesh, frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas,
	rostedt, linux-kernel, peterx, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, pauld, netdev

Alex,

On Mon, Nov 23 2020 at 17:56, Alex Belits wrote:
>  .../admin-guide/kernel-parameters.txt         |   6 +
>  drivers/base/cpu.c                            |  23 +
>  include/linux/hrtimer.h                       |   4 +
>  include/linux/isolation.h                     | 326 ++++++++
>  include/linux/sched.h                         |   5 +
>  include/linux/tick.h                          |   3 +
>  include/uapi/linux/prctl.h                    |   6 +
>  init/Kconfig                                  |  27 +
>  kernel/Makefile                               |   2 +
>  kernel/isolation.c                            | 714 ++++++++++++++++++
>  kernel/signal.c                               |   2 +
>  kernel/sys.c                                  |   6 +
>  kernel/time/hrtimer.c                         |  27 +
>  kernel/time/tick-sched.c                      |  18 +

I asked you before to split this up into bits and pieces and argue and
justify each change. Throwing this wholesale over the fence is going
nowhere. It's not revieable at all.

Aside of that ignoring review comments is a sure path to make yourself
ignored:

> +/*
> + * Logging
> + */
> +int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...);
> +
> +#define pr_task_isol_emerg(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_EMERG, false, fmt, ##__VA_ARGS__)

The comments various people made about that are not going away and none
of this is going near anything I'm responsible for unless you provide
these independent of the rest and with a reasonable justification why
you can't use any other existing mechanism or extend it for your use
case.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-11-23 17:58 ` [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
@ 2020-11-23 22:13   ` Frederic Weisbecker
  2020-11-23 22:35     ` Alex Belits
  2020-11-23 22:36   ` Thomas Gleixner
  2020-12-02 14:20   ` Mark Rutland
  2 siblings, 1 reply; 37+ messages in thread
From: Frederic Weisbecker @ 2020-11-23 22:13 UTC (permalink / raw)
  To: Alex Belits
  Cc: nitesh, Prasun Kapoor, linux-api, davem, trix, mingo,
	catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch,
	mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev

Hi Alex,

On Mon, Nov 23, 2020 at 05:58:22PM +0000, Alex Belits wrote:
> From: Yuri Norov <ynorov@marvell.com>
> 
> For nohz_full CPUs the desirable behavior is to receive interrupts
> generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> obviously not desirable because it breaks isolation.
> 
> This patch adds check for it.
> 
> Signed-off-by: Yuri Norov <ynorov@marvell.com>
> [abelits@marvell.com: updated, only exclude CPUs running isolated tasks]
> Signed-off-by: Alex Belits <abelits@marvell.com>
> ---
>  kernel/time/tick-sched.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index a213952541db..6c8679e200f0 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -20,6 +20,7 @@
>  #include <linux/sched/clock.h>
>  #include <linux/sched/stat.h>
>  #include <linux/sched/nohz.h>
> +#include <linux/isolation.h>
>  #include <linux/module.h>
>  #include <linux/irq_work.h>
>  #include <linux/posix-timers.h>
> @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
>   */
>  void tick_nohz_full_kick_cpu(int cpu)
>  {
> -	if (!tick_nohz_full_cpu(cpu))
> +	smp_rmb();
> +	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
>  		return;

Like I said in subsequent reviews, we are not going to ignore IPIs.
We must fix the sources of these IPIs instead.

Thanks.

>  
>  	irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus
  2020-11-23 17:58 ` [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits
@ 2020-11-23 22:29   ` Frederic Weisbecker
  2020-11-23 22:39     ` [EXT] " Alex Belits
  0 siblings, 1 reply; 37+ messages in thread
From: Frederic Weisbecker @ 2020-11-23 22:29 UTC (permalink / raw)
  To: Alex Belits
  Cc: nitesh, Prasun Kapoor, linux-api, davem, trix, mingo,
	catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch,
	mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev

On Mon, Nov 23, 2020 at 05:58:42PM +0000, Alex Belits wrote:
> From: Yuri Norov <ynorov@marvell.com>
> 
> Make sure that kick_all_cpus_sync() does not call CPUs that are running
> isolated tasks.
> 
> Signed-off-by: Yuri Norov <ynorov@marvell.com>
> [abelits@marvell.com: use safe task_isolation_cpumask() implementation]
> Signed-off-by: Alex Belits <abelits@marvell.com>
> ---
>  kernel/smp.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 4d17501433be..b2faecf58ed0 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -932,9 +932,21 @@ static void do_nothing(void *unused)
>   */
>  void kick_all_cpus_sync(void)
>  {
> +	struct cpumask mask;
> +
>  	/* Make sure the change is visible before we kick the cpus */
>  	smp_mb();
> -	smp_call_function(do_nothing, NULL, 1);
> +
> +	preempt_disable();
> +#ifdef CONFIG_TASK_ISOLATION
> +	cpumask_clear(&mask);
> +	task_isolation_cpumask(&mask);
> +	cpumask_complement(&mask, &mask);
> +#else
> +	cpumask_setall(&mask);
> +#endif
> +	smp_call_function_many(&mask, do_nothing, NULL, 1);
> +	preempt_enable();

Same comment about IPIs here.

>  }
>  EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
>  
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 4/9] task_isolation: Add task isolation hooks to arch-independent code
  2020-11-23 17:57 ` [PATCH v5 4/9] task_isolation: Add task isolation hooks to arch-independent code Alex Belits
@ 2020-11-23 22:31   ` Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2020-11-23 22:31 UTC (permalink / raw)
  To: Alex Belits, nitesh, frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas,
	rostedt, linux-kernel, peterx, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, pauld, netdev

Alex,

On Mon, Nov 23 2020 at 17:57, Alex Belits wrote:
> Kernel entry and exit functions for task isolation are added to context
> tracking and common entry points. Common handling of pending work on exit
> to userspace now processes isolation breaking, cleanup and start.

Again: You fail to explain the rationale and just explain what the patch
is doing. I can see what the patch is doing from the patch itself.

> ---
>  include/linux/hardirq.h   |  2 ++
>  include/linux/sched.h     |  2 ++
>  kernel/context_tracking.c |  5 +++++
>  kernel/entry/common.c     | 10 +++++++++-
>  kernel/irq/irqdesc.c      |  5 +++++

At least 3 different subsystems, which means this again failed to be
split into seperate patches.

>  extern void synchronize_irq(unsigned int irq);
> @@ -115,6 +116,7 @@ extern void rcu_nmi_exit(void);
>  	do {							\
>  		lockdep_off();					\
>  		arch_nmi_enter();				\
> +		task_isolation_kernel_enter();			\

Where is the explanation why this is safe and correct vs. this fragile
code path?

> @@ -1762,6 +1763,7 @@ extern char *__get_task_comm(char *to, size_t len, struct task_struct *tsk);
>  #ifdef CONFIG_SMP
>  static __always_inline void scheduler_ipi(void)
>  {
> +	task_isolation_kernel_enter();

Why is the scheduler_ipi() special? Just because everything else cannot
happen at all? Oh well...

>  #define CREATE_TRACE_POINTS
>  #include <trace/events/context_tracking.h>
> @@ -100,6 +101,8 @@ void noinstr __context_tracking_enter(enum ctx_state state)
>  		__this_cpu_write(context_tracking.state, state);
>  	}
>  	context_tracking_recursion_exit();
> +
> +	task_isolation_exit_to_user_mode();

Why is this here at all and why is it outside of the recursion
protection

>  }
>  EXPORT_SYMBOL_GPL(__context_tracking_enter);
>  
> @@ -148,6 +151,8 @@ void noinstr __context_tracking_exit(enum ctx_state state)
>  	if (!context_tracking_recursion_enter())
>  		return;
>  
> +	task_isolation_kernel_enter();

while this is inside?

And why has the scheduler_ipi() on x86 call this twice? Just because?

>  	if (__this_cpu_read(context_tracking.state) == state) {
>  		if (__this_cpu_read(context_tracking.active)) {
>  			/*
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>  static void exit_to_user_mode_prepare(struct pt_regs *regs)
>  {
> -	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +	unsigned long ti_work;
>  
>  	lockdep_assert_irqs_disabled();
>  
> +	task_isolation_before_pending_work_check();
> +
> +	ti_work = READ_ONCE(current_thread_info()->flags);
> +
>  	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
>  		ti_work = exit_to_user_mode_loop(regs, ti_work);
>  
> +	if (unlikely(ti_work & _TIF_TASK_ISOLATION))
> +		task_isolation_start();

Where is the explaination of this change?

Aside of that how does anything of this compile on x86 at all?

Answer: It does not ...

Stop this frenzy right now. It's going nowhere and all you achieve is to
make people more grumpy than they are already.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-11-23 22:13   ` Frederic Weisbecker
@ 2020-11-23 22:35     ` Alex Belits
  0 siblings, 0 replies; 37+ messages in thread
From: Alex Belits @ 2020-11-23 22:35 UTC (permalink / raw)
  To: frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel,
	rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, catalin.marinas, pauld, netdev


On Mon, 2020-11-23 at 23:13 +0100, Frederic Weisbecker wrote:
> External Email
> 
> -------------------------------------------------------------------
> ---
> Hi Alex,
> 
> On Mon, Nov 23, 2020 at 05:58:22PM +0000, Alex Belits wrote:
> > From: Yuri Norov <ynorov@marvell.com>
> > 
> > For nohz_full CPUs the desirable behavior is to receive interrupts
> > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> > obviously not desirable because it breaks isolation.
> > 
> > This patch adds check for it.
> > 
> > Signed-off-by: Yuri Norov <ynorov@marvell.com>
> > [abelits@marvell.com: updated, only exclude CPUs running isolated
> > tasks]
> > Signed-off-by: Alex Belits <abelits@marvell.com>
> > ---
> >  kernel/time/tick-sched.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index a213952541db..6c8679e200f0 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -20,6 +20,7 @@
> >  #include <linux/sched/clock.h>
> >  #include <linux/sched/stat.h>
> >  #include <linux/sched/nohz.h>
> > +#include <linux/isolation.h>
> >  #include <linux/module.h>
> >  #include <linux/irq_work.h>
> >  #include <linux/posix-timers.h>
> > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
> >   */
> >  void tick_nohz_full_kick_cpu(int cpu)
> >  {
> > -	if (!tick_nohz_full_cpu(cpu))
> > +	smp_rmb();
> > +	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
> >  		return;
> 
> Like I said in subsequent reviews, we are not going to ignore IPIs.
> We must fix the sources of these IPIs instead.

This is what I am working on right now. This is made with an assumption
that CPU running isolated task has no reason to be kicked because
nothing else is supposed to be there. Usually this is true and when not
true is still safe when everything else is behaving right. For this
version I have kept the original implementation with minimal changes to
make it possible to use task isolation at all.

I agree that it's a much better idea is to determine if the CPU should
be kicked. If it really should, that will be a legitimate cause to
break isolation there, because CPU running isolated task has no
legitimate reason to have timers running. Right now I am trying to
determine the origin of timers that _still_ show up as running in the
current kernel version, so I think, this is a rather large chunk of
work that I have to do separately.

-- 
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-11-23 17:58 ` [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
  2020-11-23 22:13   ` Frederic Weisbecker
@ 2020-11-23 22:36   ` Thomas Gleixner
  2020-12-02 14:20   ` Mark Rutland
  2 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2020-11-23 22:36 UTC (permalink / raw)
  To: Alex Belits, nitesh, frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas,
	rostedt, linux-kernel, peterx, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, pauld, netdev

On Mon, Nov 23 2020 at 17:58, Alex Belits wrote:
> From: Yuri Norov <ynorov@marvell.com>
>
> For nohz_full CPUs the desirable behavior is to receive interrupts
> generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> obviously not desirable because it breaks isolation.
>
> This patch adds check for it.

git grep 'This patch' Documentation/process/

>   */
>  void tick_nohz_full_kick_cpu(int cpu)
>  {
> -	if (!tick_nohz_full_cpu(cpu))
> +	smp_rmb();

Undocumented smp_rmb() ...

> +	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
>  		return;

I still have to see a convincing argument why task isolation is special
and not just a straight forward extension of NOHZ full cpu isolation.

It's not special as much as you want it to be special.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus
  2020-11-23 22:29   ` Frederic Weisbecker
@ 2020-11-23 22:39     ` Alex Belits
  2020-11-23 23:21       ` Frederic Weisbecker
  0 siblings, 1 reply; 37+ messages in thread
From: Alex Belits @ 2020-11-23 22:39 UTC (permalink / raw)
  To: frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel,
	rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, catalin.marinas, pauld, netdev


On Mon, 2020-11-23 at 23:29 +0100, Frederic Weisbecker wrote:
> External Email
> 
> -------------------------------------------------------------------
> ---
> On Mon, Nov 23, 2020 at 05:58:42PM +0000, Alex Belits wrote:
> > From: Yuri Norov <ynorov@marvell.com>
> > 
> > Make sure that kick_all_cpus_sync() does not call CPUs that are
> > running
> > isolated tasks.
> > 
> > Signed-off-by: Yuri Norov <ynorov@marvell.com>
> > [abelits@marvell.com: use safe task_isolation_cpumask()
> > implementation]
> > Signed-off-by: Alex Belits <abelits@marvell.com>
> > ---
> >  kernel/smp.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index 4d17501433be..b2faecf58ed0 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -932,9 +932,21 @@ static void do_nothing(void *unused)
> >   */
> >  void kick_all_cpus_sync(void)
> >  {
> > +	struct cpumask mask;
> > +
> >  	/* Make sure the change is visible before we kick the cpus */
> >  	smp_mb();
> > -	smp_call_function(do_nothing, NULL, 1);
> > +
> > +	preempt_disable();
> > +#ifdef CONFIG_TASK_ISOLATION
> > +	cpumask_clear(&mask);
> > +	task_isolation_cpumask(&mask);
> > +	cpumask_complement(&mask, &mask);
> > +#else
> > +	cpumask_setall(&mask);
> > +#endif
> > +	smp_call_function_many(&mask, do_nothing, NULL, 1);
> > +	preempt_enable();
> 
> Same comment about IPIs here.

This is different from timers. The original design was based on the
idea that every CPU should be able to enter kernel at any time and run
kernel code with no additional preparation. Then the only solution is
to always do full broadcast and require all CPUs to process it.

What I am trying to introduce is the idea of CPU that is not likely to
run kernel code any soon, and can afford to go through an additional
synchronization procedure on the next entry into kernel. The
synchronization is not skipped, it simply happens later, early in
kernel entry code.

-- 
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus
  2020-11-23 22:39     ` [EXT] " Alex Belits
@ 2020-11-23 23:21       ` Frederic Weisbecker
  2020-11-25  3:20         ` Alex Belits
  2021-01-22 15:00         ` Marcelo Tosatti
  0 siblings, 2 replies; 37+ messages in thread
From: Frederic Weisbecker @ 2020-11-23 23:21 UTC (permalink / raw)
  To: Alex Belits
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel,
	rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, catalin.marinas, pauld, netdev

On Mon, Nov 23, 2020 at 10:39:34PM +0000, Alex Belits wrote:
> 
> On Mon, 2020-11-23 at 23:29 +0100, Frederic Weisbecker wrote:
> > External Email
> > 
> > -------------------------------------------------------------------
> > ---
> > On Mon, Nov 23, 2020 at 05:58:42PM +0000, Alex Belits wrote:
> > > From: Yuri Norov <ynorov@marvell.com>
> > > 
> > > Make sure that kick_all_cpus_sync() does not call CPUs that are
> > > running
> > > isolated tasks.
> > > 
> > > Signed-off-by: Yuri Norov <ynorov@marvell.com>
> > > [abelits@marvell.com: use safe task_isolation_cpumask()
> > > implementation]
> > > Signed-off-by: Alex Belits <abelits@marvell.com>
> > > ---
> > >  kernel/smp.c | 14 +++++++++++++-
> > >  1 file changed, 13 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/smp.c b/kernel/smp.c
> > > index 4d17501433be..b2faecf58ed0 100644
> > > --- a/kernel/smp.c
> > > +++ b/kernel/smp.c
> > > @@ -932,9 +932,21 @@ static void do_nothing(void *unused)
> > >   */
> > >  void kick_all_cpus_sync(void)
> > >  {
> > > +	struct cpumask mask;
> > > +
> > >  	/* Make sure the change is visible before we kick the cpus */
> > >  	smp_mb();
> > > -	smp_call_function(do_nothing, NULL, 1);
> > > +
> > > +	preempt_disable();
> > > +#ifdef CONFIG_TASK_ISOLATION
> > > +	cpumask_clear(&mask);
> > > +	task_isolation_cpumask(&mask);
> > > +	cpumask_complement(&mask, &mask);
> > > +#else
> > > +	cpumask_setall(&mask);
> > > +#endif
> > > +	smp_call_function_many(&mask, do_nothing, NULL, 1);
> > > +	preempt_enable();
> > 
> > Same comment about IPIs here.
> 
> This is different from timers. The original design was based on the
> idea that every CPU should be able to enter kernel at any time and run
> kernel code with no additional preparation. Then the only solution is
> to always do full broadcast and require all CPUs to process it.
> 
> What I am trying to introduce is the idea of CPU that is not likely to
> run kernel code any soon, and can afford to go through an additional
> synchronization procedure on the next entry into kernel. The
> synchronization is not skipped, it simply happens later, early in
> kernel entry code.

Ah I see, this is ordered that way:

ll_isol_flags = ISOLATED

         CPU 0                                CPU 1
    ------------------                       -----------------
                                            // kernel entry
    data_to_sync = 1                        ll_isol_flags = ISOLATED_BROKEN
    smp_mb()                                smp_mb()
    if ll_isol_flags(CPU 1) == ISOLATED     READ data_to_sync
         smp_call(CPU 1)

You should document that, ie: explain why what you're doing is safe.

Also Beware though that the data to sync in question doesn't need to be visible
in the entry code before task_isolation_kernel_enter(). You need to audit all
the callers of kick_all_cpus_sync().

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 0/9] "Task_isolation" mode
  2020-11-23 17:42 [PATCH v5 0/9] "Task_isolation" mode Alex Belits
                   ` (8 preceding siblings ...)
  2020-11-23 17:58 ` [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits
@ 2020-11-24 16:36 ` Tom Rix
  2020-11-24 17:40   ` [EXT] " Alex Belits
  2020-12-05 20:40 ` Pavel Machek
  10 siblings, 1 reply; 37+ messages in thread
From: Tom Rix @ 2020-11-24 16:36 UTC (permalink / raw)
  To: Alex Belits, nitesh, frederic
  Cc: Prasun Kapoor, linux-api, davem, mingo, catalin.marinas, rostedt,
	linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz,
	leon, linux-arm-kernel, pauld, netdev


On 11/23/20 9:42 AM, Alex Belits wrote:
> This is an update of task isolation work that was originally done by
> Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until
> November 2017. It is adapted to the current kernel and cleaned up to
> implement its functionality in a more complete and cleaner manner.

I am having problems applying the patchset to today's linux-next.

Which kernel should I be using ?

Thanks,

Tom


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [EXT] Re: [PATCH v5 0/9] "Task_isolation" mode
  2020-11-24 16:36 ` [PATCH v5 0/9] "Task_isolation" mode Tom Rix
@ 2020-11-24 17:40   ` Alex Belits
  2020-12-02 14:02     ` Mark Rutland
  0 siblings, 1 reply; 37+ messages in thread
From: Alex Belits @ 2020-11-24 17:40 UTC (permalink / raw)
  To: trix
  Cc: Prasun Kapoor, linux-api, davem, mingo, catalin.marinas, rostedt,
	linux-kernel, peterx, tglx, nitesh, linux-arch, mtosatti, will,
	peterz, frederic, leon, linux-arm-kernel, pauld, netdev


On Tue, 2020-11-24 at 08:36 -0800, Tom Rix wrote:
> External Email
> 
> -------------------------------------------------------------------
> ---
> 
> On 11/23/20 9:42 AM, Alex Belits wrote:
> > This is an update of task isolation work that was originally done
> > by
> > Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until
> > November 2017. It is adapted to the current kernel and cleaned up
> > to
> > implement its functionality in a more complete and cleaner manner.
> 
> I am having problems applying the patchset to today's linux-next.
> 
> Which kernel should I be using ?

The patches are against Linus' tree, in particular, commit
a349e4c659609fd20e4beea89e5c4a4038e33a95

-- 
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus
  2020-11-23 23:21       ` Frederic Weisbecker
@ 2020-11-25  3:20         ` Alex Belits
  2021-01-22 15:00         ` Marcelo Tosatti
  1 sibling, 0 replies; 37+ messages in thread
From: Alex Belits @ 2020-11-25  3:20 UTC (permalink / raw)
  To: frederic
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel,
	rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will,
	peterz, leon, linux-arm-kernel, catalin.marinas, pauld, netdev


On Tue, 2020-11-24 at 00:21 +0100, Frederic Weisbecker wrote:
> On Mon, Nov 23, 2020 at 10:39:34PM +0000, Alex Belits wrote:
> > 
> > This is different from timers. The original design was based on the
> > idea that every CPU should be able to enter kernel at any time and
> > run
> > kernel code with no additional preparation. Then the only solution
> > is
> > to always do full broadcast and require all CPUs to process it.
> > 
> > What I am trying to introduce is the idea of CPU that is not likely
> > to
> > run kernel code any soon, and can afford to go through an
> > additional
> > synchronization procedure on the next entry into kernel. The
> > synchronization is not skipped, it simply happens later, early in
> > kernel entry code.
> 
> Ah I see, this is ordered that way:
> 
> ll_isol_flags = ISOLATED
> 
>          CPU 0                                CPU 1
>     ------------------                       -----------------
>                                             // kernel entry
>     data_to_sync = 1                        ll_isol_flags = ISOLATED_BROKEN
>     smp_mb()                                smp_mb()
>     if ll_isol_flags(CPU 1) == ISOLATED     READ data_to_sync
>          smp_call(CPU 1)
> 

The check for ll_isol_flags(CPU 1) is reversed, and it's a bit more
complex. In terms of scenarios, on entry from isolation the following
can happen:

1. Kernel entry happens simultaneously with operation that requires
synchronization, kernel entry processing happens before the check for
isolation on the sender side:

ll_isol_flags(CPU 1) = ISOLATED

         CPU 0                                CPU 1
    ------------------                      -----------------
                                            // kernel entry
                                            if (ll_isol_flags == ISOLATED) {
                                                  ll_isol_flags = ISOLATED_BROKEN
    data_to_sync = 1                              smp_mb()
                                                  // data_to_sync undetermined
    smp_mb()                                }
    // ll_isol_flags(CPU 1) updated
    if ll_isol_flags(CPU 1) != ISOLATED
                                            // interrupts enabled
         smp_call(CPU 1)                          // kernel entry again
                                                  if (ll_isol_flags == ISOLATED)
                                                        // nothing happens
                                                  // explicit or implied barriers
                                                  // data_to_sync updated
                                                  // kernel exit
    // CPU 0 assumes, CPU 1 will see        READ data_to_sync
    // data_to_sync = 1 when in kernel

2. Kernel entry happens simultaneously with operation that requires
synchronization, kernel entry processing happens after the check for
isolation on the sender side:

ll_isol_flags(CPU 1) = ISOLATED

         CPU 0                                CPU 1
    ------------------                      -----------------
    data_to_sync = 1                        // kernel entry
    smp_mb()                                // data_to_sync undetermined
                                            // should not access data_to_sync here
                                            if (ll_isol_flags == ISOLATED) {      
                                                   ll_isol_flags = ISOLATED_BROKEN
    // ll_isol_flags(CPU 1) undetermined           smp_mb()
                                                   // data_to_sync updated
    if ll_isol_flags(CPU 1) != ISOLATED     }
         // possibly nothing happens
    // CPU 0 assumes, CPU 1 will see        READ data_to_sync
    // data_to_sync = 1 when in kernel

3. Kernel entry processing completed before the check for isolation on the sender
side:

ll_isol_flags(CPU 1) = ISOLATED

         CPU 0                                CPU 1
    ------------------                      -----------------
                                            // kernel entry
                                            if (ll_isol_flags == ISOLATED) {
                                                  ll_isol_flags = ISOLATED_BROKEN
                                                  smp_mb()
                                            }
                                            // interrupts are enabled at some
    data_to_sync = 1                        // point here, data_to_sync value
    smp_mb()                                // is undetermined, CPU 0 makes no
    // ll_isol_flags(CPU 1) updated         // assumptions about it
    if ll_isol_flags(CPU 1) != ISOLATED     //
          smp_call(CPU 1)                         // kernel entry again
                                                  if (ll_isol_flags == ISOLATED)
                                                        // nothing happens
                                                  // explicit or implied barriers
                                                  // data_to_sync updated
                                                  // kernel exit
    // CPU 0 assumes, CPU 1 will see
    // data_to_sync = 1 when in kernel
                                            READ data_to_sync

4. Kernel entry processing happens after the check for isolation on the sender
side:

ll_isol_flags(CPU 1) = ISOLATED

         CPU 0                                CPU 1
    ------------------                      -----------------
    data_to_sync = 1                        // userspace or early kernel entry
    smp_mb()
    if ll_isol_flags(CPU 1) != ISOLATED
          smp_call(CPU 1) // skipped        // userspace or early kernel entry
    // CPU 0 assumes, CPU 1 will see        // continues undisturbed
    // data_to_sync = 1 when in kernel
                                            // kernel entry
                                            // data_to_sync undetermined
                                            // should not access data_to_sync here
                                            if (ll_isol_flags == ISOLATED) {
                                                  ll_isol_flags = ISOLATED_BROKEN
                                                  smp_mb()
                                                  // data_to_sync updated
                                            }
                                            READ data_to_sync

This also applies to exit to userspace after enabling isolation -- once
ll_isol_flags is set to ISOLATED, synchronization will be missed, so
one final barrier should happen before returning to userspace and
enabling interrupts in the process. In this case "unlucky" timing would
result in smp call interrupting userspace that is already supposed to
be isolated, it will trigger normal isolation breaking procedure but
otherwise will be an unremarkable synchronization call. On the other
hand, synchronization that was supposed to happen after setting
ll_isol_flags, will be delayed to the next kernel entry, so there
should be nothing that needs synchronization between the end of
task_isolation_exit_to_user_mode() and entering userspace (interrupts
are ok, though, they will trigger isolation breaking).

This is why I have placed task_isolation_kernel_enter()
and task_isolation_exit_to_user_mode() in entry/exit code, and when I
have seen any ambiguity or had any doubt allowed duplicate calls to
task_isolation_kernel_enter() on entry. If I overdid any of that, I
would appreciate fixes and clarifications, however it should be taken
into account that for now different architectures and even drivers may
call common and specific functions in a slightly different order.

Synchronization also applies to possible effects on pipeline (when
originating CPU writes instructions).

There is a version under development that delays TLB flushes in this
manner. On arm64 that requires a switch to soft TLB flushes, and that's
another potential ball of wax. On architectures that already use soft
TLB flushes, this will be unavoidable because TLB flush becomes another
IPI, and IPIs break isolation. Maybe it will be better to make a
somewhat smarter TLB handling in general, so it will be possible to
avoid bothering unrelated CPUs. But then, I guess, hardware may still
overdo it.Then it will be closer to the situation with timers. For now,
I want to first do the obvious and exclude isolated tasks from TLB
updates until they are back in kernel, and do a slow full flush on
isolation breaking.

> You should document that, ie: explain why what you're doing is safe.

I tried to do that in comments, however this is clearly insufficient
despite their verbosity. Would it make sense to create a separate
documentation text file? Current documentation seems to be light on
detailed specifications for things under heavy development except for
something very significant that affects nearly everyone.

Would it make sense to include all scenarios like the above plus exit
to userspace, and existing sequences of calls that should lead to
synchronization calls, task_isolation_kernel_enter() or
task_isolation_exit_to_user_mode()?


> Also Beware though that the data to sync in question doesn't need to
> be visible
> in the entry code before task_isolation_kernel_enter().

Right. And after task_isolation_exit_to_user_mode() as well. This is
why I had to dig through entry/exit code. I can produce a somewhat
usable call sequence diagram.

If by any chance I am wrong there somewhere, I welcome all relevant
comments and corrections.

At this point I am not sure only about things that are called when
CONFIG_TRACE_IRQFLAGS is enabled, simply because I have not yet checked
what they depend on, and virtualization-specific entry/exit is not
entirely clear to me. Maybe for correctness sake I should have declared
task isolation incompatible with those until I either know for sure or
added working support for them.

>  You need to audit all
> the callers of kick_all_cpus_sync().

Right now it's just three places.

do_tune_cpucache() does the right thing, keeps CPUs from seeing old
values of cachep->cpu_cache as they are being deallocated.

powerpc kvm_arch_destroy_vm() cares only about CPUs with VCPUs on them,
what for now should not be used with isolation.

arm64 flush_icache_range() is the main reason why this could be
potentially problematic, it makes sure that other CPUs will not run
stale instructions, and it's called from all places where code is
modified. If in other situation some kind of delayed processing could
be possible, this one has to be done in the right sequence. And that's
the reason for instr_sync() after smp_mb() in
task_isolation_kernel_enter(). Since flush_icache_range() could skip
this CPU, we have to synchronize it by ourselves.

There is an important issue of not having early kernel entry / late
kernel exit rely on something that may be stale (with both data and
code). With the above mentioned exceptions for now, it can be
demonstrated that this is correct. It should be taken into account that
even though static keys are used in early kernel entry code, static
keys setting does not synchronize flushes, and therefore already should
be done before use.

Task isolation in virtualization guests can be a perfectly valid thing
to support (it just has to be propagated to the host if permitted),
however this is something I will want to revisit in the future. For
now, I assume that virtualization-related events are not supposed to
break isolation, however it would be nice to be ready for handling that
possibility.

-- 
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality
  2020-11-23 17:58 ` [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality Alex Belits
@ 2020-12-02 13:59   ` Mark Rutland
  2020-12-04  0:37     ` [EXT] " Alex Belits
  0 siblings, 1 reply; 37+ messages in thread
From: Mark Rutland @ 2020-12-02 13:59 UTC (permalink / raw)
  To: Alex Belits
  Cc: nitesh, frederic, Prasun Kapoor, linux-api, davem, trix, mingo,
	catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch,
	mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev

Hi Alex,

On Mon, Nov 23, 2020 at 05:58:06PM +0000, Alex Belits wrote:
> In do_notify_resume(), call task_isolation_before_pending_work_check()
> first, to report isolation breaking, then after handling all pending
> work, call task_isolation_start() for TIF_TASK_ISOLATION tasks.
> 
> Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and _TIF_SYSCALL_WORK,
> define local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since we
> don't clear _TIF_TASK_ISOLATION in the loop.
> 
> Early kernel entry code calls task_isolation_kernel_enter(). In
> particular:
> 
> Vectors:
> el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter()
> el1_irq -> asm_nmi_enter(), handle_arch_irq()
> el1_error -> do_serror()
> el0_sync -> el0_sync_handler()
> el0_irq -> handle_arch_irq()
> el0_error -> do_serror()
> el0_sync_compat -> el0_sync_compat_handler()
> el0_irq_compat -> handle_arch_irq()
> el0_error_compat -> do_serror()
> 
> SDEI entry:
> __sdei_asm_handler -> __sdei_handler() -> nmi_enter()

As a heads-up, the arm64 entry code is changing, as we found that our
lockdep, RCU, and context-tracking management wasn't quite right. I have
a series of patches:

  https://lore.kernel.org/r/20201130115950.22492-1-mark.rutland@arm.com

... which are queued in the arm64 for-next/fixes branch. I intend to
have some further rework ready for the next cycle. I'd appreciate if you
could Cc me on any patches altering the arm64 entry code, as I have a
vested interest.

That was quite obviously broken if PROVE_LOCKING and NO_HZ_FULL were
chosen and context tracking was in use (e.g. with
CONTEXT_TRACKING_FORCE), so I'm assuming that this series has not been
tested in that configuration. What sort of testing has this seen?

It would be very helpful for the next posting if you could provide any
instructions on how to test this series (e.g. with pointers to any test
suite that you have), since it's very easy to introduce subtle breakage
in this area without realising it.

> 
> Functions called from there:
> asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter()
> asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return()
> 
> Handlers:
> do_serror() -> nmi_enter() -> task_isolation_kernel_enter()
>   or task_isolation_kernel_enter()
> el1_sync_handler() -> task_isolation_kernel_enter()
> el0_sync_handler() -> task_isolation_kernel_enter()
> el0_sync_compat_handler() -> task_isolation_kernel_enter()
> 
> handle_arch_irq() is irqchip-specific, most call handle_domain_irq()
> There is a separate patch for irqchips that do not follow this rule.
> 
> handle_domain_irq() -> task_isolation_kernel_enter()
> do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant)
> nmi_enter() -> task_isolation_kernel_enter()

The IRQ cases look very odd to me. With the rework I've just done for
arm64, we'll do the regular context tracking accounting before we ever
get into handle_domain_irq() or similar, so I suspect that's not
necessary at all?

> 
> Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
> [abelits@marvell.com: simplified to match kernel 5.10]
> Signed-off-by: Alex Belits <abelits@marvell.com>
> ---
>  arch/arm64/Kconfig                   |  1 +
>  arch/arm64/include/asm/barrier.h     |  1 +
>  arch/arm64/include/asm/thread_info.h |  7 +++++--
>  arch/arm64/kernel/entry-common.c     |  7 +++++++
>  arch/arm64/kernel/ptrace.c           | 10 ++++++++++
>  arch/arm64/kernel/signal.c           | 13 ++++++++++++-
>  arch/arm64/kernel/smp.c              |  3 +++
>  7 files changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 1515f6f153a0..fc958d8d8945 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -141,6 +141,7 @@ config ARM64
>  	select HAVE_ARCH_PREL32_RELOCATIONS
>  	select HAVE_ARCH_SECCOMP_FILTER
>  	select HAVE_ARCH_STACKLEAK
> +	select HAVE_ARCH_TASK_ISOLATION
>  	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
>  	select HAVE_ARCH_TRACEHOOK
>  	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index c3009b0e5239..ad5a6dd380cf 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -49,6 +49,7 @@
>  #define dma_rmb()	dmb(oshld)
>  #define dma_wmb()	dmb(oshst)
>  
> +#define instr_sync()	isb()

I think I've asked on prior versions of the patchset, but what is this
for? Where is it going to be used, and what is the expected semantics?
I'm wary of exposing this outside of arch code because there aren't
strong cross-architectural semantics, and at the least this requires
some documentation.

If it's unused, please delete it.

[...]

> diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
> index 43d4c329775f..8152760de683 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -8,6 +8,7 @@
>  #include <linux/context_tracking.h>
>  #include <linux/ptrace.h>
>  #include <linux/thread_info.h>
> +#include <linux/isolation.h>
>  
>  #include <asm/cpufeature.h>
>  #include <asm/daifflags.h>
> @@ -77,6 +78,8 @@ asmlinkage void notrace el1_sync_handler(struct pt_regs *regs)
>  {
>  	unsigned long esr = read_sysreg(esr_el1);
>  
> +	task_isolation_kernel_enter();

For regular context tracking we only acount the user<->kernel
transitions.

This is a kernel->kernel transition, so surely this is not necessary?

If nothing else, it doesn't feel well-balanced.

I havwe not looked at the rest of this patch (or series) in detail.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [EXT] Re: [PATCH v5 0/9] "Task_isolation" mode
  2020-11-24 17:40   ` [EXT] " Alex Belits
@ 2020-12-02 14:02     ` Mark Rutland
  2020-12-04  0:39       ` Alex Belits
  0 siblings, 1 reply; 37+ messages in thread
From: Mark Rutland @ 2020-12-02 14:02 UTC (permalink / raw)
  To: Alex Belits
  Cc: trix, Prasun Kapoor, linux-api, davem, mingo, catalin.marinas,
	rostedt, linux-kernel, peterx, tglx, nitesh, linux-arch,
	mtosatti, will, peterz, frederic, leon, linux-arm-kernel, pauld,
	netdev

On Tue, Nov 24, 2020 at 05:40:49PM +0000, Alex Belits wrote:
> 
> On Tue, 2020-11-24 at 08:36 -0800, Tom Rix wrote:
> > External Email
> > 
> > -------------------------------------------------------------------
> > ---
> > 
> > On 11/23/20 9:42 AM, Alex Belits wrote:
> > > This is an update of task isolation work that was originally done
> > > by
> > > Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until
> > > November 2017. It is adapted to the current kernel and cleaned up
> > > to
> > > implement its functionality in a more complete and cleaner manner.
> > 
> > I am having problems applying the patchset to today's linux-next.
> > 
> > Which kernel should I be using ?
> 
> The patches are against Linus' tree, in particular, commit
> a349e4c659609fd20e4beea89e5c4a4038e33a95

Is there any reason to base on that commit in particular?

Generally it's preferred that a series is based on a tag (so either a
release or an -rc kernel), and that the cover letter explains what the
base is. If you can do that in future it'll make the series much easier
to work with.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 5/9] task_isolation: Add driver-specific hooks
  2020-11-23 17:57 ` [PATCH v5 5/9] task_isolation: Add driver-specific hooks Alex Belits
@ 2020-12-02 14:18   ` Mark Rutland
  2020-12-04  0:43     ` [EXT] " Alex Belits
  0 siblings, 1 reply; 37+ messages in thread
From: Mark Rutland @ 2020-12-02 14:18 UTC (permalink / raw)
  To: Alex Belits
  Cc: nitesh, frederic, Prasun Kapoor, linux-api, davem, trix, mingo,
	catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch,
	mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev

On Mon, Nov 23, 2020 at 05:57:42PM +0000, Alex Belits wrote:
> Some drivers don't call functions that call
> task_isolation_kernel_enter() in interrupt handlers. Call it
> directly.

I don't think putting this in drivers is the right approach. IIUC we
only need to track user<->kernel transitions, and we can do that within
the architectural entry code before we ever reach irqchip code. I
suspect the current approacch is an artifact of that being difficult in
the old structure of the arch code; recent rework should address that,
and we can restruecture things further in future.

Thanks,
Mark.

> Signed-off-by: Alex Belits <abelits@marvell.com>
> ---
>  drivers/irqchip/irq-armada-370-xp.c | 6 ++++++
>  drivers/irqchip/irq-gic-v3.c        | 3 +++
>  drivers/irqchip/irq-gic.c           | 3 +++
>  drivers/s390/cio/cio.c              | 3 +++
>  4 files changed, 15 insertions(+)
> 
> diff --git a/drivers/irqchip/irq-armada-370-xp.c b/drivers/irqchip/irq-armada-370-xp.c
> index d7eb2e93db8f..4ac7babe1abe 100644
> --- a/drivers/irqchip/irq-armada-370-xp.c
> +++ b/drivers/irqchip/irq-armada-370-xp.c
> @@ -29,6 +29,7 @@
>  #include <linux/slab.h>
>  #include <linux/syscore_ops.h>
>  #include <linux/msi.h>
> +#include <linux/isolation.h>
>  #include <asm/mach/arch.h>
>  #include <asm/exception.h>
>  #include <asm/smp_plat.h>
> @@ -572,6 +573,7 @@ static const struct irq_domain_ops armada_370_xp_mpic_irq_ops = {
>  static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained)
>  {
>  	u32 msimask, msinr;
> +	int isol_entered = 0;
>  
>  	msimask = readl_relaxed(per_cpu_int_base +
>  				ARMADA_370_XP_IN_DRBEL_CAUSE_OFFS)
> @@ -588,6 +590,10 @@ static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained)
>  			continue;
>  
>  		if (is_chained) {
> +			if (!isol_entered) {
> +				task_isolation_kernel_enter();
> +				isol_entered = 1;
> +			}
>  			irq = irq_find_mapping(armada_370_xp_msi_inner_domain,
>  					       msinr - PCI_MSI_DOORBELL_START);
>  			generic_handle_irq(irq);
> diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
> index 16fecc0febe8..ded26dd4da0f 100644
> --- a/drivers/irqchip/irq-gic-v3.c
> +++ b/drivers/irqchip/irq-gic-v3.c
> @@ -18,6 +18,7 @@
>  #include <linux/percpu.h>
>  #include <linux/refcount.h>
>  #include <linux/slab.h>
> +#include <linux/isolation.h>
>  
>  #include <linux/irqchip.h>
>  #include <linux/irqchip/arm-gic-common.h>
> @@ -646,6 +647,8 @@ static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs *regs
>  {
>  	u32 irqnr;
>  
> +	task_isolation_kernel_enter();
> +
>  	irqnr = gic_read_iar();
>  
>  	if (gic_supports_nmi() &&
> diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
> index 6053245a4754..bb482b4ae218 100644
> --- a/drivers/irqchip/irq-gic.c
> +++ b/drivers/irqchip/irq-gic.c
> @@ -35,6 +35,7 @@
>  #include <linux/interrupt.h>
>  #include <linux/percpu.h>
>  #include <linux/slab.h>
> +#include <linux/isolation.h>
>  #include <linux/irqchip.h>
>  #include <linux/irqchip/chained_irq.h>
>  #include <linux/irqchip/arm-gic.h>
> @@ -337,6 +338,8 @@ static void __exception_irq_entry gic_handle_irq(struct pt_regs *regs)
>  	struct gic_chip_data *gic = &gic_data[0];
>  	void __iomem *cpu_base = gic_data_cpu_base(gic);
>  
> +	task_isolation_kernel_enter();
> +
>  	do {
>  		irqstat = readl_relaxed(cpu_base + GIC_CPU_INTACK);
>  		irqnr = irqstat & GICC_IAR_INT_ID_MASK;
> diff --git a/drivers/s390/cio/cio.c b/drivers/s390/cio/cio.c
> index 6d716db2a46a..beab88881b6d 100644
> --- a/drivers/s390/cio/cio.c
> +++ b/drivers/s390/cio/cio.c
> @@ -20,6 +20,7 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/interrupt.h>
>  #include <linux/irq.h>
> +#include <linux/isolation.h>
>  #include <asm/cio.h>
>  #include <asm/delay.h>
>  #include <asm/irq.h>
> @@ -584,6 +585,8 @@ void cio_tsch(struct subchannel *sch)
>  	struct irb *irb;
>  	int irq_context;
>  
> +	task_isolation_kernel_enter();
> +
>  	irb = this_cpu_ptr(&cio_irb);
>  	/* Store interrupt response block to lowcore. */
>  	if (tsch(sch->schid, irb) != 0)
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-11-23 17:58 ` [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
  2020-11-23 22:13   ` Frederic Weisbecker
  2020-11-23 22:36   ` Thomas Gleixner
@ 2020-12-02 14:20   ` Mark Rutland
  2020-12-04  0:54     ` [EXT] " Alex Belits
  2 siblings, 1 reply; 37+ messages in thread
From: Mark Rutland @ 2020-12-02 14:20 UTC (permalink / raw)
  To: Alex Belits
  Cc: nitesh, frederic, Prasun Kapoor, linux-api, davem, trix, mingo,
	catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch,
	mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev

On Mon, Nov 23, 2020 at 05:58:22PM +0000, Alex Belits wrote:
> From: Yuri Norov <ynorov@marvell.com>
> 
> For nohz_full CPUs the desirable behavior is to receive interrupts
> generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> obviously not desirable because it breaks isolation.
> 
> This patch adds check for it.
> 
> Signed-off-by: Yuri Norov <ynorov@marvell.com>
> [abelits@marvell.com: updated, only exclude CPUs running isolated tasks]
> Signed-off-by: Alex Belits <abelits@marvell.com>
> ---
>  kernel/time/tick-sched.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index a213952541db..6c8679e200f0 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -20,6 +20,7 @@
>  #include <linux/sched/clock.h>
>  #include <linux/sched/stat.h>
>  #include <linux/sched/nohz.h>
> +#include <linux/isolation.h>
>  #include <linux/module.h>
>  #include <linux/irq_work.h>
>  #include <linux/posix-timers.h>
> @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
>   */
>  void tick_nohz_full_kick_cpu(int cpu)
>  {
> -	if (!tick_nohz_full_cpu(cpu))
> +	smp_rmb();

What does this barrier pair with? The commit message doesn't mention it,
and it's not clear in-context.

Thanks,
Mark.

> +	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
>  		return;
>  
>  	irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [EXT] Re: [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality
  2020-12-02 13:59   ` Mark Rutland
@ 2020-12-04  0:37     ` Alex Belits
  2020-12-07 11:57       ` Mark Rutland
  0 siblings, 1 reply; 37+ messages in thread
From: Alex Belits @ 2020-12-04  0:37 UTC (permalink / raw)
  To: mark.rutland
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel,
	rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will,
	catalin.marinas, peterz, frederic, leon, linux-arm-kernel, pauld,
	netdev


On Wed, 2020-12-02 at 13:59 +0000, Mark Rutland wrote:
> External Email
> 
> -------------------------------------------------------------------
> ---
> Hi Alex,
> 
> On Mon, Nov 23, 2020 at 05:58:06PM +0000, Alex Belits wrote:
> > In do_notify_resume(), call
> > task_isolation_before_pending_work_check()
> > first, to report isolation breaking, then after handling all
> > pending
> > work, call task_isolation_start() for TIF_TASK_ISOLATION tasks.
> > 
> > Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and _TIF_SYSCALL_WORK,
> > define local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since
> > we
> > don't clear _TIF_TASK_ISOLATION in the loop.
> > 
> > Early kernel entry code calls task_isolation_kernel_enter(). In
> > particular:
> > 
> > Vectors:
> > el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter()
> > el1_irq -> asm_nmi_enter(), handle_arch_irq()
> > el1_error -> do_serror()
> > el0_sync -> el0_sync_handler()
> > el0_irq -> handle_arch_irq()
> > el0_error -> do_serror()
> > el0_sync_compat -> el0_sync_compat_handler()
> > el0_irq_compat -> handle_arch_irq()
> > el0_error_compat -> do_serror()
> > 
> > SDEI entry:
> > __sdei_asm_handler -> __sdei_handler() -> nmi_enter()
> 
> As a heads-up, the arm64 entry code is changing, as we found that our
> lockdep, RCU, and context-tracking management wasn't quite right. I
> have
> a series of patches:
> 
> https://lore.kernel.org/r/20201130115950.22492-1-mark.rutland@arm.com
> 
> ... which are queued in the arm64 for-next/fixes branch. I intend to
> have some further rework ready for the next cycle.

Thanks!

>  I'd appreciate if you
> could Cc me on any patches altering the arm64 entry code, as I have a
> vested interest.

I will do that.

> 
> That was quite obviously broken if PROVE_LOCKING and NO_HZ_FULL were
> chosen and context tracking was in use (e.g. with
> CONTEXT_TRACKING_FORCE),

I am not yet sure about TRACE_IRQFLAGS, however NO_HZ_FULL and
CONTEXT_TRACKING have to be enabled for it to do anything.

I will check it with PROVE_LOCKING and your patches.

Entry code only adds an inline function that, if task isolation is
enabled, uses raw_local_irq_save() / raw_local_irq_restore(), low-level 
operations and accesses per-CPU variabled by offset, so at very least
it should not add any problems. Even raw_local_irq_save() /
raw_local_irq_restore() probably should be removed, however I wanted to
have something that can be safely called if by whatever reason
interrupts were enabled before kernel was fully entered.

>  so I'm assuming that this series has not been
> tested in that configuration. What sort of testing has this seen?
> 

On various available arm64 hardware, with enabled

CONFIG_TASK_ISOLATION
CONFIG_NO_HZ_FULL
CONFIG_HIGH_RES_TIMERS

and disabled:

CONFIG_HZ_PERIODIC
CONFIG_NO_HZ_IDLE
CONFIG_NO_HZ

> It would be very helpful for the next posting if you could provide
> any
> instructions on how to test this series (e.g. with pointers to any
> test
> suite that you have), since it's very easy to introduce subtle
> breakage
> in this area without realising it.

I will. Currently libtmc ( https://github.com/abelits/libtmc ) contains
all userspace code used for testing, however I should document the
testing procedures.

> 
> > Functions called from there:
> > asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter()
> > asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return()
> > 
> > Handlers:
> > do_serror() -> nmi_enter() -> task_isolation_kernel_enter()
> >   or task_isolation_kernel_enter()
> > el1_sync_handler() -> task_isolation_kernel_enter()
> > el0_sync_handler() -> task_isolation_kernel_enter()
> > el0_sync_compat_handler() -> task_isolation_kernel_enter()
> > 
> > handle_arch_irq() is irqchip-specific, most call
> > handle_domain_irq()
> > There is a separate patch for irqchips that do not follow this
> > rule.
> > 
> > handle_domain_irq() -> task_isolation_kernel_enter()
> > do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant)
> > nmi_enter() -> task_isolation_kernel_enter()
> 
> The IRQ cases look very odd to me. With the rework I've just done for
> arm64, we'll do the regular context tracking accounting before we
> ever
> get into handle_domain_irq() or similar, so I suspect that's not
> necessary at all?

The goal is to call task_isolation_kernel_enter() before anything that
depends on a CPU state, including pipeline, that could remain un-
synchronized when the rest of the kernel was sending synchronization
IPIs. Similarly task_isolation_kernel_return() should be called when it
is safe to turn off synchronization. If rework allows it to be done
earlier, there is no need to touch more specific functions.

> --- a/arch/arm64/include/asm/barrier.h
> > +++ b/arch/arm64/include/asm/barrier.h
> > @@ -49,6 +49,7 @@
> >  #define dma_rmb()	dmb(oshld)
> >  #define dma_wmb()	dmb(oshst)
> >  
> > +#define instr_sync()	isb()
> 
> I think I've asked on prior versions of the patchset, but what is
> this
> for? Where is it going to be used, and what is the expected
> semantics?
> I'm wary of exposing this outside of arch code because there aren't
> strong cross-architectural semantics, and at the least this requires
> some documentation.

This is intended as an instruction pipeline flush for the situation
when arch-independent code has to synchronize with potential changes
that it missed. This is necessary after some other CPUs could modify
code (and send IPIs to notify the rest but not isolated CPU) while this
one was still running isolated task or, more likely, exiting from it,
so it might be unlucky enough to pick the old instructions before that
point.

It's only used on kernel entry.

> 
> If it's unused, please delete it.
> 
> [...]
> 
> > diff --git a/arch/arm64/kernel/entry-common.c
> > b/arch/arm64/kernel/entry-common.c
> > index 43d4c329775f..8152760de683 100644
> > --- a/arch/arm64/kernel/entry-common.c
> > +++ b/arch/arm64/kernel/entry-common.c
> > @@ -8,6 +8,7 @@
> >  #include <linux/context_tracking.h>
> >  #include <linux/ptrace.h>
> >  #include <linux/thread_info.h>
> > +#include <linux/isolation.h>
> >  
> >  #include <asm/cpufeature.h>
> >  #include <asm/daifflags.h>
> > @@ -77,6 +78,8 @@ asmlinkage void notrace el1_sync_handler(struct
> > pt_regs *regs)
> >  {
> >  	unsigned long esr = read_sysreg(esr_el1);
> >  
> > +	task_isolation_kernel_enter();
> 
> For regular context tracking we only acount the user<->kernel
> transitions.
> 
> This is a kernel->kernel transition, so surely this is not necessary?

Right. If we entered kernel from an isolated task, we have already
changed the flags.

> 
> If nothing else, it doesn't feel well-balanced.
> 
> I havwe not looked at the rest of this patch (or series) in detail.
> 
> Thanks,
> Mark.

My goal was to make sure that all transitions between kernel and
userspace are covered, so when in doubt I have added corresponding
calls those inline functions, and made them safe to be called from
those places. With improved entry-exit code it should be easier to be
sure where this can be done in a cleaner way.

-- 
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [EXT] Re: [PATCH v5 0/9] "Task_isolation" mode
  2020-12-02 14:02     ` Mark Rutland
@ 2020-12-04  0:39       ` Alex Belits
  0 siblings, 0 replies; 37+ messages in thread
From: Alex Belits @ 2020-12-04  0:39 UTC (permalink / raw)
  To: mark.rutland
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel,
	rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will,
	peterz, frederic, leon, linux-arm-kernel, catalin.marinas, pauld,
	netdev


On Wed, 2020-12-02 at 14:02 +0000, Mark Rutland wrote:
> On Tue, Nov 24, 2020 at 05:40:49PM +0000, Alex Belits wrote:
> > 
> > > I am having problems applying the patchset to today's linux-next.
> > > 
> > > Which kernel should I be using ?
> > 
> > The patches are against Linus' tree, in particular, commit
> > a349e4c659609fd20e4beea89e5c4a4038e33a95
> 
> Is there any reason to base on that commit in particular?

No specific reason for that particular commit.

> Generally it's preferred that a series is based on a tag (so either a
> release or an -rc kernel), and that the cover letter explains what
> the
> base is. If you can do that in future it'll make the series much
> easier
> to work with.

Ok.

-- 
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [EXT] Re: [PATCH v5 5/9] task_isolation: Add driver-specific hooks
  2020-12-02 14:18   ` Mark Rutland
@ 2020-12-04  0:43     ` Alex Belits
  0 siblings, 0 replies; 37+ messages in thread
From: Alex Belits @ 2020-12-04  0:43 UTC (permalink / raw)
  To: mark.rutland
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel,
	rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will,
	catalin.marinas, peterz, frederic, leon, linux-arm-kernel, pauld,
	netdev


On Wed, 2020-12-02 at 14:18 +0000, Mark Rutland wrote:
> External Email
> 
> -------------------------------------------------------------------
> ---
> On Mon, Nov 23, 2020 at 05:57:42PM +0000, Alex Belits wrote:
> > Some drivers don't call functions that call
> > task_isolation_kernel_enter() in interrupt handlers. Call it
> > directly.
> 
> I don't think putting this in drivers is the right approach. IIUC we
> only need to track user<->kernel transitions, and we can do that
> within
> the architectural entry code before we ever reach irqchip code. I
> suspect the current approacch is an artifact of that being difficult
> in
> the old structure of the arch code; recent rework should address
> that,
> and we can restruecture things further in future.

I agree completely. This patch only covers irqchip drivers with unusual
entry procedures.

-- 
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [EXT] Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-12-02 14:20   ` Mark Rutland
@ 2020-12-04  0:54     ` Alex Belits
  2020-12-07 11:58       ` Mark Rutland
  0 siblings, 1 reply; 37+ messages in thread
From: Alex Belits @ 2020-12-04  0:54 UTC (permalink / raw)
  To: mark.rutland
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel,
	rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will,
	catalin.marinas, peterz, frederic, leon, linux-arm-kernel, pauld,
	netdev


On Wed, 2020-12-02 at 14:20 +0000, Mark Rutland wrote:
> External Email
> 
> -------------------------------------------------------------------
> ---
> On Mon, Nov 23, 2020 at 05:58:22PM +0000, Alex Belits wrote:
> > From: Yuri Norov <ynorov@marvell.com>
> > 
> > For nohz_full CPUs the desirable behavior is to receive interrupts
> > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> > obviously not desirable because it breaks isolation.
> > 
> > This patch adds check for it.
> > 
> > Signed-off-by: Yuri Norov <ynorov@marvell.com>
> > [abelits@marvell.com: updated, only exclude CPUs running isolated
> > tasks]
> > Signed-off-by: Alex Belits <abelits@marvell.com>
> > ---
> >  kernel/time/tick-sched.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index a213952541db..6c8679e200f0 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -20,6 +20,7 @@
> >  #include <linux/sched/clock.h>
> >  #include <linux/sched/stat.h>
> >  #include <linux/sched/nohz.h>
> > +#include <linux/isolation.h>
> >  #include <linux/module.h>
> >  #include <linux/irq_work.h>
> >  #include <linux/posix-timers.h>
> > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
> >   */
> >  void tick_nohz_full_kick_cpu(int cpu)
> >  {
> > -	if (!tick_nohz_full_cpu(cpu))
> > +	smp_rmb();
> 
> What does this barrier pair with? The commit message doesn't mention
> it,
> and it's not clear in-context.

With barriers in task_isolation_kernel_enter()
and task_isolation_exit_to_user_mode().

-- 
Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 0/9] "Task_isolation" mode
  2020-11-23 17:42 [PATCH v5 0/9] "Task_isolation" mode Alex Belits
                   ` (9 preceding siblings ...)
  2020-11-24 16:36 ` [PATCH v5 0/9] "Task_isolation" mode Tom Rix
@ 2020-12-05 20:40 ` Pavel Machek
  2020-12-05 23:25   ` Thomas Gleixner
  10 siblings, 1 reply; 37+ messages in thread
From: Pavel Machek @ 2020-12-05 20:40 UTC (permalink / raw)
  To: Alex Belits
  Cc: nitesh, frederic, Prasun Kapoor, linux-api, davem, trix, mingo,
	catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch,
	mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev

[-- Attachment #1: Type: text/plain, Size: 1255 bytes --]

Hi!

> General description
> 
> This is the result of development and maintenance of task isolation
> functionality that originally started based on task isolation patch
> v15 and was later updated to include v16. It provided predictable
> environment for userspace tasks running on arm64 processors alongside
> with full-featured Linux environment. It is intended to provide
> reliable interruption-free environment from the point when a userspace
> task enters isolation and until the moment it leaves isolation or
> receives a signal intentionally sent to it, and was successfully used
> for this purpose. While CPU isolation with nohz provides an
> environment that is close to this requirement, the remaining IPIs and
> other disturbances keep it from being usable for tasks that require
> complete predictability of CPU timing.

So... what kind of guarantees does this aim to provide / what tasks it
is useful for?

For real time response, we have other approaches.

If you want to guarantee performnace of the "isolated" task... I don't
see how that works. Other tasks on the system still compete for DRAM
bandwidth, caches, etc...

So... what is the usecase?
								Pavel
-- 
http://www.livejournal.com/~pavelmachek

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 0/9] "Task_isolation" mode
  2020-12-05 20:40 ` Pavel Machek
@ 2020-12-05 23:25   ` Thomas Gleixner
  2020-12-11 18:08     ` Yury Norov
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2020-12-05 23:25 UTC (permalink / raw)
  To: Pavel Machek, Alex Belits
  Cc: nitesh, frederic, Prasun Kapoor, linux-api, davem, trix, mingo,
	catalin.marinas, rostedt, linux-kernel, peterx, linux-arch,
	mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev

Pavel,

On Sat, Dec 05 2020 at 21:40, Pavel Machek wrote:
> So... what kind of guarantees does this aim to provide / what tasks it
> is useful for?
>
> For real time response, we have other approaches.

Depends on your requirements. Some problems are actually better solved
with busy polling. See below.

> If you want to guarantee performnace of the "isolated" task... I don't
> see how that works. Other tasks on the system still compete for DRAM
> bandwidth, caches, etc...

Applications which want to run as undisturbed as possible. There is
quite a range of those:

  - Hardware in the loop simulation is today often done with that crude
    approach of "offlining" a CPU and then instead of playing dead
    jumping to a preloaded bare metal executable. That's a horrible hack
    and impossible to debug, but gives them the results they need to
    achieve. These applications are well optimized vs. cache and memory
    foot print, so they don't worry about these things too much and they
    surely don't run on SMI and BIOS value add inflicted machines.

    Don't even think about waiting for an interrupt to achieve what
    these folks are doing. So no, there are problems which a general
    purpose realtime OS cannot solve ever.

  - HPC computations on large data sets. While the memory foot print is
    large the access patterns are cache optimized. 

    The problem there is that any unnecessary IPI, tick interrupt or
    whatever nuisance is disturbing the carefully optimized cache usage
    and alone getting rid of the timer interrupt gained them measurable
    performance. Even very low single digit percentage of runtime saving
    is valuable for these folks because the compute time on such beasts
    is expensive.

  - Realtime guests in KVM. With posted interrupts and a fully populated
    host side page table there is no point in running host side
    interrupts or IPIs for random accounting or whatever purposes as
    they affect the latency in the guest. With all the side effects
    mitigated and a properly set up guest and host it is possible to get
    to a zero exit situation after the bootup phase which means pretty
    much matching bare metal behaviour.

    Yes, you can do that with e.g. Jailhouse as well, but you lose lots
    of the fancy things KVM provides. And people care about these not
    just because they are fancy. They care because their application
    scenario needs them.

There are more reasons why people want to be able to get as much
isolation from the OS as possible but at the same time have a sane
execution environment, debugging, performance monitoring and the OS
provided protection mechanisms instead of horrible hacks.

Isolation makes sense for a range of applications and there is no reason
why Linux should not support them. 

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [EXT] Re: [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality
  2020-12-04  0:37     ` [EXT] " Alex Belits
@ 2020-12-07 11:57       ` Mark Rutland
  0 siblings, 0 replies; 37+ messages in thread
From: Mark Rutland @ 2020-12-07 11:57 UTC (permalink / raw)
  To: Alex Belits
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel,
	rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will,
	catalin.marinas, peterz, frederic, leon, linux-arm-kernel, pauld,
	netdev

On Fri, Dec 04, 2020 at 12:37:32AM +0000, Alex Belits wrote:
> On Wed, 2020-12-02 at 13:59 +0000, Mark Rutland wrote:
> > On Mon, Nov 23, 2020 at 05:58:06PM +0000, Alex Belits wrote:

> > As a heads-up, the arm64 entry code is changing, as we found that
> > our lockdep, RCU, and context-tracking management wasn't quite
> > right. I have a series of patches:
> > 
> > https://lore.kernel.org/r/20201130115950.22492-1-mark.rutland@arm.com
> > 
> > ... which are queued in the arm64 for-next/fixes branch. I intend to
> > have some further rework ready for the next cycle.

> > That was quite obviously broken if PROVE_LOCKING and NO_HZ_FULL were
> > chosen and context tracking was in use (e.g. with
> > CONTEXT_TRACKING_FORCE),
> 
> I am not yet sure about TRACE_IRQFLAGS, however NO_HZ_FULL and
> CONTEXT_TRACKING have to be enabled for it to do anything.
> 
> I will check it with PROVE_LOCKING and your patches.
	
Thanks. In future, please do test this functionality with PROVE_LOCKING,
because otherwise bugs with RCU and IRQ state maangement will easily be
missed (as has been the case until very recently).

Testing with all those debug optiosn enabled (and stating that you have
done so) will give reviuewers much greater confidence that this works,
and if that does start spewing errors it save everyone the time
identifying that.

> Entry code only adds an inline function that, if task isolation is
> enabled, uses raw_local_irq_save() / raw_local_irq_restore(), low-level 
> operations and accesses per-CPU variabled by offset, so at very least
> it should not add any problems. Even raw_local_irq_save() /
> raw_local_irq_restore() probably should be removed, however I wanted to
> have something that can be safely called if by whatever reason
> interrupts were enabled before kernel was fully entered.

Sure. In the new flows we have new enter_from_*() and exit_to_*()
functions where these calls should be able to live (and so we should be
able to ensure a more consistent environment).

The near-term plan for arm64 is to migrate more of the exception triage
assembly to C, then to rework the arm64 entry code and generic entry
code to be more similar, then to migrate as much as possible to the
generic entry code. So please bear in mind that anything that adds to
the differences between the two is goingf to be problematic.

> >  so I'm assuming that this series has not been
> > tested in that configuration. What sort of testing has this seen?
> 
> On various available arm64 hardware, with enabled
> 
> CONFIG_TASK_ISOLATION
> CONFIG_NO_HZ_FULL
> CONFIG_HIGH_RES_TIMERS
> 
> and disabled:
> 
> CONFIG_HZ_PERIODIC
> CONFIG_NO_HZ_IDLE
> CONFIG_NO_HZ

Ok. I'd recommend looking at the various debug options under the "kernel
hacking" section in kconfig, and enabling some of those. At the very
least PROVE_LOCKING, ideally also using the lockup dectors and anything
else for debugging RCU, etc.

[...]

> > > Functions called from there:
> > > asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter()
> > > asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return()
> > > 
> > > Handlers:
> > > do_serror() -> nmi_enter() -> task_isolation_kernel_enter()
> > >   or task_isolation_kernel_enter()
> > > el1_sync_handler() -> task_isolation_kernel_enter()
> > > el0_sync_handler() -> task_isolation_kernel_enter()
> > > el0_sync_compat_handler() -> task_isolation_kernel_enter()
> > > 
> > > handle_arch_irq() is irqchip-specific, most call
> > > handle_domain_irq()
> > > There is a separate patch for irqchips that do not follow this
> > > rule.
> > > 
> > > handle_domain_irq() -> task_isolation_kernel_enter()
> > > do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant)
> > > nmi_enter() -> task_isolation_kernel_enter()
> > 
> > The IRQ cases look very odd to me. With the rework I've just done
> > for arm64, we'll do the regular context tracking accounting before
> > we ever get into handle_domain_irq() or similar, so I suspect that's
> > not necessary at all?
> 
> The goal is to call task_isolation_kernel_enter() before anything that
> depends on a CPU state, including pipeline, that could remain un-
> synchronized when the rest of the kernel was sending synchronization
> IPIs. Similarly task_isolation_kernel_return() should be called when it
> is safe to turn off synchronization. If rework allows it to be done
> earlier, there is no need to touch more specific functions.

Sure; I think that's sorted as a result of the changes I made recently.

> 
> > --- a/arch/arm64/include/asm/barrier.h
> > > +++ b/arch/arm64/include/asm/barrier.h
> > > @@ -49,6 +49,7 @@
> > >  #define dma_rmb()	dmb(oshld)
> > >  #define dma_wmb()	dmb(oshst)
> > >  
> > > +#define instr_sync()	isb()
> > 
> > I think I've asked on prior versions of the patchset, but what is
> > this for? Where is it going to be used, and what is the expected
> > semantics?  I'm wary of exposing this outside of arch code because
> > there aren't strong cross-architectural semantics, and at the least
> > this requires some documentation.
> 
> This is intended as an instruction pipeline flush for the situation
> when arch-independent code has to synchronize with potential changes
> that it missed. This is necessary after some other CPUs could modify
> code (and send IPIs to notify the rest but not isolated CPU) while this
> one was still running isolated task or, more likely, exiting from it,
> so it might be unlucky enough to pick the old instructions before that
> point.
> 
> It's only used on kernel entry.

Sure. My point is that instr_sync() is a very generic sounding name
that doesn't get any of that across, and it's entirely undocumented.

I think something like arch_simulate_kick_cpu() would be better to get
the intended semantic across, and we should add thorough documentation
somewhere as to what this is meant to do.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [EXT] Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-12-04  0:54     ` [EXT] " Alex Belits
@ 2020-12-07 11:58       ` Mark Rutland
  0 siblings, 0 replies; 37+ messages in thread
From: Mark Rutland @ 2020-12-07 11:58 UTC (permalink / raw)
  To: Alex Belits
  Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel,
	rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will,
	catalin.marinas, peterz, frederic, leon, linux-arm-kernel, pauld,
	netdev

On Fri, Dec 04, 2020 at 12:54:29AM +0000, Alex Belits wrote:
> 
> On Wed, 2020-12-02 at 14:20 +0000, Mark Rutland wrote:
> > External Email
> > 
> > -------------------------------------------------------------------
> > ---
> > On Mon, Nov 23, 2020 at 05:58:22PM +0000, Alex Belits wrote:
> > > From: Yuri Norov <ynorov@marvell.com>
> > > 
> > > For nohz_full CPUs the desirable behavior is to receive interrupts
> > > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> > > obviously not desirable because it breaks isolation.
> > > 
> > > This patch adds check for it.
> > > 
> > > Signed-off-by: Yuri Norov <ynorov@marvell.com>
> > > [abelits@marvell.com: updated, only exclude CPUs running isolated
> > > tasks]
> > > Signed-off-by: Alex Belits <abelits@marvell.com>
> > > ---
> > >  kernel/time/tick-sched.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > index a213952541db..6c8679e200f0 100644
> > > --- a/kernel/time/tick-sched.c
> > > +++ b/kernel/time/tick-sched.c
> > > @@ -20,6 +20,7 @@
> > >  #include <linux/sched/clock.h>
> > >  #include <linux/sched/stat.h>
> > >  #include <linux/sched/nohz.h>
> > > +#include <linux/isolation.h>
> > >  #include <linux/module.h>
> > >  #include <linux/irq_work.h>
> > >  #include <linux/posix-timers.h>
> > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
> > >   */
> > >  void tick_nohz_full_kick_cpu(int cpu)
> > >  {
> > > -	if (!tick_nohz_full_cpu(cpu))
> > > +	smp_rmb();
> > 
> > What does this barrier pair with? The commit message doesn't mention
> > it,
> > and it's not clear in-context.
> 
> With barriers in task_isolation_kernel_enter()
> and task_isolation_exit_to_user_mode().

Please add a comment in the code as to what it pairs with.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 0/9] "Task_isolation" mode
  2020-12-05 23:25   ` Thomas Gleixner
@ 2020-12-11 18:08     ` Yury Norov
  0 siblings, 0 replies; 37+ messages in thread
From: Yury Norov @ 2020-12-11 18:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Pavel Machek, Alex Belits, nitesh, frederic, Prasun Kapoor,
	linux-api, davem, trix, mingo, catalin.marinas, rostedt,
	linux-kernel, peterx, linux-arch, mtosatti, will, peterz, leon,
	linux-arm-kernel, pauld, netdev

On Sun, Dec 06, 2020 at 12:25:45AM +0100, Thomas Gleixner wrote:
> Pavel,
> 
> On Sat, Dec 05 2020 at 21:40, Pavel Machek wrote:
> > So... what kind of guarantees does this aim to provide / what tasks it
> > is useful for?
> >
> > For real time response, we have other approaches.
> 
> Depends on your requirements. Some problems are actually better solved
> with busy polling. See below.
> 
> > If you want to guarantee performnace of the "isolated" task... I don't
> > see how that works. Other tasks on the system still compete for DRAM
> > bandwidth, caches, etc...
> 
> Applications which want to run as undisturbed as possible. There is
> quite a range of those:
> 
>   - Hardware in the loop simulation is today often done with that crude
>     approach of "offlining" a CPU and then instead of playing dead
>     jumping to a preloaded bare metal executable. That's a horrible hack
>     and impossible to debug, but gives them the results they need to
>     achieve. These applications are well optimized vs. cache and memory
>     foot print, so they don't worry about these things too much and they
>     surely don't run on SMI and BIOS value add inflicted machines.
> 
>     Don't even think about waiting for an interrupt to achieve what
>     these folks are doing. So no, there are problems which a general
>     purpose realtime OS cannot solve ever.
> 
>   - HPC computations on large data sets. While the memory foot print is
>     large the access patterns are cache optimized. 
> 
>     The problem there is that any unnecessary IPI, tick interrupt or
>     whatever nuisance is disturbing the carefully optimized cache usage
>     and alone getting rid of the timer interrupt gained them measurable
>     performance. Even very low single digit percentage of runtime saving
>     is valuable for these folks because the compute time on such beasts
>     is expensive.
> 
>   - Realtime guests in KVM. With posted interrupts and a fully populated
>     host side page table there is no point in running host side
>     interrupts or IPIs for random accounting or whatever purposes as
>     they affect the latency in the guest. With all the side effects
>     mitigated and a properly set up guest and host it is possible to get
>     to a zero exit situation after the bootup phase which means pretty
>     much matching bare metal behaviour.
> 
>     Yes, you can do that with e.g. Jailhouse as well, but you lose lots
>     of the fancy things KVM provides. And people care about these not
>     just because they are fancy. They care because their application
>     scenario needs them.
> 
> There are more reasons why people want to be able to get as much
> isolation from the OS as possible but at the same time have a sane
> execution environment, debugging, performance monitoring and the OS
> provided protection mechanisms instead of horrible hacks.
> 
> Isolation makes sense for a range of applications and there is no reason
> why Linux should not support them. 

One good client for the task isolation is Open Data Plane. There are
even some code stubs supposed to enable isolation where needed.

> > If you want to guarantee performnace of the "isolated" task... I don't
> > see how that works. Other tasks on the system still compete for DRAM
> > bandwidth, caches, etc...

My experiments say that typical delay caused by dry IPI or syscall is
2000-20000 'ticks'. Typical delay caused by cache miss is 3-30 ticks.

To guarantee cache / memory bandwidth, one can use resctrl. Linux has
implementation of it for x86 only, but arm64 has support for for
resctrl on CPU side.

Thanks,
Yury

> Thanks,
> 
>         tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus
  2020-11-23 23:21       ` Frederic Weisbecker
  2020-11-25  3:20         ` Alex Belits
@ 2021-01-22 15:00         ` Marcelo Tosatti
  1 sibling, 0 replies; 37+ messages in thread
From: Marcelo Tosatti @ 2021-01-22 15:00 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Alex Belits, Prasun Kapoor, linux-api, davem, trix, mingo,
	linux-kernel, rostedt, peterx, tglx, nitesh, linux-arch, will,
	peterz, leon, linux-arm-kernel, catalin.marinas, pauld, netdev

On Tue, Nov 24, 2020 at 12:21:06AM +0100, Frederic Weisbecker wrote:
> On Mon, Nov 23, 2020 at 10:39:34PM +0000, Alex Belits wrote:
> > 
> > On Mon, 2020-11-23 at 23:29 +0100, Frederic Weisbecker wrote:
> > > External Email
> > > 
> > > -------------------------------------------------------------------
> > > ---
> > > On Mon, Nov 23, 2020 at 05:58:42PM +0000, Alex Belits wrote:
> > > > From: Yuri Norov <ynorov@marvell.com>
> > > > 
> > > > Make sure that kick_all_cpus_sync() does not call CPUs that are
> > > > running
> > > > isolated tasks.
> > > > 
> > > > Signed-off-by: Yuri Norov <ynorov@marvell.com>
> > > > [abelits@marvell.com: use safe task_isolation_cpumask()
> > > > implementation]
> > > > Signed-off-by: Alex Belits <abelits@marvell.com>
> > > > ---
> > > >  kernel/smp.c | 14 +++++++++++++-
> > > >  1 file changed, 13 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/kernel/smp.c b/kernel/smp.c
> > > > index 4d17501433be..b2faecf58ed0 100644
> > > > --- a/kernel/smp.c
> > > > +++ b/kernel/smp.c
> > > > @@ -932,9 +932,21 @@ static void do_nothing(void *unused)
> > > >   */
> > > >  void kick_all_cpus_sync(void)
> > > >  {
> > > > +	struct cpumask mask;
> > > > +
> > > >  	/* Make sure the change is visible before we kick the cpus */
> > > >  	smp_mb();
> > > > -	smp_call_function(do_nothing, NULL, 1);
> > > > +
> > > > +	preempt_disable();
> > > > +#ifdef CONFIG_TASK_ISOLATION
> > > > +	cpumask_clear(&mask);
> > > > +	task_isolation_cpumask(&mask);
> > > > +	cpumask_complement(&mask, &mask);
> > > > +#else
> > > > +	cpumask_setall(&mask);
> > > > +#endif
> > > > +	smp_call_function_many(&mask, do_nothing, NULL, 1);
> > > > +	preempt_enable();
> > > 
> > > Same comment about IPIs here.
> > 
> > This is different from timers. The original design was based on the
> > idea that every CPU should be able to enter kernel at any time and run
> > kernel code with no additional preparation. Then the only solution is
> > to always do full broadcast and require all CPUs to process it.
> > 
> > What I am trying to introduce is the idea of CPU that is not likely to
> > run kernel code any soon, and can afford to go through an additional
> > synchronization procedure on the next entry into kernel. The
> > synchronization is not skipped, it simply happens later, early in
> > kernel entry code.

Perhaps a bitmask of pending flushes makes more sense? 
static_key_enable IPIs is one of the users, but for its case it would 
be necessary to differentiate between in-kernel mode and out of kernel 
mode atomically (since i-cache flush must be performed if isolated CPU 
is in kernel mode).

> Ah I see, this is ordered that way:
> 
> ll_isol_flags = ISOLATED
> 
>          CPU 0                                CPU 1
>     ------------------                       -----------------
>                                             // kernel entry
>     data_to_sync = 1                        ll_isol_flags = ISOLATED_BROKEN
>     smp_mb()                                smp_mb()
>     if ll_isol_flags(CPU 1) == ISOLATED     READ data_to_sync
>          smp_call(CPU 1)

Since isolated mode with syscalls is a desired feature, having a
separate atomic with in_kernel_mode = 0/1 (that is set/cleared 
on kernel entry / kernel exit, while on TIF_TASK_ISOLATION), would be
necessary (and a similar race-free logic as above).

> You should document that, ie: explain why what you're doing is safe.
> 
> Also Beware though that the data to sync in question doesn't need to be visible
> in the entry code before task_isolation_kernel_enter(). You need to audit all
> the callers of kick_all_cpus_sync().

Cscope tag: flush_icache_range
   #   line  filename / context / line
   1     96  arch/arc/kernel/jump_label.c <<arch_jump_label_transform>>
             flush_icache_range(entry->code, entry->code + JUMP_LABEL_NOP_SIZE);

This case would be OK for delayed processing before kernel entry, as long as
no code before task_isolation_kernel_enter can be modified (which i am
not sure about).

But:

  36     28  arch/ia64/include/asm/cacheflush.h <<flush_icache_user_page>>
             flush_icache_range(_addr, _addr + (len)); \

Is less certain.

Alex do you recall if arch_jump_label_transform was the only offender or 
there were others as well? (suppose handling only the ones which matter
in production at the moment, and later fixing individual ones makes most
sense).




^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2021-01-22 15:12 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-23 17:42 [PATCH v5 0/9] "Task_isolation" mode Alex Belits
2020-11-23 17:56 ` [PATCH v5 1/9] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
2020-11-23 21:48   ` Thomas Gleixner
2020-11-23 17:56 ` [PATCH v5 2/9] task_isolation: vmstat: add vmstat_idle function Alex Belits
2020-11-23 21:49   ` Thomas Gleixner
2020-11-23 17:56 ` [PATCH v5 3/9] task_isolation: userspace hard isolation from kernel Alex Belits
2020-11-23 22:01   ` Thomas Gleixner
2020-11-23 17:57 ` [PATCH v5 4/9] task_isolation: Add task isolation hooks to arch-independent code Alex Belits
2020-11-23 22:31   ` Thomas Gleixner
2020-11-23 17:57 ` [PATCH v5 5/9] task_isolation: Add driver-specific hooks Alex Belits
2020-12-02 14:18   ` Mark Rutland
2020-12-04  0:43     ` [EXT] " Alex Belits
2020-11-23 17:58 ` [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality Alex Belits
2020-12-02 13:59   ` Mark Rutland
2020-12-04  0:37     ` [EXT] " Alex Belits
2020-12-07 11:57       ` Mark Rutland
2020-11-23 17:58 ` [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
2020-11-23 22:13   ` Frederic Weisbecker
2020-11-23 22:35     ` Alex Belits
2020-11-23 22:36   ` Thomas Gleixner
2020-12-02 14:20   ` Mark Rutland
2020-12-04  0:54     ` [EXT] " Alex Belits
2020-12-07 11:58       ` Mark Rutland
2020-11-23 17:58 ` [PATCH v5 8/9] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize Alex Belits
2020-11-23 17:58 ` [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits
2020-11-23 22:29   ` Frederic Weisbecker
2020-11-23 22:39     ` [EXT] " Alex Belits
2020-11-23 23:21       ` Frederic Weisbecker
2020-11-25  3:20         ` Alex Belits
2021-01-22 15:00         ` Marcelo Tosatti
2020-11-24 16:36 ` [PATCH v5 0/9] "Task_isolation" mode Tom Rix
2020-11-24 17:40   ` [EXT] " Alex Belits
2020-12-02 14:02     ` Mark Rutland
2020-12-04  0:39       ` Alex Belits
2020-12-05 20:40 ` Pavel Machek
2020-12-05 23:25   ` Thomas Gleixner
2020-12-11 18:08     ` Yury Norov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).