linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/12] "Task_isolation" mode
@ 2020-03-04 16:01 Alex Belits
  2020-03-04 16:03 ` [PATCH 01/12] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
                   ` (12 more replies)
  0 siblings, 13 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-04 16:01 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	linux-mm, linux-arch

This is an update of task isolation work that was originally done by
Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until
November 2017. It is adapted to the current kernel and cleaned up to
make this functionality both more complete (as in, prevent isolation
breaking in situations that were not covered before) and cleaner (as
in, avoid any dubious or fragile use of kernel interfaces, and provide
clean and reliable isolation breaking procedure).

I guess, I have to explain why such a thing exists.

This is the result of development and maintenance of task isolation
functionality that originally started based on task isolation patch
v15 and was later updated to include v16. It provided RTOS-like
predictable environment for userspace tasks running on arm64
processors alongside with full-featured Linux environment. It is
intended to provide reliable interruption-free environment from the
point when a userspace task enters isolation and until the moment it
leaves isolation or receives a signal intentionally sent to it, and
was successfully used for this purpose. While CPU isolation with nohz
provides an environment that is close to this requirement, the
remaining IPIs and other disturbances keep it from being usable for
tasks that require complete predictability of CPU timing.

It is clear that such isolation is neither possible nor necessary
while a CPU is running kernel, userspace initialization or cleanup, so
there is a need for a separate isolated state that a userspace task
can enter and exit. This was the reason for using the original task
isolation, and such reason still exists now. The alternative, running
RTOS instead of Linux, is becoming more and more labor-consuming
because modern CPUs and SoCs have very complex device/resource
configuration and management procedures, and at this point for some
hardware it is clearly in the realm of impractical to maintain an RTOS
with hardware support on par with Linux kernel, reliable and secure at
the same time.

On the other hand, development of modern embedded-oriented SoCs had
shown that numerous CPU cores may or may not share any hardware
resources based on SoC designers' intention. Therefore OS ability to
switch a CPU core into RTOS-ish mode and truly, really, at all levels,
leave it alone until OS is needed there again, is an important feature
for modern embedded systems development. Probably more important than
even real-time interrupts latency and preemption, now that people,
when they don't like how their interrupts are handled, can just add
CPU cores. This is why we had to maintain task isolation, and I
believe, after all improvements in CPU isolation, timer and interrupt
management that was done in Linux since 2017, it is needed even more,
as opposed to less.

This set of patches only covers the implementation of task isolation,
however the need for additional functionality, such as selective TLB
flushes, is one of the reasons behind task_isolation_on_cpu() avoiding
any non-isolation-specific data structures and the existence of
fast_task_isolation_cpu_cleanup() function, that is always called on
the CPU where isolated task is running.

Reporting task isolation breaking in kernel log is now more
informative and, if necessary, can be adapted to provide meaningful
cause information to userspace software. I am not sure if such
mechanism is needed -- development and reporting failures in
production usually relies on kernel logs, and in production it is
assumed that isolation breaking should not happen on its own. On the
other hand, if application can collect a meaningful log where its
events are matched to isolation failures, this may be better for
developers than matching timing of records from multiple sources. For
now, only log shows detailed descriptions.

The userspace support and test program is now at
https://github.com/abelits/libtmc . It was originally developed for
earlier implementation, so it has some checks that may be redundant
now but kept for compatibility.

My thanks to Chris Metcalf for design and maintenance of the original
task isolation patch, Francis Giraldeau <francis.giraldeau@gmail.com>
and Yuri Norov <ynorov@marvell.com> for various contributions to this
work, and Frederic Weisbecker <frederic@kernel.org> for his work on
CPU isolation and housekeeping that made possible to remove some less
elegant solutions that I had to devise for earlier, <4.17 kernels.

The previous patch (v16 by Chris Metcalf) is at:

https://lore.kernel.org/lkml/1509728692-10460-1-git-send-email-cmetcalf@mellanox.com

-- 
Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 01/12] task_isolation: vmstat: add quiet_vmstat_sync function
  2020-03-04 16:01 [PATCH 00/12] "Task_isolation" mode Alex Belits
@ 2020-03-04 16:03 ` Alex Belits
  2020-03-04 16:04 ` [PATCH 02/12] task_isolation: vmstat: add vmstat_idle function Alex Belits
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-04 16:03 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	linux-mm, linux-arch

From: Chris Metcalf <cmetcalf@mellanox.com>

In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
the quiet_vmstat() function became asynchronous, in the sense that
the vmstat work was still scheduled to run on the core when the
function returned.  For task isolation, we need a synchronous
version of the function that guarantees that the vmstat worker
will not run on the core on return from the function.  Add a
quiet_vmstat_sync() function with that semantic.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 include/linux/vmstat.h | 2 ++
 mm/vmstat.c            | 9 +++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 292485f3d24d..2bc5e85f2514 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -270,6 +270,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
+void quiet_vmstat_sync(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -372,6 +373,7 @@ static inline void __dec_node_page_state(struct page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat_sync(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 78d53378db99..1fa0b2d04afa 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1870,6 +1870,15 @@ void quiet_vmstat(void)
 	refresh_cpu_vm_stats(false);
 }
 
+/*
+ * Synchronously quiet vmstat so the work is guaranteed not to run on return.
+ */
+void quiet_vmstat_sync(void)
+{
+	cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work));
+	refresh_cpu_vm_stats(false);
+}
+
 /*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 02/12] task_isolation: vmstat: add vmstat_idle function
  2020-03-04 16:01 [PATCH 00/12] "Task_isolation" mode Alex Belits
  2020-03-04 16:03 ` [PATCH 01/12] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
@ 2020-03-04 16:04 ` Alex Belits
  2020-03-04 16:07 ` [PATCH 03/12] task_isolation: userspace hard isolation from kernel Alex Belits
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-04 16:04 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	linux-mm, linux-arch

From: Chris Metcalf <cmetcalf@mellanox.com>

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 2bc5e85f2514..66d9ae32cf07 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -271,6 +271,7 @@ extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
 void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -374,6 +375,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1fa0b2d04afa..5c4aec651062 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1879,6 +1879,16 @@ void quiet_vmstat_sync(void)
 	refresh_cpu_vm_stats(false);
 }
 
+/*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+	return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) &&
+		!need_update(smp_processor_id());
+}
+
 /*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 03/12] task_isolation: userspace hard isolation from kernel
  2020-03-04 16:01 [PATCH 00/12] "Task_isolation" mode Alex Belits
  2020-03-04 16:03 ` [PATCH 01/12] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
  2020-03-04 16:04 ` [PATCH 02/12] task_isolation: vmstat: add vmstat_idle function Alex Belits
@ 2020-03-04 16:07 ` Alex Belits
  2020-03-05 18:33   ` Frederic Weisbecker
                     ` (2 more replies)
  2020-03-04 16:08 ` [PATCH 04/12] task_isolation: Add task isolation hooks to arch-independent code Alex Belits
                   ` (9 subsequent siblings)
  12 siblings, 3 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-04 16:07 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	linux-mm, linux-arch

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
"isolcpus=nohz,domain,CPULIST" boot argument to enable
nohz_full and isolcpus. The "task_isolation" state is then indicated
by setting a new task struct field, task_isolation_flag, to the
value passed by prctl(), and also setting a TIF_TASK_ISOLATION
bit in the thread_info flags. When the kernel is returning to
userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
it calls the new task_isolation_start() routine to arrange for
the task to avoid being interrupted in the future.

With interrupts disabled, task_isolation_start() ensures that kernel
subsystems that might cause a future interrupt are quiesced. If it
doesn't succeed, it adjusts the syscall return value to indicate that
fact, and userspace can retry as desired. In addition to stopping
the scheduler tick, the code takes any actions that might avoid
a future interrupt to the core, such as a worker thread being
scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
other exception or irq, the kernel will kill it with SIGKILL.
In addition to sending a signal, the code supports a kernel
command-line "task_isolation_debug" flag which causes a stack
backtrace to be generated whenever a task loses isolation.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
clear the bit again later, and ignores exit/exit_group to allow
exiting the task without a pointless signal being delivered.

The prctl() API allows for specifying a signal number to use instead
of the default SIGKILL, to allow for catching the notification
signal; for example, in a production environment, it might be
helpful to log information to the application logging mechanism
before exiting. Or, the signal handler might choose to reset the
program counter back to the code segment intended to be run isolated
via prctl() to continue execution.

In a number of cases we can tell on a remote cpu that we are
going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
In that case we generate the diagnostic (and optional stack dump)
on the remote core to be able to deliver better diagnostics.
If the interrupt is not something caught by Linux (e.g. a
hypervisor interrupt) we can also request a reschedule IPI to
be sent to the remote core so it can be sure to generate a
signal to notify the process.

Separate patches that follow provide these changes for x86, arm,
and arm64.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 .../admin-guide/kernel-parameters.txt         |   6 +
 include/linux/hrtimer.h                       |   4 +
 include/linux/isolation.h                     | 229 ++++++
 include/linux/sched.h                         |   4 +
 include/linux/tick.h                          |   3 +
 include/uapi/linux/prctl.h                    |   6 +
 init/Kconfig                                  |  28 +
 kernel/Makefile                               |   2 +
 kernel/context_tracking.c                     |   2 +
 kernel/isolation.c                            | 774 ++++++++++++++++++
 kernel/signal.c                               |   2 +
 kernel/sys.c                                  |   6 +
 kernel/time/hrtimer.c                         |  27 +
 kernel/time/tick-sched.c                      |  18 +
 14 files changed, 1111 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index c07815d230bc..e4a2d6e37645 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4808,6 +4808,12 @@
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation_debug	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION, this
+			setting will generate console backtraces to
+			accompany the diagnostics generated about
+			interrupting tasks running with task isolation.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 15c8ac313678..e81252eb4f92 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -528,6 +528,10 @@ extern void __init hrtimers_init(void);
 /* Show pending timers: */
 extern void sysrq_timer_list_show(void);
 
+#ifdef CONFIG_TASK_ISOLATION
+extern void kick_hrtimer(void);
+#endif
+
 int hrtimers_prepare_cpu(unsigned int cpu);
 #ifdef CONFIG_HOTPLUG_CPU
 int hrtimers_dead_cpu(unsigned int cpu);
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..144766d3536a
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,229 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Task isolation support
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@marvell.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <stdarg.h>
+#include <linux/errno.h>
+#include <linux/cpumask.h>
+#include <linux/prctl.h>
+#include <linux/types.h>
+
+struct task_struct;
+
+#ifdef CONFIG_TASK_ISOLATION
+
+int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...);
+
+#define pr_task_isol_emerg(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_EMERG, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_alert(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_ALERT, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_crit(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_CRIT, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_err(cpu, fmt, ...)				\
+	task_isolation_message(cpu, LOGLEVEL_ERR, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_warn(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_WARNING, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_notice(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_NOTICE, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_info(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_INFO, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_debug(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_DEBUG, false, fmt, ##__VA_ARGS__)
+
+#define pr_task_isol_emerg_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_EMERG, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_alert_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_ALERT, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_crit_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_CRIT, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_err_supp(cpu, fmt, ...)				\
+	task_isolation_message(cpu, LOGLEVEL_ERR, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_warn_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_WARNING, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_notice_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_NOTICE, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_info_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_INFO, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_debug_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_DEBUG, true, fmt, ##__VA_ARGS__)
+DECLARE_PER_CPU(unsigned long, tsk_thread_flags_copy);
+extern cpumask_var_t task_isolation_map;
+
+/**
+ * task_isolation_request() - prctl hook to request task isolation
+ * @flags:	Flags from <linux/prctl.h> PR_TASK_ISOLATION_xxx.
+ *
+ * This is called from the generic prctl() code for PR_TASK_ISOLATION.
+ *
+ * Return: Returns 0 when task isolation enabled, otherwise a negative
+ * errno.
+ */
+extern int task_isolation_request(unsigned int flags);
+extern void task_isolation_cpu_cleanup(void);
+/**
+ * task_isolation_start() - attempt to actually start task isolation
+ *
+ * This function should be invoked as the last thing prior to returning to
+ * user space if TIF_TASK_ISOLATION is set in the thread_info flags.  It
+ * will attempt to quiesce the core and enter task-isolation mode.  If it
+ * fails, it will reset the system call return value to an error code that
+ * indicates the failure mode.
+ */
+extern void task_isolation_start(void);
+
+/**
+ * is_isolation_cpu() - check if CPU is intended for running isolated tasks.
+ * @cpu:	CPU to check.
+ */
+static inline bool is_isolation_cpu(int cpu)
+{
+	return task_isolation_map != NULL &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+/**
+ * task_isolation_on_cpu() - check if the cpu is running isolated task
+ * @cpu:	CPU to check.
+ */
+extern int task_isolation_on_cpu(int cpu);
+extern void task_isolation_check_run_cleanup(void);
+
+/**
+ * task_isolation_cpumask() - set CPUs currently running isolated tasks
+ * @mask:	Mask to modify.
+ */
+extern void task_isolation_cpumask(struct cpumask *mask);
+
+/**
+ * task_isolation_clear_cpumask() - clear CPUs currently running isolated tasks
+ * @mask:      Mask to modify.
+ */
+extern void task_isolation_clear_cpumask(struct cpumask *mask);
+
+/**
+ * task_isolation_syscall() - report a syscall from an isolated task
+ * @nr:		The syscall number.
+ *
+ * This routine should be invoked at syscall entry if TIF_TASK_ISOLATION is
+ * set in the thread_info flags.  It checks for valid syscalls,
+ * specifically prctl() with PR_TASK_ISOLATION, exit(), and exit_group().
+ * For any other syscall it will raise a signal and return failure.
+ *
+ * Return: 0 for acceptable syscalls, -1 for all others.
+ */
+extern int task_isolation_syscall(int nr);
+
+/**
+ * _task_isolation_interrupt() - report an interrupt of an isolated task
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This routine should be invoked at any exception or IRQ if
+ * TIF_TASK_ISOLATION is set in the thread_info flags.  It is not necessary
+ * to invoke it if the exception will generate a signal anyway (e.g. a bad
+ * page fault), and in that case it is preferable not to invoke it but just
+ * rely on the standard Linux signal.  The macro task_isolation_syscall()
+ * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
+ */
+extern void _task_isolation_interrupt(const char *fmt, ...);
+#define task_isolation_interrupt(fmt, ...)				\
+	do {								\
+		if (current_thread_info()->flags & _TIF_TASK_ISOLATION)	\
+			_task_isolation_interrupt(fmt, ## __VA_ARGS__);	\
+	} while (0)
+
+/**
+ * task_isolation_remote() - report a remote interrupt of an isolated task
+ * @cpu:	The remote cpu that is about to be interrupted.
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This routine should be invoked any time a remote IPI or other type of
+ * interrupt is being delivered to another cpu. The function will check to
+ * see if the target core is running a task-isolation task, and generate a
+ * diagnostic on the console if so; in addition, we tag the task so it
+ * doesn't generate another diagnostic when the interrupt actually arrives.
+ * Generating a diagnostic remotely yields a clearer indication of what
+ * happened then just reporting only when the remote core is interrupted.
+ *
+ */
+extern void task_isolation_remote(int cpu, const char *fmt, ...);
+
+/**
+ * task_isolation_remote_cpumask() - report interruption of multiple cpus
+ * @mask:	The set of remotes cpus that are about to be interrupted.
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This is the cpumask variant of _task_isolation_remote().  We
+ * generate a single-line diagnostic message even if multiple remote
+ * task-isolation cpus are being interrupted.
+ */
+extern void task_isolation_remote_cpumask(const struct cpumask *mask,
+					  const char *fmt, ...);
+
+/**
+ * _task_isolation_signal() - disable task isolation when signal is pending
+ * @task:	The task for which to disable isolation.
+ *
+ * This function generates a diagnostic and disables task isolation; it
+ * should be called if TIF_TASK_ISOLATION is set when notifying a task of a
+ * pending signal.  The task_isolation_interrupt() function normally
+ * generates a diagnostic for events that just interrupt a task without
+ * generating a signal; here we need to hook the paths that correspond to
+ * interrupts that do generate a signal.  The macro task_isolation_signal()
+ * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
+ */
+extern void _task_isolation_signal(struct task_struct *task);
+#define task_isolation_signal(task)					\
+	do {								\
+		if (task_thread_info(task)->flags & _TIF_TASK_ISOLATION) \
+			_task_isolation_signal(task);			\
+	} while (0)
+
+/**
+ * task_isolation_user_exit() - debug all user_exit calls
+ *
+ * By default, we don't generate an exception in the low-level user_exit()
+ * code, because programs lose the ability to disable task isolation: the
+ * user_exit() hook will cause a signal prior to task_isolation_syscall()
+ * disabling task isolation.  In addition, it means that we lose all the
+ * diagnostic info otherwise available from task_isolation_interrupt() hooks
+ * later in the interrupt-handling process.  But you may enable it here for
+ * a special kernel build if you are having undiagnosed userspace jitter.
+ */
+static inline void task_isolation_user_exit(void)
+{
+#ifdef DEBUG_TASK_ISOLATION
+	task_isolation_interrupt("user_exit");
+#endif
+}
+
+#else /* !CONFIG_TASK_ISOLATION */
+static inline int task_isolation_request(unsigned int flags) { return -EINVAL; }
+static inline void task_isolation_start(void) { }
+static inline bool is_isolation_cpu(int cpu) { return 0; }
+static inline int task_isolation_on_cpu(int cpu) { return 0; }
+static inline void task_isolation_cpumask(struct cpumask *mask) { }
+static inline void task_isolation_clear_cpumask(struct cpumask *mask) { }
+static inline void task_isolation_cpu_cleanup(void) { }
+static inline void task_isolation_check_run_cleanup(void) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_interrupt(const char *fmt, ...) { }
+static inline void task_isolation_remote(int cpu, const char *fmt, ...) { }
+static inline void task_isolation_remote_cpumask(const struct cpumask *mask,
+						 const char *fmt, ...) { }
+static inline void task_isolation_signal(struct task_struct *task) { }
+static inline void task_isolation_user_exit(void) { }
+#endif
+
+#endif /* _LINUX_ISOLATION_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15..52fdb32aa3b9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1280,6 +1280,10 @@ struct task_struct {
 	unsigned long			lowest_stack;
 	unsigned long			prev_lowest_stack;
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned short			task_isolation_flags;  /* prctl */
+	unsigned short			task_isolation_state;
+#endif
 
 	/*
 	 * New fields for task_struct should be added above here, so that
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 7340613c7eff..27c7c033d5a8 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -268,6 +268,9 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
 extern void __init tick_nohz_full_setup(cpumask_var_t cpumask);
+#ifdef CONFIG_TASK_ISOLATION
+extern int try_stop_full_tick(void);
+#endif
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f8131e36..f4848ed2a069 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -238,4 +238,10 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Enable task_isolation mode for TASK_ISOLATION kernels. */
+#define PR_TASK_ISOLATION		48
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 20a6ac33761c..ecdf567f6bd4 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -576,6 +576,34 @@ config CPU_ISOLATION
 
 source "kernel/rcu/Kconfig"
 
+config HAVE_ARCH_TASK_ISOLATION
+	bool
+
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
+	help
+
+	Allow userspace processes that place themselves on cores with
+	nohz_full and isolcpus enabled, and run prctl(PR_TASK_ISOLATION),
+	to "isolate" themselves from the kernel.  Prior to returning to
+	userspace, isolated tasks will arrange that no future kernel
+	activity will interrupt the task while the task is running in
+	userspace.  Attempting to re-enter the kernel while in this mode
+	will cause the task to be terminated with a signal; you must
+	explicitly use prctl() to disable task isolation before resuming
+	normal use of the kernel.
+
+	This "hard" isolation from the kernel is required for userspace
+	tasks that are running hard real-time tasks in userspace, such as
+	a high-speed network driver in userspace.  Without this option, but
+	with NO_HZ_FULL enabled, the kernel will make a best-faith, "soft"
+	effort to shield a single userspace process from interrupts, but
+	makes no guarantees.
+
+	You should say "N" unless you are intending to run a
+	high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index 4cb4130ced32..2f2ae91f90d5 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -122,6 +122,8 @@ obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o
 KASAN_SANITIZE_stackleak.o := n
 KCOV_INSTRUMENT_stackleak.o := n
 
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
+
 $(obj)/configs.o: $(obj)/config_data.gz
 
 targets += config_data.gz
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 0296b4bda8f1..e9206736f219 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -21,6 +21,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -157,6 +158,7 @@ void __context_tracking_exit(enum ctx_state state)
 			if (state == CONTEXT_USER) {
 				vtime_user_exit(current);
 				trace_user_exit(0);
+				task_isolation_user_exit();
 			}
 		}
 		__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..8e982ae1b19e
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,774 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation of task isolation.
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@marvell.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/sched.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include <linux/smp.h>
+#include <linux/tick.h>
+#include <asm/unistd.h>
+#include <asm/syscall.h>
+#include <linux/hrtimer.h>
+
+/*
+ * These values are stored in task_isolation_state.
+ * Note that STATE_NORMAL + TIF_TASK_ISOLATION means we are still
+ * returning from sys_prctl() to userspace.
+ */
+enum {
+	STATE_NORMAL = 0,	/* Not isolated */
+	STATE_ISOLATED = 1	/* In userspace, isolated */
+};
+
+/*
+ * This variable contains thread flags copied at the moment
+ * when schedule() switched to the task on a given CPU,
+ * or 0 if no task is running.
+ */
+DEFINE_PER_CPU(unsigned long, tsk_thread_flags_cache);
+
+/*
+ * Counter for isolation state on a given CPU, increments when entering
+ * isolation and decrements when exiting isolation (before or after the
+ * cleanup). Multiple simultaneously running procedures entering or
+ * exiting isolation are prevented by checking the result of
+ * incrementing or decrementing this variable. This variable is both
+ * incremented and decremented by CPU that caused isolation entering or
+ * exit.
+ *
+ * This is necessary because multiple isolation-breaking events may happen
+ * at once (or one as the result of the other), however isolation exit
+ * may only happen once to transition from isolated to non-isolated state.
+ * Therefore, if decrementing this counter results in a value less than 0,
+ * isolation exit procedure can't be started -- it already happened, or is
+ * in progress, or isolation is not entered yet.
+ */
+DEFINE_PER_CPU(atomic_t, isol_counter);
+
+/*
+ * Description of the last two tasks that ran isolated on a given CPU.
+ * This is intended only for messages about isolation breaking. We
+ * don't want any references to actual task while accessing this from
+ * CPU that caused isolation breaking -- we know nothing about timing
+ * and don't want to use locking or RCU.
+ */
+struct isol_task_desc {
+	atomic_t curr_index;
+	atomic_t curr_index_wr;
+	bool	warned[2];
+	pid_t	pid[2];
+	pid_t	tgid[2];
+	char	comm[2][TASK_COMM_LEN];
+};
+static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs);
+
+/*
+ * Counter for isolation exiting procedures (from request to the start of
+ * cleanup) being attempted at once on a CPU. Normally incrementing of
+ * this counter is performed from the CPU that caused isolation breaking,
+ * however decrementing is done from the cleanup procedure, delegated to
+ * the CPU that is exiting isolation, not from the CPU that caused isolation
+ * breaking.
+ *
+ * If incrementing this counter while starting isolation exit procedure
+ * results in a value greater than 0, isolation exiting is already in
+ * progress, and cleanup did not start yet. This means, counter should be
+ * decremented back, and isolation exit that is already in progress, should
+ * be allowed to complete. Otherwise, a new isolation exit procedure should
+ * be started.
+ */
+DEFINE_PER_CPU(atomic_t, isol_exit_counter);
+
+/*
+ * Descriptor for isolation-breaking SMP calls
+ */
+DEFINE_PER_CPU(call_single_data_t, isol_break_csd);
+
+cpumask_var_t task_isolation_map;
+cpumask_var_t task_isolation_cleanup_map;
+static DEFINE_SPINLOCK(task_isolation_cleanup_lock);
+
+/* We can run on cpus that are isolated from the scheduler and are nohz_full. */
+static int __init task_isolation_init(void)
+{
+	alloc_bootmem_cpumask_var(&task_isolation_cleanup_map);
+	if (alloc_cpumask_var(&task_isolation_map, GFP_KERNEL))
+		/*
+		 * At this point task isolation should match
+		 * nohz_full. This may change in the future.
+		 */
+		cpumask_copy(task_isolation_map, tick_nohz_full_mask);
+	return 0;
+}
+core_initcall(task_isolation_init)
+
+/* Enable stack backtraces of any interrupts of task_isolation cores. */
+static bool task_isolation_debug;
+static int __init task_isolation_debug_func(char *str)
+{
+	task_isolation_debug = true;
+	return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+/*
+ * Record name, pid and group pid of the task entering isolation on
+ * the current CPU.
+ */
+static void record_curr_isolated_task(void)
+{
+	int ind;
+	int cpu = smp_processor_id();
+	struct isol_task_desc *desc = &per_cpu(isol_task_descs, cpu);
+	struct task_struct *task = current;
+
+	/* Finish everything before recording current task */
+	smp_mb();
+	ind = atomic_inc_return(&desc->curr_index_wr) & 1;
+	desc->comm[ind][sizeof(task->comm) - 1] = '\0';
+	memcpy(desc->comm[ind], task->comm, sizeof(task->comm) - 1);
+	desc->pid[ind] = task->pid;
+	desc->tgid[ind] = task->tgid;
+	desc->warned[ind] = false;
+	/* Write everything, to be seen by other CPUs */
+	smp_mb();
+	atomic_inc(&desc->curr_index);
+	/* Everyone will see the new record from this point */
+	smp_mb();
+}
+
+/*
+ * Print message prefixed with the description of the current (or
+ * last) isolated task on a given CPU. Intended for isolation breaking
+ * messages that include target task for the user's convenience.
+ *
+ * Messages produced with this function may have obsolete task
+ * information if isolated tasks managed to exit, start and enter
+ * isolation multiple times, or multiple tasks tried to enter
+ * isolation on the same CPU at once. For those unusual cases it would
+ * contain a valid description of the cause for isolation breaking and
+ * target CPU number, just not the correct description of which task
+ * ended up losing isolation.
+ */
+int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...)
+{
+	struct isol_task_desc *desc;
+	struct task_struct *task;
+	va_list args;
+	char buf_prefix[TASK_COMM_LEN + 20 + 3 * 20];
+	char buf[200];
+	int curr_cpu, ind_counter, ind_counter_old, ind;
+
+	curr_cpu = get_cpu();
+	desc = &per_cpu(isol_task_descs, cpu);
+	ind_counter = atomic_read(&desc->curr_index);
+
+	if (curr_cpu == cpu) {
+		/*
+		 * Message is for the current CPU so current
+		 * task_struct should be used instead of cached
+		 * information.
+		 *
+		 * Like in other diagnostic messages, if issued from
+		 * interrupt context, current will be the interrupted
+		 * task. Unlike other diagnostic messages, this is
+		 * always relevant because the message is about
+		 * interrupting a task.
+		 */
+		ind = ind_counter & 1;
+		if (supp && desc->warned[ind]) {
+			/*
+			 * If supp is true, skip the message if the
+			 * same task was mentioned in the message
+			 * originated on remote CPU, and it did not
+			 * re-enter isolated state since then (warned
+			 * is true). Only local messages following
+			 * remote messages, likely about the same
+			 * isolation breaking event, are skipped to
+			 * avoid duplication. If remote cause is
+			 * immediately followed by a local one before
+			 * isolation is broken, local cause is skipped
+			 * from messages.
+			 */
+			put_cpu();
+			return 0;
+		}
+		task = current;
+		snprintf(buf_prefix, sizeof(buf_prefix),
+			 "isolation %s/%d/%d (cpu %d)",
+			 task->comm, task->tgid, task->pid, cpu);
+		put_cpu();
+	} else {
+		/*
+		 * Message is for remote CPU, use cached information.
+		 */
+		put_cpu();
+		/*
+		 * Make sure, index remained unchanged while data was
+		 * copied. If it changed, data that was copied may be
+		 * inconsistent because two updates in a sequence could
+		 * overwrite the data while it was being read.
+		 */
+		do {
+			/* Make sure we are reading up to date values */
+			smp_mb();
+			ind = ind_counter & 1;
+			snprintf(buf_prefix, sizeof(buf_prefix),
+				 "isolation %s/%d/%d (cpu %d)",
+				 desc->comm[ind], desc->tgid[ind],
+				 desc->pid[ind], cpu);
+			desc->warned[ind] = true;
+			ind_counter_old = ind_counter;
+			/* Record the warned flag, then re-read descriptor */
+			smp_mb();
+			ind_counter = atomic_read(&desc->curr_index);
+			/*
+			 * If the counter changed, something was updated, so
+			 * repeat everything to get the current data
+			 */
+		} while (ind_counter != ind_counter_old);
+	}
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	switch (level) {
+	case LOGLEVEL_EMERG:
+		pr_emerg("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_ALERT:
+		pr_alert("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_CRIT:
+		pr_crit("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_ERR:
+		pr_err("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_WARNING:
+		pr_warn("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_NOTICE:
+		pr_notice("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_INFO:
+		pr_info("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_DEBUG:
+		pr_debug("%s: %s", buf_prefix, buf);
+		break;
+	default:
+		/* No message without a valid level */
+		return 0;
+	}
+	return 1;
+}
+
+/*
+ * Dump stack if need be. This can be helpful even from the final exit
+ * to usermode code since stack traces sometimes carry information about
+ * what put you into the kernel, e.g. an interrupt number encoded in
+ * the initial entry stack frame that is still visible at exit time.
+ */
+static void debug_dump_stack(void)
+{
+	if (task_isolation_debug)
+		dump_stack();
+}
+
+/*
+ * Set the flags word but don't try to actually start task isolation yet.
+ * We will start it when entering user space in task_isolation_start().
+ */
+int task_isolation_request(unsigned int flags)
+{
+	struct task_struct *task = current;
+
+	/*
+	 * The task isolation flags should always be cleared just by
+	 * virtue of having entered the kernel.
+	 */
+	WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_TASK_ISOLATION));
+	WARN_ON_ONCE(task->task_isolation_flags != 0);
+	WARN_ON_ONCE(task->task_isolation_state != STATE_NORMAL);
+
+	task->task_isolation_flags = flags;
+	if (!(task->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
+		return 0;
+
+	/* We are trying to enable task isolation. */
+	set_tsk_thread_flag(task, TIF_TASK_ISOLATION);
+
+	/*
+	 * Shut down the vmstat worker so we're not interrupted later.
+	 * We have to try to do this here (with interrupts enabled) since
+	 * we are canceling delayed work and will call flush_work()
+	 * (which enables interrupts) and possibly schedule().
+	 */
+	quiet_vmstat_sync();
+
+	/* We return 0 here but we may change that in task_isolation_start(). */
+	return 0;
+}
+
+/*
+ * Perform actions that should be done immediately on exit from isolation.
+ */
+static void fast_task_isolation_cpu_cleanup(void *info)
+{
+	atomic_dec(&per_cpu(isol_exit_counter, smp_processor_id()));
+	/* At this point breaking isolation from other CPUs is possible again */
+
+	/*
+	 * This task is no longer isolated (and if by any chance this
+	 * is the wrong task, it's already not isolated)
+	 */
+	current->task_isolation_flags = 0;
+	clear_tsk_thread_flag(current, TIF_TASK_ISOLATION);
+
+	/* Run the rest of cleanup later */
+	set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+
+	/* Copy flags with task isolation disabled */
+	this_cpu_write(tsk_thread_flags_cache,
+		       READ_ONCE(task_thread_info(current)->flags));
+}
+
+/* Disable task isolation for the specified task. */
+static void stop_isolation(struct task_struct *p)
+{
+	int cpu, this_cpu;
+	unsigned long flags;
+
+	this_cpu = get_cpu();
+	cpu = task_cpu(p);
+	if (atomic_inc_return(&per_cpu(isol_exit_counter, cpu)) > 1) {
+		/* Already exiting isolation */
+		atomic_dec(&per_cpu(isol_exit_counter, cpu));
+		put_cpu();
+		return;
+	}
+
+	if (p == current) {
+		p->task_isolation_state = STATE_NORMAL;
+		fast_task_isolation_cpu_cleanup(NULL);
+		task_isolation_cpu_cleanup();
+		if (atomic_dec_return(&per_cpu(isol_counter, cpu)) < 0) {
+			/* Is not isolated already */
+			atomic_inc(&per_cpu(isol_counter, cpu));
+		}
+		put_cpu();
+	} else {
+		if (atomic_dec_return(&per_cpu(isol_counter, cpu)) < 0) {
+			/* Is not isolated already */
+			atomic_inc(&per_cpu(isol_counter, cpu));
+			atomic_dec(&per_cpu(isol_exit_counter, cpu));
+			put_cpu();
+			return;
+		}
+		/*
+		 * Schedule "slow" cleanup. This relies on
+		 * TIF_NOTIFY_RESUME being set
+		 */
+		spin_lock_irqsave(&task_isolation_cleanup_lock, flags);
+		cpumask_set_cpu(cpu, task_isolation_cleanup_map);
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+		/*
+		 * Setting flags is delegated to the CPU where
+		 * isolated task is running
+		 * isol_exit_counter will be decremented from there as well.
+		 */
+		per_cpu(isol_break_csd, cpu).func =
+		    fast_task_isolation_cpu_cleanup;
+		per_cpu(isol_break_csd, cpu).info = NULL;
+		per_cpu(isol_break_csd, cpu).flags = 0;
+		smp_call_function_single_async(cpu,
+					       &per_cpu(isol_break_csd, cpu));
+		put_cpu();
+	}
+}
+
+/*
+ * This code runs with interrupts disabled just before the return to
+ * userspace, after a prctl() has requested enabling task isolation.
+ * We take whatever steps are needed to avoid being interrupted later:
+ * drain the lru pages, stop the scheduler tick, etc.  More
+ * functionality may be added here later to avoid other types of
+ * interrupts from other kernel subsystems.
+ *
+ * If we can't enable task isolation, we update the syscall return
+ * value with an appropriate error.
+ */
+void task_isolation_start(void)
+{
+	int error;
+
+	/*
+	 * We should only be called in STATE_NORMAL (isolation disabled),
+	 * on our way out of the kernel from the prctl() that turned it on.
+	 * If we are exiting from the kernel in another state, it means we
+	 * made it back into the kernel without disabling task isolation,
+	 * and we should investigate how (and in any case disable task
+	 * isolation at this point).  We are clearly not on the path back
+	 * from the prctl() so we don't touch the syscall return value.
+	 */
+	if (WARN_ON_ONCE(current->task_isolation_state != STATE_NORMAL)) {
+		/* Increment counter, this will allow isolation breaking */
+		if (atomic_inc_return(&per_cpu(isol_counter,
+					      smp_processor_id())) > 1) {
+			atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
+		}
+		atomic_inc(&per_cpu(isol_counter, smp_processor_id()));
+		stop_isolation(current);
+		return;
+	}
+
+	/*
+	 * Must be affinitized to a single core with task isolation possible.
+	 * In principle this could be remotely modified between the prctl()
+	 * and the return to userspace, so we have to check it here.
+	 */
+	if (current->nr_cpus_allowed != 1 ||
+	    !is_isolation_cpu(smp_processor_id())) {
+		error = -EINVAL;
+		goto error;
+	}
+
+	/* If the vmstat delayed work is not canceled, we have to try again. */
+	if (!vmstat_idle()) {
+		error = -EAGAIN;
+		goto error;
+	}
+
+	/* Try to stop the dynamic tick. */
+	error = try_stop_full_tick();
+	if (error)
+		goto error;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Increment counter, this will allow isolation breaking */
+	if (atomic_inc_return(&per_cpu(isol_counter,
+				      smp_processor_id())) > 1) {
+		atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
+	}
+
+	/* Record isolated task IDs and name */
+	record_curr_isolated_task();
+
+	/* Copy flags with task isolation enabled */
+	this_cpu_write(tsk_thread_flags_cache,
+		       READ_ONCE(task_thread_info(current)->flags));
+
+	current->task_isolation_state = STATE_ISOLATED;
+	return;
+
+error:
+	/* Increment counter, this will allow isolation breaking */
+	if (atomic_inc_return(&per_cpu(isol_counter,
+				      smp_processor_id())) > 1) {
+		atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
+	}
+	stop_isolation(current);
+	syscall_set_return_value(current, current_pt_regs(), error, 0);
+}
+
+/* Stop task isolation on the remote task and send it a signal. */
+static void send_isolation_signal(struct task_struct *task)
+{
+	int flags = task->task_isolation_flags;
+	kernel_siginfo_t info = {
+		.si_signo = PR_TASK_ISOLATION_GET_SIG(flags) ?: SIGKILL,
+	};
+
+	stop_isolation(task);
+	send_sig_info(info.si_signo, &info, task);
+}
+
+/* Only a few syscalls are valid once we are in task isolation mode. */
+static bool is_acceptable_syscall(int syscall)
+{
+	/* No need to incur an isolation signal if we are just exiting. */
+	if (syscall == __NR_exit || syscall == __NR_exit_group)
+		return true;
+
+	/* Check to see if it's the prctl for isolation. */
+	if (syscall == __NR_prctl) {
+		unsigned long arg[SYSCALL_MAX_ARGS];
+
+		syscall_get_arguments(current, current_pt_regs(), arg);
+		if (arg[0] == PR_TASK_ISOLATION)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * This routine is called from syscall entry, prevents most syscalls
+ * from executing, and if needed raises a signal to notify the process.
+ *
+ * Note that we have to stop isolation before we even print a message
+ * here, since otherwise we might end up reporting an interrupt due to
+ * kicking the printk handling code, rather than reporting the true
+ * cause of interrupt here.
+ *
+ * The message is not suppressed by previous remotely triggered
+ * messages.
+ */
+int task_isolation_syscall(int syscall)
+{
+	struct task_struct *task = current;
+
+	if (is_acceptable_syscall(syscall)) {
+		stop_isolation(task);
+		return 0;
+	}
+
+	send_isolation_signal(task);
+
+	pr_task_isol_warn(smp_processor_id(),
+			  "task_isolation lost due to syscall %d\n",
+			  syscall);
+	debug_dump_stack();
+
+	syscall_set_return_value(task, current_pt_regs(), -ERESTARTNOINTR, -1);
+	return -1;
+}
+
+/*
+ * This routine is called from any exception or irq that doesn't
+ * otherwise trigger a signal to the user process (e.g. page fault).
+ *
+ * Messages will be suppressed if there is already a reported remote
+ * cause for isolation breaking, so we don't generate multiple
+ * confusingly similar messages about the same event.
+ */
+void _task_isolation_interrupt(const char *fmt, ...)
+{
+	struct task_struct *task = current;
+	va_list args;
+	char buf[100];
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	/* Are we exiting isolation already? */
+	if (atomic_read(&per_cpu(isol_exit_counter, smp_processor_id())) != 0) {
+		task->task_isolation_state = STATE_NORMAL;
+		return;
+	}
+	/*
+	 * Avoid reporting interrupts that happen after we have prctl'ed
+	 * to enable isolation, but before we have returned to userspace.
+	 */
+	if (task->task_isolation_state == STATE_NORMAL)
+		return;
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	/* Handle NMIs minimally, since we can't send a signal. */
+	if (in_nmi()) {
+		pr_task_isol_err(smp_processor_id(),
+				 "isolation: in NMI; not delivering signal\n");
+	} else {
+		send_isolation_signal(task);
+	}
+
+	if (pr_task_isol_warn_supp(smp_processor_id(),
+				   "task_isolation lost due to %s\n", buf))
+		debug_dump_stack();
+}
+
+/*
+ * Called before we wake up a task that has a signal to process.
+ * Needs to be done to handle interrupts that trigger signals, which
+ * we don't catch with task_isolation_interrupt() hooks.
+ *
+ * This message is also suppressed if there was already a remotely
+ * caused message about the same isolation breaking event.
+ */
+void _task_isolation_signal(struct task_struct *task)
+{
+	struct isol_task_desc *desc;
+	int ind, cpu;
+	bool do_warn = (task->task_isolation_state == STATE_ISOLATED);
+
+	cpu = task_cpu(task);
+	desc = &per_cpu(isol_task_descs, cpu);
+	ind = atomic_read(&desc->curr_index) & 1;
+	if (desc->warned[ind])
+		do_warn = false;
+
+	stop_isolation(task);
+
+	if (do_warn) {
+		pr_warn("isolation: %s/%d/%d (cpu %d): task_isolation lost due to signal\n",
+			task->comm, task->tgid, task->pid, cpu);
+		debug_dump_stack();
+	}
+}
+
+/*
+ * Generate a stack backtrace if we are going to interrupt another task
+ * isolation process.
+ */
+void task_isolation_remote(int cpu, const char *fmt, ...)
+{
+	struct task_struct *curr_task;
+	va_list args;
+	char buf[200];
+
+	if (!is_isolation_cpu(cpu) || !task_isolation_on_cpu(cpu))
+		return;
+
+	curr_task = current;
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	if (pr_task_isol_warn(cpu,
+			      "task_isolation lost due to %s by %s/%d/%d on cpu %d\n",
+			      buf,
+			      curr_task->comm, curr_task->tgid,
+			      curr_task->pid, smp_processor_id()))
+		debug_dump_stack();
+}
+
+/*
+ * Generate a stack backtrace if any of the cpus in "mask" are running
+ * task isolation processes.
+ */
+void task_isolation_remote_cpumask(const struct cpumask *mask,
+				   const char *fmt, ...)
+{
+	struct task_struct *curr_task;
+	cpumask_var_t warn_mask;
+	va_list args;
+	char buf[200];
+	int cpu, first_cpu;
+
+	if (task_isolation_map == NULL ||
+		!zalloc_cpumask_var(&warn_mask, GFP_KERNEL))
+		return;
+
+	first_cpu = -1;
+	for_each_cpu_and(cpu, mask, task_isolation_map) {
+		if (task_isolation_on_cpu(cpu)) {
+			if (first_cpu < 0)
+				first_cpu = cpu;
+			else
+				cpumask_set_cpu(cpu, warn_mask);
+		}
+	}
+
+	if (first_cpu < 0)
+		goto done;
+
+	curr_task = current;
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	if (cpumask_weight(warn_mask) == 0)
+		pr_task_isol_warn(first_cpu,
+				  "task_isolation lost due to %s by %s/%d/%d on cpu %d\n",
+				  buf, curr_task->comm, curr_task->tgid,
+				  curr_task->pid, smp_processor_id());
+	else
+		pr_task_isol_warn(first_cpu,
+				  " and cpus %*pbl: task_isolation lost due to %s by %s/%d/%d on cpu %d\n",
+				  cpumask_pr_args(warn_mask),
+				  buf, curr_task->comm, curr_task->tgid,
+				  curr_task->pid, smp_processor_id());
+	debug_dump_stack();
+
+done:
+	free_cpumask_var(warn_mask);
+}
+
+/*
+ * Check if given CPU is running isolated task.
+ */
+int task_isolation_on_cpu(int cpu)
+{
+	return test_bit(TIF_TASK_ISOLATION,
+			&per_cpu(tsk_thread_flags_cache, cpu));
+}
+
+/*
+ * Set CPUs currently running isolated tasks in CPU mask.
+ */
+void task_isolation_cpumask(struct cpumask *mask)
+{
+	int cpu;
+
+	if (task_isolation_map == NULL)
+		return;
+
+	for_each_cpu(cpu, task_isolation_map)
+		if (task_isolation_on_cpu(cpu))
+			cpumask_set_cpu(cpu, mask);
+}
+
+/*
+ * Clear CPUs currently running isolated tasks in CPU mask.
+ */
+void task_isolation_clear_cpumask(struct cpumask *mask)
+{
+	int cpu;
+
+	if (task_isolation_map == NULL)
+		return;
+
+	for_each_cpu(cpu, task_isolation_map)
+		if (task_isolation_on_cpu(cpu))
+			cpumask_clear_cpu(cpu, mask);
+}
+
+/*
+ * Cleanup procedure. The call to this procedure may be delayed.
+ */
+void task_isolation_cpu_cleanup(void)
+{
+	kick_hrtimer();
+}
+
+/*
+ * Check if cleanup is scheduled on the current CPU, and if so, run it.
+ * Intended to be called from notify_resume() or another such callback
+ * on the target CPU.
+ */
+void task_isolation_check_run_cleanup(void)
+{
+	int cpu;
+	unsigned long flags;
+
+	spin_lock_irqsave(&task_isolation_cleanup_lock, flags);
+
+	cpu = smp_processor_id();
+
+	if (cpumask_test_cpu(cpu, task_isolation_cleanup_map)) {
+		cpumask_clear_cpu(cpu, task_isolation_cleanup_map);
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+		task_isolation_cpu_cleanup();
+	} else
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+}
diff --git a/kernel/signal.c b/kernel/signal.c
index 5b2396350dd1..1df57e38c361 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -46,6 +46,7 @@
 #include <linux/livepatch.h>
 #include <linux/cgroup.h>
 #include <linux/audit.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -758,6 +759,7 @@ static int dequeue_synchronous_signal(kernel_siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+	task_isolation_signal(t);
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/sys.c b/kernel/sys.c
index f9bc5c303e3f..0a4059a8c4f9 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,6 +42,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2513,6 +2514,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+	case PR_TASK_ISOLATION:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = task_isolation_request(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 3a609e7344f3..5bb98f39bde6 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -30,6 +30,7 @@
 #include <linux/syscalls.h>
 #include <linux/interrupt.h>
 #include <linux/tick.h>
+#include <linux/isolation.h>
 #include <linux/err.h>
 #include <linux/debugobjects.h>
 #include <linux/sched/signal.h>
@@ -721,6 +722,19 @@ static void retrigger_next_event(void *arg)
 	raw_spin_unlock(&base->lock);
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+void kick_hrtimer(void)
+{
+	unsigned long flags;
+
+	preempt_disable();
+	local_irq_save(flags);
+	retrigger_next_event(NULL);
+	local_irq_restore(flags);
+	preempt_enable();
+}
+#endif
+
 /*
  * Switch to high resolution mode
  */
@@ -868,8 +882,21 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
 void clock_was_set(void)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
+#ifdef CONFIG_TASK_ISOLATION
+	struct cpumask mask;
+
+	cpumask_clear(&mask);
+	task_isolation_cpumask(&mask);
+	cpumask_complement(&mask, &mask);
+	/*
+	 * Retrigger the CPU local events everywhere except CPUs
+	 * running isolated tasks.
+	 */
+	on_each_cpu_mask(&mask, retrigger_next_event, NULL, 1);
+#else
 	/* Retrigger the CPU local events everywhere */
 	on_each_cpu(retrigger_next_event, NULL, 1);
+#endif
 #endif
 	timerfd_clock_was_set();
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a792d21cac64..1d4dec9d3ee7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -882,6 +882,24 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts)
 #endif
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+int try_stop_full_tick(void)
+{
+	int cpu = smp_processor_id();
+	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+	/* For an unstable clock, we should return a permanent error code. */
+	if (atomic_read(&tick_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE)
+		return -EINVAL;
+
+	if (!can_stop_full_tick(cpu, ts))
+		return -EAGAIN;
+
+	tick_nohz_stop_sched_tick(ts, cpu);
+	return 0;
+}
+#endif
+
 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 {
 	/*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 04/12] task_isolation: Add task isolation hooks to arch-independent code
  2020-03-04 16:01 [PATCH 00/12] "Task_isolation" mode Alex Belits
                   ` (2 preceding siblings ...)
  2020-03-04 16:07 ` [PATCH 03/12] task_isolation: userspace hard isolation from kernel Alex Belits
@ 2020-03-04 16:08 ` Alex Belits
  2020-03-04 16:09 ` [PATCH 05/12] task_isolation: arch/x86: enable task isolation functionality Alex Belits
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-04 16:08 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	linux-mm, linux-arch

From: Chris Metcalf <cmetcalf@mellanox.com>

This commit adds task isolation hooks as follows:

- __handle_domain_irq() generates an isolation warning for the
  local task

- irq_work_queue_on() generates an isolation warning for the remote
  task being interrupted for irq_work

- generic_exec_single() generates a remote isolation warning for
  the remote cpu being IPI'd

- smp_call_function_many() generates a remote isolation warning for
  the set of remote cpus being IPI'd

Calls to task_isolation_remote() or task_isolation_interrupt() can
be placed in the platform-independent code like this when doing so
results in fewer lines of code changes, as for example is true of
the users of the arch_send_call_function_*() APIs. Or, they can be
placed in the per-architecture code when there are many callers,
as for example is true of the smp_send_reschedule() call.

A further cleanup might be to create an intermediate layer, so that
for example smp_send_reschedule() is a single generic function that
just calls arch_smp_send_reschedule(), allowing generic code to be
called every time smp_send_reschedule() is invoked. But for now, we
just update either callers or callees as makes most sense.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/irq/irqdesc.c | 9 +++++++++
 kernel/irq_work.c    | 5 ++++-
 kernel/smp.c         | 6 +++++-
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 98a5f10d1900..e2b81d035fa1 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -16,6 +16,7 @@
 #include <linux/bitmap.h>
 #include <linux/irqdomain.h>
 #include <linux/sysfs.h>
+#include <linux/isolation.h>
 
 #include "internals.h"
 
@@ -670,6 +671,10 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq,
 		irq = irq_find_mapping(domain, hwirq);
 #endif
 
+	task_isolation_interrupt((irq == hwirq) ?
+				 "irq %d (%s)" : "irq %d (%s hwirq %d)",
+				 irq, domain ? domain->name : "", hwirq);
+
 	/*
 	 * Some hardware gives randomly wrong interrupts.  Rather
 	 * than crashing, do something sensible.
@@ -711,6 +716,10 @@ int handle_domain_nmi(struct irq_domain *domain, unsigned int hwirq,
 
 	irq = irq_find_mapping(domain, hwirq);
 
+	task_isolation_interrupt((irq == hwirq) ?
+				 "NMI irq %d (%s)" : "NMI irq %d (%s hwirq %d)",
+				 irq, domain ? domain->name : "", hwirq);
+
 	/*
 	 * ack_bad_irq is not NMI-safe, just report
 	 * an invalid interrupt.
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 828cc30774bc..8fd4ece43dd8 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -18,6 +18,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/isolation.h>
 #include <asm/processor.h>
 
 
@@ -102,8 +103,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (cpu != smp_processor_id()) {
 		/* Arch remote IPI send/receive backend aren't NMI safe */
 		WARN_ON_ONCE(in_nmi());
-		if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+		if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+			task_isolation_remote(cpu, "irq_work");
 			arch_send_call_function_single_ipi(cpu);
+		}
 	} else {
 		__irq_work_queue_local(work);
 	}
diff --git a/kernel/smp.c b/kernel/smp.c
index d0ada39eb4d4..3a8bcbdd4ce6 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -20,6 +20,7 @@
 #include <linux/sched.h>
 #include <linux/sched/idle.h>
 #include <linux/hypervisor.h>
+#include <linux/isolation.h>
 
 #include "smpboot.h"
 
@@ -176,8 +177,10 @@ static int generic_exec_single(int cpu, call_single_data_t *csd,
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
-	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) {
+		task_isolation_remote(cpu, "IPI function");
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return 0;
 }
@@ -466,6 +469,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	task_isolation_remote_cpumask(cfd->cpumask_ipi, "IPI function");
 	arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
 
 	if (wait) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 05/12] task_isolation: arch/x86: enable task isolation functionality
  2020-03-04 16:01 [PATCH 00/12] "Task_isolation" mode Alex Belits
                   ` (3 preceding siblings ...)
  2020-03-04 16:08 ` [PATCH 04/12] task_isolation: Add task isolation hooks to arch-independent code Alex Belits
@ 2020-03-04 16:09 ` Alex Belits
  2020-03-04 16:10 ` [PATCH 06/12] task_isolation: arch/arm64: " Alex Belits
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-04 16:09 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	linux-mm, linux-arch

From: Chris Metcalf <cmetcalf@mellanox.com>

In prepare_exit_to_usermode(), call task_isolation_start() for
TIF_TASK_ISOLATION tasks.

In syscall_trace_enter_phase1(), add the necessary support for
reporting syscalls for task-isolation processes.

Add task_isolation_remote() calls for the kernel exception types
that do not result in signals, namely non-signalling page faults.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/entry/common.c            | 13 +++++++++++++
 arch/x86/include/asm/apic.h        |  3 +++
 arch/x86/include/asm/thread_info.h |  4 +++-
 arch/x86/kernel/apic/ipi.c         |  2 ++
 arch/x86/mm/fault.c                |  4 ++++
 6 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index beea77046f9b..9ea6d3e6e77d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -144,6 +144,7 @@ config X86
 	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_STACKLEAK
 	select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 9747876980b5..191327f4f802 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -26,6 +26,7 @@
 #include <linux/livepatch.h>
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
+#include <linux/isolation.h>
 
 #include <asm/desc.h>
 #include <asm/traps.h>
@@ -86,6 +87,15 @@ static long syscall_trace_enter(struct pt_regs *regs)
 			return -1L;
 	}
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->orig_ax) == -1)
+			return -1L;
+		work &= ~_TIF_TASK_ISOLATION;
+	}
 #ifdef CONFIG_SECCOMP
 	/*
 	 * Do seccomp after ptrace, to catch any tracer changes.
@@ -204,6 +214,9 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	if (unlikely(cached_flags & _TIF_NEED_FPU_LOAD))
 		switch_fpu_return();
 
+	if (cached_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
+
 #ifdef CONFIG_COMPAT
 	/*
 	 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 19e94af9cc5d..71149abbb0a0 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_APIC_H
 
 #include <linux/cpumask.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/cpufeature.h>
@@ -524,6 +525,7 @@ extern void irq_exit(void);
 
 static inline void entering_irq(void)
 {
+	task_isolation_interrupt("irq");
 	irq_enter();
 	kvm_set_cpu_l1tf_flush_l1d();
 }
@@ -536,6 +538,7 @@ static inline void entering_ack_irq(void)
 
 static inline void ipi_entering_ack_irq(void)
 {
+	task_isolation_interrupt("ack irq");
 	irq_enter();
 	ack_APIC_irq();
 	kvm_set_cpu_l1tf_flush_l1d();
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index cf4327986e98..60d107f784ee 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -92,6 +92,7 @@ struct thread_info {
 #define TIF_NOCPUID		15	/* CPUID is not accessible in userland */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
+#define TIF_TASK_ISOLATION	18	/* task isolation enabled for task */
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
@@ -122,6 +123,7 @@ struct thread_info {
 #define _TIF_NOCPUID		(1 << TIF_NOCPUID)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
@@ -140,7 +142,7 @@ struct thread_info {
 #define _TIF_WORK_SYSCALL_ENTRY	\
 	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |	\
 	 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT |	\
-	 _TIF_NOHZ)
+	 _TIF_NOHZ | _TIF_TASK_ISOLATION)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE					\
diff --git a/arch/x86/kernel/apic/ipi.c b/arch/x86/kernel/apic/ipi.c
index 6ca0f91372fd..b4dfaad6a440 100644
--- a/arch/x86/kernel/apic/ipi.c
+++ b/arch/x86/kernel/apic/ipi.c
@@ -2,6 +2,7 @@
 
 #include <linux/cpumask.h>
 #include <linux/smp.h>
+#include <linux/isolation.h>
 
 #include "local.h"
 
@@ -67,6 +68,7 @@ void native_smp_send_reschedule(int cpu)
 		WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu);
 		return;
 	}
+	task_isolation_remote(cpu, "reschedule IPI");
 	apic->send_IPI(cpu, RESCHEDULE_VECTOR);
 }
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fa4ea09593ab..2175a8655a7d 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 #include <linux/efi.h>			/* efi_recover_from_page_fault()*/
 #include <linux/mm_types.h>
+#include <linux/isolation.h>		/* task_isolation_interrupt     */
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1483,6 +1484,9 @@ void do_user_addr_fault(struct pt_regs *regs,
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
 	}
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	task_isolation_interrupt("page fault at %#lx", address);
+
 	check_v8086_mode(regs, address, tsk);
 }
 NOKPROBE_SYMBOL(do_user_addr_fault);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 06/12] task_isolation: arch/arm64: enable task isolation functionality
  2020-03-04 16:01 [PATCH 00/12] "Task_isolation" mode Alex Belits
                   ` (4 preceding siblings ...)
  2020-03-04 16:09 ` [PATCH 05/12] task_isolation: arch/x86: enable task isolation functionality Alex Belits
@ 2020-03-04 16:10 ` Alex Belits
  2020-03-04 16:31   ` Mark Rutland
  2020-03-04 16:11 ` [PATCH 07/12] task_isolation: arch/arm: " Alex Belits
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 71+ messages in thread
From: Alex Belits @ 2020-03-04 16:10 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	linux-mm, linux-arch

From: Chris Metcalf <cmetcalf@mellanox.com>

In do_notify_resume(), call task_isolation_start() for
TIF_TASK_ISOLATION tasks. Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop,
since we don't clear _TIF_TASK_ISOLATION in the loop.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one. This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

We instrument the smp_send_reschedule() routine so that it checks for
isolated tasks and generates a suitable warning if needed.

Finally, report on page faults in task-isolation processes in
do_page_faults().

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/thread_info.h |  5 ++++-
 arch/arm64/kernel/ptrace.c           | 10 ++++++++++
 arch/arm64/kernel/signal.c           | 13 ++++++++++++-
 arch/arm64/kernel/smp.c              |  7 +++++++
 arch/arm64/mm/fault.c                |  5 +++++
 6 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0b30e884e088..93b6aabc8be9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -129,6 +129,7 @@ config ARM64
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_STACKLEAK
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index f0cec4160136..7563098eb5b2 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -63,6 +63,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
 #define TIF_UPROBE		4	/* uprobe breakpoint or singlestep */
 #define TIF_FSCHECK		5	/* Check FS is USER_DS on return */
+#define TIF_TASK_ISOLATION	6
 #define TIF_NOHZ		7
 #define TIF_SYSCALL_TRACE	8	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	9	/* syscall auditing */
@@ -83,6 +84,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
@@ -96,7 +98,8 @@ void arch_release_task_struct(struct task_struct *tsk);
 
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
-				 _TIF_UPROBE | _TIF_FSCHECK)
+				 _TIF_UPROBE | _TIF_FSCHECK | \
+				 _TIF_TASK_ISOLATION)
 
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index cd6e5fa48b9c..b35b9b0c594c 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -29,6 +29,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/cpufeature.h>
@@ -1836,6 +1837,15 @@ int syscall_trace_enter(struct pt_regs *regs)
 			return -1;
 	}
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (test_thread_flag(TIF_TASK_ISOLATION)) {
+		if (task_isolation_syscall(regs->syscallno) == -1)
+			return -1;
+	}
+
 	/* Do the secure computing after ptrace; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 339882db5a91..d488c91a4877 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -20,6 +20,7 @@
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
 #include <linux/syscalls.h>
+#include <linux/isolation.h>
 
 #include <asm/daifflags.h>
 #include <asm/debug-monitors.h>
@@ -898,6 +899,11 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
+#define NOTIFY_RESUME_LOOP_FLAGS \
+	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
+	_TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
+	_TIF_UPROBE | _TIF_FSCHECK)
+
 asmlinkage void do_notify_resume(struct pt_regs *regs,
 				 unsigned long thread_flags)
 {
@@ -908,6 +914,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 	 */
 	trace_hardirqs_off();
 
+	task_isolation_check_run_cleanup();
+
 	do {
 		/* Check valid user FS if needed */
 		addr_limit_user_check();
@@ -938,7 +946,10 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 
 		local_daif_mask();
 		thread_flags = READ_ONCE(current_thread_info()->flags);
-	} while (thread_flags & _TIF_WORK_MASK);
+	} while (thread_flags & NOTIFY_RESUME_LOOP_FLAGS);
+
+	if (thread_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
 }
 
 unsigned long __ro_after_init signal_minsigstksz;
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index d4ed9a19d8fe..00f0f77adea0 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -32,6 +32,7 @@
 #include <linux/irq_work.h>
 #include <linux/kexec.h>
 #include <linux/kvm_host.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -818,6 +819,7 @@ void arch_send_call_function_single_ipi(int cpu)
 #ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
 void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "wakeup IPI");
 	smp_cross_call(mask, IPI_WAKEUP);
 }
 #endif
@@ -886,6 +888,9 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 		__inc_irq_stat(cpu, ipi_irqs[ipinr]);
 	}
 
+	task_isolation_interrupt("IPI type %d (%s)", ipinr,
+				 ipinr < NR_IPI ? ipi_types[ipinr] : "unknown");
+
 	switch (ipinr) {
 	case IPI_RESCHEDULE:
 		scheduler_ipi();
@@ -948,12 +953,14 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 
 void smp_send_reschedule(int cpu)
 {
+	task_isolation_remote(cpu, "reschedule IPI");
 	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
 }
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 void tick_broadcast(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "timer IPI");
 	smp_cross_call(mask, IPI_TIMER);
 }
 #endif
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 85566d32958f..fc4b42c81c4f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -23,6 +23,7 @@
 #include <linux/perf_event.h>
 #include <linux/preempt.h>
 #include <linux/hugetlb.h>
+#include <linux/isolation.h>
 
 #include <asm/acpi.h>
 #include <asm/bug.h>
@@ -543,6 +544,10 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 	 */
 	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
 			      VM_FAULT_BADACCESS)))) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (user_mode(regs))
+			task_isolation_interrupt("page fault at %#lx", addr);
+
 		/*
 		 * Major/minor page fault accounting is only done
 		 * once. If we go through a retry, it is extremely
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 07/12] task_isolation: arch/arm: enable task isolation functionality
  2020-03-04 16:01 [PATCH 00/12] "Task_isolation" mode Alex Belits
                   ` (5 preceding siblings ...)
  2020-03-04 16:10 ` [PATCH 06/12] task_isolation: arch/arm64: " Alex Belits
@ 2020-03-04 16:11 ` Alex Belits
  2020-03-04 16:12 ` [PATCH 08/12] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-04 16:11 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	linux-mm, linux-arch

From: Francis Giraldeau <francis.giraldeau@gmail.com>

This patch is a port of the task isolation functionality to the arm 32-bit
architecture. The task isolation needs an additional thread flag that
requires to change the entry assembly code to accept a bitfield larger than
one byte. The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now
defined in the literal pool. The rest of the patch is straightforward and
reflects what is done on other architectures.

To avoid problems with the tst instruction in the v7m build, we renumber
TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 arch/arm/Kconfig                   |  1 +
 arch/arm/include/asm/thread_info.h | 10 +++++++---
 arch/arm/kernel/entry-common.S     | 15 ++++++++++-----
 arch/arm/kernel/signal.c           | 10 +++++++++-
 arch/arm/kernel/smp.c              |  4 ++++
 arch/arm/mm/fault.c                |  8 +++++++-
 6 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 97864aabc2a6..1a66e6c6807c 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -67,6 +67,7 @@ config ARM
 	select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
 	select HAVE_ARCH_MMAP_RND_BITS if MMU
 	select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARM_SMCCC if CPU_V7
diff --git a/arch/arm/include/asm/thread_info.h b/arch/arm/include/asm/thread_info.h
index 0d0d5178e2c3..ec3c2084c391 100644
--- a/arch/arm/include/asm/thread_info.h
+++ b/arch/arm/include/asm/thread_info.h
@@ -139,7 +139,8 @@ extern int vfp_restore_user_hwstate(struct user_vfp *,
 #define TIF_SYSCALL_TRACE	4	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	5	/* syscall auditing active */
 #define TIF_SYSCALL_TRACEPOINT	6	/* syscall tracepoint instrumentation */
-#define TIF_SECCOMP		7	/* seccomp syscall filtering active */
+#define TIF_TASK_ISOLATION	7	/* task isolation active */
+#define TIF_SECCOMP		8	/* seccomp syscall filtering active */
 
 #define TIF_NOHZ		12	/* in adaptive nohz mode */
 #define TIF_USING_IWMMXT	17
@@ -153,18 +154,21 @@ extern int vfp_restore_user_hwstate(struct user_vfp *,
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SYSCALL_TRACEPOINT	(1 << TIF_SYSCALL_TRACEPOINT)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_USING_IWMMXT	(1 << TIF_USING_IWMMXT)
 
 /* Checks for any syscall work in entry-common.S */
 #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
-			   _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP)
+			   _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
+			   _TIF_TASK_ISOLATION)
 
 /*
  * Change these and you break ASM code in entry-common.S
  */
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
-				 _TIF_NOTIFY_RESUME | _TIF_UPROBE)
+				 _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
+				 _TIF_TASK_ISOLATION)
 
 #endif /* __KERNEL__ */
 #endif /* __ASM_ARM_THREAD_INFO_H */
diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
index 271cb8a1eba1..6ceb5cb808a9 100644
--- a/arch/arm/kernel/entry-common.S
+++ b/arch/arm/kernel/entry-common.S
@@ -53,7 +53,8 @@ __ret_fast_syscall:
 	cmp	r2, #TASK_SIZE
 	blne	addr_limit_check_failed
 	ldr	r1, [tsk, #TI_FLAGS]		@ re-check for syscall tracing
-	tst	r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	ldr	r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	tst	r1, r2
 	bne	fast_work_pending
 
 
@@ -90,7 +91,8 @@ __ret_fast_syscall:
 	cmp	r2, #TASK_SIZE
 	blne	addr_limit_check_failed
 	ldr	r1, [tsk, #TI_FLAGS]		@ re-check for syscall tracing
-	tst	r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	ldr	r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	tst	r1, r2
 	beq	no_work_pending
  UNWIND(.fnend		)
 ENDPROC(ret_fast_syscall)
@@ -98,7 +100,8 @@ ENDPROC(ret_fast_syscall)
 	/* Slower path - fall through to work_pending */
 #endif
 
-	tst	r1, #_TIF_SYSCALL_WORK
+	ldr	r2, =_TIF_SYSCALL_WORK
+	tst	r1, r2
 	bne	__sys_trace_return_nosave
 slow_work_pending:
 	mov	r0, sp				@ 'regs'
@@ -131,7 +134,8 @@ ENTRY(ret_to_user_from_irq)
 	cmp	r2, #TASK_SIZE
 	blne	addr_limit_check_failed
 	ldr	r1, [tsk, #TI_FLAGS]
-	tst	r1, #_TIF_WORK_MASK
+	ldr	r2, =_TIF_WORK_MASK
+	tst	r1, r2
 	bne	slow_work_pending
 no_work_pending:
 	asm_trace_hardirqs_on save = 0
@@ -251,7 +255,8 @@ local_restart:
 	ldr	r10, [tsk, #TI_FLAGS]		@ check for syscall tracing
 	stmdb	sp!, {r4, r5}			@ push fifth and sixth args
 
-	tst	r10, #_TIF_SYSCALL_WORK		@ are we tracing syscalls?
+	ldr	r11, =_TIF_SYSCALL_WORK		@ are we tracing syscalls?
+	tst	r10, r11
 	bne	__sys_trace
 
 	invoke_syscall tbl, scno, r10, __ret_fast_syscall
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index ab2568996ddb..f775cbcb7487 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -12,6 +12,7 @@
 #include <linux/tracehook.h>
 #include <linux/uprobes.h>
 #include <linux/syscalls.h>
+#include <linux/isolation.h>
 
 #include <asm/elf.h>
 #include <asm/cacheflush.h>
@@ -639,6 +640,9 @@ static int do_signal(struct pt_regs *regs, int syscall)
 	return 0;
 }
 
+#define WORK_PENDING_LOOP_FLAGS	(_TIF_NEED_RESCHED | _TIF_SIGPENDING |	\
+				 _TIF_NOTIFY_RESUME | _TIF_UPROBE)
+
 asmlinkage int
 do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 {
@@ -676,7 +680,11 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 		}
 		local_irq_disable();
 		thread_flags = current_thread_info()->flags;
-	} while (thread_flags & _TIF_WORK_MASK);
+	} while (thread_flags & WORK_PENDING_LOOP_FLAGS);
+
+	if (thread_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
+
 	return 0;
 }
 
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 46e1be9e57a8..95f19b980776 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -26,6 +26,7 @@
 #include <linux/completion.h>
 #include <linux/cpufreq.h>
 #include <linux/irq_work.h>
+#include <linux/isolation.h>
 
 #include <linux/atomic.h>
 #include <asm/bugs.h>
@@ -560,6 +561,7 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask)
 
 void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "wakeup IPI");
 	smp_cross_call(mask, IPI_WAKEUP);
 }
 
@@ -579,6 +581,7 @@ void arch_irq_work_raise(void)
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 void tick_broadcast(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "timer IPI");
 	smp_cross_call(mask, IPI_TIMER);
 }
 #endif
@@ -702,6 +705,7 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 
 void smp_send_reschedule(int cpu)
 {
+	task_isolation_remote(cpu, "reschedule IPI");
 	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
 }
 
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index bd0f4821f7e1..acd11a69c4e4 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -17,6 +17,7 @@
 #include <linux/sched/debug.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/isolation.h>
 
 #include <asm/pgtable.h>
 #include <asm/system_misc.h>
@@ -332,8 +333,13 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	/*
 	 * Handle the "normal" case first - VM_FAULT_MAJOR
 	 */
-	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
+	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
+			      VM_FAULT_BADACCESS)))) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (user_mode(regs))
+			task_isolation_interrupt("page fault at %#lx", addr);
 		return 0;
+	}
 
 	/*
 	 * If we are in kernel mode at this point, we
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 08/12] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-03-04 16:01 [PATCH 00/12] "Task_isolation" mode Alex Belits
                   ` (6 preceding siblings ...)
  2020-03-04 16:11 ` [PATCH 07/12] task_isolation: arch/arm: " Alex Belits
@ 2020-03-04 16:12 ` Alex Belits
  2020-03-06 16:03   ` Frederic Weisbecker
  2020-03-04 16:13 ` [PATCH 09/12] task_isolation: net: don't flush backlog on CPUs running isolated tasks Alex Belits
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 71+ messages in thread
From: Alex Belits @ 2020-03-04 16:12 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	linux-mm, linux-arch

From: Yuri Norov <ynorov@marvell.com>

For nohz_full CPUs the desirable behavior is to receive interrupts
generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
obviously not desirable because it breaks isolation.

This patch adds check for it.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/time/tick-sched.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 1d4dec9d3ee7..fe4503ba1316 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include <linux/sched/clock.h>
 #include <linux/sched/stat.h>
 #include <linux/sched/nohz.h>
+#include <linux/isolation.h>
 #include <linux/module.h>
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
@@ -262,7 +263,7 @@ static void tick_nohz_full_kick(void)
  */
 void tick_nohz_full_kick_cpu(int cpu)
 {
-	if (!tick_nohz_full_cpu(cpu))
+	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
 		return;
 
 	irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 09/12] task_isolation: net: don't flush backlog on CPUs running isolated tasks
  2020-03-04 16:01 [PATCH 00/12] "Task_isolation" mode Alex Belits
                   ` (7 preceding siblings ...)
  2020-03-04 16:12 ` [PATCH 08/12] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
@ 2020-03-04 16:13 ` Alex Belits
  2020-03-04 16:14 ` [PATCH 10/12] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize Alex Belits
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-04 16:13 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	linux-mm, linux-arch

From: Yuri Norov <ynorov@marvell.com>

If CPU runs isolated task, there's no any backlog on it, and
so we don't need to flush it. Currently flush_all_backlogs()
enqueues corresponding work on all CPUs including ones that run
isolated tasks. It leads to breaking task isolation for nothing.

In this patch, backlog flushing is enqueued only on non-isolated CPUs.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 net/core/dev.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c6c985fe7b1b..6d32abb1f06d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -74,6 +74,7 @@
 #include <linux/cpu.h>
 #include <linux/types.h>
 #include <linux/kernel.h>
+#include <linux/isolation.h>
 #include <linux/hash.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
@@ -5518,9 +5519,12 @@ static void flush_all_backlogs(void)
 
 	get_online_cpus();
 
-	for_each_online_cpu(cpu)
+	for_each_online_cpu(cpu) {
+		if (task_isolation_on_cpu(cpu))
+			continue;
 		queue_work_on(cpu, system_highpri_wq,
 			      per_cpu_ptr(&flush_works, cpu));
+	}
 
 	for_each_online_cpu(cpu)
 		flush_work(per_cpu_ptr(&flush_works, cpu));
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 10/12] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize
  2020-03-04 16:01 [PATCH 00/12] "Task_isolation" mode Alex Belits
                   ` (8 preceding siblings ...)
  2020-03-04 16:13 ` [PATCH 09/12] task_isolation: net: don't flush backlog on CPUs running isolated tasks Alex Belits
@ 2020-03-04 16:14 ` Alex Belits
  2020-03-04 16:15 ` [PATCH 11/12] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-04 16:14 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	linux-mm, linux-arch

From: Yuri Norov <ynorov@marvell.com>

CPUs running isolated tasks are in userspace, so they don't have to
perform ring buffer updates immediately. If ring_buffer_resize()
schedules the update on those CPUs, isolation is broken. To prevent
that, updates for CPUs running isolated tasks are performed locally,
like for offline CPUs.

A race condition between this update and isolation breaking is avoided
at the cost of disabling per_cpu buffer writing for the time of update
when it coincides with isolation breaking.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/trace/ring_buffer.c | 61 ++++++++++++++++++++++++++++++++++----
 1 file changed, 55 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 61f0e92ace99..6c7479b6411d 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -21,6 +21,7 @@
 #include <linux/delay.h>
 #include <linux/slab.h>
 #include <linux/init.h>
+#include <linux/isolation.h>
 #include <linux/hash.h>
 #include <linux/list.h>
 #include <linux/cpu.h>
@@ -1701,6 +1702,37 @@ static void update_pages_handler(struct work_struct *work)
 	complete(&cpu_buffer->update_done);
 }
 
+static bool update_if_isolated(struct ring_buffer_per_cpu *cpu_buffer,
+			       int cpu)
+{
+	bool rv = false;
+
+	if (task_isolation_on_cpu(cpu)) {
+		/*
+		 * CPU is running isolated task. Since it may lose
+		 * isolation and re-enter kernel simultaneously with
+		 * this update, disable recording until it's done.
+		 */
+		atomic_inc(&cpu_buffer->record_disabled);
+		/* Make sure, update is done, and isolation state is current */
+		smp_mb();
+		if (task_isolation_on_cpu(cpu)) {
+			/*
+			 * If CPU is still running isolated task, we
+			 * can be sure that breaking isolation will
+			 * happen while recording is disabled, and CPU
+			 * will not touch this buffer until the update
+			 * is done.
+			 */
+			rb_update_pages(cpu_buffer);
+			cpu_buffer->nr_pages_to_update = 0;
+			rv = true;
+		}
+		atomic_dec(&cpu_buffer->record_disabled);
+	}
+	return rv;
+}
+
 /**
  * ring_buffer_resize - resize the ring buffer
  * @buffer: the buffer to resize.
@@ -1784,13 +1816,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
 			if (!cpu_buffer->nr_pages_to_update)
 				continue;
 
-			/* Can't run something on an offline CPU. */
+			/*
+			 * Can't run something on an offline CPU.
+			 *
+			 * CPUs running isolated tasks don't have to
+			 * update ring buffers until they exit
+			 * isolation because they are in
+			 * userspace. Use the procedure that prevents
+			 * race condition with isolation breaking.
+			 */
 			if (!cpu_online(cpu)) {
 				rb_update_pages(cpu_buffer);
 				cpu_buffer->nr_pages_to_update = 0;
 			} else {
-				schedule_work_on(cpu,
-						&cpu_buffer->update_pages_work);
+				if (!update_if_isolated(cpu_buffer, cpu))
+					schedule_work_on(cpu,
+					&cpu_buffer->update_pages_work);
 			}
 		}
 
@@ -1829,13 +1870,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
 
 		get_online_cpus();
 
-		/* Can't run something on an offline CPU. */
+		/*
+		 * Can't run something on an offline CPU.
+		 *
+		 * CPUs running isolated tasks don't have to update
+		 * ring buffers until they exit isolation because they
+		 * are in userspace. Use the procedure that prevents
+		 * race condition with isolation breaking.
+		 */
 		if (!cpu_online(cpu_id))
 			rb_update_pages(cpu_buffer);
 		else {
-			schedule_work_on(cpu_id,
+			if (!update_if_isolated(cpu_buffer, cpu_id))
+				schedule_work_on(cpu_id,
 					 &cpu_buffer->update_pages_work);
-			wait_for_completion(&cpu_buffer->update_done);
+				wait_for_completion(&cpu_buffer->update_done);
+			}
 		}
 
 		cpu_buffer->nr_pages_to_update = 0;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 11/12] task_isolation: kick_all_cpus_sync: don't kick isolated cpus
  2020-03-04 16:01 [PATCH 00/12] "Task_isolation" mode Alex Belits
                   ` (9 preceding siblings ...)
  2020-03-04 16:14 ` [PATCH 10/12] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize Alex Belits
@ 2020-03-04 16:15 ` Alex Belits
  2020-03-06 15:34   ` Frederic Weisbecker
  2020-03-04 16:16 ` [PATCH 12/12] task_isolation: CONFIG_TASK_ISOLATION prevents distribution of jobs to non-housekeeping CPUs Alex Belits
  2020-03-08  3:42 ` [PATCH v2 00/12] "Task_isolation" mode Alex Belits
  12 siblings, 1 reply; 71+ messages in thread
From: Alex Belits @ 2020-03-04 16:15 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	linux-mm, linux-arch

From: Yuri Norov <ynorov@marvell.com>

Make sure that kick_all_cpus_sync() does not call CPUs that are running
isolated tasks.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/smp.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 3a8bcbdd4ce6..d9b4b2fedfed 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -731,9 +731,21 @@ static void do_nothing(void *unused)
  */
 void kick_all_cpus_sync(void)
 {
+	struct cpumask mask;
+
 	/* Make sure the change is visible before we kick the cpus */
 	smp_mb();
-	smp_call_function(do_nothing, NULL, 1);
+
+	preempt_disable();
+#ifdef CONFIG_TASK_ISOLATION
+	cpumask_clear(&mask);
+	task_isolation_cpumask(&mask);
+	cpumask_complement(&mask, &mask);
+#else
+	cpumask_setall(&mask);
+#endif
+	smp_call_function_many(&mask, do_nothing, NULL, 1);
+	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 12/12] task_isolation: CONFIG_TASK_ISOLATION prevents distribution of jobs to non-housekeeping CPUs
  2020-03-04 16:01 [PATCH 00/12] "Task_isolation" mode Alex Belits
                   ` (10 preceding siblings ...)
  2020-03-04 16:15 ` [PATCH 11/12] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits
@ 2020-03-04 16:16 ` Alex Belits
  2020-03-08  3:42 ` [PATCH v2 00/12] "Task_isolation" mode Alex Belits
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-04 16:16 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	linux-mm, linux-arch

There are various mechanisms that select CPUs for jobs other than
regular workqueue selection. CPU isolation normally does not
prevent those jobs from running on isolated CPUs. When task
isolation is enabled those jobs should be limited to housekeeping
CPUs.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 drivers/pci/pci-driver.c |  9 +++++++
 lib/cpumask.c            | 53 +++++++++++++++++++++++++---------------
 net/core/net-sysfs.c     |  9 +++++++
 3 files changed, 51 insertions(+), 20 deletions(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 0454ca0e4e3f..cb872cdd1782 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -12,6 +12,7 @@
 #include <linux/string.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/sched/isolation.h>
 #include <linux/cpu.h>
 #include <linux/pm_runtime.h>
 #include <linux/suspend.h>
@@ -332,6 +333,9 @@ static bool pci_physfn_is_probed(struct pci_dev *dev)
 static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 			  const struct pci_device_id *id)
 {
+#ifdef CONFIG_TASK_ISOLATION
+	int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
+#endif
 	int error, node, cpu;
 	struct drv_dev_and_id ddi = { drv, dev, id };
 
@@ -353,7 +357,12 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 	    pci_physfn_is_probed(dev))
 		cpu = nr_cpu_ids;
 	else
+#ifdef CONFIG_TASK_ISOLATION
+		cpu = cpumask_any_and(cpumask_of_node(node),
+				      housekeeping_cpumask(hk_flags));
+#else
 		cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
+#endif
 
 	if (cpu < nr_cpu_ids)
 		error = work_on_cpu(cpu, local_pci_probe, &ddi);
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 0cb672eb107c..dcbc30a47600 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,6 +6,7 @@
 #include <linux/export.h>
 #include <linux/memblock.h>
 #include <linux/numa.h>
+#include <linux/sched/isolation.h>
 
 /**
  * cpumask_next - get the next cpu in a cpumask
@@ -205,28 +206,40 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
  */
 unsigned int cpumask_local_spread(unsigned int i, int node)
 {
-	int cpu;
+	const struct cpumask *mask;
+	int cpu, m, n;
+
+#ifdef CONFIG_TASK_ISOLATION
+	mask = housekeeping_cpumask(HK_FLAG_DOMAIN);
+	m = cpumask_weight(mask);
+#else
+	mask = cpu_online_mask;
+	m = num_online_cpus();
+#endif
 
 	/* Wrap: we always want a cpu. */
-	i %= num_online_cpus();
-
-	if (node == NUMA_NO_NODE) {
-		for_each_cpu(cpu, cpu_online_mask)
-			if (i-- == 0)
-				return cpu;
-	} else {
-		/* NUMA first. */
-		for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
-			if (i-- == 0)
-				return cpu;
-
-		for_each_cpu(cpu, cpu_online_mask) {
-			/* Skip NUMA nodes, done above. */
-			if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
-				continue;
-
-			if (i-- == 0)
-				return cpu;
+	n = i % m;
+
+	while (m-- > 0) {
+		if (node == NUMA_NO_NODE) {
+			for_each_cpu(cpu, mask)
+				if (n-- == 0)
+					return cpu;
+		} else {
+			/* NUMA first. */
+			for_each_cpu_and(cpu, cpumask_of_node(node), mask)
+				if (n-- == 0)
+					return cpu;
+
+			for_each_cpu(cpu, mask) {
+				/* Skip NUMA nodes, done above. */
+				if (cpumask_test_cpu(cpu,
+						     cpumask_of_node(node)))
+					continue;
+
+				if (n-- == 0)
+					return cpu;
+			}
 		}
 	}
 	BUG();
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 4c826b8bf9b1..253758f102d9 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -11,6 +11,7 @@
 #include <linux/if_arp.h>
 #include <linux/slab.h>
 #include <linux/sched/signal.h>
+#include <linux/sched/isolation.h>
 #include <linux/nsproxy.h>
 #include <net/sock.h>
 #include <net/net_namespace.h>
@@ -725,6 +726,14 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 		return err;
 	}
 
+#ifdef CONFIG_TASK_ISOLATION
+	cpumask_and(mask, mask, housekeeping_cpumask(HK_FLAG_DOMAIN));
+	if (cpumask_weight(mask) == 0) {
+		free_cpumask_var(mask);
+		return -EINVAL;
+	}
+#endif
+
 	map = kzalloc(max_t(unsigned int,
 			    RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
 		      GFP_KERNEL);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH 06/12] task_isolation: arch/arm64: enable task isolation functionality
  2020-03-04 16:10 ` [PATCH 06/12] task_isolation: arch/arm64: " Alex Belits
@ 2020-03-04 16:31   ` Mark Rutland
  2020-03-08  4:48     ` [EXT] " Alex Belits
  0 siblings, 1 reply; 71+ messages in thread
From: Mark Rutland @ 2020-03-04 16:31 UTC (permalink / raw)
  To: Alex Belits
  Cc: frederic, rostedt, mingo, peterz, linux-kernel, Prasun Kapoor,
	tglx, linux-api, linux-mm, linux-arch, will, catalin.marinas

Hi Alex,

For patches affecting arm64, please CC LAKML and the arm64 maintainers
(Will and Catalin). I've Cc'd the maintainers here.

On Wed, Mar 04, 2020 at 04:10:28PM +0000, Alex Belits wrote:
> From: Chris Metcalf <cmetcalf@mellanox.com>
> 
> In do_notify_resume(), call task_isolation_start() for
> TIF_TASK_ISOLATION tasks. Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
> and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop,
> since we don't clear _TIF_TASK_ISOLATION in the loop.
> 
> We tweak syscall_trace_enter() slightly to carry the "flags"
> value from current_thread_info()->flags for each of the tests,
> rather than doing a volatile read from memory for each one. This
> avoids a small overhead for each test, and in particular avoids
> that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

Stale commit message?

Looking at the patch below, this doesn't seem to be the case; it just
calls test_thread_flag(TIF_TASK_ISOLATION).

> 
> We instrument the smp_send_reschedule() routine so that it checks for
> isolated tasks and generates a suitable warning if needed.
> 
> Finally, report on page faults in task-isolation processes in
> do_page_faults().
> 
> Signed-off-by: Alex Belits <abelits@marvell.com>

The From line says this was from Chris Metcalf, but he's missing from
the Sign-off chain here, which isn't right.

> ---
>  arch/arm64/Kconfig                   |  1 +
>  arch/arm64/include/asm/thread_info.h |  5 ++++-
>  arch/arm64/kernel/ptrace.c           | 10 ++++++++++
>  arch/arm64/kernel/signal.c           | 13 ++++++++++++-
>  arch/arm64/kernel/smp.c              |  7 +++++++
>  arch/arm64/mm/fault.c                |  5 +++++
>  6 files changed, 39 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 0b30e884e088..93b6aabc8be9 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -129,6 +129,7 @@ config ARM64
>  	select HAVE_ARCH_PREL32_RELOCATIONS
>  	select HAVE_ARCH_SECCOMP_FILTER
>  	select HAVE_ARCH_STACKLEAK
> +	select HAVE_ARCH_TASK_ISOLATION
>  	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
>  	select HAVE_ARCH_TRACEHOOK
>  	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index f0cec4160136..7563098eb5b2 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -63,6 +63,7 @@ void arch_release_task_struct(struct task_struct *tsk);
>  #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
>  #define TIF_UPROBE		4	/* uprobe breakpoint or singlestep */
>  #define TIF_FSCHECK		5	/* Check FS is USER_DS on return */
> +#define TIF_TASK_ISOLATION	6
>  #define TIF_NOHZ		7
>  #define TIF_SYSCALL_TRACE	8	/* syscall trace active */
>  #define TIF_SYSCALL_AUDIT	9	/* syscall auditing */
> @@ -83,6 +84,7 @@ void arch_release_task_struct(struct task_struct *tsk);
>  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
>  #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
>  #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
> +#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
>  #define _TIF_NOHZ		(1 << TIF_NOHZ)
>  #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
>  #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
> @@ -96,7 +98,8 @@ void arch_release_task_struct(struct task_struct *tsk);
>  
>  #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
>  				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
> -				 _TIF_UPROBE | _TIF_FSCHECK)
> +				 _TIF_UPROBE | _TIF_FSCHECK | \
> +				 _TIF_TASK_ISOLATION)
>  
>  #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
>  				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> index cd6e5fa48b9c..b35b9b0c594c 100644
> --- a/arch/arm64/kernel/ptrace.c
> +++ b/arch/arm64/kernel/ptrace.c
> @@ -29,6 +29,7 @@
>  #include <linux/regset.h>
>  #include <linux/tracehook.h>
>  #include <linux/elf.h>
> +#include <linux/isolation.h>
>  
>  #include <asm/compat.h>
>  #include <asm/cpufeature.h>
> @@ -1836,6 +1837,15 @@ int syscall_trace_enter(struct pt_regs *regs)
>  			return -1;
>  	}
>  
> +	/*
> +	 * In task isolation mode, we may prevent the syscall from
> +	 * running, and if so we also deliver a signal to the process.
> +	 */
> +	if (test_thread_flag(TIF_TASK_ISOLATION)) {
> +		if (task_isolation_syscall(regs->syscallno) == -1)
> +			return -1;
> +	}

As above, this doesn't match the commit message.

AFAICT, task_isolation_syscall() always returns either 0 or -1, which
isn't great as an API. I see secure_computing() seems to do the same,
and it'd be nice to clean that up to either be a real erro code or a
boolean.

> +
>  	/* Do the secure computing after ptrace; failures should be fast. */
>  	if (secure_computing() == -1)
>  		return -1;
> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index 339882db5a91..d488c91a4877 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -20,6 +20,7 @@
>  #include <linux/tracehook.h>
>  #include <linux/ratelimit.h>
>  #include <linux/syscalls.h>
> +#include <linux/isolation.h>
>  
>  #include <asm/daifflags.h>
>  #include <asm/debug-monitors.h>
> @@ -898,6 +899,11 @@ static void do_signal(struct pt_regs *regs)
>  	restore_saved_sigmask();
>  }
>  
> +#define NOTIFY_RESUME_LOOP_FLAGS \
> +	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
> +	_TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
> +	_TIF_UPROBE | _TIF_FSCHECK)
> +
>  asmlinkage void do_notify_resume(struct pt_regs *regs,
>  				 unsigned long thread_flags)
>  {
> @@ -908,6 +914,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
>  	 */
>  	trace_hardirqs_off();
>  
> +	task_isolation_check_run_cleanup();
> +
>  	do {
>  		/* Check valid user FS if needed */
>  		addr_limit_user_check();
> @@ -938,7 +946,10 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
>  
>  		local_daif_mask();
>  		thread_flags = READ_ONCE(current_thread_info()->flags);
> -	} while (thread_flags & _TIF_WORK_MASK);
> +	} while (thread_flags & NOTIFY_RESUME_LOOP_FLAGS);
> +
> +	if (thread_flags & _TIF_TASK_ISOLATION)
> +		task_isolation_start();
>  }
>  
>  unsigned long __ro_after_init signal_minsigstksz;
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index d4ed9a19d8fe..00f0f77adea0 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -32,6 +32,7 @@
>  #include <linux/irq_work.h>
>  #include <linux/kexec.h>
>  #include <linux/kvm_host.h>
> +#include <linux/isolation.h>
>  
>  #include <asm/alternative.h>
>  #include <asm/atomic.h>
> @@ -818,6 +819,7 @@ void arch_send_call_function_single_ipi(int cpu)
>  #ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
>  void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
>  {
> +	task_isolation_remote_cpumask(mask, "wakeup IPI");
>  	smp_cross_call(mask, IPI_WAKEUP);
>  }
>  #endif
> @@ -886,6 +888,9 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
>  		__inc_irq_stat(cpu, ipi_irqs[ipinr]);
>  	}
>  
> +	task_isolation_interrupt("IPI type %d (%s)", ipinr,
> +				 ipinr < NR_IPI ? ipi_types[ipinr] : "unknown");
> +

Are these tracing hooks?

Surely they aren't necessary for functional correctness?

>  	switch (ipinr) {
>  	case IPI_RESCHEDULE:
>  		scheduler_ipi();
> @@ -948,12 +953,14 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
>  
>  void smp_send_reschedule(int cpu)
>  {
> +	task_isolation_remote(cpu, "reschedule IPI");
>  	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
>  }
>  
>  #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
>  void tick_broadcast(const struct cpumask *mask)
>  {
> +	task_isolation_remote_cpumask(mask, "timer IPI");
>  	smp_cross_call(mask, IPI_TIMER);
>  }
>  #endif
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 85566d32958f..fc4b42c81c4f 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -23,6 +23,7 @@
>  #include <linux/perf_event.h>
>  #include <linux/preempt.h>
>  #include <linux/hugetlb.h>
> +#include <linux/isolation.h>
>  
>  #include <asm/acpi.h>
>  #include <asm/bug.h>
> @@ -543,6 +544,10 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
>  	 */
>  	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
>  			      VM_FAULT_BADACCESS)))) {
> +		/* No signal was generated, but notify task-isolation tasks. */
> +		if (user_mode(regs))
> +			task_isolation_interrupt("page fault at %#lx", addr);
> +

We check user_mode(regs) much earlier in this function to set
FAULT_FLAG_USER. Is there some reason this cannot live there?

Also, this seems to be a tracing hook -- is this necessary?

Thanks,
Mark.

>  		/*
>  		 * Major/minor page fault accounting is only done
>  		 * once. If we go through a retry, it is extremely
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/12] task_isolation: userspace hard isolation from kernel
  2020-03-04 16:07 ` [PATCH 03/12] task_isolation: userspace hard isolation from kernel Alex Belits
@ 2020-03-05 18:33   ` Frederic Weisbecker
  2020-03-08  5:32     ` [EXT] " Alex Belits
  2020-04-28 14:12     ` Marcelo Tosatti
  2020-03-06 15:26   ` Frederic Weisbecker
  2020-03-06 16:00   ` Frederic Weisbecker
  2 siblings, 2 replies; 71+ messages in thread
From: Frederic Weisbecker @ 2020-03-05 18:33 UTC (permalink / raw)
  To: Alex Belits
  Cc: rostedt, mingo, peterz, linux-kernel, Prasun Kapoor, tglx,
	linux-api, linux-mm, linux-arch

On Wed, Mar 04, 2020 at 04:07:12PM +0000, Alex Belits wrote:
> The existing nohz_full mode is designed as a "soft" isolation mode
> that makes tradeoffs to minimize userspace interruptions while
> still attempting to avoid overheads in the kernel entry/exit path,
> to provide 100% kernel semantics, etc.
> 
> However, some applications require a "hard" commitment from the
> kernel to avoid interruptions, in particular userspace device driver
> style applications, such as high-speed networking code.
> 
> This change introduces a framework to allow applications
> to elect to have the "hard" semantics as needed, specifying
> prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
> 
> The kernel must be built with the new TASK_ISOLATION Kconfig flag
> to enable this mode, and the kernel booted with an appropriate
> "isolcpus=nohz,domain,CPULIST" boot argument to enable
> nohz_full and isolcpus. The "task_isolation" state is then indicated
> by setting a new task struct field, task_isolation_flag, to the
> value passed by prctl(), and also setting a TIF_TASK_ISOLATION
> bit in the thread_info flags. When the kernel is returning to
> userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
> it calls the new task_isolation_start() routine to arrange for
> the task to avoid being interrupted in the future.
> 
> With interrupts disabled, task_isolation_start() ensures that kernel
> subsystems that might cause a future interrupt are quiesced. If it
> doesn't succeed, it adjusts the syscall return value to indicate that
> fact, and userspace can retry as desired. In addition to stopping
> the scheduler tick, the code takes any actions that might avoid
> a future interrupt to the core, such as a worker thread being
> scheduled that could be quiesced now (e.g. the vmstat worker)
> or a future IPI to the core to clean up some state that could be
> cleaned up now (e.g. the mm lru per-cpu cache).
> 
> Once the task has returned to userspace after issuing the prctl(),
> if it enters the kernel again via system call, page fault, or any
> other exception or irq, the kernel will kill it with SIGKILL.
> In addition to sending a signal, the code supports a kernel
> command-line "task_isolation_debug" flag which causes a stack
> backtrace to be generated whenever a task loses isolation.
> 
> To allow the state to be entered and exited, the syscall checking
> test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
> clear the bit again later, and ignores exit/exit_group to allow
> exiting the task without a pointless signal being delivered.
> 
> The prctl() API allows for specifying a signal number to use instead
> of the default SIGKILL, to allow for catching the notification
> signal; for example, in a production environment, it might be
> helpful to log information to the application logging mechanism
> before exiting. Or, the signal handler might choose to reset the
> program counter back to the code segment intended to be run isolated
> via prctl() to continue execution.

Hi Alew,

I'm glad this patchset is being resurected.
Reading that changelog, I like the general idea and the direction.
The diff is a bit scary though but I'll check the patches in detail
in the upcoming days.

> 
> In a number of cases we can tell on a remote cpu that we are
> going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
> In that case we generate the diagnostic (and optional stack dump)
> on the remote core to be able to deliver better diagnostics.
> If the interrupt is not something caught by Linux (e.g. a
> hypervisor interrupt) we can also request a reschedule IPI to
> be sent to the remote core so it can be sure to generate a
> signal to notify the process.

I'm wondering if it's wise to run that on a guest at all :-)
Or we should consider any guest exit to the host as a
disturbance, we would then need some sort of paravirt
driver to notify that, etc... That doesn't sound appealing.

Thanks.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/12] task_isolation: userspace hard isolation from kernel
  2020-03-04 16:07 ` [PATCH 03/12] task_isolation: userspace hard isolation from kernel Alex Belits
  2020-03-05 18:33   ` Frederic Weisbecker
@ 2020-03-06 15:26   ` Frederic Weisbecker
  2020-03-08  6:06     ` [EXT] " Alex Belits
  2020-03-06 16:00   ` Frederic Weisbecker
  2 siblings, 1 reply; 71+ messages in thread
From: Frederic Weisbecker @ 2020-03-06 15:26 UTC (permalink / raw)
  To: Alex Belits
  Cc: rostedt, mingo, peterz, linux-kernel, Prasun Kapoor, tglx,
	linux-api, linux-mm, linux-arch

On Wed, Mar 04, 2020 at 04:07:12PM +0000, Alex Belits wrote:
> +
> +/*
> + * Print message prefixed with the description of the current (or
> + * last) isolated task on a given CPU. Intended for isolation breaking
> + * messages that include target task for the user's convenience.
> + *
> + * Messages produced with this function may have obsolete task
> + * information if isolated tasks managed to exit, start and enter
> + * isolation multiple times, or multiple tasks tried to enter
> + * isolation on the same CPU at once. For those unusual cases it would
> + * contain a valid description of the cause for isolation breaking and
> + * target CPU number, just not the correct description of which task
> + * ended up losing isolation.
> + */
> +int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...)
> +{
> +	struct isol_task_desc *desc;
> +	struct task_struct *task;
> +	va_list args;
> +	char buf_prefix[TASK_COMM_LEN + 20 + 3 * 20];
> +	char buf[200];
> +	int curr_cpu, ind_counter, ind_counter_old, ind;
> +
> +	curr_cpu = get_cpu();
> +	desc = &per_cpu(isol_task_descs, cpu);
> +	ind_counter = atomic_read(&desc->curr_index);
> +
> +	if (curr_cpu == cpu) {
> +		/*
> +		 * Message is for the current CPU so current
> +		 * task_struct should be used instead of cached
> +		 * information.
> +		 *
> +		 * Like in other diagnostic messages, if issued from
> +		 * interrupt context, current will be the interrupted
> +		 * task. Unlike other diagnostic messages, this is
> +		 * always relevant because the message is about
> +		 * interrupting a task.
> +		 */
> +		ind = ind_counter & 1;
> +		if (supp && desc->warned[ind]) {
> +			/*
> +			 * If supp is true, skip the message if the
> +			 * same task was mentioned in the message
> +			 * originated on remote CPU, and it did not
> +			 * re-enter isolated state since then (warned
> +			 * is true). Only local messages following
> +			 * remote messages, likely about the same
> +			 * isolation breaking event, are skipped to
> +			 * avoid duplication. If remote cause is
> +			 * immediately followed by a local one before
> +			 * isolation is broken, local cause is skipped
> +			 * from messages.
> +			 */
> +			put_cpu();
> +			return 0;
> +		}
> +		task = current;
> +		snprintf(buf_prefix, sizeof(buf_prefix),
> +			 "isolation %s/%d/%d (cpu %d)",
> +			 task->comm, task->tgid, task->pid, cpu);
> +		put_cpu();
> +	} else {
> +		/*
> +		 * Message is for remote CPU, use cached information.
> +		 */
> +		put_cpu();
> +		/*
> +		 * Make sure, index remained unchanged while data was
> +		 * copied. If it changed, data that was copied may be
> +		 * inconsistent because two updates in a sequence could
> +		 * overwrite the data while it was being read.
> +		 */
> +		do {
> +			/* Make sure we are reading up to date values */
> +			smp_mb();
> +			ind = ind_counter & 1;
> +			snprintf(buf_prefix, sizeof(buf_prefix),
> +				 "isolation %s/%d/%d (cpu %d)",
> +				 desc->comm[ind], desc->tgid[ind],
> +				 desc->pid[ind], cpu);
> +			desc->warned[ind] = true;
> +			ind_counter_old = ind_counter;
> +			/* Record the warned flag, then re-read descriptor */
> +			smp_mb();
> +			ind_counter = atomic_read(&desc->curr_index);
> +			/*
> +			 * If the counter changed, something was updated, so
> +			 * repeat everything to get the current data
> +			 */
> +		} while (ind_counter != ind_counter_old);
> +	}

So the need to log the fact we are sending an event to a remote CPU that *may be*
running an isolated task makes things very complicated and even racy.

How bad would it be to only log those interruptions once they land on the target?

Thanks.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 11/12] task_isolation: kick_all_cpus_sync: don't kick isolated cpus
  2020-03-04 16:15 ` [PATCH 11/12] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits
@ 2020-03-06 15:34   ` Frederic Weisbecker
  2020-03-08  6:48     ` [EXT] " Alex Belits
  0 siblings, 1 reply; 71+ messages in thread
From: Frederic Weisbecker @ 2020-03-06 15:34 UTC (permalink / raw)
  To: Alex Belits
  Cc: rostedt, mingo, peterz, linux-kernel, Prasun Kapoor, tglx,
	linux-api, linux-mm, linux-arch

On Wed, Mar 04, 2020 at 04:15:24PM +0000, Alex Belits wrote:
> From: Yuri Norov <ynorov@marvell.com>
> 
> Make sure that kick_all_cpus_sync() does not call CPUs that are running
> isolated tasks.
> 
> Signed-off-by: Alex Belits <abelits@marvell.com>
> ---
>  kernel/smp.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 3a8bcbdd4ce6..d9b4b2fedfed 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -731,9 +731,21 @@ static void do_nothing(void *unused)
>   */
>  void kick_all_cpus_sync(void)
>  {
> +	struct cpumask mask;
> +
>  	/* Make sure the change is visible before we kick the cpus */
>  	smp_mb();
> -	smp_call_function(do_nothing, NULL, 1);
> +
> +	preempt_disable();
> +#ifdef CONFIG_TASK_ISOLATION
> +	cpumask_clear(&mask);
> +	task_isolation_cpumask(&mask);
> +	cpumask_complement(&mask, &mask);
> +#else
> +	cpumask_setall(&mask);
> +#endif
> +	smp_call_function_many(&mask, do_nothing, NULL, 1);
> +	preempt_enable();
>  }

That looks very dangerous, the callers of kick_all_cpus_sync() want to
sync all CPUs for a reason. You will rather need to fix the callers.

Thanks.

>  EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
>  
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/12] task_isolation: userspace hard isolation from kernel
  2020-03-04 16:07 ` [PATCH 03/12] task_isolation: userspace hard isolation from kernel Alex Belits
  2020-03-05 18:33   ` Frederic Weisbecker
  2020-03-06 15:26   ` Frederic Weisbecker
@ 2020-03-06 16:00   ` Frederic Weisbecker
  2020-03-08  7:16     ` [EXT] " Alex Belits
  2 siblings, 1 reply; 71+ messages in thread
From: Frederic Weisbecker @ 2020-03-06 16:00 UTC (permalink / raw)
  To: Alex Belits
  Cc: rostedt, mingo, peterz, linux-kernel, Prasun Kapoor, tglx,
	linux-api, linux-mm, linux-arch

On Wed, Mar 04, 2020 at 04:07:12PM +0000, Alex Belits wrote:
> +#ifdef CONFIG_TASK_ISOLATION
> +int try_stop_full_tick(void)
> +{
> +	int cpu = smp_processor_id();
> +	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
> +
> +	/* For an unstable clock, we should return a permanent error code. */
> +	if (atomic_read(&tick_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE)
> +		return -EINVAL;
> +
> +	if (!can_stop_full_tick(cpu, ts))
> +		return -EAGAIN;

Note that the stop_tick naming in nohz can be misleading. It means
we actually leave the periodic mode and we enter in dynamic tick mode.

In practice it means that the tick is delayed until the next event, which
in the worst case may well be in 1 ms and in the best case never. So what
you probably want to check instead is whether the tick has been entirely
stopped (ie: we called hrtimer_cancel(&ts->sched_timer)).

Thanks.

> +
> +	tick_nohz_stop_sched_tick(ts, cpu);
> +	return 0;
> +}
> +#endif
> +
>  static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
>  {
>  	/*
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 08/12] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-03-04 16:12 ` [PATCH 08/12] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
@ 2020-03-06 16:03   ` Frederic Weisbecker
  2020-03-08  7:28     ` [EXT] " Alex Belits
  0 siblings, 1 reply; 71+ messages in thread
From: Frederic Weisbecker @ 2020-03-06 16:03 UTC (permalink / raw)
  To: Alex Belits
  Cc: rostedt, mingo, peterz, linux-kernel, Prasun Kapoor, tglx,
	linux-api, linux-mm, linux-arch

On Wed, Mar 04, 2020 at 04:12:40PM +0000, Alex Belits wrote:
> From: Yuri Norov <ynorov@marvell.com>
> 
> For nohz_full CPUs the desirable behavior is to receive interrupts
> generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> obviously not desirable because it breaks isolation.
> 
> This patch adds check for it.
> 
> Signed-off-by: Alex Belits <abelits@marvell.com>
> ---
>  kernel/time/tick-sched.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 1d4dec9d3ee7..fe4503ba1316 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -20,6 +20,7 @@
>  #include <linux/sched/clock.h>
>  #include <linux/sched/stat.h>
>  #include <linux/sched/nohz.h>
> +#include <linux/isolation.h>
>  #include <linux/module.h>
>  #include <linux/irq_work.h>
>  #include <linux/posix-timers.h>
> @@ -262,7 +263,7 @@ static void tick_nohz_full_kick(void)
>   */
>  void tick_nohz_full_kick_cpu(int cpu)
>  {
> -	if (!tick_nohz_full_cpu(cpu))
> +	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
>  		return;

I fear you can't do that. A nohz full CPU is kicked for a reason.
As for the other cases, you need to fix the callers.

In the general case, randomly ignoring an interrupt is a correctness
issue.

Thanks.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 00/12] "Task_isolation" mode
  2020-03-04 16:01 [PATCH 00/12] "Task_isolation" mode Alex Belits
                   ` (11 preceding siblings ...)
  2020-03-04 16:16 ` [PATCH 12/12] task_isolation: CONFIG_TASK_ISOLATION prevents distribution of jobs to non-housekeeping CPUs Alex Belits
@ 2020-03-08  3:42 ` Alex Belits
  2020-03-08  3:44   ` [PATCH v2 01/12] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
                     ` (12 more replies)
  12 siblings, 13 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-08  3:42 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	catalin.marinas, linux-arm-kernel, netdev, davem, linux-arch,
	will

This is the updated version of task isolation patchset.

1. Commit messages updated to match changes.
2. Sign-off lines restored from original patches, changes listed wherever applicable.
3. arm platform -- added missing calls to syscall check and cleanup procedure after leaving isolation.
4. x86 platform -- added missing calls to cleanup procedure after leaving isolation.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 01/12] task_isolation: vmstat: add quiet_vmstat_sync function
  2020-03-08  3:42 ` [PATCH v2 00/12] "Task_isolation" mode Alex Belits
@ 2020-03-08  3:44   ` Alex Belits
  2020-03-08  3:46   ` [PATCH v2 02/12] task_isolation: vmstat: add vmstat_idle function Alex Belits
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-08  3:44 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	catalin.marinas, linux-arm-kernel, netdev, davem, linux-arch,
	will

From: Chris Metcalf <cmetcalf@mellanox.com>

In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
the quiet_vmstat() function became asynchronous, in the sense that
the vmstat work was still scheduled to run on the core when the
function returned.  For task isolation, we need a synchronous
version of the function that guarantees that the vmstat worker
will not run on the core on return from the function.  Add a
quiet_vmstat_sync() function with that semantic.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 include/linux/vmstat.h | 2 ++
 mm/vmstat.c            | 9 +++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 292485f3d24d..2bc5e85f2514 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -270,6 +270,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
+void quiet_vmstat_sync(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -372,6 +373,7 @@ static inline void __dec_node_page_state(struct page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat_sync(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 78d53378db99..1fa0b2d04afa 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1870,6 +1870,15 @@ void quiet_vmstat(void)
 	refresh_cpu_vm_stats(false);
 }
 
+/*
+ * Synchronously quiet vmstat so the work is guaranteed not to run on return.
+ */
+void quiet_vmstat_sync(void)
+{
+	cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work));
+	refresh_cpu_vm_stats(false);
+}
+
 /*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 02/12] task_isolation: vmstat: add vmstat_idle function
  2020-03-08  3:42 ` [PATCH v2 00/12] "Task_isolation" mode Alex Belits
  2020-03-08  3:44   ` [PATCH v2 01/12] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
@ 2020-03-08  3:46   ` Alex Belits
  2020-03-08  3:47   ` [PATCH v2 03/12] task_isolation: userspace hard isolation from kernel Alex Belits
                     ` (10 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-08  3:46 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	catalin.marinas, linux-arm-kernel, netdev, davem, linux-arch,
	will

From: Chris Metcalf <cmetcalf@mellanox.com>

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 2bc5e85f2514..66d9ae32cf07 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -271,6 +271,7 @@ extern void __dec_node_state(struct pglist_data *,
enum node_stat_item);
 
 void quiet_vmstat(void);
 void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -374,6 +375,7 @@ static inline void
refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1fa0b2d04afa..5c4aec651062 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1879,6 +1879,16 @@ void quiet_vmstat_sync(void)
 	refresh_cpu_vm_stats(false);
 }
 
+/*
+ * Report on whether vmstat processing is quiesced on the core
currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+	return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) &&
+		!need_update(smp_processor_id());
+}
+
 /*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 03/12] task_isolation: userspace hard isolation from kernel
  2020-03-08  3:42 ` [PATCH v2 00/12] "Task_isolation" mode Alex Belits
  2020-03-08  3:44   ` [PATCH v2 01/12] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
  2020-03-08  3:46   ` [PATCH v2 02/12] task_isolation: vmstat: add vmstat_idle function Alex Belits
@ 2020-03-08  3:47   ` Alex Belits
       [not found]     ` <20200307214254.7a8f6c22@hermes.lan>
                       ` (3 more replies)
  2020-03-08  3:48   ` [PATCH v2 04/12] task_isolation: Add task isolation hooks to arch-independent code Alex Belits
                     ` (9 subsequent siblings)
  12 siblings, 4 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-08  3:47 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	catalin.marinas, linux-arm-kernel, netdev, davem, linux-arch,
	will

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
"isolcpus=nohz,domain,CPULIST" boot argument to enable
nohz_full and isolcpus. The "task_isolation" state is then indicated
by setting a new task struct field, task_isolation_flag, to the
value passed by prctl(), and also setting a TIF_TASK_ISOLATION
bit in the thread_info flags. When the kernel is returning to
userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
it calls the new task_isolation_start() routine to arrange for
the task to avoid being interrupted in the future.

With interrupts disabled, task_isolation_start() ensures that kernel
subsystems that might cause a future interrupt are quiesced. If it
doesn't succeed, it adjusts the syscall return value to indicate that
fact, and userspace can retry as desired. In addition to stopping
the scheduler tick, the code takes any actions that might avoid
a future interrupt to the core, such as a worker thread being
scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
other exception or irq, the kernel will kill it with SIGKILL.
In addition to sending a signal, the code supports a kernel
command-line "task_isolation_debug" flag which causes a stack
backtrace to be generated whenever a task loses isolation.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
clear the bit again later, and ignores exit/exit_group to allow
exiting the task without a pointless signal being delivered.

The prctl() API allows for specifying a signal number to use instead
of the default SIGKILL, to allow for catching the notification
signal; for example, in a production environment, it might be
helpful to log information to the application logging mechanism
before exiting. Or, the signal handler might choose to reset the
program counter back to the code segment intended to be run isolated
via prctl() to continue execution.

In a number of cases we can tell on a remote cpu that we are
going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
In that case we generate the diagnostic (and optional stack dump)
on the remote core to be able to deliver better diagnostics.
If the interrupt is not something caught by Linux (e.g. a
hypervisor interrupt) we can also request a reschedule IPI to
be sent to the remote core so it can be sure to generate a
signal to notify the process.

Separate patches that follow provide these changes for x86, arm,
and arm64.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 .../admin-guide/kernel-parameters.txt         |   6 +
 include/linux/hrtimer.h                       |   4 +
 include/linux/isolation.h                     | 229 ++++++
 include/linux/sched.h                         |   4 +
 include/linux/tick.h                          |   3 +
 include/uapi/linux/prctl.h                    |   6 +
 init/Kconfig                                  |  28 +
 kernel/Makefile                               |   2 +
 kernel/context_tracking.c                     |   2 +
 kernel/isolation.c                            | 774 ++++++++++++++++++
 kernel/signal.c                               |   2 +
 kernel/sys.c                                  |   6 +
 kernel/time/hrtimer.c                         |  27 +
 kernel/time/tick-sched.c                      |  18 +
 14 files changed, 1111 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index c07815d230bc..e4a2d6e37645 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4808,6 +4808,12 @@
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation_debug	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION, this
+			setting will generate console backtraces to
+			accompany the diagnostics generated about
+			interrupting tasks running with task isolation.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 15c8ac313678..e81252eb4f92 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -528,6 +528,10 @@ extern void __init hrtimers_init(void);
 /* Show pending timers: */
 extern void sysrq_timer_list_show(void);
 
+#ifdef CONFIG_TASK_ISOLATION
+extern void kick_hrtimer(void);
+#endif
+
 int hrtimers_prepare_cpu(unsigned int cpu);
 #ifdef CONFIG_HOTPLUG_CPU
 int hrtimers_dead_cpu(unsigned int cpu);
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..6bd71c67f10f
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,229 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Task isolation support
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@marvell.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <stdarg.h>
+#include <linux/errno.h>
+#include <linux/cpumask.h>
+#include <linux/prctl.h>
+#include <linux/types.h>
+
+struct task_struct;
+
+#ifdef CONFIG_TASK_ISOLATION
+
+int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...);
+
+#define pr_task_isol_emerg(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_EMERG, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_alert(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_ALERT, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_crit(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_CRIT, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_err(cpu, fmt, ...)				\
+	task_isolation_message(cpu, LOGLEVEL_ERR, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_warn(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_WARNING, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_notice(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_NOTICE, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_info(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_INFO, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_debug(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_DEBUG, false, fmt, ##__VA_ARGS__)
+
+#define pr_task_isol_emerg_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_EMERG, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_alert_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_ALERT, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_crit_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_CRIT, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_err_supp(cpu, fmt, ...)				\
+	task_isolation_message(cpu, LOGLEVEL_ERR, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_warn_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_WARNING, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_notice_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_NOTICE, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_info_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_INFO, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_debug_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_DEBUG, true, fmt, ##__VA_ARGS__)
+DECLARE_PER_CPU(unsigned long, tsk_thread_flags_copy);
+extern cpumask_var_t task_isolation_map;
+
+/**
+ * task_isolation_request() - prctl hook to request task isolation
+ * @flags:	Flags from <linux/prctl.h> PR_TASK_ISOLATION_xxx.
+ *
+ * This is called from the generic prctl() code for PR_TASK_ISOLATION.
+ *
+ * Return: Returns 0 when task isolation enabled, otherwise a negative
+ * errno.
+ */
+extern int task_isolation_request(unsigned int flags);
+extern void task_isolation_cpu_cleanup(void);
+/**
+ * task_isolation_start() - attempt to actually start task isolation
+ *
+ * This function should be invoked as the last thing prior to returning to
+ * user space if TIF_TASK_ISOLATION is set in the thread_info flags.  It
+ * will attempt to quiesce the core and enter task-isolation mode.  If it
+ * fails, it will reset the system call return value to an error code that
+ * indicates the failure mode.
+ */
+extern void task_isolation_start(void);
+
+/**
+ * is_isolation_cpu() - check if CPU is intended for running isolated tasks.
+ * @cpu:	CPU to check.
+ */
+static inline bool is_isolation_cpu(int cpu)
+{
+	return task_isolation_map != NULL &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+/**
+ * task_isolation_on_cpu() - check if the cpu is running isolated task
+ * @cpu:	CPU to check.
+ */
+extern int task_isolation_on_cpu(int cpu);
+extern void task_isolation_check_run_cleanup(void);
+
+/**
+ * task_isolation_cpumask() - set CPUs currently running isolated tasks
+ * @mask:	Mask to modify.
+ */
+extern void task_isolation_cpumask(struct cpumask *mask);
+
+/**
+ * task_isolation_clear_cpumask() - clear CPUs currently running isolated tasks
+ * @mask:      Mask to modify.
+ */
+extern void task_isolation_clear_cpumask(struct cpumask *mask);
+
+/**
+ * task_isolation_syscall() - report a syscall from an isolated task
+ * @nr:		The syscall number.
+ *
+ * This routine should be invoked at syscall entry if TIF_TASK_ISOLATION is
+ * set in the thread_info flags.  It checks for valid syscalls,
+ * specifically prctl() with PR_TASK_ISOLATION, exit(), and exit_group().
+ * For any other syscall it will raise a signal and return failure.
+ *
+ * Return: 0 for acceptable syscalls, -1 for all others.
+ */
+extern int task_isolation_syscall(int nr);
+
+/**
+ * _task_isolation_interrupt() - report an interrupt of an isolated task
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This routine should be invoked at any exception or IRQ if
+ * TIF_TASK_ISOLATION is set in the thread_info flags.  It is not necessary
+ * to invoke it if the exception will generate a signal anyway (e.g. a bad
+ * page fault), and in that case it is preferable not to invoke it but just
+ * rely on the standard Linux signal.  The macro task_isolation_syscall()
+ * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
+ */
+extern void _task_isolation_interrupt(const char *fmt, ...);
+#define task_isolation_interrupt(fmt, ...)				\
+	do {								\
+		if (current_thread_info()->flags & _TIF_TASK_ISOLATION)	\
+			_task_isolation_interrupt(fmt, ## __VA_ARGS__);	\
+	} while (0)
+
+/**
+ * task_isolation_remote() - report a remote interrupt of an isolated task
+ * @cpu:	The remote cpu that is about to be interrupted.
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This routine should be invoked any time a remote IPI or other type of
+ * interrupt is being delivered to another cpu. The function will check to
+ * see if the target core is running a task-isolation task, and generate a
+ * diagnostic on the console if so; in addition, we tag the task so it
+ * doesn't generate another diagnostic when the interrupt actually arrives.
+ * Generating a diagnostic remotely yields a clearer indication of what
+ * happened then just reporting only when the remote core is interrupted.
+ *
+ */
+extern void task_isolation_remote(int cpu, const char *fmt, ...);
+
+/**
+ * task_isolation_remote_cpumask() - report interruption of multiple cpus
+ * @mask:	The set of remotes cpus that are about to be interrupted.
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This is the cpumask variant of _task_isolation_remote().  We
+ * generate a single-line diagnostic message even if multiple remote
+ * task-isolation cpus are being interrupted.
+ */
+extern void task_isolation_remote_cpumask(const struct cpumask *mask,
+					  const char *fmt, ...);
+
+/**
+ * _task_isolation_signal() - disable task isolation when signal is pending
+ * @task:	The task for which to disable isolation.
+ *
+ * This function generates a diagnostic and disables task isolation; it
+ * should be called if TIF_TASK_ISOLATION is set when notifying a task of a
+ * pending signal.  The task_isolation_interrupt() function normally
+ * generates a diagnostic for events that just interrupt a task without
+ * generating a signal; here we need to hook the paths that correspond to
+ * interrupts that do generate a signal.  The macro task_isolation_signal()
+ * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
+ */
+extern void _task_isolation_signal(struct task_struct *task);
+#define task_isolation_signal(task)					\
+	do {								\
+		if (task_thread_info(task)->flags & _TIF_TASK_ISOLATION) \
+			_task_isolation_signal(task);			\
+	} while (0)
+
+/**
+ * task_isolation_user_exit() - debug all user_exit calls
+ *
+ * By default, we don't generate an exception in the low-level user_exit()
+ * code, because programs lose the ability to disable task isolation: the
+ * user_exit() hook will cause a signal prior to task_isolation_syscall()
+ * disabling task isolation.  In addition, it means that we lose all the
+ * diagnostic info otherwise available from task_isolation_interrupt() hooks
+ * later in the interrupt-handling process.  But you may enable it here for
+ * a special kernel build if you are having undiagnosed userspace jitter.
+ */
+static inline void task_isolation_user_exit(void)
+{
+#ifdef DEBUG_TASK_ISOLATION
+	task_isolation_interrupt("user_exit");
+#endif
+}
+
+#else /* !CONFIG_TASK_ISOLATION */
+static inline int task_isolation_request(unsigned int flags) { return -EINVAL; }
+static inline void task_isolation_start(void) { }
+static inline bool is_isolation_cpu(int cpu) { return 0; }
+static inline int task_isolation_on_cpu(int cpu) { return 0; }
+static inline void task_isolation_cpumask(struct cpumask *mask) { }
+static inline void task_isolation_clear_cpumask(struct cpumask *mask) { }
+static inline void task_isolation_cpu_cleanup(void) { }
+static inline void task_isolation_check_run_cleanup(void) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_interrupt(const char *fmt, ...) { }
+static inline void task_isolation_remote(int cpu, const char *fmt, ...) { }
+static inline void task_isolation_remote_cpumask(const struct cpumask *mask,
+						 const char *fmt, ...) { }
+static inline void task_isolation_signal(struct task_struct *task) { }
+static inline void task_isolation_user_exit(void) { }
+#endif
+
+#endif /* _LINUX_ISOLATION_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15..52fdb32aa3b9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1280,6 +1280,10 @@ struct task_struct {
 	unsigned long			lowest_stack;
 	unsigned long			prev_lowest_stack;
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned short			task_isolation_flags;  /* prctl */
+	unsigned short			task_isolation_state;
+#endif
 
 	/*
 	 * New fields for task_struct should be added above here, so that
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 7340613c7eff..27c7c033d5a8 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -268,6 +268,9 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
 extern void __init tick_nohz_full_setup(cpumask_var_t cpumask);
+#ifdef CONFIG_TASK_ISOLATION
+extern int try_stop_full_tick(void);
+#endif
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f8131e36..f4848ed2a069 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -238,4 +238,10 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Enable task_isolation mode for TASK_ISOLATION kernels. */
+#define PR_TASK_ISOLATION		48
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 20a6ac33761c..ecdf567f6bd4 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -576,6 +576,34 @@ config CPU_ISOLATION
 
 source "kernel/rcu/Kconfig"
 
+config HAVE_ARCH_TASK_ISOLATION
+	bool
+
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
+	help
+
+	Allow userspace processes that place themselves on cores with
+	nohz_full and isolcpus enabled, and run prctl(PR_TASK_ISOLATION),
+	to "isolate" themselves from the kernel.  Prior to returning to
+	userspace, isolated tasks will arrange that no future kernel
+	activity will interrupt the task while the task is running in
+	userspace.  Attempting to re-enter the kernel while in this mode
+	will cause the task to be terminated with a signal; you must
+	explicitly use prctl() to disable task isolation before resuming
+	normal use of the kernel.
+
+	This "hard" isolation from the kernel is required for userspace
+	tasks that are running hard real-time tasks in userspace, such as
+	a high-speed network driver in userspace.  Without this option, but
+	with NO_HZ_FULL enabled, the kernel will make a best-faith, "soft"
+	effort to shield a single userspace process from interrupts, but
+	makes no guarantees.
+
+	You should say "N" unless you are intending to run a
+	high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index 4cb4130ced32..2f2ae91f90d5 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -122,6 +122,8 @@ obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o
 KASAN_SANITIZE_stackleak.o := n
 KCOV_INSTRUMENT_stackleak.o := n
 
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
+
 $(obj)/configs.o: $(obj)/config_data.gz
 
 targets += config_data.gz
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 0296b4bda8f1..e9206736f219 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -21,6 +21,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -157,6 +158,7 @@ void __context_tracking_exit(enum ctx_state state)
 			if (state == CONTEXT_USER) {
 				vtime_user_exit(current);
 				trace_user_exit(0);
+				task_isolation_user_exit();
 			}
 		}
 		__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..ae29732c376c
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,774 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation of task isolation.
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@marvell.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/sched.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include <linux/smp.h>
+#include <linux/tick.h>
+#include <asm/unistd.h>
+#include <asm/syscall.h>
+#include <linux/hrtimer.h>
+
+/*
+ * These values are stored in task_isolation_state.
+ * Note that STATE_NORMAL + TIF_TASK_ISOLATION means we are still
+ * returning from sys_prctl() to userspace.
+ */
+enum {
+	STATE_NORMAL = 0,	/* Not isolated */
+	STATE_ISOLATED = 1	/* In userspace, isolated */
+};
+
+/*
+ * This variable contains thread flags copied at the moment
+ * when schedule() switched to the task on a given CPU,
+ * or 0 if no task is running.
+ */
+DEFINE_PER_CPU(unsigned long, tsk_thread_flags_cache);
+
+/*
+ * Counter for isolation state on a given CPU, increments when entering
+ * isolation and decrements when exiting isolation (before or after the
+ * cleanup). Multiple simultaneously running procedures entering or
+ * exiting isolation are prevented by checking the result of
+ * incrementing or decrementing this variable. This variable is both
+ * incremented and decremented by CPU that caused isolation entering or
+ * exit.
+ *
+ * This is necessary because multiple isolation-breaking events may happen
+ * at once (or one as the result of the other), however isolation exit
+ * may only happen once to transition from isolated to non-isolated state.
+ * Therefore, if decrementing this counter results in a value less than 0,
+ * isolation exit procedure can't be started -- it already happened, or is
+ * in progress, or isolation is not entered yet.
+ */
+DEFINE_PER_CPU(atomic_t, isol_counter);
+
+/*
+ * Description of the last two tasks that ran isolated on a given CPU.
+ * This is intended only for messages about isolation breaking. We
+ * don't want any references to actual task while accessing this from
+ * CPU that caused isolation breaking -- we know nothing about timing
+ * and don't want to use locking or RCU.
+ */
+struct isol_task_desc {
+	atomic_t curr_index;
+	atomic_t curr_index_wr;
+	bool	warned[2];
+	pid_t	pid[2];
+	pid_t	tgid[2];
+	char	comm[2][TASK_COMM_LEN];
+};
+static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs);
+
+/*
+ * Counter for isolation exiting procedures (from request to the start of
+ * cleanup) being attempted at once on a CPU. Normally incrementing of
+ * this counter is performed from the CPU that caused isolation breaking,
+ * however decrementing is done from the cleanup procedure, delegated to
+ * the CPU that is exiting isolation, not from the CPU that caused isolation
+ * breaking.
+ *
+ * If incrementing this counter while starting isolation exit procedure
+ * results in a value greater than 0, isolation exiting is already in
+ * progress, and cleanup did not start yet. This means, counter should be
+ * decremented back, and isolation exit that is already in progress, should
+ * be allowed to complete. Otherwise, a new isolation exit procedure should
+ * be started.
+ */
+DEFINE_PER_CPU(atomic_t, isol_exit_counter);
+
+/*
+ * Descriptor for isolation-breaking SMP calls
+ */
+DEFINE_PER_CPU(call_single_data_t, isol_break_csd);
+
+cpumask_var_t task_isolation_map;
+cpumask_var_t task_isolation_cleanup_map;
+static DEFINE_SPINLOCK(task_isolation_cleanup_lock);
+
+/* We can run on cpus that are isolated from the scheduler and are nohz_full. */
+static int __init task_isolation_init(void)
+{
+	alloc_bootmem_cpumask_var(&task_isolation_cleanup_map);
+	if (alloc_cpumask_var(&task_isolation_map, GFP_KERNEL))
+		/*
+		 * At this point task isolation should match
+		 * nohz_full. This may change in the future.
+		 */
+		cpumask_copy(task_isolation_map, tick_nohz_full_mask);
+	return 0;
+}
+core_initcall(task_isolation_init)
+
+/* Enable stack backtraces of any interrupts of task_isolation cores. */
+static bool task_isolation_debug;
+static int __init task_isolation_debug_func(char *str)
+{
+	task_isolation_debug = true;
+	return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+/*
+ * Record name, pid and group pid of the task entering isolation on
+ * the current CPU.
+ */
+static void record_curr_isolated_task(void)
+{
+	int ind;
+	int cpu = smp_processor_id();
+	struct isol_task_desc *desc = &per_cpu(isol_task_descs, cpu);
+	struct task_struct *task = current;
+
+	/* Finish everything before recording current task */
+	smp_mb();
+	ind = atomic_inc_return(&desc->curr_index_wr) & 1;
+	desc->comm[ind][sizeof(task->comm) - 1] = '\0';
+	memcpy(desc->comm[ind], task->comm, sizeof(task->comm) - 1);
+	desc->pid[ind] = task->pid;
+	desc->tgid[ind] = task->tgid;
+	desc->warned[ind] = false;
+	/* Write everything, to be seen by other CPUs */
+	smp_mb();
+	atomic_inc(&desc->curr_index);
+	/* Everyone will see the new record from this point */
+	smp_mb();
+}
+
+/*
+ * Print message prefixed with the description of the current (or
+ * last) isolated task on a given CPU. Intended for isolation breaking
+ * messages that include target task for the user's convenience.
+ *
+ * Messages produced with this function may have obsolete task
+ * information if isolated tasks managed to exit, start and enter
+ * isolation multiple times, or multiple tasks tried to enter
+ * isolation on the same CPU at once. For those unusual cases it would
+ * contain a valid description of the cause for isolation breaking and
+ * target CPU number, just not the correct description of which task
+ * ended up losing isolation.
+ */
+int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...)
+{
+	struct isol_task_desc *desc;
+	struct task_struct *task;
+	va_list args;
+	char buf_prefix[TASK_COMM_LEN + 20 + 3 * 20];
+	char buf[200];
+	int curr_cpu, ind_counter, ind_counter_old, ind;
+
+	curr_cpu = get_cpu();
+	desc = &per_cpu(isol_task_descs, cpu);
+	ind_counter = atomic_read(&desc->curr_index);
+
+	if (curr_cpu == cpu) {
+		/*
+		 * Message is for the current CPU so current
+		 * task_struct should be used instead of cached
+		 * information.
+		 *
+		 * Like in other diagnostic messages, if issued from
+		 * interrupt context, current will be the interrupted
+		 * task. Unlike other diagnostic messages, this is
+		 * always relevant because the message is about
+		 * interrupting a task.
+		 */
+		ind = ind_counter & 1;
+		if (supp && desc->warned[ind]) {
+			/*
+			 * If supp is true, skip the message if the
+			 * same task was mentioned in the message
+			 * originated on remote CPU, and it did not
+			 * re-enter isolated state since then (warned
+			 * is true). Only local messages following
+			 * remote messages, likely about the same
+			 * isolation breaking event, are skipped to
+			 * avoid duplication. If remote cause is
+			 * immediately followed by a local one before
+			 * isolation is broken, local cause is skipped
+			 * from messages.
+			 */
+			put_cpu();
+			return 0;
+		}
+		task = current;
+		snprintf(buf_prefix, sizeof(buf_prefix),
+			 "isolation %s/%d/%d (cpu %d)",
+			 task->comm, task->tgid, task->pid, cpu);
+		put_cpu();
+	} else {
+		/*
+		 * Message is for remote CPU, use cached information.
+		 */
+		put_cpu();
+		/*
+		 * Make sure, index remained unchanged while data was
+		 * copied. If it changed, data that was copied may be
+		 * inconsistent because two updates in a sequence could
+		 * overwrite the data while it was being read.
+		 */
+		do {
+			/* Make sure we are reading up to date values */
+			smp_mb();
+			ind = ind_counter & 1;
+			snprintf(buf_prefix, sizeof(buf_prefix),
+				 "isolation %s/%d/%d (cpu %d)",
+				 desc->comm[ind], desc->tgid[ind],
+				 desc->pid[ind], cpu);
+			desc->warned[ind] = true;
+			ind_counter_old = ind_counter;
+			/* Record the warned flag, then re-read descriptor */
+			smp_mb();
+			ind_counter = atomic_read(&desc->curr_index);
+			/*
+			 * If the counter changed, something was updated, so
+			 * repeat everything to get the current data
+			 */
+		} while (ind_counter != ind_counter_old);
+	}
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	switch (level) {
+	case LOGLEVEL_EMERG:
+		pr_emerg("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_ALERT:
+		pr_alert("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_CRIT:
+		pr_crit("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_ERR:
+		pr_err("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_WARNING:
+		pr_warn("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_NOTICE:
+		pr_notice("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_INFO:
+		pr_info("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_DEBUG:
+		pr_debug("%s: %s", buf_prefix, buf);
+		break;
+	default:
+		/* No message without a valid level */
+		return 0;
+	}
+	return 1;
+}
+
+/*
+ * Dump stack if need be. This can be helpful even from the final exit
+ * to usermode code since stack traces sometimes carry information about
+ * what put you into the kernel, e.g. an interrupt number encoded in
+ * the initial entry stack frame that is still visible at exit time.
+ */
+static void debug_dump_stack(void)
+{
+	if (task_isolation_debug)
+		dump_stack();
+}
+
+/*
+ * Set the flags word but don't try to actually start task isolation yet.
+ * We will start it when entering user space in task_isolation_start().
+ */
+int task_isolation_request(unsigned int flags)
+{
+	struct task_struct *task = current;
+
+	/*
+	 * The task isolation flags should always be cleared just by
+	 * virtue of having entered the kernel.
+	 */
+	WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_TASK_ISOLATION));
+	WARN_ON_ONCE(task->task_isolation_flags != 0);
+	WARN_ON_ONCE(task->task_isolation_state != STATE_NORMAL);
+
+	task->task_isolation_flags = flags;
+	if (!(task->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
+		return 0;
+
+	/* We are trying to enable task isolation. */
+	set_tsk_thread_flag(task, TIF_TASK_ISOLATION);
+
+	/*
+	 * Shut down the vmstat worker so we're not interrupted later.
+	 * We have to try to do this here (with interrupts enabled) since
+	 * we are canceling delayed work and will call flush_work()
+	 * (which enables interrupts) and possibly schedule().
+	 */
+	quiet_vmstat_sync();
+
+	/* We return 0 here but we may change that in task_isolation_start(). */
+	return 0;
+}
+
+/*
+ * Perform actions that should be done immediately on exit from isolation.
+ */
+static void fast_task_isolation_cpu_cleanup(void *info)
+{
+	atomic_dec(&per_cpu(isol_exit_counter, smp_processor_id()));
+	/* At this point breaking isolation from other CPUs is possible again */
+
+	/*
+	 * This task is no longer isolated (and if by any chance this
+	 * is the wrong task, it's already not isolated)
+	 */
+	current->task_isolation_flags = 0;
+	clear_tsk_thread_flag(current, TIF_TASK_ISOLATION);
+
+	/* Run the rest of cleanup later */
+	set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+
+	/* Copy flags with task isolation disabled */
+	this_cpu_write(tsk_thread_flags_cache,
+		       READ_ONCE(task_thread_info(current)->flags));
+}
+
+/* Disable task isolation for the specified task. */
+static void stop_isolation(struct task_struct *p)
+{
+	int cpu, this_cpu;
+	unsigned long flags;
+
+	this_cpu = get_cpu();
+	cpu = task_cpu(p);
+	if (atomic_inc_return(&per_cpu(isol_exit_counter, cpu)) > 1) {
+		/* Already exiting isolation */
+		atomic_dec(&per_cpu(isol_exit_counter, cpu));
+		put_cpu();
+		return;
+	}
+
+	if (p == current) {
+		p->task_isolation_state = STATE_NORMAL;
+		fast_task_isolation_cpu_cleanup(NULL);
+		task_isolation_cpu_cleanup();
+		if (atomic_dec_return(&per_cpu(isol_counter, cpu)) < 0) {
+			/* Is not isolated already */
+			atomic_inc(&per_cpu(isol_counter, cpu));
+		}
+		put_cpu();
+	} else {
+		if (atomic_dec_return(&per_cpu(isol_counter, cpu)) < 0) {
+			/* Is not isolated already */
+			atomic_inc(&per_cpu(isol_counter, cpu));
+			atomic_dec(&per_cpu(isol_exit_counter, cpu));
+			put_cpu();
+			return;
+		}
+		/*
+		 * Schedule "slow" cleanup. This relies on
+		 * TIF_NOTIFY_RESUME being set
+		 */
+		spin_lock_irqsave(&task_isolation_cleanup_lock, flags);
+		cpumask_set_cpu(cpu, task_isolation_cleanup_map);
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+		/*
+		 * Setting flags is delegated to the CPU where
+		 * isolated task is running
+		 * isol_exit_counter will be decremented from there as well.
+		 */
+		per_cpu(isol_break_csd, cpu).func =
+		    fast_task_isolation_cpu_cleanup;
+		per_cpu(isol_break_csd, cpu).info = NULL;
+		per_cpu(isol_break_csd, cpu).flags = 0;
+		smp_call_function_single_async(cpu,
+					       &per_cpu(isol_break_csd, cpu));
+		put_cpu();
+	}
+}
+
+/*
+ * This code runs with interrupts disabled just before the return to
+ * userspace, after a prctl() has requested enabling task isolation.
+ * We take whatever steps are needed to avoid being interrupted later:
+ * drain the lru pages, stop the scheduler tick, etc.  More
+ * functionality may be added here later to avoid other types of
+ * interrupts from other kernel subsystems.
+ *
+ * If we can't enable task isolation, we update the syscall return
+ * value with an appropriate error.
+ */
+void task_isolation_start(void)
+{
+	int error;
+
+	/*
+	 * We should only be called in STATE_NORMAL (isolation disabled),
+	 * on our way out of the kernel from the prctl() that turned it on.
+	 * If we are exiting from the kernel in another state, it means we
+	 * made it back into the kernel without disabling task isolation,
+	 * and we should investigate how (and in any case disable task
+	 * isolation at this point).  We are clearly not on the path back
+	 * from the prctl() so we don't touch the syscall return value.
+	 */
+	if (WARN_ON_ONCE(current->task_isolation_state != STATE_NORMAL)) {
+		/* Increment counter, this will allow isolation breaking */
+		if (atomic_inc_return(&per_cpu(isol_counter,
+					      smp_processor_id())) > 1) {
+			atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
+		}
+		atomic_inc(&per_cpu(isol_counter, smp_processor_id()));
+		stop_isolation(current);
+		return;
+	}
+
+	/*
+	 * Must be affinitized to a single core with task isolation possible.
+	 * In principle this could be remotely modified between the prctl()
+	 * and the return to userspace, so we have to check it here.
+	 */
+	if (current->nr_cpus_allowed != 1 ||
+	    !is_isolation_cpu(smp_processor_id())) {
+		error = -EINVAL;
+		goto error;
+	}
+
+	/* If the vmstat delayed work is not canceled, we have to try again. */
+	if (!vmstat_idle()) {
+		error = -EAGAIN;
+		goto error;
+	}
+
+	/* Try to stop the dynamic tick. */
+	error = try_stop_full_tick();
+	if (error)
+		goto error;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Increment counter, this will allow isolation breaking */
+	if (atomic_inc_return(&per_cpu(isol_counter,
+				      smp_processor_id())) > 1) {
+		atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
+	}
+
+	/* Record isolated task IDs and name */
+	record_curr_isolated_task();
+
+	/* Copy flags with task isolation enabled */
+	this_cpu_write(tsk_thread_flags_cache,
+		       READ_ONCE(task_thread_info(current)->flags));
+
+	current->task_isolation_state = STATE_ISOLATED;
+	return;
+
+error:
+	/* Increment counter, this will allow isolation breaking */
+	if (atomic_inc_return(&per_cpu(isol_counter,
+				      smp_processor_id())) > 1) {
+		atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
+	}
+	stop_isolation(current);
+	syscall_set_return_value(current, current_pt_regs(), error, 0);
+}
+
+/* Stop task isolation on the remote task and send it a signal. */
+static void send_isolation_signal(struct task_struct *task)
+{
+	int flags = task->task_isolation_flags;
+	kernel_siginfo_t info = {
+		.si_signo = PR_TASK_ISOLATION_GET_SIG(flags) ?: SIGKILL,
+	};
+
+	stop_isolation(task);
+	send_sig_info(info.si_signo, &info, task);
+}
+
+/* Only a few syscalls are valid once we are in task isolation mode. */
+static bool is_acceptable_syscall(int syscall)
+{
+	/* No need to incur an isolation signal if we are just exiting. */
+	if (syscall == __NR_exit || syscall == __NR_exit_group)
+		return true;
+
+	/* Check to see if it's the prctl for isolation. */
+	if (syscall == __NR_prctl) {
+		unsigned long arg[SYSCALL_MAX_ARGS];
+
+		syscall_get_arguments(current, current_pt_regs(), arg);
+		if (arg[0] == PR_TASK_ISOLATION)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * This routine is called from syscall entry, prevents most syscalls
+ * from executing, and if needed raises a signal to notify the process.
+ *
+ * Note that we have to stop isolation before we even print a message
+ * here, since otherwise we might end up reporting an interrupt due to
+ * kicking the printk handling code, rather than reporting the true
+ * cause of interrupt here.
+ *
+ * The message is not suppressed by previous remotely triggered
+ * messages.
+ */
+int task_isolation_syscall(int syscall)
+{
+	struct task_struct *task = current;
+
+	if (is_acceptable_syscall(syscall)) {
+		stop_isolation(task);
+		return 0;
+	}
+
+	send_isolation_signal(task);
+
+	pr_task_isol_warn(smp_processor_id(),
+			  "task_isolation lost due to syscall %d\n",
+			  syscall);
+	debug_dump_stack();
+
+	syscall_set_return_value(task, current_pt_regs(), -ERESTARTNOINTR, -1);
+	return -1;
+}
+
+/*
+ * This routine is called from any exception or irq that doesn't
+ * otherwise trigger a signal to the user process (e.g. page fault).
+ *
+ * Messages will be suppressed if there is already a reported remote
+ * cause for isolation breaking, so we don't generate multiple
+ * confusingly similar messages about the same event.
+ */
+void _task_isolation_interrupt(const char *fmt, ...)
+{
+	struct task_struct *task = current;
+	va_list args;
+	char buf[100];
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	/* Are we exiting isolation already? */
+	if (atomic_read(&per_cpu(isol_exit_counter, smp_processor_id())) != 0) {
+		task->task_isolation_state = STATE_NORMAL;
+		return;
+	}
+	/*
+	 * Avoid reporting interrupts that happen after we have prctl'ed
+	 * to enable isolation, but before we have returned to userspace.
+	 */
+	if (task->task_isolation_state == STATE_NORMAL)
+		return;
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	/* Handle NMIs minimally, since we can't send a signal. */
+	if (in_nmi()) {
+		pr_task_isol_err(smp_processor_id(),
+				 "isolation: in NMI; not delivering signal\n");
+	} else {
+		send_isolation_signal(task);
+	}
+
+	if (pr_task_isol_warn_supp(smp_processor_id(),
+				   "task_isolation lost due to %s\n", buf))
+		debug_dump_stack();
+}
+
+/*
+ * Called before we wake up a task that has a signal to process.
+ * Needs to be done to handle interrupts that trigger signals, which
+ * we don't catch with task_isolation_interrupt() hooks.
+ *
+ * This message is also suppressed if there was already a remotely
+ * caused message about the same isolation breaking event.
+ */
+void _task_isolation_signal(struct task_struct *task)
+{
+	struct isol_task_desc *desc;
+	int ind, cpu;
+	bool do_warn = (task->task_isolation_state == STATE_ISOLATED);
+
+	cpu = task_cpu(task);
+	desc = &per_cpu(isol_task_descs, cpu);
+	ind = atomic_read(&desc->curr_index) & 1;
+	if (desc->warned[ind])
+		do_warn = false;
+
+	stop_isolation(task);
+
+	if (do_warn) {
+		pr_warn("isolation: %s/%d/%d (cpu %d): task_isolation lost due to signal\n",
+			task->comm, task->tgid, task->pid, cpu);
+		debug_dump_stack();
+	}
+}
+
+/*
+ * Generate a stack backtrace if we are going to interrupt another task
+ * isolation process.
+ */
+void task_isolation_remote(int cpu, const char *fmt, ...)
+{
+	struct task_struct *curr_task;
+	va_list args;
+	char buf[200];
+
+	if (!is_isolation_cpu(cpu) || !task_isolation_on_cpu(cpu))
+		return;
+
+	curr_task = current;
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	if (pr_task_isol_warn(cpu,
+			      "task_isolation lost due to %s by %s/%d/%d on cpu %d\n",
+			      buf,
+			      curr_task->comm, curr_task->tgid,
+			      curr_task->pid, smp_processor_id()))
+		debug_dump_stack();
+}
+
+/*
+ * Generate a stack backtrace if any of the cpus in "mask" are running
+ * task isolation processes.
+ */
+void task_isolation_remote_cpumask(const struct cpumask *mask,
+				   const char *fmt, ...)
+{
+	struct task_struct *curr_task;
+	cpumask_var_t warn_mask;
+	va_list args;
+	char buf[200];
+	int cpu, first_cpu;
+
+	if (task_isolation_map == NULL ||
+		!zalloc_cpumask_var(&warn_mask, GFP_KERNEL))
+		return;
+
+	first_cpu = -1;
+	for_each_cpu_and(cpu, mask, task_isolation_map) {
+		if (task_isolation_on_cpu(cpu)) {
+			if (first_cpu < 0)
+				first_cpu = cpu;
+			else
+				cpumask_set_cpu(cpu, warn_mask);
+		}
+	}
+
+	if (first_cpu < 0)
+		goto done;
+
+	curr_task = current;
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	if (cpumask_weight(warn_mask) == 0)
+		pr_task_isol_warn(first_cpu,
+				  "task_isolation lost due to %s by %s/%d/%d on cpu %d\n",
+				  buf, curr_task->comm, curr_task->tgid,
+				  curr_task->pid, smp_processor_id());
+	else
+		pr_task_isol_warn(first_cpu,
+				  " and cpus %*pbl: task_isolation lost due to %s by %s/%d/%d on cpu %d\n",
+				  cpumask_pr_args(warn_mask),
+				  buf, curr_task->comm, curr_task->tgid,
+				  curr_task->pid, smp_processor_id());
+	debug_dump_stack();
+
+done:
+	free_cpumask_var(warn_mask);
+}
+
+/*
+ * Check if given CPU is running isolated task.
+ */
+int task_isolation_on_cpu(int cpu)
+{
+	return test_bit(TIF_TASK_ISOLATION,
+			&per_cpu(tsk_thread_flags_cache, cpu));
+}
+
+/*
+ * Set CPUs currently running isolated tasks in CPU mask.
+ */
+void task_isolation_cpumask(struct cpumask *mask)
+{
+	int cpu;
+
+	if (task_isolation_map == NULL)
+		return;
+
+	for_each_cpu(cpu, task_isolation_map)
+		if (task_isolation_on_cpu(cpu))
+			cpumask_set_cpu(cpu, mask);
+}
+
+/*
+ * Clear CPUs currently running isolated tasks in CPU mask.
+ */
+void task_isolation_clear_cpumask(struct cpumask *mask)
+{
+	int cpu;
+
+	if (task_isolation_map == NULL)
+		return;
+
+	for_each_cpu(cpu, task_isolation_map)
+		if (task_isolation_on_cpu(cpu))
+			cpumask_clear_cpu(cpu, mask);
+}
+
+/*
+ * Cleanup procedure. The call to this procedure may be delayed.
+ */
+void task_isolation_cpu_cleanup(void)
+{
+	kick_hrtimer();
+}
+
+/*
+ * Check if cleanup is scheduled on the current CPU, and if so, run it.
+ * Intended to be called from notify_resume() or another such callback
+ * on the target CPU.
+ */
+void task_isolation_check_run_cleanup(void)
+{
+	int cpu;
+	unsigned long flags;
+
+	spin_lock_irqsave(&task_isolation_cleanup_lock, flags);
+
+	cpu = smp_processor_id();
+
+	if (cpumask_test_cpu(cpu, task_isolation_cleanup_map)) {
+		cpumask_clear_cpu(cpu, task_isolation_cleanup_map);
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+		task_isolation_cpu_cleanup();
+	} else
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+}
diff --git a/kernel/signal.c b/kernel/signal.c
index 5b2396350dd1..1df57e38c361 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -46,6 +46,7 @@
 #include <linux/livepatch.h>
 #include <linux/cgroup.h>
 #include <linux/audit.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -758,6 +759,7 @@ static int dequeue_synchronous_signal(kernel_siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+	task_isolation_signal(t);
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/sys.c b/kernel/sys.c
index f9bc5c303e3f..0a4059a8c4f9 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,6 +42,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2513,6 +2514,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+	case PR_TASK_ISOLATION:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = task_isolation_request(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 3a609e7344f3..5bb98f39bde6 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -30,6 +30,7 @@
 #include <linux/syscalls.h>
 #include <linux/interrupt.h>
 #include <linux/tick.h>
+#include <linux/isolation.h>
 #include <linux/err.h>
 #include <linux/debugobjects.h>
 #include <linux/sched/signal.h>
@@ -721,6 +722,19 @@ static void retrigger_next_event(void *arg)
 	raw_spin_unlock(&base->lock);
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+void kick_hrtimer(void)
+{
+	unsigned long flags;
+
+	preempt_disable();
+	local_irq_save(flags);
+	retrigger_next_event(NULL);
+	local_irq_restore(flags);
+	preempt_enable();
+}
+#endif
+
 /*
  * Switch to high resolution mode
  */
@@ -868,8 +882,21 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
 void clock_was_set(void)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
+#ifdef CONFIG_TASK_ISOLATION
+	struct cpumask mask;
+
+	cpumask_clear(&mask);
+	task_isolation_cpumask(&mask);
+	cpumask_complement(&mask, &mask);
+	/*
+	 * Retrigger the CPU local events everywhere except CPUs
+	 * running isolated tasks.
+	 */
+	on_each_cpu_mask(&mask, retrigger_next_event, NULL, 1);
+#else
 	/* Retrigger the CPU local events everywhere */
 	on_each_cpu(retrigger_next_event, NULL, 1);
+#endif
 #endif
 	timerfd_clock_was_set();
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a792d21cac64..1d4dec9d3ee7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -882,6 +882,24 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts)
 #endif
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+int try_stop_full_tick(void)
+{
+	int cpu = smp_processor_id();
+	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+	/* For an unstable clock, we should return a permanent error code. */
+	if (atomic_read(&tick_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE)
+		return -EINVAL;
+
+	if (!can_stop_full_tick(cpu, ts))
+		return -EAGAIN;
+
+	tick_nohz_stop_sched_tick(ts, cpu);
+	return 0;
+}
+#endif
+
 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 {
 	/*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 04/12] task_isolation: Add task isolation hooks to arch-independent code
  2020-03-08  3:42 ` [PATCH v2 00/12] "Task_isolation" mode Alex Belits
                     ` (2 preceding siblings ...)
  2020-03-08  3:47   ` [PATCH v2 03/12] task_isolation: userspace hard isolation from kernel Alex Belits
@ 2020-03-08  3:48   ` Alex Belits
  2020-03-08  3:49   ` [PATCH v2 05/12] task_isolation: arch/x86: enable task isolation functionality Alex Belits
                     ` (8 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-08  3:48 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	catalin.marinas, linux-arm-kernel, netdev, davem, linux-arch,
	will

From: Chris Metcalf <cmetcalf@mellanox.com>

This commit adds task isolation hooks as follows:

- __handle_domain_irq() generates an isolation warning for the
  local task

- irq_work_queue_on() generates an isolation warning for the remote
  task being interrupted for irq_work

- generic_exec_single() generates a remote isolation warning for
  the remote cpu being IPI'd

- smp_call_function_many() generates a remote isolation warning for
  the set of remote cpus being IPI'd

Calls to task_isolation_remote() or task_isolation_interrupt() can
be placed in the platform-independent code like this when doing so
results in fewer lines of code changes, as for example is true of
the users of the arch_send_call_function_*() APIs. Or, they can be
placed in the per-architecture code when there are many callers,
as for example is true of the smp_send_reschedule() call.

A further cleanup might be to create an intermediate layer, so that
for example smp_send_reschedule() is a single generic function that
just calls arch_smp_send_reschedule(), allowing generic code to be
called every time smp_send_reschedule() is invoked. But for now, we
just update either callers or callees as makes most sense.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
[abelits@marvell.com: adapted for kernel 5.6]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/irq/irqdesc.c | 9 +++++++++
 kernel/irq_work.c    | 5 ++++-
 kernel/smp.c         | 6 +++++-
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 98a5f10d1900..e2b81d035fa1 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -16,6 +16,7 @@
 #include <linux/bitmap.h>
 #include <linux/irqdomain.h>
 #include <linux/sysfs.h>
+#include <linux/isolation.h>
 
 #include "internals.h"
 
@@ -670,6 +671,10 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq,
 		irq = irq_find_mapping(domain, hwirq);
 #endif
 
+	task_isolation_interrupt((irq == hwirq) ?
+				 "irq %d (%s)" : "irq %d (%s hwirq %d)",
+				 irq, domain ? domain->name : "", hwirq);
+
 	/*
 	 * Some hardware gives randomly wrong interrupts.  Rather
 	 * than crashing, do something sensible.
@@ -711,6 +716,10 @@ int handle_domain_nmi(struct irq_domain *domain, unsigned int hwirq,
 
 	irq = irq_find_mapping(domain, hwirq);
 
+	task_isolation_interrupt((irq == hwirq) ?
+				 "NMI irq %d (%s)" : "NMI irq %d (%s hwirq %d)",
+				 irq, domain ? domain->name : "", hwirq);
+
 	/*
 	 * ack_bad_irq is not NMI-safe, just report
 	 * an invalid interrupt.
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 828cc30774bc..8fd4ece43dd8 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -18,6 +18,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/isolation.h>
 #include <asm/processor.h>
 
 
@@ -102,8 +103,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (cpu != smp_processor_id()) {
 		/* Arch remote IPI send/receive backend aren't NMI safe */
 		WARN_ON_ONCE(in_nmi());
-		if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+		if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+			task_isolation_remote(cpu, "irq_work");
 			arch_send_call_function_single_ipi(cpu);
+		}
 	} else {
 		__irq_work_queue_local(work);
 	}
diff --git a/kernel/smp.c b/kernel/smp.c
index d0ada39eb4d4..3a8bcbdd4ce6 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -20,6 +20,7 @@
 #include <linux/sched.h>
 #include <linux/sched/idle.h>
 #include <linux/hypervisor.h>
+#include <linux/isolation.h>
 
 #include "smpboot.h"
 
@@ -176,8 +177,10 @@ static int generic_exec_single(int cpu, call_single_data_t *csd,
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
-	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) {
+		task_isolation_remote(cpu, "IPI function");
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return 0;
 }
@@ -466,6 +469,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	task_isolation_remote_cpumask(cfd->cpumask_ipi, "IPI function");
 	arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
 
 	if (wait) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 05/12] task_isolation: arch/x86: enable task isolation functionality
  2020-03-08  3:42 ` [PATCH v2 00/12] "Task_isolation" mode Alex Belits
                     ` (3 preceding siblings ...)
  2020-03-08  3:48   ` [PATCH v2 04/12] task_isolation: Add task isolation hooks to arch-independent code Alex Belits
@ 2020-03-08  3:49   ` Alex Belits
  2020-03-08  3:50   ` [PATCH v2 06/12] task_isolation: arch/arm64: " Alex Belits
                     ` (7 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-08  3:49 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	catalin.marinas, linux-arm-kernel, netdev, davem, linux-arch,
	will

From: Chris Metcalf <cmetcalf@mellanox.com>

In prepare_exit_to_usermode(), run cleanup for tasks exited from
isolation and call task_isolation_start() for tasks with
TIF_TASK_ISOLATION.

In syscall_trace_enter_phase1(), add the necessary support for
reporting syscalls for task-isolation processes.

Add task_isolation_remote() calls for the kernel exception types
that do not result in signals, namely non-signalling page faults.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
[abelits@marvell.com: adapted for kernel 5.6]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/entry/common.c            | 15 +++++++++++++++
 arch/x86/include/asm/apic.h        |  3 +++
 arch/x86/include/asm/thread_info.h |  4 +++-
 arch/x86/kernel/apic/ipi.c         |  2 ++
 arch/x86/mm/fault.c                |  4 ++++
 6 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index beea77046f9b..9ea6d3e6e77d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -144,6 +144,7 @@ config X86
 	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_STACKLEAK
 	select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 9747876980b5..ba8cd75dc7cf 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -26,6 +26,7 @@
 #include <linux/livepatch.h>
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
+#include <linux/isolation.h>
 
 #include <asm/desc.h>
 #include <asm/traps.h>
@@ -86,6 +87,15 @@ static long syscall_trace_enter(struct pt_regs *regs)
 			return -1L;
 	}
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->orig_ax) == -1)
+			return -1L;
+		work &= ~_TIF_TASK_ISOLATION;
+	}
 #ifdef CONFIG_SECCOMP
 	/*
 	 * Do seccomp after ptrace, to catch any tracer changes.
@@ -189,6 +199,8 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	lockdep_assert_irqs_disabled();
 	lockdep_sys_exit();
 
+	task_isolation_check_run_cleanup();
+
 	cached_flags = READ_ONCE(ti->flags);
 
 	if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
@@ -204,6 +216,9 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	if (unlikely(cached_flags & _TIF_NEED_FPU_LOAD))
 		switch_fpu_return();
 
+	if (cached_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
+
 #ifdef CONFIG_COMPAT
 	/*
 	 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 19e94af9cc5d..71149abbb0a0 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_APIC_H
 
 #include <linux/cpumask.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/cpufeature.h>
@@ -524,6 +525,7 @@ extern void irq_exit(void);
 
 static inline void entering_irq(void)
 {
+	task_isolation_interrupt("irq");
 	irq_enter();
 	kvm_set_cpu_l1tf_flush_l1d();
 }
@@ -536,6 +538,7 @@ static inline void entering_ack_irq(void)
 
 static inline void ipi_entering_ack_irq(void)
 {
+	task_isolation_interrupt("ack irq");
 	irq_enter();
 	ack_APIC_irq();
 	kvm_set_cpu_l1tf_flush_l1d();
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index cf4327986e98..60d107f784ee 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -92,6 +92,7 @@ struct thread_info {
 #define TIF_NOCPUID		15	/* CPUID is not accessible in userland */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
+#define TIF_TASK_ISOLATION	18	/* task isolation enabled for task */
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
@@ -122,6 +123,7 @@ struct thread_info {
 #define _TIF_NOCPUID		(1 << TIF_NOCPUID)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
@@ -140,7 +142,7 @@ struct thread_info {
 #define _TIF_WORK_SYSCALL_ENTRY	\
 	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |	\
 	 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT |	\
-	 _TIF_NOHZ)
+	 _TIF_NOHZ | _TIF_TASK_ISOLATION)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE					\
diff --git a/arch/x86/kernel/apic/ipi.c b/arch/x86/kernel/apic/ipi.c
index 6ca0f91372fd..b4dfaad6a440 100644
--- a/arch/x86/kernel/apic/ipi.c
+++ b/arch/x86/kernel/apic/ipi.c
@@ -2,6 +2,7 @@
 
 #include <linux/cpumask.h>
 #include <linux/smp.h>
+#include <linux/isolation.h>
 
 #include "local.h"
 
@@ -67,6 +68,7 @@ void native_smp_send_reschedule(int cpu)
 		WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu);
 		return;
 	}
+	task_isolation_remote(cpu, "reschedule IPI");
 	apic->send_IPI(cpu, RESCHEDULE_VECTOR);
 }
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fa4ea09593ab..2175a8655a7d 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 #include <linux/efi.h>			/* efi_recover_from_page_fault()*/
 #include <linux/mm_types.h>
+#include <linux/isolation.h>		/* task_isolation_interrupt     */
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1483,6 +1484,9 @@ void do_user_addr_fault(struct pt_regs *regs,
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
 	}
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	task_isolation_interrupt("page fault at %#lx", address);
+
 	check_v8086_mode(regs, address, tsk);
 }
 NOKPROBE_SYMBOL(do_user_addr_fault);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 06/12] task_isolation: arch/arm64: enable task isolation functionality
  2020-03-08  3:42 ` [PATCH v2 00/12] "Task_isolation" mode Alex Belits
                     ` (4 preceding siblings ...)
  2020-03-08  3:49   ` [PATCH v2 05/12] task_isolation: arch/x86: enable task isolation functionality Alex Belits
@ 2020-03-08  3:50   ` Alex Belits
  2020-03-09 16:59     ` Mark Rutland
  2020-03-08  3:52   ` [PATCH v2 07/12] task_isolation: arch/arm: " Alex Belits
                     ` (6 subsequent siblings)
  12 siblings, 1 reply; 71+ messages in thread
From: Alex Belits @ 2020-03-08  3:50 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	catalin.marinas, linux-arm-kernel, netdev, davem, linux-arch,
	will

From: Chris Metcalf <cmetcalf@mellanox.com>

In do_notify_resume(), call task_isolation_start() for
TIF_TASK_ISOLATION tasks. Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop,
since we don't clear _TIF_TASK_ISOLATION in the loop.

We instrument the smp_send_reschedule() routine so that it checks for
isolated tasks and generates a suitable warning if needed.

Finally, report on page faults in task-isolation processes in
do_page_faults().

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
[abelits@marvell.com: simplified to match kernel 5.6]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/thread_info.h |  5 ++++-
 arch/arm64/kernel/ptrace.c           | 10 ++++++++++
 arch/arm64/kernel/signal.c           | 13 ++++++++++++-
 arch/arm64/kernel/smp.c              |  7 +++++++
 arch/arm64/mm/fault.c                |  5 +++++
 6 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0b30e884e088..93b6aabc8be9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -129,6 +129,7 @@ config ARM64
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_STACKLEAK
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index f0cec4160136..7563098eb5b2 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -63,6 +63,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
 #define TIF_UPROBE		4	/* uprobe breakpoint or singlestep */
 #define TIF_FSCHECK		5	/* Check FS is USER_DS on return */
+#define TIF_TASK_ISOLATION	6
 #define TIF_NOHZ		7
 #define TIF_SYSCALL_TRACE	8	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	9	/* syscall auditing */
@@ -83,6 +84,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
@@ -96,7 +98,8 @@ void arch_release_task_struct(struct task_struct *tsk);
 
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
-				 _TIF_UPROBE | _TIF_FSCHECK)
+				 _TIF_UPROBE | _TIF_FSCHECK | \
+				 _TIF_TASK_ISOLATION)
 
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index cd6e5fa48b9c..b35b9b0c594c 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -29,6 +29,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/cpufeature.h>
@@ -1836,6 +1837,15 @@ int syscall_trace_enter(struct pt_regs *regs)
 			return -1;
 	}
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (test_thread_flag(TIF_TASK_ISOLATION)) {
+		if (task_isolation_syscall(regs->syscallno) == -1)
+			return -1;
+	}
+
 	/* Do the secure computing after ptrace; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 339882db5a91..d488c91a4877 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -20,6 +20,7 @@
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
 #include <linux/syscalls.h>
+#include <linux/isolation.h>
 
 #include <asm/daifflags.h>
 #include <asm/debug-monitors.h>
@@ -898,6 +899,11 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
+#define NOTIFY_RESUME_LOOP_FLAGS \
+	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
+	_TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
+	_TIF_UPROBE | _TIF_FSCHECK)
+
 asmlinkage void do_notify_resume(struct pt_regs *regs,
 				 unsigned long thread_flags)
 {
@@ -908,6 +914,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 	 */
 	trace_hardirqs_off();
 
+	task_isolation_check_run_cleanup();
+
 	do {
 		/* Check valid user FS if needed */
 		addr_limit_user_check();
@@ -938,7 +946,10 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 
 		local_daif_mask();
 		thread_flags = READ_ONCE(current_thread_info()->flags);
-	} while (thread_flags & _TIF_WORK_MASK);
+	} while (thread_flags & NOTIFY_RESUME_LOOP_FLAGS);
+
+	if (thread_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
 }
 
 unsigned long __ro_after_init signal_minsigstksz;
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index d4ed9a19d8fe..00f0f77adea0 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -32,6 +32,7 @@
 #include <linux/irq_work.h>
 #include <linux/kexec.h>
 #include <linux/kvm_host.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -818,6 +819,7 @@ void arch_send_call_function_single_ipi(int cpu)
 #ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
 void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "wakeup IPI");
 	smp_cross_call(mask, IPI_WAKEUP);
 }
 #endif
@@ -886,6 +888,9 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 		__inc_irq_stat(cpu, ipi_irqs[ipinr]);
 	}
 
+	task_isolation_interrupt("IPI type %d (%s)", ipinr,
+				 ipinr < NR_IPI ? ipi_types[ipinr] : "unknown");
+
 	switch (ipinr) {
 	case IPI_RESCHEDULE:
 		scheduler_ipi();
@@ -948,12 +953,14 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 
 void smp_send_reschedule(int cpu)
 {
+	task_isolation_remote(cpu, "reschedule IPI");
 	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
 }
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 void tick_broadcast(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "timer IPI");
 	smp_cross_call(mask, IPI_TIMER);
 }
 #endif
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 85566d32958f..fc4b42c81c4f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -23,6 +23,7 @@
 #include <linux/perf_event.h>
 #include <linux/preempt.h>
 #include <linux/hugetlb.h>
+#include <linux/isolation.h>
 
 #include <asm/acpi.h>
 #include <asm/bug.h>
@@ -543,6 +544,10 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 	 */
 	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
 			      VM_FAULT_BADACCESS)))) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (user_mode(regs))
+			task_isolation_interrupt("page fault at %#lx", addr);
+
 		/*
 		 * Major/minor page fault accounting is only done
 		 * once. If we go through a retry, it is extremely
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 07/12] task_isolation: arch/arm: enable task isolation functionality
  2020-03-08  3:42 ` [PATCH v2 00/12] "Task_isolation" mode Alex Belits
                     ` (5 preceding siblings ...)
  2020-03-08  3:50   ` [PATCH v2 06/12] task_isolation: arch/arm64: " Alex Belits
@ 2020-03-08  3:52   ` Alex Belits
  2020-03-08  3:53   ` [PATCH v2 08/12] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
                     ` (5 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-08  3:52 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	catalin.marinas, linux-arm-kernel, netdev, davem, linux-arch,
	will

From: Francis Giraldeau <francis.giraldeau@gmail.com>

This patch is a port of the task isolation functionality to the arm 32-bit
architecture. The task isolation needs an additional thread flag that
requires to change the entry assembly code to accept a bitfield larger than
one byte. The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now
defined in the literal pool. The rest of the patch is straightforward and
reflects what is done on other architectures.

To avoid problems with the tst instruction in the v7m build, we renumber
TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7.

Signed-off-by: Francis Giraldeau <francis.giraldeau@gmail.com>
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> [with modifications]
[abelits@marvell.com: modified for kernel 5.6, added isolation cleanup]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 arch/arm/Kconfig                   |  1 +
 arch/arm/include/asm/thread_info.h | 10 +++++++---
 arch/arm/kernel/entry-common.S     | 15 ++++++++++-----
 arch/arm/kernel/ptrace.c           | 10 ++++++++++
 arch/arm/kernel/signal.c           | 13 ++++++++++++-
 arch/arm/kernel/smp.c              |  4 ++++
 arch/arm/mm/fault.c                |  8 +++++++-
 7 files changed, 51 insertions(+), 10 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 97864aabc2a6..1a66e6c6807c 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -67,6 +67,7 @@ config ARM
 	select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
 	select HAVE_ARCH_MMAP_RND_BITS if MMU
 	select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARM_SMCCC if CPU_V7
diff --git a/arch/arm/include/asm/thread_info.h b/arch/arm/include/asm/thread_info.h
index 0d0d5178e2c3..ec3c2084c391 100644
--- a/arch/arm/include/asm/thread_info.h
+++ b/arch/arm/include/asm/thread_info.h
@@ -139,7 +139,8 @@ extern int vfp_restore_user_hwstate(struct user_vfp *,
 #define TIF_SYSCALL_TRACE	4	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	5	/* syscall auditing active */
 #define TIF_SYSCALL_TRACEPOINT	6	/* syscall tracepoint instrumentation */
-#define TIF_SECCOMP		7	/* seccomp syscall filtering active */
+#define TIF_TASK_ISOLATION	7	/* task isolation active */
+#define TIF_SECCOMP		8	/* seccomp syscall filtering active */
 
 #define TIF_NOHZ		12	/* in adaptive nohz mode */
 #define TIF_USING_IWMMXT	17
@@ -153,18 +154,21 @@ extern int vfp_restore_user_hwstate(struct user_vfp *,
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SYSCALL_TRACEPOINT	(1 << TIF_SYSCALL_TRACEPOINT)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_USING_IWMMXT	(1 << TIF_USING_IWMMXT)
 
 /* Checks for any syscall work in entry-common.S */
 #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
-			   _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP)
+			   _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
+			   _TIF_TASK_ISOLATION)
 
 /*
  * Change these and you break ASM code in entry-common.S
  */
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
-				 _TIF_NOTIFY_RESUME | _TIF_UPROBE)
+				 _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
+				 _TIF_TASK_ISOLATION)
 
 #endif /* __KERNEL__ */
 #endif /* __ASM_ARM_THREAD_INFO_H */
diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
index 271cb8a1eba1..6ceb5cb808a9 100644
--- a/arch/arm/kernel/entry-common.S
+++ b/arch/arm/kernel/entry-common.S
@@ -53,7 +53,8 @@ __ret_fast_syscall:
 	cmp	r2, #TASK_SIZE
 	blne	addr_limit_check_failed
 	ldr	r1, [tsk, #TI_FLAGS]		@ re-check for syscall tracing
-	tst	r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	ldr	r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	tst	r1, r2
 	bne	fast_work_pending
 
 
@@ -90,7 +91,8 @@ __ret_fast_syscall:
 	cmp	r2, #TASK_SIZE
 	blne	addr_limit_check_failed
 	ldr	r1, [tsk, #TI_FLAGS]		@ re-check for syscall tracing
-	tst	r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	ldr	r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	tst	r1, r2
 	beq	no_work_pending
  UNWIND(.fnend		)
 ENDPROC(ret_fast_syscall)
@@ -98,7 +100,8 @@ ENDPROC(ret_fast_syscall)
 	/* Slower path - fall through to work_pending */
 #endif
 
-	tst	r1, #_TIF_SYSCALL_WORK
+	ldr	r2, =_TIF_SYSCALL_WORK
+	tst	r1, r2
 	bne	__sys_trace_return_nosave
 slow_work_pending:
 	mov	r0, sp				@ 'regs'
@@ -131,7 +134,8 @@ ENTRY(ret_to_user_from_irq)
 	cmp	r2, #TASK_SIZE
 	blne	addr_limit_check_failed
 	ldr	r1, [tsk, #TI_FLAGS]
-	tst	r1, #_TIF_WORK_MASK
+	ldr	r2, =_TIF_WORK_MASK
+	tst	r1, r2
 	bne	slow_work_pending
 no_work_pending:
 	asm_trace_hardirqs_on save = 0
@@ -251,7 +255,8 @@ local_restart:
 	ldr	r10, [tsk, #TI_FLAGS]		@ check for syscall tracing
 	stmdb	sp!, {r4, r5}			@ push fifth and sixth args
 
-	tst	r10, #_TIF_SYSCALL_WORK		@ are we tracing syscalls?
+	ldr	r11, =_TIF_SYSCALL_WORK		@ are we tracing syscalls?
+	tst	r10, r11
 	bne	__sys_trace
 
 	invoke_syscall tbl, scno, r10, __ret_fast_syscall
diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
index b606cded90cd..a69b0bfd71ae 100644
--- a/arch/arm/kernel/ptrace.c
+++ b/arch/arm/kernel/ptrace.c
@@ -24,6 +24,7 @@
 #include <linux/audit.h>
 #include <linux/tracehook.h>
 #include <linux/unistd.h>
+#include <linux/isolation.h>
 
 #include <asm/pgtable.h>
 #include <asm/traps.h>
@@ -921,6 +922,15 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs, int scno)
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (test_thread_flag(TIF_TASK_ISOLATION)) {
+		if (task_isolation_syscall(scno) == -1)
+			return -1;
+	}
+
 	/* Do seccomp after ptrace; syscall may have changed. */
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
 	if (secure_computing() == -1)
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index ab2568996ddb..29ccef8403cd 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -12,6 +12,7 @@
 #include <linux/tracehook.h>
 #include <linux/uprobes.h>
 #include <linux/syscalls.h>
+#include <linux/isolation.h>
 
 #include <asm/elf.h>
 #include <asm/cacheflush.h>
@@ -639,6 +640,9 @@ static int do_signal(struct pt_regs *regs, int syscall)
 	return 0;
 }
 
+#define WORK_PENDING_LOOP_FLAGS	(_TIF_NEED_RESCHED | _TIF_SIGPENDING |	\
+				 _TIF_NOTIFY_RESUME | _TIF_UPROBE)
+
 asmlinkage int
 do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 {
@@ -648,6 +652,9 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 	 * Update the trace code with the current status.
 	 */
 	trace_hardirqs_off();
+
+	task_isolation_check_run_cleanup();
+
 	do {
 		if (likely(thread_flags & _TIF_NEED_RESCHED)) {
 			schedule();
@@ -676,7 +683,11 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 		}
 		local_irq_disable();
 		thread_flags = current_thread_info()->flags;
-	} while (thread_flags & _TIF_WORK_MASK);
+	} while (thread_flags & WORK_PENDING_LOOP_FLAGS);
+
+	if (thread_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
+
 	return 0;
 }
 
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 46e1be9e57a8..95f19b980776 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -26,6 +26,7 @@
 #include <linux/completion.h>
 #include <linux/cpufreq.h>
 #include <linux/irq_work.h>
+#include <linux/isolation.h>
 
 #include <linux/atomic.h>
 #include <asm/bugs.h>
@@ -560,6 +561,7 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask)
 
 void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "wakeup IPI");
 	smp_cross_call(mask, IPI_WAKEUP);
 }
 
@@ -579,6 +581,7 @@ void arch_irq_work_raise(void)
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 void tick_broadcast(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "timer IPI");
 	smp_cross_call(mask, IPI_TIMER);
 }
 #endif
@@ -702,6 +705,7 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 
 void smp_send_reschedule(int cpu)
 {
+	task_isolation_remote(cpu, "reschedule IPI");
 	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
 }
 
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index bd0f4821f7e1..acd11a69c4e4 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -17,6 +17,7 @@
 #include <linux/sched/debug.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/isolation.h>
 
 #include <asm/pgtable.h>
 #include <asm/system_misc.h>
@@ -332,8 +333,13 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	/*
 	 * Handle the "normal" case first - VM_FAULT_MAJOR
 	 */
-	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
+	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
+			      VM_FAULT_BADACCESS)))) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (user_mode(regs))
+			task_isolation_interrupt("page fault at %#lx", addr);
 		return 0;
+	}
 
 	/*
 	 * If we are in kernel mode at this point, we
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 08/12] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-03-08  3:42 ` [PATCH v2 00/12] "Task_isolation" mode Alex Belits
                     ` (6 preceding siblings ...)
  2020-03-08  3:52   ` [PATCH v2 07/12] task_isolation: arch/arm: " Alex Belits
@ 2020-03-08  3:53   ` Alex Belits
  2020-03-08  3:54   ` [PATCH v2 09/12] task_isolation: net: don't flush backlog on CPUs running isolated tasks Alex Belits
                     ` (4 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-08  3:53 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	catalin.marinas, linux-arm-kernel, netdev, davem, linux-arch,
	will

From: Yuri Norov <ynorov@marvell.com>

For nohz_full CPUs the desirable behavior is to receive interrupts
generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
obviously not desirable because it breaks isolation.

This patch adds check for it.

Signed-off-by: Yuri Norov <ynorov@marvell.com>
[abelits@marvell.com: updated, only exclude CPUs running isolated tasks]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/time/tick-sched.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 1d4dec9d3ee7..fe4503ba1316 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include <linux/sched/clock.h>
 #include <linux/sched/stat.h>
 #include <linux/sched/nohz.h>
+#include <linux/isolation.h>
 #include <linux/module.h>
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
@@ -262,7 +263,7 @@ static void tick_nohz_full_kick(void)
  */
 void tick_nohz_full_kick_cpu(int cpu)
 {
-	if (!tick_nohz_full_cpu(cpu))
+	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
 		return;
 
 	irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 09/12] task_isolation: net: don't flush backlog on CPUs running isolated tasks
  2020-03-08  3:42 ` [PATCH v2 00/12] "Task_isolation" mode Alex Belits
                     ` (7 preceding siblings ...)
  2020-03-08  3:53   ` [PATCH v2 08/12] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
@ 2020-03-08  3:54   ` Alex Belits
  2020-03-08  3:55   ` [PATCH v2 10/12] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize Alex Belits
                     ` (3 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-08  3:54 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	catalin.marinas, linux-arm-kernel, netdev, davem, linux-arch,
	will

From: Yuri Norov <ynorov@marvell.com>

If CPU runs isolated task, there's no any backlog on it, and
so we don't need to flush it. Currently flush_all_backlogs()
enqueues corresponding work on all CPUs including ones that run
isolated tasks. It leads to breaking task isolation for nothing.

In this patch, backlog flushing is enqueued only on non-isolated CPUs.

Signed-off-by: Yuri Norov <ynorov@marvell.com>
[abelits@marvell.com: use safe task_isolation_on_cpu() implementation]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 net/core/dev.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c6c985fe7b1b..6d32abb1f06d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -74,6 +74,7 @@
 #include <linux/cpu.h>
 #include <linux/types.h>
 #include <linux/kernel.h>
+#include <linux/isolation.h>
 #include <linux/hash.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
@@ -5518,9 +5519,12 @@ static void flush_all_backlogs(void)
 
 	get_online_cpus();
 
-	for_each_online_cpu(cpu)
+	for_each_online_cpu(cpu) {
+		if (task_isolation_on_cpu(cpu))
+			continue;
 		queue_work_on(cpu, system_highpri_wq,
 			      per_cpu_ptr(&flush_works, cpu));
+	}
 
 	for_each_online_cpu(cpu)
 		flush_work(per_cpu_ptr(&flush_works, cpu));
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 10/12] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize
  2020-03-08  3:42 ` [PATCH v2 00/12] "Task_isolation" mode Alex Belits
                     ` (8 preceding siblings ...)
  2020-03-08  3:54   ` [PATCH v2 09/12] task_isolation: net: don't flush backlog on CPUs running isolated tasks Alex Belits
@ 2020-03-08  3:55   ` Alex Belits
  2020-04-06  4:27     ` Kevyn-Alexandre Paré
  2020-03-08  3:56   ` [PATCH v2 11/12] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits
                     ` (2 subsequent siblings)
  12 siblings, 1 reply; 71+ messages in thread
From: Alex Belits @ 2020-03-08  3:55 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	catalin.marinas, linux-arm-kernel, netdev, davem, linux-arch,
	will

From: Yuri Norov <ynorov@marvell.com>

CPUs running isolated tasks are in userspace, so they don't have to
perform ring buffer updates immediately. If ring_buffer_resize()
schedules the update on those CPUs, isolation is broken. To prevent
that, updates for CPUs running isolated tasks are performed locally,
like for offline CPUs.

A race condition between this update and isolation breaking is avoided
at the cost of disabling per_cpu buffer writing for the time of update
when it coincides with isolation breaking.

Signed-off-by: Yuri Norov <ynorov@marvell.com>
[abelits@marvell.com: updated to prevent race with isolation breaking]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/trace/ring_buffer.c | 62 ++++++++++++++++++++++++++++++++++----
 1 file changed, 56 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 61f0e92ace99..593effe40183 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -21,6 +21,7 @@
 #include <linux/delay.h>
 #include <linux/slab.h>
 #include <linux/init.h>
+#include <linux/isolation.h>
 #include <linux/hash.h>
 #include <linux/list.h>
 #include <linux/cpu.h>
@@ -1701,6 +1702,37 @@ static void update_pages_handler(struct work_struct *work)
 	complete(&cpu_buffer->update_done);
 }
 
+static bool update_if_isolated(struct ring_buffer_per_cpu *cpu_buffer,
+			       int cpu)
+{
+	bool rv = false;
+
+	if (task_isolation_on_cpu(cpu)) {
+		/*
+		 * CPU is running isolated task. Since it may lose
+		 * isolation and re-enter kernel simultaneously with
+		 * this update, disable recording until it's done.
+		 */
+		atomic_inc(&cpu_buffer->record_disabled);
+		/* Make sure, update is done, and isolation state is current */
+		smp_mb();
+		if (task_isolation_on_cpu(cpu)) {
+			/*
+			 * If CPU is still running isolated task, we
+			 * can be sure that breaking isolation will
+			 * happen while recording is disabled, and CPU
+			 * will not touch this buffer until the update
+			 * is done.
+			 */
+			rb_update_pages(cpu_buffer);
+			cpu_buffer->nr_pages_to_update = 0;
+			rv = true;
+		}
+		atomic_dec(&cpu_buffer->record_disabled);
+	}
+	return rv;
+}
+
 /**
  * ring_buffer_resize - resize the ring buffer
  * @buffer: the buffer to resize.
@@ -1784,13 +1816,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
 			if (!cpu_buffer->nr_pages_to_update)
 				continue;
 
-			/* Can't run something on an offline CPU. */
+			/*
+			 * Can't run something on an offline CPU.
+			 *
+			 * CPUs running isolated tasks don't have to
+			 * update ring buffers until they exit
+			 * isolation because they are in
+			 * userspace. Use the procedure that prevents
+			 * race condition with isolation breaking.
+			 */
 			if (!cpu_online(cpu)) {
 				rb_update_pages(cpu_buffer);
 				cpu_buffer->nr_pages_to_update = 0;
 			} else {
-				schedule_work_on(cpu,
-						&cpu_buffer->update_pages_work);
+				if (!update_if_isolated(cpu_buffer, cpu))
+					schedule_work_on(cpu,
+					&cpu_buffer->update_pages_work);
 			}
 		}
 
@@ -1829,13 +1870,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
 
 		get_online_cpus();
 
-		/* Can't run something on an offline CPU. */
+		/*
+		 * Can't run something on an offline CPU.
+		 *
+		 * CPUs running isolated tasks don't have to update
+		 * ring buffers until they exit isolation because they
+		 * are in userspace. Use the procedure that prevents
+		 * race condition with isolation breaking.
+		 */
 		if (!cpu_online(cpu_id))
 			rb_update_pages(cpu_buffer);
 		else {
-			schedule_work_on(cpu_id,
+			if (!update_if_isolated(cpu_buffer, cpu_id))
+				schedule_work_on(cpu_id,
 					 &cpu_buffer->update_pages_work);
-			wait_for_completion(&cpu_buffer->update_done);
+				wait_for_completion(&cpu_buffer->update_done);
+			}
 		}
 
 		cpu_buffer->nr_pages_to_update = 0;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 11/12] task_isolation: kick_all_cpus_sync: don't kick isolated cpus
  2020-03-08  3:42 ` [PATCH v2 00/12] "Task_isolation" mode Alex Belits
                     ` (9 preceding siblings ...)
  2020-03-08  3:55   ` [PATCH v2 10/12] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize Alex Belits
@ 2020-03-08  3:56   ` Alex Belits
  2020-03-08  3:57   ` [PATCH v2 12/12] task_isolation: CONFIG_TASK_ISOLATION prevents distribution of jobs to non-housekeeping CPUs Alex Belits
  2020-04-09 15:09   ` [PATCH v3 00/13] "Task_isolation" mode Alex Belits
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-08  3:56 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	catalin.marinas, linux-arm-kernel, netdev, davem, linux-arch,
	will

From: Yuri Norov <ynorov@marvell.com>

Make sure that kick_all_cpus_sync() does not call CPUs that are running
isolated tasks.

Signed-off-by: Yuri Norov <ynorov@marvell.com>
[abelits@marvell.com: use safe task_isolation_cpumask() implementation]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/smp.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 3a8bcbdd4ce6..d9b4b2fedfed 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -731,9 +731,21 @@ static void do_nothing(void *unused)
  */
 void kick_all_cpus_sync(void)
 {
+	struct cpumask mask;
+
 	/* Make sure the change is visible before we kick the cpus */
 	smp_mb();
-	smp_call_function(do_nothing, NULL, 1);
+
+	preempt_disable();
+#ifdef CONFIG_TASK_ISOLATION
+	cpumask_clear(&mask);
+	task_isolation_cpumask(&mask);
+	cpumask_complement(&mask, &mask);
+#else
+	cpumask_setall(&mask);
+#endif
+	smp_call_function_many(&mask, do_nothing, NULL, 1);
+	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 12/12] task_isolation: CONFIG_TASK_ISOLATION prevents distribution of jobs to non-housekeeping CPUs
  2020-03-08  3:42 ` [PATCH v2 00/12] "Task_isolation" mode Alex Belits
                     ` (10 preceding siblings ...)
  2020-03-08  3:56   ` [PATCH v2 11/12] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits
@ 2020-03-08  3:57   ` Alex Belits
  2020-04-09 15:09   ` [PATCH v3 00/13] "Task_isolation" mode Alex Belits
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-08  3:57 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: mingo, peterz, linux-kernel, Prasun Kapoor, tglx, linux-api,
	catalin.marinas, linux-arm-kernel, netdev, davem, linux-arch,
	will

There are various mechanisms that select CPUs for jobs other than
regular workqueue selection. CPU isolation normally does not
prevent those jobs from running on isolated CPUs. When task
isolation is enabled those jobs should be limited to housekeeping
CPUs.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 drivers/pci/pci-driver.c |  9 +++++++
 lib/cpumask.c            | 53 +++++++++++++++++++++++++---------------
 net/core/net-sysfs.c     |  9 +++++++
 3 files changed, 51 insertions(+), 20 deletions(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 0454ca0e4e3f..cb872cdd1782 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -12,6 +12,7 @@
 #include <linux/string.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/sched/isolation.h>
 #include <linux/cpu.h>
 #include <linux/pm_runtime.h>
 #include <linux/suspend.h>
@@ -332,6 +333,9 @@ static bool pci_physfn_is_probed(struct pci_dev *dev)
 static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 			  const struct pci_device_id *id)
 {
+#ifdef CONFIG_TASK_ISOLATION
+	int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
+#endif
 	int error, node, cpu;
 	struct drv_dev_and_id ddi = { drv, dev, id };
 
@@ -353,7 +357,12 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 	    pci_physfn_is_probed(dev))
 		cpu = nr_cpu_ids;
 	else
+#ifdef CONFIG_TASK_ISOLATION
+		cpu = cpumask_any_and(cpumask_of_node(node),
+				      housekeeping_cpumask(hk_flags));
+#else
 		cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
+#endif
 
 	if (cpu < nr_cpu_ids)
 		error = work_on_cpu(cpu, local_pci_probe, &ddi);
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 0cb672eb107c..dcbc30a47600 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,6 +6,7 @@
 #include <linux/export.h>
 #include <linux/memblock.h>
 #include <linux/numa.h>
+#include <linux/sched/isolation.h>
 
 /**
  * cpumask_next - get the next cpu in a cpumask
@@ -205,28 +206,40 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
  */
 unsigned int cpumask_local_spread(unsigned int i, int node)
 {
-	int cpu;
+	const struct cpumask *mask;
+	int cpu, m, n;
+
+#ifdef CONFIG_TASK_ISOLATION
+	mask = housekeeping_cpumask(HK_FLAG_DOMAIN);
+	m = cpumask_weight(mask);
+#else
+	mask = cpu_online_mask;
+	m = num_online_cpus();
+#endif
 
 	/* Wrap: we always want a cpu. */
-	i %= num_online_cpus();
-
-	if (node == NUMA_NO_NODE) {
-		for_each_cpu(cpu, cpu_online_mask)
-			if (i-- == 0)
-				return cpu;
-	} else {
-		/* NUMA first. */
-		for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
-			if (i-- == 0)
-				return cpu;
-
-		for_each_cpu(cpu, cpu_online_mask) {
-			/* Skip NUMA nodes, done above. */
-			if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
-				continue;
-
-			if (i-- == 0)
-				return cpu;
+	n = i % m;
+
+	while (m-- > 0) {
+		if (node == NUMA_NO_NODE) {
+			for_each_cpu(cpu, mask)
+				if (n-- == 0)
+					return cpu;
+		} else {
+			/* NUMA first. */
+			for_each_cpu_and(cpu, cpumask_of_node(node), mask)
+				if (n-- == 0)
+					return cpu;
+
+			for_each_cpu(cpu, mask) {
+				/* Skip NUMA nodes, done above. */
+				if (cpumask_test_cpu(cpu,
+						     cpumask_of_node(node)))
+					continue;
+
+				if (n-- == 0)
+					return cpu;
+			}
 		}
 	}
 	BUG();
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 4c826b8bf9b1..253758f102d9 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -11,6 +11,7 @@
 #include <linux/if_arp.h>
 #include <linux/slab.h>
 #include <linux/sched/signal.h>
+#include <linux/sched/isolation.h>
 #include <linux/nsproxy.h>
 #include <net/sock.h>
 #include <net/net_namespace.h>
@@ -725,6 +726,14 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 		return err;
 	}
 
+#ifdef CONFIG_TASK_ISOLATION
+	cpumask_and(mask, mask, housekeeping_cpumask(HK_FLAG_DOMAIN));
+	if (cpumask_weight(mask) == 0) {
+		free_cpumask_var(mask);
+		return -EINVAL;
+	}
+#endif
+
 	map = kzalloc(max_t(unsigned int,
 			    RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
 		      GFP_KERNEL);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [EXT] Re: [PATCH 06/12] task_isolation: arch/arm64: enable task isolation functionality
  2020-03-04 16:31   ` Mark Rutland
@ 2020-03-08  4:48     ` Alex Belits
  0 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-08  4:48 UTC (permalink / raw)
  To: mark.rutland
  Cc: mingo, peterz, rostedt, frederic, Prasun Kapoor, tglx, linux-api,
	linux-kernel, catalin.marinas, linux-arm-kernel, netdev, davem,
	linux-arch, will

On Wed, 2020-03-04 at 16:31 +0000, Mark Rutland wrote:
> Hi Alex,
> 
> For patches affecting arm64, please CC LAKML and the arm64
> maintainers
> (Will and Catalin). I've Cc'd the maintainers here.

Thanks. Added them to Cc:.

> 
> On Wed, Mar 04, 2020 at 04:10:28PM +0000, Alex Belits wrote:
> > From: Chris Metcalf <cmetcalf@mellanox.com>
> > 
> > In do_notify_resume(), call task_isolation_start() for
> > TIF_TASK_ISOLATION tasks. Add _TIF_TASK_ISOLATION to
> > _TIF_WORK_MASK,
> > and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop,
> > since we don't clear _TIF_TASK_ISOLATION in the loop.
> > 
> > We tweak syscall_trace_enter() slightly to carry the "flags"
> > value from current_thread_info()->flags for each of the tests,
> > rather than doing a volatile read from memory for each one. This
> > avoids a small overhead for each test, and in particular avoids
> > that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.
> 
> Stale commit message?
> 
> Looking at the patch below, this doesn't seem to be the case; it just
> calls test_thread_flag(TIF_TASK_ISOLATION).

Right. I had to revert to a simple check to match the current
implementation of this function.

> 
We instrument the smp_send_reschedule() routine so that
itchecks for
> > isolated tasks and generates a suitable warning if needed.
> > 
> > Finally, report on page faults in task-isolation processes in
> > do_page_faults().
> > 
> > Signed-off-by: Alex Belits <abelits@marvell.com>
> 
> The From line says this was from Chris Metcalf, but he's missing from
> the Sign-off chain here, which isn't right.

I have posted updated patches with properly preserved sign-off lines
and descriptions of updates.

> >  arch/arm64/include/asm/thread_info.h |  5 ++++-
> >  arch/arm64/kernel/ptrace.c           | 10 ++++++++++
> >  arch/arm64/kernel/signal.c           | 13 ++++++++++++-
> >  arch/arm64/kernel/smp.c              |  7 +++++++
> >  arch/arm64/mm/fault.c                |  5 +++++
> >  6 files changed, 39 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 0b30e884e088..93b6aabc8be9 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -129,6 +129,7 @@ config ARM64
> >  	select HAVE_ARCH_PREL32_RELOCATIONS
> >  	select HAVE_ARCH_SECCOMP_FILTER
> >  	select HAVE_ARCH_STACKLEAK
> > +	select HAVE_ARCH_TASK_ISOLATION
> >  	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
> >  	select HAVE_ARCH_TRACEHOOK
> >  	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
> > diff --git a/arch/arm64/include/asm/thread_info.h
> > b/arch/arm64/include/asm/thread_info.h
> > index f0cec4160136..7563098eb5b2 100644
> > --- a/arch/arm64/include/asm/thread_info.h
> > +++ b/arch/arm64/include/asm/thread_info.h
> > @@ -63,6 +63,7 @@ void arch_release_task_struct(struct task_struct
> > *tsk);
> >  #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not
> > current's */
> >  #define TIF_UPROBE		4	/* uprobe breakpoint or singlestep
> > */
> >  #define TIF_FSCHECK		5	/* Check FS is USER_DS on
> > return */
> > +#define TIF_TASK_ISOLATION	6
> >  #define TIF_NOHZ		7
> >  #define TIF_SYSCALL_TRACE	8	/* syscall trace active */
> >  #define TIF_SYSCALL_AUDIT	9	/* syscall auditing */
> > @@ -83,6 +84,7 @@ void arch_release_task_struct(struct task_struct
> > *tsk);
> >  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
> >  #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
> >  #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
> > +#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
> >  #define _TIF_NOHZ		(1 << TIF_NOHZ)
> >  #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
> >  #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
> > @@ -96,7 +98,8 @@ void arch_release_task_struct(struct task_struct
> > *tsk);
> >  
> >  #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED |
> > _TIF_SIGPENDING | \
> >  				 _TIF_NOTIFY_RESUME |
> > _TIF_FOREIGN_FPSTATE | \
> > -				 _TIF_UPROBE | _TIF_FSCHECK)
> > +				 _TIF_UPROBE | _TIF_FSCHECK | \
> > +				 _TIF_TASK_ISOLATION)
> >  
> >  #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE |
> > _TIF_SYSCALL_AUDIT | \
> >  				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP
> > | \
> > diff --git a/arch/arm64/kernel/ptrace.c
> > b/arch/arm64/kernel/ptrace.c
> > index cd6e5fa48b9c..b35b9b0c594c 100644
> > --- a/arch/arm64/kernel/ptrace.c
> > +++ b/arch/arm64/kernel/ptrace.c
> > @@ -29,6 +29,7 @@
> >  #include <linux/regset.h>
> >  #include <linux/tracehook.h>
> >  #include <linux/elf.h>
> > +#include <linux/isolation.h>
> >  
> >  #include <asm/compat.h>
> >  #include <asm/cpufeature.h>
> > @@ -1836,6 +1837,15 @@ int syscall_trace_enter(struct pt_regs
> > *regs)
> >  			return -1;
> >  	}
> >  
> > +	/*
> > +	 * In task isolation mode, we may prevent the syscall from
> > +	 * running, and if so we also deliver a signal to the process.
> > +	 */
> > +	if (test_thread_flag(TIF_TASK_ISOLATION)) {
> > +		if (task_isolation_syscall(regs->syscallno) == -1)
> > +			return -1;
> > +	}
> 
> As above, this doesn't match the commit message.
> 
> AFAICT, task_isolation_syscall() always returns either 0 or -1, which
> isn't great as an API. I see secure_computing() seems to do the same,
> and it'd be nice to clean that up to either be a real erro code or a
> boolean.

Boolean may make more sense considering that we are not dealing with
errors returned by some operation the fact that calls were made from
the wrong state.

> 

> > > > +
> >  	/* Do the secure computing after ptrace; failures should be
> > fast. */
> >  	if (secure_computing() == -1)
> >  		return -1;
> > diff --git a/arch/arm64/kernel/signal.c
> > b/arch/arm64/kernel/signal.c
> > index 339882db5a91..d488c91a4877 100644
> > --- a/arch/arm64/kernel/signal.c
> > +++ b/arch/arm64/kernel/signal.c
> > @@ -20,6 +20,7 @@
> >  #include <linux/tracehook.h>
> >  #include <linux/ratelimit.h>
> >  #include <linux/syscalls.h>
> > +#include <linux/isolation.h>
> >  
> >  #include <asm/daifflags.h>
> >  #include <asm/debug-monitors.h>
> > @@ -898,6 +899,11 @@ static void do_signal(struct pt_regs *regs)
> >  	restore_saved_sigmask();
> >  }
> >  
> > +#define NOTIFY_RESUME_LOOP_FLAGS \
> > +	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
> > +	_TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
> > +	_TIF_UPROBE | _TIF_FSCHECK)
> > +
> >  asmlinkage void do_notify_resume(struct pt_regs *regs,
> >  				 unsigned long thread_flags)
> >  {
> > @@ -908,6 +914,8 @@ asmlinkage void do_notify_resume(struct pt_regs
> > *regs,
> >  	 */
> >  	trace_hardirqs_off();
> >  
> > +	task_isolation_check_run_cleanup();
> > +
> >  	do {
> >  		/* Check valid user FS if needed */
> >  		addr_limit_user_check();
> > @@ -938,7 +946,10 @@ asmlinkage void do_notify_resume(struct
> > pt_regs *regs,
> >  
> >  		local_daif_mask();
> >  		thread_flags = READ_ONCE(current_thread_info()->flags);
> > -	} while (thread_flags & _TIF_WORK_MASK);
> > +	} while (thread_flags & NOTIFY_RESUME_LOOP_FLAGS);
> > +
> > +	if (thread_flags & _TIF_TASK_ISOLATION)
> > +		task_isolation_start();
> >  }
> >  
> >  unsigned long __ro_after_init signal_minsigstksz;
> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > index d4ed9a19d8fe..00f0f77adea0 100644
> > --- a/arch/arm64/kernel/smp.c
> > +++ b/arch/arm64/kernel/smp.c
> > @@ -32,6 +32,7 @@
> >  #include <linux/irq_work.h>
> >  #include <linux/kexec.h>
> >  #include <linux/kvm_host.h>
> > +#include <linux/isolation.h>
> >  
> >  #include <asm/alternative.h>
> >  #include <asm/atomic.h>
> > @@ -818,6 +819,7 @@ void arch_send_call_function_single_ipi(int
> > cpu)
> >  #ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
> >  void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
> >  {
> > +	task_isolation_remote_cpumask(mask, "wakeup IPI");
> >  	smp_cross_call(mask, IPI_WAKEUP);
> >  }
> >  #endif
> > @@ -886,6 +888,9 @@ void handle_IPI(int ipinr, struct pt_regs
> > *regs)
> >  		__inc_irq_stat(cpu, ipi_irqs[ipinr]);
> >  	}
> >  
> > +	task_isolation_interrupt("IPI type %d (%s)", ipinr,
> > +				 ipinr < NR_IPI ? ipi_types[ipinr] :
> > "unknown");
> > +
> 
> Are these tracing hooks?
> 
> Surely they aren't necessary for functional correctness?

This is necessary to properly break isolation and send a signal when
something disturbed the isolated task, and to tell the user (likely a
developer), why this happened. There are some legitimate reasons for
breaking isolation (for example, a signal that terminates the process)
and very large number of possible things that should not happen to an
isolated task -- page fault caused by access to unmapped addresses,
syscall issued by the task in isolated state, interrupt directed to the
core running isolated task, timer remaining there, or kernel trying to
run something on that core. It was very helpful to know what exactly
happened, and if not this reporting, I would have hard time debugging
the drivers and subsystems that called their functions and scheduled
their threads all over the place. So both reliable breaking of
isolation and reporting the cause is important.

> 
> >  	switch (ipinr) {
> >  	case IPI_RESCHEDULE:
> >  		scheduler_ipi();
> > @@ -948,12 +953,14 @@ void handle_IPI(int ipinr, struct pt_regs
> > *regs)
> >  
> >  void smp_send_reschedule(int cpu)
> >  {
> > +	task_isolation_remote(cpu, "reschedule IPI");
> >  	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
> >  }
> >  
> >  #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
> >  void tick_broadcast(const struct cpumask *mask)
> >  {
> > +	task_isolation_remote_cpumask(mask, "timer IPI");
> >  	smp_cross_call(mask, IPI_TIMER);
> >  }
> >  #endif
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index 85566d32958f..fc4b42c81c4f 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -23,6 +23,7 @@
> >  #include <linux/perf_event.h>
> >  #include <linux/preempt.h>
> >  #include <linux/hugetlb.h>
> > +#include <linux/isolation.h>
> >  
> >  #include <asm/acpi.h>
> >  #include <asm/bug.h>
> > @@ -543,6 +544,10 @@ static int __kprobes do_page_fault(unsigned
> > long addr, unsigned int esr,
> >  	 */
> >  	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
> >  			      VM_FAULT_BADACCESS)))) {
> > +		/* No signal was generated, but notify task-isolation
> > tasks. */
> > +		if (user_mode(regs))
> > +			task_isolation_interrupt("page fault at %#lx",
> > addr);
> > +
> 
> We check user_mode(regs) much earlier in this function to set
> FAULT_FLAG_USER. Is there some reason this cannot live there?
> 
> Also, this seems to be a tracing hook -- is this necessary?

This is another point where task isolation state is broken, task
receives a signal and the reason is logged. If not isolation, task
would not know that anything happened, but it is important for
isolation.

In general if isolated task is supposed to be running on a CPU core, it
has to stay in userspace. So if for some reason a kernel code is
running on that CPU core, this is either a task leaving isolation on
its own or isolation is broken and the task should be notified about
it. The only problem is how to properly and reliably explain why this
happened.

Thanks!

-- 
Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [EXT] Re: [PATCH 03/12] task_isolation: userspace hard isolation from kernel
  2020-03-05 18:33   ` Frederic Weisbecker
@ 2020-03-08  5:32     ` Alex Belits
  2020-04-28 14:12     ` Marcelo Tosatti
  1 sibling, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-08  5:32 UTC (permalink / raw)
  To: frederic
  Cc: mingo, peterz, rostedt, Prasun Kapoor, tglx, linux-api,
	linux-kernel, catalin.marinas, linux-arm-kernel, netdev, davem,
	linux-arch, will

On Thu, 2020-03-05 at 19:33 +0100, Frederic Weisbecker wrote:
> On Wed, Mar 04, 2020 at 04:07:12PM +0000, Alex Belits wrote:
> > 
> 
> Hi Alew,
> 
> I'm glad this patchset is being resurected.
> Reading that changelog, I like the general idea and the direction.
> The diff is a bit scary though but I'll check the patches in detail
> in the upcoming days.
> 

I made some updates -- added missing code for arm and x86, restored
sign-off lines and updated commit messages.

This is the result of some work that mostly happened on earlier
versions and had to deal with the fact that timers and housekeeping
work often appeared on all CPUs, so some solutions may look like an
overkill. Nevertheless it was very helpful for finding the sources of
unexpected disturbances.

Also originally some of the race conditions and potential delayed work
at the time when a task is entering isolated state were considered
unavoidable. So the part in kernel was focused on correctness of
handling those conditions, while detection and dealing with their
consequences was done in userspace (in libtmc). Now it looks like there
may be much fewer such situations, however I am still not very thrilled
with the idea of complicating the kernel more than we have to.
Especially when it comes to code that is relevant only over few seconds
when the task is starting and entering isolated mode. So I have to
admit that some solutions look like "more EINTR than EINTR", and I
still like them more than making kernel side of entering/exiting
isolation even more complex than it is now.

I may be wrong, and there may be some more elegant solution, however I
don't see it now. Userspace-assisted isolation entering/exiting
procedure worked very well in a system with a huge number of cores,
threads, drivers with unusual features, etc., so at very least we have
some usable reference point.

> > In a number of cases we can tell on a remote cpu that we are
> > going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
> > In that case we generate the diagnostic (and optional stack dump)
> > on the remote core to be able to deliver better diagnostics.
> > If the interrupt is not something caught by Linux (e.g. a
> > hypervisor interrupt) we can also request a reschedule IPI to
> > be sent to the remote core so it can be sure to generate a
> > signal to notify the process.
> 
> I'm wondering if it's wise to run that on a guest at all :-)
> Or we should consider any guest exit to the host as a
> disturbance, we would then need some sort of paravirt
> driver to notify that, etc... That doesn't sound appealing.

Why not? I am not a big fan of virtualization, however people seem to
use it for all kinds of purposes now, and we only have to propagate (or
reject) isolation requests from guest to host (as long as resource and
permissions policy allow that). For KVM it would be literally
replicating guest task isolation state on the host, and as long as CPU
core is isolated, does it really matter if the task was created with
two layers of virtualization instead of one?

For isolation to make sense, it's still code running on a CPU with
fixed address mapping. If this is still the case, virtualization only
determines what can be in that space, not how it behaves. If this is
not the case, and task causes kernel code to run, be it guest or host
kernel, then something is wrong, and isolation is broken. Not very
different from behavior without virtualization.

This would be very bad for early days of virtualization when very
little could be done by a guest without host messing with it. Now, when
pieces of hardware can be (relatively) safely given to the guest
userspace to work on, we can just as well let it run isolated.

> 
> Thanks.

Thanks!

-- 
Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [EXT] Re: [PATCH 03/12] task_isolation: userspace hard isolation from kernel
  2020-03-06 15:26   ` Frederic Weisbecker
@ 2020-03-08  6:06     ` Alex Belits
  0 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-08  6:06 UTC (permalink / raw)
  To: frederic
  Cc: mingo, peterz, rostedt, Prasun Kapoor, tglx, linux-api,
	linux-kernel, catalin.marinas, linux-arm-kernel, netdev, davem,
	linux-arch, will

On Fri, 2020-03-06 at 16:26 +0100, Frederic Weisbecker wrote:
> On Wed, Mar 04, 2020 at 04:07:12PM +0000, Alex Belits wrote:
> > 		do {
> > +			/* Make sure we are reading up to date values
> > */
> > +			smp_mb();
> > +			ind = ind_counter & 1;
> > +			snprintf(buf_prefix, sizeof(buf_prefix),
> > +				 "isolation %s/%d/%d (cpu %d)",
> > +				 desc->comm[ind], desc->tgid[ind],
> > +				 desc->pid[ind], cpu);
> > +			desc->warned[ind] = true;
> > +			ind_counter_old = ind_counter;
> > +			/* Record the warned flag, then re-read
> > descriptor */
> > +			smp_mb();
> > +			ind_counter = atomic_read(&desc->curr_index);
> > +			/*
> > +			 * If the counter changed, something was
> > updated, so
> > +			 * repeat everything to get the current data
> > +			 */
> > +		} while (ind_counter != ind_counter_old);
> > +	}
> 
> So the need to log the fact we are sending an event to a remote CPU
> that *may be*
> running an isolated task makes things very complicated and even racy.

The only reason why the result of this would be wrong, is the race
between multiple causes of breaking isolation of the same task or race
with the task exiting isolation on its own at the same time (and
possibly re-entering it, or even another task entering on the same CPU
core). This is possible, however for all practical purposes we are
still logging an isolation-breaking event that happened while a real
isolated task was running. We should keep in mind the possibility that
this isolation-breaking event could be preempted by another isolation
breaking cause, and all of them will be recorded even if only one ended
up causing fast_task_isolation_cpu_cleanup() to be called on the target
CPU core.

> How bad would it be to only log those interruptions once they land on
> the target?
> 

For the purpose of determining the cause of isolation breaking -- very
bad. Early versions of this made people tear their hair out trying to
divine, where some IPI came from. Then there was a monstrosity that did
some rather unsafe manipulations with task_struct, however it was only
suitable as a temporary mechanism for development. This version keeps
things consistent and only shows up when there is something that should
be reported.

> Thanks.

Thanks!

-- 
Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [EXT] Re: [PATCH 11/12] task_isolation: kick_all_cpus_sync: don't kick isolated cpus
  2020-03-06 15:34   ` Frederic Weisbecker
@ 2020-03-08  6:48     ` Alex Belits
  2020-03-09  2:28       ` Frederic Weisbecker
  0 siblings, 1 reply; 71+ messages in thread
From: Alex Belits @ 2020-03-08  6:48 UTC (permalink / raw)
  To: frederic
  Cc: mingo, peterz, rostedt, Prasun Kapoor, tglx, linux-api,
	linux-kernel, catalin.marinas, linux-arm-kernel, netdev, davem,
	linux-arch, will

On Fri, 2020-03-06 at 16:34 +0100, Frederic Weisbecker wrote:
> On Wed, Mar 04, 2020 at 04:15:24PM +0000, Alex Belits wrote:
> > From: Yuri Norov <ynorov@marvell.com>
> > 
> > Make sure that kick_all_cpus_sync() does not call CPUs that are
> > running
> > isolated tasks.
> > 
> > Signed-off-by: Alex Belits <abelits@marvell.com>
> > ---
> >  kernel/smp.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index 3a8bcbdd4ce6..d9b4b2fedfed 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -731,9 +731,21 @@ static void do_nothing(void *unused)
> >   */
> >  void kick_all_cpus_sync(void)
> >  {
> > +	struct cpumask mask;
> > +
> >  	/* Make sure the change is visible before we kick the cpus */
> >  	smp_mb();
> > -	smp_call_function(do_nothing, NULL, 1);
> > +
> > +	preempt_disable();
> > +#ifdef CONFIG_TASK_ISOLATION
> > +	cpumask_clear(&mask);
> > +	task_isolation_cpumask(&mask);
> > +	cpumask_complement(&mask, &mask);
> > +#else
> > +	cpumask_setall(&mask);
> > +#endif
> > +	smp_call_function_many(&mask, do_nothing, NULL, 1);
> > +	preempt_enable();
> >  }
> 
> That looks very dangerous, the callers of kick_all_cpus_sync() want
> to
> sync all CPUs for a reason. You will rather need to fix the callers.

All callers of this use this function to synchronize IPIs and icache,
and they have no idea if there is anything special about the state of
CPUs. If a task is isolated, this call would not be necessary because
the task is in userspace, and it would have to enter kernel for any of
that to become relevant but then it will have to switch from userspace
to kernel. At worst it is returning to userspace after entering
isolation or back in kernel running cleanup after isolation is broken
but before tsk_thread_flags_cache is updated. There will be nothing to
run on the same CPU because we have just left isolation, so task will
either exit or go back to userspace.

Is there any reason for a race at that point?

> Thanks.
> 
> >  EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
> >  
> > -- 
> > 2.20.1
> > 

-- 
Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [EXT] Re: [PATCH 03/12] task_isolation: userspace hard isolation from kernel
  2020-03-06 16:00   ` Frederic Weisbecker
@ 2020-03-08  7:16     ` Alex Belits
  0 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-08  7:16 UTC (permalink / raw)
  To: frederic
  Cc: mingo, linux-kernel, peterz, rostedt, Prasun Kapoor, tglx,
	linux-api, linux-mm, linux-arch

On Fri, 2020-03-06 at 17:00 +0100, Frederic Weisbecker wrote:
> On Wed, Mar 04, 2020 at 04:07:12PM +0000, Alex Belits wrote:
> > +#ifdef CONFIG_TASK_ISOLATION
> > +int try_stop_full_tick(void)
> > +{
> > +	int cpu = smp_processor_id();
> > +	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
> > +
> > +	/* For an unstable clock, we should return a permanent error
> > code. */
> > +	if (atomic_read(&tick_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE)
> > +		return -EINVAL;
> > +
> > +	if (!can_stop_full_tick(cpu, ts))
> > +		return -EAGAIN;
> 
> Note that the stop_tick naming in nohz can be misleading. It means
> we actually leave the periodic mode and we enter in dynamic tick
> mode.
> 
> In practice it means that the tick is delayed until the next event,
> which
> in the worst case may well be in 1 ms and in the best case never. So
> what
> you probably want to check instead is whether the tick has been
> entirely
> stopped (ie: we called hrtimer_cancel(&ts->sched_timer)).

This is a part of solution where libtmc in userspace checks for timers
from another core before it confirms that the core that is entering
isolation can continue. Since, indeed, it is possible that some events
are pending, it is up to userspace to tell the task that it's not
really isolated yet, should exit and re-enter isolation when everything
is done. Or that it will be too much of the wait, and it should be seen
as an error, reported, etc.

Maybe it would be better if we checked for timer state and returned
-EAGAIN when it is running at this point, and left userspace check for
those cases when this did not work due to some race and preemption.
However I still want to assume that as long as there is no complete
prohibition of scheduling things on isolated CPUs, there might be
things that will enable this timer at unexpected times while we are
returning to userspace or even immediately after we got into userspace.

> 
> Thanks.
> 
> > +
> > +	tick_nohz_stop_sched_tick(ts, cpu);
> > +	return 0;
> > +}
> > +#endif
> > +
> >  static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
> >  {
> >  	/*
> > -- 
> > 2.20.1
> > 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [EXT] Re: [PATCH 08/12] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-03-06 16:03   ` Frederic Weisbecker
@ 2020-03-08  7:28     ` Alex Belits
  2020-03-09  2:38       ` Frederic Weisbecker
  0 siblings, 1 reply; 71+ messages in thread
From: Alex Belits @ 2020-03-08  7:28 UTC (permalink / raw)
  To: frederic
  Cc: mingo, peterz, rostedt, tglx, linux-api, catalin.marinas,
	linux-arm-kernel, netdev, davem, linux-arch, will

On Fri, 2020-03-06 at 17:03 +0100, Frederic Weisbecker wrote:
> On Wed, Mar 04, 2020 at 04:12:40PM +0000, Alex Belits wrote:
> > From: Yuri Norov <ynorov@marvell.com>
> > 
> > For nohz_full CPUs the desirable behavior is to receive interrupts
> > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> > obviously not desirable because it breaks isolation.
> > 
> > This patch adds check for it.
> > 
> > Signed-off-by: Alex Belits <abelits@marvell.com>
> > ---
> >  kernel/time/tick-sched.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 1d4dec9d3ee7..fe4503ba1316 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -20,6 +20,7 @@
> >  #include <linux/sched/clock.h>
> >  #include <linux/sched/stat.h>
> >  #include <linux/sched/nohz.h>
> > +#include <linux/isolation.h>
> >  #include <linux/module.h>
> >  #include <linux/irq_work.h>
> >  #include <linux/posix-timers.h>
> > @@ -262,7 +263,7 @@ static void tick_nohz_full_kick(void)
> >   */
> >  void tick_nohz_full_kick_cpu(int cpu)
> >  {
> > -	if (!tick_nohz_full_cpu(cpu))
> > +	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
> >  		return;
> 
> I fear you can't do that. A nohz full CPU is kicked for a reason.
> As for the other cases, you need to fix the callers.
> 
> In the general case, randomly ignoring an interrupt is a correctness
> issue.

Not ignoring, just delaying until we are back from userspace. We know
that everything was done on this CPU when we successfully entered
userspace in isolated mode -- otherwise we would be kicked out. We
restart timers when we are back in kernel again on cleanup, so things
will be back to normal at that point. Between those moments we can just
as well remain in userspace and forget about the timers until we are
back in kernel.

> 
> Thanks.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [EXT] Re: [PATCH v2 03/12] task_isolation: userspace hard isolation from kernel
       [not found]     ` <20200307214254.7a8f6c22@hermes.lan>
@ 2020-03-08  7:33       ` Alex Belits
  0 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-03-08  7:33 UTC (permalink / raw)
  To: stephen
  Cc: mingo, peterz, rostedt, frederic, Prasun Kapoor, tglx, linux-api,
	linux-kernel, catalin.marinas, linux-arm-kernel, netdev, davem,
	linux-arch, will

On Sat, 2020-03-07 at 21:42 -0800, Stephen Hemminger wrote:
> On Sun, 8 Mar 2020 03:47:08 +0000
> Alex Belits <abelits@marvell.com> wrote:
> 
> > +int task_isolation_message(int cpu, int level, bool supp, const
> > char *fmt, ...);
> 
> Since you are doing printf/printk var args you should use the GCC
> format attributes
> like printk.

Agreed.

-- 
Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [EXT] Re: [PATCH 11/12] task_isolation: kick_all_cpus_sync: don't kick isolated cpus
  2020-03-08  6:48     ` [EXT] " Alex Belits
@ 2020-03-09  2:28       ` Frederic Weisbecker
  0 siblings, 0 replies; 71+ messages in thread
From: Frederic Weisbecker @ 2020-03-09  2:28 UTC (permalink / raw)
  To: Alex Belits
  Cc: mingo, peterz, rostedt, Prasun Kapoor, tglx, linux-api,
	linux-kernel, catalin.marinas, linux-arm-kernel, netdev, davem,
	linux-arch, will

On Sun, Mar 08, 2020 at 06:48:43AM +0000, Alex Belits wrote:
> On Fri, 2020-03-06 at 16:34 +0100, Frederic Weisbecker wrote:
> > On Wed, Mar 04, 2020 at 04:15:24PM +0000, Alex Belits wrote:
> > > From: Yuri Norov <ynorov@marvell.com>
> > > 
> > > Make sure that kick_all_cpus_sync() does not call CPUs that are
> > > running
> > > isolated tasks.
> > > 
> > > Signed-off-by: Alex Belits <abelits@marvell.com>
> > > ---
> > >  kernel/smp.c | 14 +++++++++++++-
> > >  1 file changed, 13 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/smp.c b/kernel/smp.c
> > > index 3a8bcbdd4ce6..d9b4b2fedfed 100644
> > > --- a/kernel/smp.c
> > > +++ b/kernel/smp.c
> > > @@ -731,9 +731,21 @@ static void do_nothing(void *unused)
> > >   */
> > >  void kick_all_cpus_sync(void)
> > >  {
> > > +	struct cpumask mask;
> > > +
> > >  	/* Make sure the change is visible before we kick the cpus */
> > >  	smp_mb();
> > > -	smp_call_function(do_nothing, NULL, 1);
> > > +
> > > +	preempt_disable();
> > > +#ifdef CONFIG_TASK_ISOLATION
> > > +	cpumask_clear(&mask);
> > > +	task_isolation_cpumask(&mask);
> > > +	cpumask_complement(&mask, &mask);
> > > +#else
> > > +	cpumask_setall(&mask);
> > > +#endif
> > > +	smp_call_function_many(&mask, do_nothing, NULL, 1);
> > > +	preempt_enable();
> > >  }
> > 
> > That looks very dangerous, the callers of kick_all_cpus_sync() want
> > to
> > sync all CPUs for a reason. You will rather need to fix the callers.
> 
> All callers of this use this function to synchronize IPIs and icache,
> and they have no idea if there is anything special about the state of
> CPUs. If a task is isolated, this call would not be necessary because
> the task is in userspace, and it would have to enter kernel for any of
> that to become relevant but then it will have to switch from userspace
> to kernel. At worst it is returning to userspace after entering
> isolation or back in kernel running cleanup after isolation is broken
> but before tsk_thread_flags_cache is updated. There will be nothing to
> run on the same CPU because we have just left isolation, so task will
> either exit or go back to userspace.
> 
> Is there any reason for a race at that point?


I can imagine several races:

1) The isolated task has set the cpumask but hasn't exited the kernel
yet. If it still runs kernel code while kick_all_cpus_sync() has completed,
we fail.

2) The isolated task is running do_exit() but the caller of kick_all_cpus_sync()
still sees the target as part of the isolated mask.

3) The isolated task has just set the isolated cpumask and entered userspace
but the caller still don't see the new value in the isolated cpumask, so it sends
the IPI to the isolated CPU.

Besides, any caller of kick_all_cpus_sync() is in its right to expect that
everything preceding the call to that function is visible to all CPUs
after that call. If you spare that IPI to an isolated CPU, what ensures
it will see what it is supposed to once it calls do_exit() or prctl()?

Is there a way we could fix the callers instead? For example synchronize_rcu()
could be a replacement (it handles very well nohz_full CPUs), provided the
callsites can sleep. It seems to be the case for __do_tune_cpucache() at least.

flush_icache_range() is scarier I have to admit, doesn't look like it can
sleep.


> > Thanks.
> > 
> > >  EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
> > >  
> > > -- 
> > > 2.20.1
> > > 
> 
> -- 
> Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [EXT] Re: [PATCH 08/12] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-03-08  7:28     ` [EXT] " Alex Belits
@ 2020-03-09  2:38       ` Frederic Weisbecker
  0 siblings, 0 replies; 71+ messages in thread
From: Frederic Weisbecker @ 2020-03-09  2:38 UTC (permalink / raw)
  To: Alex Belits
  Cc: mingo, peterz, rostedt, tglx, linux-api, catalin.marinas,
	linux-arm-kernel, netdev, davem, linux-arch, will

On Sun, Mar 08, 2020 at 07:28:22AM +0000, Alex Belits wrote:
> On Fri, 2020-03-06 at 17:03 +0100, Frederic Weisbecker wrote:
> > On Wed, Mar 04, 2020 at 04:12:40PM +0000, Alex Belits wrote:
> > > From: Yuri Norov <ynorov@marvell.com>
> > > 
> > > For nohz_full CPUs the desirable behavior is to receive interrupts
> > > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> > > obviously not desirable because it breaks isolation.
> > > 
> > > This patch adds check for it.
> > > 
> > > Signed-off-by: Alex Belits <abelits@marvell.com>
> > > ---
> > >  kernel/time/tick-sched.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > index 1d4dec9d3ee7..fe4503ba1316 100644
> > > --- a/kernel/time/tick-sched.c
> > > +++ b/kernel/time/tick-sched.c
> > > @@ -20,6 +20,7 @@
> > >  #include <linux/sched/clock.h>
> > >  #include <linux/sched/stat.h>
> > >  #include <linux/sched/nohz.h>
> > > +#include <linux/isolation.h>
> > >  #include <linux/module.h>
> > >  #include <linux/irq_work.h>
> > >  #include <linux/posix-timers.h>
> > > @@ -262,7 +263,7 @@ static void tick_nohz_full_kick(void)
> > >   */
> > >  void tick_nohz_full_kick_cpu(int cpu)
> > >  {
> > > -	if (!tick_nohz_full_cpu(cpu))
> > > +	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
> > >  		return;
> > 
> > I fear you can't do that. A nohz full CPU is kicked for a reason.
> > As for the other cases, you need to fix the callers.
> > 
> > In the general case, randomly ignoring an interrupt is a correctness
> > issue.
> 
> Not ignoring, just delaying until we are back from userspace. We know
> that everything was done on this CPU when we successfully entered
> userspace in isolated mode -- otherwise we would be kicked out. We
> restart timers when we are back in kernel again on cleanup, so things
> will be back to normal at that point. Between those moments we can just
> as well remain in userspace and forget about the timers until we are
> back in kernel.

Well, if another CPU requests the tick on our isolated CPU, we can't ignore
it. This can be a posix cpu timer belonging to our process, a timer bound
to our CPU or tasks added to our CPU that require the scheduler tick.
Denying any of that can crash the kernel randomly.

The only thing we can do is to simply avoid these situations. But those
are requirements anyway if you want to run a task undisturbed.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 06/12] task_isolation: arch/arm64: enable task isolation functionality
  2020-03-08  3:50   ` [PATCH v2 06/12] task_isolation: arch/arm64: " Alex Belits
@ 2020-03-09 16:59     ` Mark Rutland
  0 siblings, 0 replies; 71+ messages in thread
From: Mark Rutland @ 2020-03-09 16:59 UTC (permalink / raw)
  To: Alex Belits
  Cc: frederic, rostedt, mingo, peterz, linux-kernel, Prasun Kapoor,
	tglx, linux-api, catalin.marinas, linux-arm-kernel, netdev,
	davem, linux-arch, will

On Sun, Mar 08, 2020 at 03:50:58AM +0000, Alex Belits wrote:
> From: Chris Metcalf <cmetcalf@mellanox.com>
> 
> In do_notify_resume(), call task_isolation_start() for
> TIF_TASK_ISOLATION tasks. Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
> and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop,
> since we don't clear _TIF_TASK_ISOLATION in the loop.
> 
> We instrument the smp_send_reschedule() routine so that it checks for
> isolated tasks and generates a suitable warning if needed.
> 
> Finally, report on page faults in task-isolation processes in
> do_page_faults().
> 
> Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
> [abelits@marvell.com: simplified to match kernel 5.6]
> Signed-off-by: Alex Belits <abelits@marvell.com>
> ---
>  arch/arm64/Kconfig                   |  1 +
>  arch/arm64/include/asm/thread_info.h |  5 ++++-
>  arch/arm64/kernel/ptrace.c           | 10 ++++++++++
>  arch/arm64/kernel/signal.c           | 13 ++++++++++++-
>  arch/arm64/kernel/smp.c              |  7 +++++++
>  arch/arm64/mm/fault.c                |  5 +++++
>  6 files changed, 39 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 0b30e884e088..93b6aabc8be9 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -129,6 +129,7 @@ config ARM64
>  	select HAVE_ARCH_PREL32_RELOCATIONS
>  	select HAVE_ARCH_SECCOMP_FILTER
>  	select HAVE_ARCH_STACKLEAK
> +	select HAVE_ARCH_TASK_ISOLATION
>  	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
>  	select HAVE_ARCH_TRACEHOOK
>  	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index f0cec4160136..7563098eb5b2 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -63,6 +63,7 @@ void arch_release_task_struct(struct task_struct *tsk);
>  #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
>  #define TIF_UPROBE		4	/* uprobe breakpoint or singlestep */
>  #define TIF_FSCHECK		5	/* Check FS is USER_DS on return */
> +#define TIF_TASK_ISOLATION	6
>  #define TIF_NOHZ		7
>  #define TIF_SYSCALL_TRACE	8	/* syscall trace active */
>  #define TIF_SYSCALL_AUDIT	9	/* syscall auditing */
> @@ -83,6 +84,7 @@ void arch_release_task_struct(struct task_struct *tsk);
>  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
>  #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
>  #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
> +#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
>  #define _TIF_NOHZ		(1 << TIF_NOHZ)
>  #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
>  #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
> @@ -96,7 +98,8 @@ void arch_release_task_struct(struct task_struct *tsk);
>  
>  #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
>  				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
> -				 _TIF_UPROBE | _TIF_FSCHECK)
> +				 _TIF_UPROBE | _TIF_FSCHECK | \
> +				 _TIF_TASK_ISOLATION)
>  
>  #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
>  				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> index cd6e5fa48b9c..b35b9b0c594c 100644
> --- a/arch/arm64/kernel/ptrace.c
> +++ b/arch/arm64/kernel/ptrace.c
> @@ -29,6 +29,7 @@
>  #include <linux/regset.h>
>  #include <linux/tracehook.h>
>  #include <linux/elf.h>
> +#include <linux/isolation.h>
>  
>  #include <asm/compat.h>
>  #include <asm/cpufeature.h>
> @@ -1836,6 +1837,15 @@ int syscall_trace_enter(struct pt_regs *regs)
>  			return -1;
>  	}
>  
> +	/*
> +	 * In task isolation mode, we may prevent the syscall from
> +	 * running, and if so we also deliver a signal to the process.
> +	 */
> +	if (test_thread_flag(TIF_TASK_ISOLATION)) {
> +		if (task_isolation_syscall(regs->syscallno) == -1)

Please use NO_SYSCALL rather than -1 here.

> +			return -1;
> +	}
> +
>  	/* Do the secure computing after ptrace; failures should be fast. */
>  	if (secure_computing() == -1)
>  		return -1;
> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index 339882db5a91..d488c91a4877 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -20,6 +20,7 @@
>  #include <linux/tracehook.h>
>  #include <linux/ratelimit.h>
>  #include <linux/syscalls.h>
> +#include <linux/isolation.h>
>  
>  #include <asm/daifflags.h>
>  #include <asm/debug-monitors.h>
> @@ -898,6 +899,11 @@ static void do_signal(struct pt_regs *regs)
>  	restore_saved_sigmask();
>  }
>  
> +#define NOTIFY_RESUME_LOOP_FLAGS \
> +	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
> +	_TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
> +	_TIF_UPROBE | _TIF_FSCHECK)
> +
>  asmlinkage void do_notify_resume(struct pt_regs *regs,
>  				 unsigned long thread_flags)
>  {
> @@ -908,6 +914,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
>  	 */
>  	trace_hardirqs_off();
>  
> +	task_isolation_check_run_cleanup();
> +
>  	do {
>  		/* Check valid user FS if needed */
>  		addr_limit_user_check();
> @@ -938,7 +946,10 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
>  
>  		local_daif_mask();
>  		thread_flags = READ_ONCE(current_thread_info()->flags);
> -	} while (thread_flags & _TIF_WORK_MASK);
> +	} while (thread_flags & NOTIFY_RESUME_LOOP_FLAGS);
> +
> +	if (thread_flags & _TIF_TASK_ISOLATION)
> +		task_isolation_start();
>  }
>  
>  unsigned long __ro_after_init signal_minsigstksz;
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index d4ed9a19d8fe..00f0f77adea0 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -32,6 +32,7 @@
>  #include <linux/irq_work.h>
>  #include <linux/kexec.h>
>  #include <linux/kvm_host.h>
> +#include <linux/isolation.h>
>  
>  #include <asm/alternative.h>
>  #include <asm/atomic.h>
> @@ -818,6 +819,7 @@ void arch_send_call_function_single_ipi(int cpu)
>  #ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
>  void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
>  {
> +	task_isolation_remote_cpumask(mask, "wakeup IPI");
>  	smp_cross_call(mask, IPI_WAKEUP);
>  }
>  #endif
> @@ -886,6 +888,9 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
>  		__inc_irq_stat(cpu, ipi_irqs[ipinr]);
>  	}
>  
> +	task_isolation_interrupt("IPI type %d (%s)", ipinr,
> +				 ipinr < NR_IPI ? ipi_types[ipinr] : "unknown");

When I previously asked about tracing, I was asking about the format
strings, since we don't bother with that kind of thing elsewhere.

What exactly are these hooks used for? I assume the strings are only
there as a debugging aid?

What about other IRQs? Does we need something in the irqchip driver? 

If we need to track that /any/ interrupt was received, I think that
would be better to put in the top-level interrupt exception handler than
to sprinkle hooks into every potential handler.

> +
>  	switch (ipinr) {
>  	case IPI_RESCHEDULE:
>  		scheduler_ipi();
> @@ -948,12 +953,14 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
>  
>  void smp_send_reschedule(int cpu)
>  {
> +	task_isolation_remote(cpu, "reschedule IPI");
>  	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
>  }
>  
>  #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
>  void tick_broadcast(const struct cpumask *mask)
>  {
> +	task_isolation_remote_cpumask(mask, "timer IPI");
>  	smp_cross_call(mask, IPI_TIMER);
>  }
>  #endif
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 85566d32958f..fc4b42c81c4f 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -23,6 +23,7 @@
>  #include <linux/perf_event.h>
>  #include <linux/preempt.h>
>  #include <linux/hugetlb.h>
> +#include <linux/isolation.h>
>  
>  #include <asm/acpi.h>
>  #include <asm/bug.h>
> @@ -543,6 +544,10 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
>  	 */
>  	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
>  			      VM_FAULT_BADACCESS)))) {
> +		/* No signal was generated, but notify task-isolation tasks. */
> +		if (user_mode(regs))
> +			task_isolation_interrupt("page fault at %#lx", addr);

This isn't an interrupt. Why do we need to hook this?

What about /other/ exceptions caused by userspace?

If we need to notify userspace, it would be much more reliable to do so
in the return path.

Thanks,
Mark.

> +
>  		/*
>  		 * Major/minor page fault accounting is only done
>  		 * once. If we go through a retry, it is extremely
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 03/12] task_isolation: userspace hard isolation from kernel
  2020-03-08  3:47   ` [PATCH v2 03/12] task_isolation: userspace hard isolation from kernel Alex Belits
       [not found]     ` <20200307214254.7a8f6c22@hermes.lan>
@ 2020-03-27  8:42     ` Marta Rybczynska
  2020-04-06  4:31     ` Kevyn-Alexandre Paré
  2020-04-06  4:43     ` Kevyn-Alexandre Paré
  3 siblings, 0 replies; 71+ messages in thread
From: Marta Rybczynska @ 2020-03-27  8:42 UTC (permalink / raw)
  To: Alex Belits
  Cc: frederic, rostedt, mingo, peterz, linux-kernel, Prasun Kapoor,
	tglx, linux-api, catalin.marinas, linux-arm-kernel, netdev,
	davem, linux-arch, will

On Sun, Mar 8, 2020 at 4:48 AM Alex Belits <abelits@marvell.com> wrote:
> +/* Enable task_isolation mode for TASK_ISOLATION kernels. */
> +#define PR_TASK_ISOLATION              48
> +# define PR_TASK_ISOLATION_ENABLE      (1 << 0)
> +# define PR_TASK_ISOLATION_SET_SIG(sig)        (((sig) & 0x7f) << 8)
> +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
> +
Thank you for resurrecting this code!

I have a question on the UAPI: the example code is using
PR_TASK_ISOLATION_USERSIG and it seems to be removed from this
version.

To enable isolation with SIGUSR1 the task should run:
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE
    | PR_TASK_ISOLATION_SET_SIG(SIGUSR1), 0, 0, 0);

And to disable:
prctl(PR_SET_TASK_ISOLATION, 0, 0, 0, 0);

Is this correct?
Marta

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/12] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize
  2020-03-08  3:55   ` [PATCH v2 10/12] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize Alex Belits
@ 2020-04-06  4:27     ` Kevyn-Alexandre Paré
  0 siblings, 0 replies; 71+ messages in thread
From: Kevyn-Alexandre Paré @ 2020-04-06  4:27 UTC (permalink / raw)
  To: Alex Belits
  Cc: frederic, rostedt, mingo, peterz, linux-kernel, Prasun Kapoor,
	tglx, linux-api, catalin.marinas, linux-arm-kernel, netdev,
	davem, linux-arch, will

On Sun, Mar 08, 2020 at 03:55:24AM +0000, Alex Belits wrote:
> From: Yuri Norov <ynorov@marvell.com>
> 
> CPUs running isolated tasks are in userspace, so they don't have to
> perform ring buffer updates immediately. If ring_buffer_resize()
> schedules the update on those CPUs, isolation is broken. To prevent
> that, updates for CPUs running isolated tasks are performed locally,
> like for offline CPUs.
> 
> A race condition between this update and isolation breaking is avoided
> at the cost of disabling per_cpu buffer writing for the time of update
> when it coincides with isolation breaking.
> 
> Signed-off-by: Yuri Norov <ynorov@marvell.com>
> [abelits@marvell.com: updated to prevent race with isolation breaking]
> Signed-off-by: Alex Belits <abelits@marvell.com>
> ---
>  kernel/trace/ring_buffer.c | 62 ++++++++++++++++++++++++++++++++++----
>  1 file changed, 56 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 61f0e92ace99..593effe40183 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -21,6 +21,7 @@
>  #include <linux/delay.h>
>  #include <linux/slab.h>
>  #include <linux/init.h>
> +#include <linux/isolation.h>
>  #include <linux/hash.h>
>  #include <linux/list.h>
>  #include <linux/cpu.h>
> @@ -1701,6 +1702,37 @@ static void update_pages_handler(struct work_struct *work)
>  	complete(&cpu_buffer->update_done);
>  }
>  
> +static bool update_if_isolated(struct ring_buffer_per_cpu *cpu_buffer,
> +			       int cpu)
> +{
> +	bool rv = false;
> +
> +	if (task_isolation_on_cpu(cpu)) {
> +		/*
> +		 * CPU is running isolated task. Since it may lose
> +		 * isolation and re-enter kernel simultaneously with
> +		 * this update, disable recording until it's done.
> +		 */
> +		atomic_inc(&cpu_buffer->record_disabled);
> +		/* Make sure, update is done, and isolation state is current */
> +		smp_mb();
> +		if (task_isolation_on_cpu(cpu)) {
> +			/*
> +			 * If CPU is still running isolated task, we
> +			 * can be sure that breaking isolation will
> +			 * happen while recording is disabled, and CPU
> +			 * will not touch this buffer until the update
> +			 * is done.
> +			 */
> +			rb_update_pages(cpu_buffer);
> +			cpu_buffer->nr_pages_to_update = 0;
> +			rv = true;
> +		}
> +		atomic_dec(&cpu_buffer->record_disabled);
> +	}
> +	return rv;
> +}
> +
>  /**
>   * ring_buffer_resize - resize the ring buffer
>   * @buffer: the buffer to resize.
> @@ -1784,13 +1816,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
>  			if (!cpu_buffer->nr_pages_to_update)
>  				continue;
>  
> -			/* Can't run something on an offline CPU. */
> +			/*
> +			 * Can't run something on an offline CPU.
> +			 *
> +			 * CPUs running isolated tasks don't have to
> +			 * update ring buffers until they exit
> +			 * isolation because they are in
> +			 * userspace. Use the procedure that prevents
> +			 * race condition with isolation breaking.
> +			 */
>  			if (!cpu_online(cpu)) {
>  				rb_update_pages(cpu_buffer);
>  				cpu_buffer->nr_pages_to_update = 0;
>  			} else {
> -				schedule_work_on(cpu,
> -						&cpu_buffer->update_pages_work);
> +				if (!update_if_isolated(cpu_buffer, cpu))
> +					schedule_work_on(cpu,
> +					&cpu_buffer->update_pages_work);
>  			}
>  		}
>  
> @@ -1829,13 +1870,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
>  
>  		get_online_cpus();
>  
> -		/* Can't run something on an offline CPU. */
> +		/*
> +		 * Can't run something on an offline CPU.
> +		 *
> +		 * CPUs running isolated tasks don't have to update
> +		 * ring buffers until they exit isolation because they
> +		 * are in userspace. Use the procedure that prevents
> +		 * race condition with isolation breaking.
> +		 */
>  		if (!cpu_online(cpu_id))
>  			rb_update_pages(cpu_buffer);
>  		else {
> -			schedule_work_on(cpu_id,
> +			if (!update_if_isolated(cpu_buffer, cpu_id))
> +				schedule_work_on(cpu_id,
>  					 &cpu_buffer->update_pages_work);
> -			wait_for_completion(&cpu_buffer->update_done);
> +				wait_for_completion(&cpu_buffer->update_done);
> +			}
>  		}
>  
>  		cpu_buffer->nr_pages_to_update = 0;

gcc output:

kernel/trace/ring_buffer.c: In function 'ring_buffer_resize':
kernel/trace/ring_buffer.c:1884:4: warning: this 'if' clause does not guard... [-Wmisleading-indentation]
    if (!update_if_isolated(cpu_buffer, cpu_id))
    ^~
kernel/trace/ring_buffer.c:1887:5: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'if'
     wait_for_completion(&cpu_buffer->update_done);
     ^~~~~~~~~~~~~~~~~~~
kernel/trace/ring_buffer.c:1858:4: error: label 'out' used but not defined
    goto out;
    ^~~~
kernel/trace/ring_buffer.c:1868:4: error: label 'out_err' used but not defined
    goto out_err;
    ^~~~

My fix:

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 593effe40183..8b458400ac31 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1881,9 +1881,8 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
                if (!cpu_online(cpu_id))
                        rb_update_pages(cpu_buffer);
                else {
-                       if (!update_if_isolated(cpu_buffer, cpu_id))
-                               schedule_work_on(cpu_id,
-                                        &cpu_buffer->update_pages_work);
+                       if (!update_if_isolated(cpu_buffer, cpu_id)) {
+                               schedule_work_on(cpu_id, &cpu_buffer->update_pages_work);
                                wait_for_completion(&cpu_buffer->update_done);
                        }
                }


thx,

-- Kevyn-Alexandre Paré 

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 03/12] task_isolation: userspace hard isolation from kernel
  2020-03-08  3:47   ` [PATCH v2 03/12] task_isolation: userspace hard isolation from kernel Alex Belits
       [not found]     ` <20200307214254.7a8f6c22@hermes.lan>
  2020-03-27  8:42     ` Marta Rybczynska
@ 2020-04-06  4:31     ` Kevyn-Alexandre Paré
  2020-04-06  4:43     ` Kevyn-Alexandre Paré
  3 siblings, 0 replies; 71+ messages in thread
From: Kevyn-Alexandre Paré @ 2020-04-06  4:31 UTC (permalink / raw)
  To: Alex Belits
  Cc: frederic, rostedt, mingo, peterz, linux-kernel, Prasun Kapoor,
	tglx, linux-api, catalin.marinas, linux-arm-kernel, netdev,
	davem, linux-arch, will

On Sun, Mar 08, 2020 at 03:47:08AM +0000, Alex Belits wrote:
> The existing nohz_full mode is designed as a "soft" isolation mode
> that makes tradeoffs to minimize userspace interruptions while
> still attempting to avoid overheads in the kernel entry/exit path,
> to provide 100% kernel semantics, etc.
> 
> However, some applications require a "hard" commitment from the
> kernel to avoid interruptions, in particular userspace device driver
> style applications, such as high-speed networking code.
> 
> This change introduces a framework to allow applications
> to elect to have the "hard" semantics as needed, specifying
> prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
> 
> The kernel must be built with the new TASK_ISOLATION Kconfig flag
> to enable this mode, and the kernel booted with an appropriate
> "isolcpus=nohz,domain,CPULIST" boot argument to enable
> nohz_full and isolcpus. The "task_isolation" state is then indicated
> by setting a new task struct field, task_isolation_flag, to the
> value passed by prctl(), and also setting a TIF_TASK_ISOLATION
> bit in the thread_info flags. When the kernel is returning to
> userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
> it calls the new task_isolation_start() routine to arrange for
> the task to avoid being interrupted in the future.
> 
> With interrupts disabled, task_isolation_start() ensures that kernel
> subsystems that might cause a future interrupt are quiesced. If it
> doesn't succeed, it adjusts the syscall return value to indicate that
> fact, and userspace can retry as desired. In addition to stopping
> the scheduler tick, the code takes any actions that might avoid
> a future interrupt to the core, such as a worker thread being
> scheduled that could be quiesced now (e.g. the vmstat worker)
> or a future IPI to the core to clean up some state that could be
> cleaned up now (e.g. the mm lru per-cpu cache).
> 
> Once the task has returned to userspace after issuing the prctl(),
> if it enters the kernel again via system call, page fault, or any
> other exception or irq, the kernel will kill it with SIGKILL.
> In addition to sending a signal, the code supports a kernel
> command-line "task_isolation_debug" flag which causes a stack
> backtrace to be generated whenever a task loses isolation.
> 
> To allow the state to be entered and exited, the syscall checking
> test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
> clear the bit again later, and ignores exit/exit_group to allow
> exiting the task without a pointless signal being delivered.
> 
> The prctl() API allows for specifying a signal number to use instead
> of the default SIGKILL, to allow for catching the notification
> signal; for example, in a production environment, it might be
> helpful to log information to the application logging mechanism
> before exiting. Or, the signal handler might choose to reset the
> program counter back to the code segment intended to be run isolated
> via prctl() to continue execution.
> 
> In a number of cases we can tell on a remote cpu that we are
> going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
> In that case we generate the diagnostic (and optional stack dump)
> on the remote core to be able to deliver better diagnostics.
> If the interrupt is not something caught by Linux (e.g. a
> hypervisor interrupt) we can also request a reschedule IPI to
> be sent to the remote core so it can be sure to generate a
> signal to notify the process.
> 
> Separate patches that follow provide these changes for x86, arm,
> and arm64.
> 
> Signed-off-by: Alex Belits <abelits@marvell.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |   6 +
>  include/linux/hrtimer.h                       |   4 +
>  include/linux/isolation.h                     | 229 ++++++
>  include/linux/sched.h                         |   4 +
>  include/linux/tick.h                          |   3 +
>  include/uapi/linux/prctl.h                    |   6 +
>  init/Kconfig                                  |  28 +
>  kernel/Makefile                               |   2 +
>  kernel/context_tracking.c                     |   2 +
>  kernel/isolation.c                            | 774 ++++++++++++++++++
>  kernel/signal.c                               |   2 +
>  kernel/sys.c                                  |   6 +
>  kernel/time/hrtimer.c                         |  27 +
>  kernel/time/tick-sched.c                      |  18 +
>  14 files changed, 1111 insertions(+)
>  create mode 100644 include/linux/isolation.h
>  create mode 100644 kernel/isolation.c
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index c07815d230bc..e4a2d6e37645 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4808,6 +4808,12 @@
>  			neutralize any effect of /proc/sys/kernel/sysrq.
>  			Useful for debugging.
>  
> +	task_isolation_debug	[KNL]
> +			In kernels built with CONFIG_TASK_ISOLATION, this
> +			setting will generate console backtraces to
> +			accompany the diagnostics generated about
> +			interrupting tasks running with task isolation.
> +
>  	tcpmhash_entries= [KNL,NET]
>  			Set the number of tcp_metrics_hash slots.
>  			Default value is 8192 or 16384 depending on total
> diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
> index 15c8ac313678..e81252eb4f92 100644
> --- a/include/linux/hrtimer.h
> +++ b/include/linux/hrtimer.h
> @@ -528,6 +528,10 @@ extern void __init hrtimers_init(void);
>  /* Show pending timers: */
>  extern void sysrq_timer_list_show(void);
>  
> +#ifdef CONFIG_TASK_ISOLATION
> +extern void kick_hrtimer(void);
> +#endif
> +
>  int hrtimers_prepare_cpu(unsigned int cpu);
>  #ifdef CONFIG_HOTPLUG_CPU
>  int hrtimers_dead_cpu(unsigned int cpu);
> diff --git a/include/linux/isolation.h b/include/linux/isolation.h
> new file mode 100644
> index 000000000000..6bd71c67f10f
> --- /dev/null
> +++ b/include/linux/isolation.h
> @@ -0,0 +1,229 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Task isolation support
> + *
> + * Authors:
> + *   Chris Metcalf <cmetcalf@mellanox.com>
> + *   Alex Belits <abelits@marvell.com>
> + *   Yuri Norov <ynorov@marvell.com>
> + */
> +#ifndef _LINUX_ISOLATION_H
> +#define _LINUX_ISOLATION_H
> +
> +#include <stdarg.h>
> +#include <linux/errno.h>
> +#include <linux/cpumask.h>
> +#include <linux/prctl.h>
> +#include <linux/types.h>
> +
> +struct task_struct;
> +
> +#ifdef CONFIG_TASK_ISOLATION
> +
> +int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...);
> +
> +#define pr_task_isol_emerg(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_EMERG, false, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_alert(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_ALERT, false, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_crit(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_CRIT, false, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_err(cpu, fmt, ...)				\
> +	task_isolation_message(cpu, LOGLEVEL_ERR, false, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_warn(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_WARNING, false, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_notice(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_NOTICE, false, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_info(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_INFO, false, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_debug(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_DEBUG, false, fmt, ##__VA_ARGS__)
> +
> +#define pr_task_isol_emerg_supp(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_EMERG, true, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_alert_supp(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_ALERT, true, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_crit_supp(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_CRIT, true, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_err_supp(cpu, fmt, ...)				\
> +	task_isolation_message(cpu, LOGLEVEL_ERR, true, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_warn_supp(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_WARNING, true, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_notice_supp(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_NOTICE, true, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_info_supp(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_INFO, true, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_debug_supp(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_DEBUG, true, fmt, ##__VA_ARGS__)
> +DECLARE_PER_CPU(unsigned long, tsk_thread_flags_copy);

gcc output:

In file included from ./arch/x86/include/asm/apic.h:6,
                 from arch/x86/kernel/apic/apic_noop.c:14:
./include/linux/isolation.h:58:32: error: unknown type name 'tsk_thread_flags_copy'
 DECLARE_PER_CPU(unsigned long, tsk_thread_flags_copy);
                                ^~~~~~~~~~~~~~~~~~~~~

My fix:

iff --git a/include/linux/isolation.h b/include/linux/isolation.h
index 6bd71c67f10f..a392abed304b 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -55,7 +55,7 @@ int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...);
        task_isolation_message(cpu, LOGLEVEL_INFO, true, fmt, ##__VA_ARGS__)
 #define pr_task_isol_debug_supp(cpu, fmt, ...)                 \
        task_isolation_message(cpu, LOGLEVEL_DEBUG, true, fmt, ##__VA_ARGS__)
-DECLARE_PER_CPU(unsigned long, tsk_thread_flags_copy);
+//DECLARE_PER_CPU(unsigned long, tsk_thread_flags_copy);
 extern cpumask_var_t task_isolation_map;
 
 /**

> +extern cpumask_var_t task_isolation_map;
> +
> +/**
> + * task_isolation_request() - prctl hook to request task isolation
> + * @flags:	Flags from <linux/prctl.h> PR_TASK_ISOLATION_xxx.
> + *
> + * This is called from the generic prctl() code for PR_TASK_ISOLATION.
> + *
> + * Return: Returns 0 when task isolation enabled, otherwise a negative
> + * errno.
> + */
> +extern int task_isolation_request(unsigned int flags);
> +extern void task_isolation_cpu_cleanup(void);
> +/**
> + * task_isolation_start() - attempt to actually start task isolation
> + *
> + * This function should be invoked as the last thing prior to returning to
> + * user space if TIF_TASK_ISOLATION is set in the thread_info flags.  It
> + * will attempt to quiesce the core and enter task-isolation mode.  If it
> + * fails, it will reset the system call return value to an error code that
> + * indicates the failure mode.
> + */
> +extern void task_isolation_start(void);
> +
> +/**
> + * is_isolation_cpu() - check if CPU is intended for running isolated tasks.
> + * @cpu:	CPU to check.
> + */
> +static inline bool is_isolation_cpu(int cpu)
> +{
> +	return task_isolation_map != NULL &&
> +		cpumask_test_cpu(cpu, task_isolation_map);
> +}
> +
> +/**
> + * task_isolation_on_cpu() - check if the cpu is running isolated task
> + * @cpu:	CPU to check.
> + */
> +extern int task_isolation_on_cpu(int cpu);
> +extern void task_isolation_check_run_cleanup(void);
> +
> +/**
> + * task_isolation_cpumask() - set CPUs currently running isolated tasks
> + * @mask:	Mask to modify.
> + */
> +extern void task_isolation_cpumask(struct cpumask *mask);
> +
> +/**
> + * task_isolation_clear_cpumask() - clear CPUs currently running isolated tasks
> + * @mask:      Mask to modify.
> + */
> +extern void task_isolation_clear_cpumask(struct cpumask *mask);
> +
> +/**
> + * task_isolation_syscall() - report a syscall from an isolated task
> + * @nr:		The syscall number.
> + *
> + * This routine should be invoked at syscall entry if TIF_TASK_ISOLATION is
> + * set in the thread_info flags.  It checks for valid syscalls,
> + * specifically prctl() with PR_TASK_ISOLATION, exit(), and exit_group().
> + * For any other syscall it will raise a signal and return failure.
> + *
> + * Return: 0 for acceptable syscalls, -1 for all others.
> + */
> +extern int task_isolation_syscall(int nr);
> +
> +/**
> + * _task_isolation_interrupt() - report an interrupt of an isolated task
> + * @fmt:	A format string describing the interrupt
> + * @...:	Format arguments, if any.
> + *
> + * This routine should be invoked at any exception or IRQ if
> + * TIF_TASK_ISOLATION is set in the thread_info flags.  It is not necessary
> + * to invoke it if the exception will generate a signal anyway (e.g. a bad
> + * page fault), and in that case it is preferable not to invoke it but just
> + * rely on the standard Linux signal.  The macro task_isolation_syscall()
> + * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
> + */
> +extern void _task_isolation_interrupt(const char *fmt, ...);
> +#define task_isolation_interrupt(fmt, ...)				\
> +	do {								\
> +		if (current_thread_info()->flags & _TIF_TASK_ISOLATION)	\
> +			_task_isolation_interrupt(fmt, ## __VA_ARGS__);	\
> +	} while (0)
> +
> +/**
> + * task_isolation_remote() - report a remote interrupt of an isolated task
> + * @cpu:	The remote cpu that is about to be interrupted.
> + * @fmt:	A format string describing the interrupt
> + * @...:	Format arguments, if any.
> + *
> + * This routine should be invoked any time a remote IPI or other type of
> + * interrupt is being delivered to another cpu. The function will check to
> + * see if the target core is running a task-isolation task, and generate a
> + * diagnostic on the console if so; in addition, we tag the task so it
> + * doesn't generate another diagnostic when the interrupt actually arrives.
> + * Generating a diagnostic remotely yields a clearer indication of what
> + * happened then just reporting only when the remote core is interrupted.
> + *
> + */
> +extern void task_isolation_remote(int cpu, const char *fmt, ...);
> +
> +/**
> + * task_isolation_remote_cpumask() - report interruption of multiple cpus
> + * @mask:	The set of remotes cpus that are about to be interrupted.
> + * @fmt:	A format string describing the interrupt
> + * @...:	Format arguments, if any.
> + *
> + * This is the cpumask variant of _task_isolation_remote().  We
> + * generate a single-line diagnostic message even if multiple remote
> + * task-isolation cpus are being interrupted.
> + */
> +extern void task_isolation_remote_cpumask(const struct cpumask *mask,
> +					  const char *fmt, ...);
> +
> +/**
> + * _task_isolation_signal() - disable task isolation when signal is pending
> + * @task:	The task for which to disable isolation.
> + *
> + * This function generates a diagnostic and disables task isolation; it
> + * should be called if TIF_TASK_ISOLATION is set when notifying a task of a
> + * pending signal.  The task_isolation_interrupt() function normally
> + * generates a diagnostic for events that just interrupt a task without
> + * generating a signal; here we need to hook the paths that correspond to
> + * interrupts that do generate a signal.  The macro task_isolation_signal()
> + * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
> + */
> +extern void _task_isolation_signal(struct task_struct *task);
> +#define task_isolation_signal(task)					\
> +	do {								\
> +		if (task_thread_info(task)->flags & _TIF_TASK_ISOLATION) \
> +			_task_isolation_signal(task);			\
> +	} while (0)
> +
> +/**
> + * task_isolation_user_exit() - debug all user_exit calls
> + *
> + * By default, we don't generate an exception in the low-level user_exit()
> + * code, because programs lose the ability to disable task isolation: the
> + * user_exit() hook will cause a signal prior to task_isolation_syscall()
> + * disabling task isolation.  In addition, it means that we lose all the
> + * diagnostic info otherwise available from task_isolation_interrupt() hooks
> + * later in the interrupt-handling process.  But you may enable it here for
> + * a special kernel build if you are having undiagnosed userspace jitter.
> + */
> +static inline void task_isolation_user_exit(void)
> +{
> +#ifdef DEBUG_TASK_ISOLATION
> +	task_isolation_interrupt("user_exit");
> +#endif
> +}
> +
> +#else /* !CONFIG_TASK_ISOLATION */
> +static inline int task_isolation_request(unsigned int flags) { return -EINVAL; }
> +static inline void task_isolation_start(void) { }
> +static inline bool is_isolation_cpu(int cpu) { return 0; }
> +static inline int task_isolation_on_cpu(int cpu) { return 0; }
> +static inline void task_isolation_cpumask(struct cpumask *mask) { }
> +static inline void task_isolation_clear_cpumask(struct cpumask *mask) { }
> +static inline void task_isolation_cpu_cleanup(void) { }
> +static inline void task_isolation_check_run_cleanup(void) { }
> +static inline int task_isolation_syscall(int nr) { return 0; }
> +static inline void task_isolation_interrupt(const char *fmt, ...) { }
> +static inline void task_isolation_remote(int cpu, const char *fmt, ...) { }
> +static inline void task_isolation_remote_cpumask(const struct cpumask *mask,
> +						 const char *fmt, ...) { }
> +static inline void task_isolation_signal(struct task_struct *task) { }
> +static inline void task_isolation_user_exit(void) { }
> +#endif
> +
> +#endif /* _LINUX_ISOLATION_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 04278493bf15..52fdb32aa3b9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1280,6 +1280,10 @@ struct task_struct {
>  	unsigned long			lowest_stack;
>  	unsigned long			prev_lowest_stack;
>  #endif
> +#ifdef CONFIG_TASK_ISOLATION
> +	unsigned short			task_isolation_flags;  /* prctl */
> +	unsigned short			task_isolation_state;
> +#endif
>  
>  	/*
>  	 * New fields for task_struct should be added above here, so that
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index 7340613c7eff..27c7c033d5a8 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -268,6 +268,9 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
>  extern void tick_nohz_full_kick_cpu(int cpu);
>  extern void __tick_nohz_task_switch(void);
>  extern void __init tick_nohz_full_setup(cpumask_var_t cpumask);
> +#ifdef CONFIG_TASK_ISOLATION
> +extern int try_stop_full_tick(void);
> +#endif
>  #else
>  static inline bool tick_nohz_full_enabled(void) { return false; }
>  static inline bool tick_nohz_full_cpu(int cpu) { return false; }
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 07b4f8131e36..f4848ed2a069 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -238,4 +238,10 @@ struct prctl_mm_map {
>  #define PR_SET_IO_FLUSHER		57
>  #define PR_GET_IO_FLUSHER		58
>  
> +/* Enable task_isolation mode for TASK_ISOLATION kernels. */
> +#define PR_TASK_ISOLATION		48
> +# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
> +# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
> +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index 20a6ac33761c..ecdf567f6bd4 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -576,6 +576,34 @@ config CPU_ISOLATION
>  
>  source "kernel/rcu/Kconfig"
>  
> +config HAVE_ARCH_TASK_ISOLATION
> +	bool
> +
> +config TASK_ISOLATION
> +	bool "Provide hard CPU isolation from the kernel on demand"
> +	depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
> +	help
> +
> +	Allow userspace processes that place themselves on cores with
> +	nohz_full and isolcpus enabled, and run prctl(PR_TASK_ISOLATION),
> +	to "isolate" themselves from the kernel.  Prior to returning to
> +	userspace, isolated tasks will arrange that no future kernel
> +	activity will interrupt the task while the task is running in
> +	userspace.  Attempting to re-enter the kernel while in this mode
> +	will cause the task to be terminated with a signal; you must
> +	explicitly use prctl() to disable task isolation before resuming
> +	normal use of the kernel.
> +
> +	This "hard" isolation from the kernel is required for userspace
> +	tasks that are running hard real-time tasks in userspace, such as
> +	a high-speed network driver in userspace.  Without this option, but
> +	with NO_HZ_FULL enabled, the kernel will make a best-faith, "soft"
> +	effort to shield a single userspace process from interrupts, but
> +	makes no guarantees.
> +
> +	You should say "N" unless you are intending to run a
> +	high-performance userspace driver or similar task.
> +
>  config BUILD_BIN2C
>  	bool
>  	default n
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 4cb4130ced32..2f2ae91f90d5 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -122,6 +122,8 @@ obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o
>  KASAN_SANITIZE_stackleak.o := n
>  KCOV_INSTRUMENT_stackleak.o := n
>  
> +obj-$(CONFIG_TASK_ISOLATION) += isolation.o
> +
>  $(obj)/configs.o: $(obj)/config_data.gz
>  
>  targets += config_data.gz
> diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
> index 0296b4bda8f1..e9206736f219 100644
> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -21,6 +21,7 @@
>  #include <linux/hardirq.h>
>  #include <linux/export.h>
>  #include <linux/kprobes.h>
> +#include <linux/isolation.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/context_tracking.h>
> @@ -157,6 +158,7 @@ void __context_tracking_exit(enum ctx_state state)
>  			if (state == CONTEXT_USER) {
>  				vtime_user_exit(current);
>  				trace_user_exit(0);
> +				task_isolation_user_exit();
>  			}
>  		}
>  		__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
> diff --git a/kernel/isolation.c b/kernel/isolation.c
> new file mode 100644
> index 000000000000..ae29732c376c
> --- /dev/null
> +++ b/kernel/isolation.c
> @@ -0,0 +1,774 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + *  linux/kernel/isolation.c
> + *
> + *  Implementation of task isolation.
> + *
> + * Authors:
> + *   Chris Metcalf <cmetcalf@mellanox.com>
> + *   Alex Belits <abelits@marvell.com>
> + *   Yuri Norov <ynorov@marvell.com>
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/swap.h>
> +#include <linux/vmstat.h>
> +#include <linux/sched.h>
> +#include <linux/isolation.h>
> +#include <linux/syscalls.h>
> +#include <linux/smp.h>
> +#include <linux/tick.h>
> +#include <asm/unistd.h>
> +#include <asm/syscall.h>
> +#include <linux/hrtimer.h>
> +
> +/*
> + * These values are stored in task_isolation_state.
> + * Note that STATE_NORMAL + TIF_TASK_ISOLATION means we are still
> + * returning from sys_prctl() to userspace.
> + */
> +enum {
> +	STATE_NORMAL = 0,	/* Not isolated */
> +	STATE_ISOLATED = 1	/* In userspace, isolated */
> +};
> +
> +/*
> + * This variable contains thread flags copied at the moment
> + * when schedule() switched to the task on a given CPU,
> + * or 0 if no task is running.
> + */
> +DEFINE_PER_CPU(unsigned long, tsk_thread_flags_cache);
> +
> +/*
> + * Counter for isolation state on a given CPU, increments when entering
> + * isolation and decrements when exiting isolation (before or after the
> + * cleanup). Multiple simultaneously running procedures entering or
> + * exiting isolation are prevented by checking the result of
> + * incrementing or decrementing this variable. This variable is both
> + * incremented and decremented by CPU that caused isolation entering or
> + * exit.
> + *
> + * This is necessary because multiple isolation-breaking events may happen
> + * at once (or one as the result of the other), however isolation exit
> + * may only happen once to transition from isolated to non-isolated state.
> + * Therefore, if decrementing this counter results in a value less than 0,
> + * isolation exit procedure can't be started -- it already happened, or is
> + * in progress, or isolation is not entered yet.
> + */
> +DEFINE_PER_CPU(atomic_t, isol_counter);
> +
> +/*
> + * Description of the last two tasks that ran isolated on a given CPU.
> + * This is intended only for messages about isolation breaking. We
> + * don't want any references to actual task while accessing this from
> + * CPU that caused isolation breaking -- we know nothing about timing
> + * and don't want to use locking or RCU.
> + */
> +struct isol_task_desc {
> +	atomic_t curr_index;
> +	atomic_t curr_index_wr;
> +	bool	warned[2];
> +	pid_t	pid[2];
> +	pid_t	tgid[2];
> +	char	comm[2][TASK_COMM_LEN];
> +};
> +static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs);
> +
> +/*
> + * Counter for isolation exiting procedures (from request to the start of
> + * cleanup) being attempted at once on a CPU. Normally incrementing of
> + * this counter is performed from the CPU that caused isolation breaking,
> + * however decrementing is done from the cleanup procedure, delegated to
> + * the CPU that is exiting isolation, not from the CPU that caused isolation
> + * breaking.
> + *
> + * If incrementing this counter while starting isolation exit procedure
> + * results in a value greater than 0, isolation exiting is already in
> + * progress, and cleanup did not start yet. This means, counter should be
> + * decremented back, and isolation exit that is already in progress, should
> + * be allowed to complete. Otherwise, a new isolation exit procedure should
> + * be started.
> + */
> +DEFINE_PER_CPU(atomic_t, isol_exit_counter);
> +
> +/*
> + * Descriptor for isolation-breaking SMP calls
> + */
> +DEFINE_PER_CPU(call_single_data_t, isol_break_csd);
> +
> +cpumask_var_t task_isolation_map;
> +cpumask_var_t task_isolation_cleanup_map;
> +static DEFINE_SPINLOCK(task_isolation_cleanup_lock);
> +
> +/* We can run on cpus that are isolated from the scheduler and are nohz_full. */
> +static int __init task_isolation_init(void)
> +{
> +	alloc_bootmem_cpumask_var(&task_isolation_cleanup_map);
> +	if (alloc_cpumask_var(&task_isolation_map, GFP_KERNEL))
> +		/*
> +		 * At this point task isolation should match
> +		 * nohz_full. This may change in the future.
> +		 */
> +		cpumask_copy(task_isolation_map, tick_nohz_full_mask);
> +	return 0;
> +}
> +core_initcall(task_isolation_init)
> +
> +/* Enable stack backtraces of any interrupts of task_isolation cores. */
> +static bool task_isolation_debug;
> +static int __init task_isolation_debug_func(char *str)
> +{
> +	task_isolation_debug = true;
> +	return 1;
> +}
> +__setup("task_isolation_debug", task_isolation_debug_func);
> +
> +/*
> + * Record name, pid and group pid of the task entering isolation on
> + * the current CPU.
> + */
> +static void record_curr_isolated_task(void)
> +{
> +	int ind;
> +	int cpu = smp_processor_id();
> +	struct isol_task_desc *desc = &per_cpu(isol_task_descs, cpu);
> +	struct task_struct *task = current;
> +
> +	/* Finish everything before recording current task */
> +	smp_mb();
> +	ind = atomic_inc_return(&desc->curr_index_wr) & 1;
> +	desc->comm[ind][sizeof(task->comm) - 1] = '\0';
> +	memcpy(desc->comm[ind], task->comm, sizeof(task->comm) - 1);
> +	desc->pid[ind] = task->pid;
> +	desc->tgid[ind] = task->tgid;
> +	desc->warned[ind] = false;
> +	/* Write everything, to be seen by other CPUs */
> +	smp_mb();
> +	atomic_inc(&desc->curr_index);
> +	/* Everyone will see the new record from this point */
> +	smp_mb();
> +}
> +
> +/*
> + * Print message prefixed with the description of the current (or
> + * last) isolated task on a given CPU. Intended for isolation breaking
> + * messages that include target task for the user's convenience.
> + *
> + * Messages produced with this function may have obsolete task
> + * information if isolated tasks managed to exit, start and enter
> + * isolation multiple times, or multiple tasks tried to enter
> + * isolation on the same CPU at once. For those unusual cases it would
> + * contain a valid description of the cause for isolation breaking and
> + * target CPU number, just not the correct description of which task
> + * ended up losing isolation.
> + */
> +int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...)
> +{
> +	struct isol_task_desc *desc;
> +	struct task_struct *task;
> +	va_list args;
> +	char buf_prefix[TASK_COMM_LEN + 20 + 3 * 20];
> +	char buf[200];
> +	int curr_cpu, ind_counter, ind_counter_old, ind;
> +
> +	curr_cpu = get_cpu();
> +	desc = &per_cpu(isol_task_descs, cpu);
> +	ind_counter = atomic_read(&desc->curr_index);
> +
> +	if (curr_cpu == cpu) {
> +		/*
> +		 * Message is for the current CPU so current
> +		 * task_struct should be used instead of cached
> +		 * information.
> +		 *
> +		 * Like in other diagnostic messages, if issued from
> +		 * interrupt context, current will be the interrupted
> +		 * task. Unlike other diagnostic messages, this is
> +		 * always relevant because the message is about
> +		 * interrupting a task.
> +		 */
> +		ind = ind_counter & 1;
> +		if (supp && desc->warned[ind]) {
> +			/*
> +			 * If supp is true, skip the message if the
> +			 * same task was mentioned in the message
> +			 * originated on remote CPU, and it did not
> +			 * re-enter isolated state since then (warned
> +			 * is true). Only local messages following
> +			 * remote messages, likely about the same
> +			 * isolation breaking event, are skipped to
> +			 * avoid duplication. If remote cause is
> +			 * immediately followed by a local one before
> +			 * isolation is broken, local cause is skipped
> +			 * from messages.
> +			 */
> +			put_cpu();
> +			return 0;
> +		}
> +		task = current;
> +		snprintf(buf_prefix, sizeof(buf_prefix),
> +			 "isolation %s/%d/%d (cpu %d)",
> +			 task->comm, task->tgid, task->pid, cpu);
> +		put_cpu();
> +	} else {
> +		/*
> +		 * Message is for remote CPU, use cached information.
> +		 */
> +		put_cpu();
> +		/*
> +		 * Make sure, index remained unchanged while data was
> +		 * copied. If it changed, data that was copied may be
> +		 * inconsistent because two updates in a sequence could
> +		 * overwrite the data while it was being read.
> +		 */
> +		do {
> +			/* Make sure we are reading up to date values */
> +			smp_mb();
> +			ind = ind_counter & 1;
> +			snprintf(buf_prefix, sizeof(buf_prefix),
> +				 "isolation %s/%d/%d (cpu %d)",
> +				 desc->comm[ind], desc->tgid[ind],
> +				 desc->pid[ind], cpu);
> +			desc->warned[ind] = true;
> +			ind_counter_old = ind_counter;
> +			/* Record the warned flag, then re-read descriptor */
> +			smp_mb();
> +			ind_counter = atomic_read(&desc->curr_index);
> +			/*
> +			 * If the counter changed, something was updated, so
> +			 * repeat everything to get the current data
> +			 */
> +		} while (ind_counter != ind_counter_old);
> +	}
> +
> +	va_start(args, fmt);
> +	vsnprintf(buf, sizeof(buf), fmt, args);
> +	va_end(args);
> +
> +	switch (level) {
> +	case LOGLEVEL_EMERG:
> +		pr_emerg("%s: %s", buf_prefix, buf);
> +		break;
> +	case LOGLEVEL_ALERT:
> +		pr_alert("%s: %s", buf_prefix, buf);
> +		break;
> +	case LOGLEVEL_CRIT:
> +		pr_crit("%s: %s", buf_prefix, buf);
> +		break;
> +	case LOGLEVEL_ERR:
> +		pr_err("%s: %s", buf_prefix, buf);
> +		break;
> +	case LOGLEVEL_WARNING:
> +		pr_warn("%s: %s", buf_prefix, buf);
> +		break;
> +	case LOGLEVEL_NOTICE:
> +		pr_notice("%s: %s", buf_prefix, buf);
> +		break;
> +	case LOGLEVEL_INFO:
> +		pr_info("%s: %s", buf_prefix, buf);
> +		break;
> +	case LOGLEVEL_DEBUG:
> +		pr_debug("%s: %s", buf_prefix, buf);
> +		break;
> +	default:
> +		/* No message without a valid level */
> +		return 0;
> +	}
> +	return 1;
> +}
> +
> +/*
> + * Dump stack if need be. This can be helpful even from the final exit
> + * to usermode code since stack traces sometimes carry information about
> + * what put you into the kernel, e.g. an interrupt number encoded in
> + * the initial entry stack frame that is still visible at exit time.
> + */
> +static void debug_dump_stack(void)
> +{
> +	if (task_isolation_debug)
> +		dump_stack();
> +}
> +
> +/*
> + * Set the flags word but don't try to actually start task isolation yet.
> + * We will start it when entering user space in task_isolation_start().
> + */
> +int task_isolation_request(unsigned int flags)
> +{
> +	struct task_struct *task = current;
> +
> +	/*
> +	 * The task isolation flags should always be cleared just by
> +	 * virtue of having entered the kernel.
> +	 */
> +	WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_TASK_ISOLATION));
> +	WARN_ON_ONCE(task->task_isolation_flags != 0);
> +	WARN_ON_ONCE(task->task_isolation_state != STATE_NORMAL);
> +
> +	task->task_isolation_flags = flags;
> +	if (!(task->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
> +		return 0;
> +
> +	/* We are trying to enable task isolation. */
> +	set_tsk_thread_flag(task, TIF_TASK_ISOLATION);
> +
> +	/*
> +	 * Shut down the vmstat worker so we're not interrupted later.
> +	 * We have to try to do this here (with interrupts enabled) since
> +	 * we are canceling delayed work and will call flush_work()
> +	 * (which enables interrupts) and possibly schedule().
> +	 */
> +	quiet_vmstat_sync();
> +
> +	/* We return 0 here but we may change that in task_isolation_start(). */
> +	return 0;
> +}
> +
> +/*
> + * Perform actions that should be done immediately on exit from isolation.
> + */
> +static void fast_task_isolation_cpu_cleanup(void *info)
> +{
> +	atomic_dec(&per_cpu(isol_exit_counter, smp_processor_id()));
> +	/* At this point breaking isolation from other CPUs is possible again */
> +
> +	/*
> +	 * This task is no longer isolated (and if by any chance this
> +	 * is the wrong task, it's already not isolated)
> +	 */
> +	current->task_isolation_flags = 0;
> +	clear_tsk_thread_flag(current, TIF_TASK_ISOLATION);
> +
> +	/* Run the rest of cleanup later */
> +	set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
> +
> +	/* Copy flags with task isolation disabled */
> +	this_cpu_write(tsk_thread_flags_cache,
> +		       READ_ONCE(task_thread_info(current)->flags));
> +}
> +
> +/* Disable task isolation for the specified task. */
> +static void stop_isolation(struct task_struct *p)
> +{
> +	int cpu, this_cpu;
> +	unsigned long flags;
> +
> +	this_cpu = get_cpu();
> +	cpu = task_cpu(p);
> +	if (atomic_inc_return(&per_cpu(isol_exit_counter, cpu)) > 1) {
> +		/* Already exiting isolation */
> +		atomic_dec(&per_cpu(isol_exit_counter, cpu));
> +		put_cpu();
> +		return;
> +	}
> +
> +	if (p == current) {
> +		p->task_isolation_state = STATE_NORMAL;
> +		fast_task_isolation_cpu_cleanup(NULL);
> +		task_isolation_cpu_cleanup();
> +		if (atomic_dec_return(&per_cpu(isol_counter, cpu)) < 0) {
> +			/* Is not isolated already */
> +			atomic_inc(&per_cpu(isol_counter, cpu));
> +		}
> +		put_cpu();
> +	} else {
> +		if (atomic_dec_return(&per_cpu(isol_counter, cpu)) < 0) {
> +			/* Is not isolated already */
> +			atomic_inc(&per_cpu(isol_counter, cpu));
> +			atomic_dec(&per_cpu(isol_exit_counter, cpu));
> +			put_cpu();
> +			return;
> +		}
> +		/*
> +		 * Schedule "slow" cleanup. This relies on
> +		 * TIF_NOTIFY_RESUME being set
> +		 */
> +		spin_lock_irqsave(&task_isolation_cleanup_lock, flags);
> +		cpumask_set_cpu(cpu, task_isolation_cleanup_map);
> +		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
> +		/*
> +		 * Setting flags is delegated to the CPU where
> +		 * isolated task is running
> +		 * isol_exit_counter will be decremented from there as well.
> +		 */
> +		per_cpu(isol_break_csd, cpu).func =
> +		    fast_task_isolation_cpu_cleanup;
> +		per_cpu(isol_break_csd, cpu).info = NULL;
> +		per_cpu(isol_break_csd, cpu).flags = 0;
> +		smp_call_function_single_async(cpu,
> +					       &per_cpu(isol_break_csd, cpu));
> +		put_cpu();
> +	}
> +}
> +
> +/*
> + * This code runs with interrupts disabled just before the return to
> + * userspace, after a prctl() has requested enabling task isolation.
> + * We take whatever steps are needed to avoid being interrupted later:
> + * drain the lru pages, stop the scheduler tick, etc.  More
> + * functionality may be added here later to avoid other types of
> + * interrupts from other kernel subsystems.
> + *
> + * If we can't enable task isolation, we update the syscall return
> + * value with an appropriate error.
> + */
> +void task_isolation_start(void)
> +{
> +	int error;
> +
> +	/*
> +	 * We should only be called in STATE_NORMAL (isolation disabled),
> +	 * on our way out of the kernel from the prctl() that turned it on.
> +	 * If we are exiting from the kernel in another state, it means we
> +	 * made it back into the kernel without disabling task isolation,
> +	 * and we should investigate how (and in any case disable task
> +	 * isolation at this point).  We are clearly not on the path back
> +	 * from the prctl() so we don't touch the syscall return value.
> +	 */
> +	if (WARN_ON_ONCE(current->task_isolation_state != STATE_NORMAL)) {
> +		/* Increment counter, this will allow isolation breaking */
> +		if (atomic_inc_return(&per_cpu(isol_counter,
> +					      smp_processor_id())) > 1) {
> +			atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
> +		}
> +		atomic_inc(&per_cpu(isol_counter, smp_processor_id()));
> +		stop_isolation(current);
> +		return;
> +	}
> +
> +	/*
> +	 * Must be affinitized to a single core with task isolation possible.
> +	 * In principle this could be remotely modified between the prctl()
> +	 * and the return to userspace, so we have to check it here.
> +	 */
> +	if (current->nr_cpus_allowed != 1 ||
> +	    !is_isolation_cpu(smp_processor_id())) {
> +		error = -EINVAL;
> +		goto error;
> +	}
> +
> +	/* If the vmstat delayed work is not canceled, we have to try again. */
> +	if (!vmstat_idle()) {
> +		error = -EAGAIN;
> +		goto error;
> +	}
> +
> +	/* Try to stop the dynamic tick. */
> +	error = try_stop_full_tick();
> +	if (error)
> +		goto error;
> +
> +	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
> +	lru_add_drain();
> +
> +	/* Increment counter, this will allow isolation breaking */
> +	if (atomic_inc_return(&per_cpu(isol_counter,
> +				      smp_processor_id())) > 1) {
> +		atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
> +	}
> +
> +	/* Record isolated task IDs and name */
> +	record_curr_isolated_task();
> +
> +	/* Copy flags with task isolation enabled */
> +	this_cpu_write(tsk_thread_flags_cache,
> +		       READ_ONCE(task_thread_info(current)->flags));
> +
> +	current->task_isolation_state = STATE_ISOLATED;
> +	return;
> +
> +error:
> +	/* Increment counter, this will allow isolation breaking */
> +	if (atomic_inc_return(&per_cpu(isol_counter,
> +				      smp_processor_id())) > 1) {
> +		atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
> +	}
> +	stop_isolation(current);
> +	syscall_set_return_value(current, current_pt_regs(), error, 0);
> +}
> +
> +/* Stop task isolation on the remote task and send it a signal. */
> +static void send_isolation_signal(struct task_struct *task)
> +{
> +	int flags = task->task_isolation_flags;
> +	kernel_siginfo_t info = {
> +		.si_signo = PR_TASK_ISOLATION_GET_SIG(flags) ?: SIGKILL,
> +	};
> +
> +	stop_isolation(task);
> +	send_sig_info(info.si_signo, &info, task);
> +}
> +
> +/* Only a few syscalls are valid once we are in task isolation mode. */
> +static bool is_acceptable_syscall(int syscall)
> +{
> +	/* No need to incur an isolation signal if we are just exiting. */
> +	if (syscall == __NR_exit || syscall == __NR_exit_group)
> +		return true;
> +
> +	/* Check to see if it's the prctl for isolation. */
> +	if (syscall == __NR_prctl) {
> +		unsigned long arg[SYSCALL_MAX_ARGS];
> +
> +		syscall_get_arguments(current, current_pt_regs(), arg);
> +		if (arg[0] == PR_TASK_ISOLATION)
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +/*
> + * This routine is called from syscall entry, prevents most syscalls
> + * from executing, and if needed raises a signal to notify the process.
> + *
> + * Note that we have to stop isolation before we even print a message
> + * here, since otherwise we might end up reporting an interrupt due to
> + * kicking the printk handling code, rather than reporting the true
> + * cause of interrupt here.
> + *
> + * The message is not suppressed by previous remotely triggered
> + * messages.
> + */
> +int task_isolation_syscall(int syscall)
> +{
> +	struct task_struct *task = current;
> +
> +	if (is_acceptable_syscall(syscall)) {
> +		stop_isolation(task);
> +		return 0;
> +	}
> +
> +	send_isolation_signal(task);
> +
> +	pr_task_isol_warn(smp_processor_id(),
> +			  "task_isolation lost due to syscall %d\n",
> +			  syscall);
> +	debug_dump_stack();
> +
> +	syscall_set_return_value(task, current_pt_regs(), -ERESTARTNOINTR, -1);
> +	return -1;
> +}
> +
> +/*
> + * This routine is called from any exception or irq that doesn't
> + * otherwise trigger a signal to the user process (e.g. page fault).
> + *
> + * Messages will be suppressed if there is already a reported remote
> + * cause for isolation breaking, so we don't generate multiple
> + * confusingly similar messages about the same event.
> + */
> +void _task_isolation_interrupt(const char *fmt, ...)
> +{
> +	struct task_struct *task = current;
> +	va_list args;
> +	char buf[100];
> +
> +	/* RCU should have been enabled prior to this point. */
> +	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
> +
> +	/* Are we exiting isolation already? */
> +	if (atomic_read(&per_cpu(isol_exit_counter, smp_processor_id())) != 0) {
> +		task->task_isolation_state = STATE_NORMAL;
> +		return;
> +	}
> +	/*
> +	 * Avoid reporting interrupts that happen after we have prctl'ed
> +	 * to enable isolation, but before we have returned to userspace.
> +	 */
> +	if (task->task_isolation_state == STATE_NORMAL)
> +		return;
> +
> +	va_start(args, fmt);
> +	vsnprintf(buf, sizeof(buf), fmt, args);
> +	va_end(args);
> +
> +	/* Handle NMIs minimally, since we can't send a signal. */
> +	if (in_nmi()) {
> +		pr_task_isol_err(smp_processor_id(),
> +				 "isolation: in NMI; not delivering signal\n");
> +	} else {
> +		send_isolation_signal(task);
> +	}
> +
> +	if (pr_task_isol_warn_supp(smp_processor_id(),
> +				   "task_isolation lost due to %s\n", buf))
> +		debug_dump_stack();
> +}
> +
> +/*
> + * Called before we wake up a task that has a signal to process.
> + * Needs to be done to handle interrupts that trigger signals, which
> + * we don't catch with task_isolation_interrupt() hooks.
> + *
> + * This message is also suppressed if there was already a remotely
> + * caused message about the same isolation breaking event.
> + */
> +void _task_isolation_signal(struct task_struct *task)
> +{
> +	struct isol_task_desc *desc;
> +	int ind, cpu;
> +	bool do_warn = (task->task_isolation_state == STATE_ISOLATED);
> +
> +	cpu = task_cpu(task);
> +	desc = &per_cpu(isol_task_descs, cpu);
> +	ind = atomic_read(&desc->curr_index) & 1;
> +	if (desc->warned[ind])
> +		do_warn = false;
> +
> +	stop_isolation(task);
> +
> +	if (do_warn) {
> +		pr_warn("isolation: %s/%d/%d (cpu %d): task_isolation lost due to signal\n",
> +			task->comm, task->tgid, task->pid, cpu);
> +		debug_dump_stack();
> +	}
> +}
> +
> +/*
> + * Generate a stack backtrace if we are going to interrupt another task
> + * isolation process.
> + */
> +void task_isolation_remote(int cpu, const char *fmt, ...)
> +{
> +	struct task_struct *curr_task;
> +	va_list args;
> +	char buf[200];
> +
> +	if (!is_isolation_cpu(cpu) || !task_isolation_on_cpu(cpu))
> +		return;
> +
> +	curr_task = current;
> +
> +	va_start(args, fmt);
> +	vsnprintf(buf, sizeof(buf), fmt, args);
> +	va_end(args);
> +	if (pr_task_isol_warn(cpu,
> +			      "task_isolation lost due to %s by %s/%d/%d on cpu %d\n",
> +			      buf,
> +			      curr_task->comm, curr_task->tgid,
> +			      curr_task->pid, smp_processor_id()))
> +		debug_dump_stack();
> +}
> +
> +/*
> + * Generate a stack backtrace if any of the cpus in "mask" are running
> + * task isolation processes.
> + */
> +void task_isolation_remote_cpumask(const struct cpumask *mask,
> +				   const char *fmt, ...)
> +{
> +	struct task_struct *curr_task;
> +	cpumask_var_t warn_mask;
> +	va_list args;
> +	char buf[200];
> +	int cpu, first_cpu;
> +
> +	if (task_isolation_map == NULL ||
> +		!zalloc_cpumask_var(&warn_mask, GFP_KERNEL))
> +		return;
> +
> +	first_cpu = -1;
> +	for_each_cpu_and(cpu, mask, task_isolation_map) {
> +		if (task_isolation_on_cpu(cpu)) {
> +			if (first_cpu < 0)
> +				first_cpu = cpu;
> +			else
> +				cpumask_set_cpu(cpu, warn_mask);
> +		}
> +	}
> +
> +	if (first_cpu < 0)
> +		goto done;
> +
> +	curr_task = current;
> +
> +	va_start(args, fmt);
> +	vsnprintf(buf, sizeof(buf), fmt, args);
> +	va_end(args);
> +
> +	if (cpumask_weight(warn_mask) == 0)
> +		pr_task_isol_warn(first_cpu,
> +				  "task_isolation lost due to %s by %s/%d/%d on cpu %d\n",
> +				  buf, curr_task->comm, curr_task->tgid,
> +				  curr_task->pid, smp_processor_id());
> +	else
> +		pr_task_isol_warn(first_cpu,
> +				  " and cpus %*pbl: task_isolation lost due to %s by %s/%d/%d on cpu %d\n",
> +				  cpumask_pr_args(warn_mask),
> +				  buf, curr_task->comm, curr_task->tgid,
> +				  curr_task->pid, smp_processor_id());
> +	debug_dump_stack();
> +
> +done:
> +	free_cpumask_var(warn_mask);
> +}
> +
> +/*
> + * Check if given CPU is running isolated task.
> + */
> +int task_isolation_on_cpu(int cpu)
> +{
> +	return test_bit(TIF_TASK_ISOLATION,
> +			&per_cpu(tsk_thread_flags_cache, cpu));
> +}
> +
> +/*
> + * Set CPUs currently running isolated tasks in CPU mask.
> + */
> +void task_isolation_cpumask(struct cpumask *mask)
> +{
> +	int cpu;
> +
> +	if (task_isolation_map == NULL)
> +		return;
> +
> +	for_each_cpu(cpu, task_isolation_map)
> +		if (task_isolation_on_cpu(cpu))
> +			cpumask_set_cpu(cpu, mask);
> +}
> +
> +/*
> + * Clear CPUs currently running isolated tasks in CPU mask.
> + */
> +void task_isolation_clear_cpumask(struct cpumask *mask)
> +{
> +	int cpu;
> +
> +	if (task_isolation_map == NULL)
> +		return;
> +
> +	for_each_cpu(cpu, task_isolation_map)
> +		if (task_isolation_on_cpu(cpu))
> +			cpumask_clear_cpu(cpu, mask);
> +}
> +
> +/*
> + * Cleanup procedure. The call to this procedure may be delayed.
> + */
> +void task_isolation_cpu_cleanup(void)
> +{
> +	kick_hrtimer();
> +}
> +
> +/*
> + * Check if cleanup is scheduled on the current CPU, and if so, run it.
> + * Intended to be called from notify_resume() or another such callback
> + * on the target CPU.
> + */
> +void task_isolation_check_run_cleanup(void)
> +{
> +	int cpu;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&task_isolation_cleanup_lock, flags);
> +
> +	cpu = smp_processor_id();
> +
> +	if (cpumask_test_cpu(cpu, task_isolation_cleanup_map)) {
> +		cpumask_clear_cpu(cpu, task_isolation_cleanup_map);
> +		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
> +		task_isolation_cpu_cleanup();
> +	} else
> +		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
> +}
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 5b2396350dd1..1df57e38c361 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -46,6 +46,7 @@
>  #include <linux/livepatch.h>
>  #include <linux/cgroup.h>
>  #include <linux/audit.h>
> +#include <linux/isolation.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/signal.h>
> @@ -758,6 +759,7 @@ static int dequeue_synchronous_signal(kernel_siginfo_t *info)
>   */
>  void signal_wake_up_state(struct task_struct *t, unsigned int state)
>  {
> +	task_isolation_signal(t);
>  	set_tsk_thread_flag(t, TIF_SIGPENDING);
>  	/*
>  	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
> diff --git a/kernel/sys.c b/kernel/sys.c
> index f9bc5c303e3f..0a4059a8c4f9 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -42,6 +42,7 @@
>  #include <linux/syscore_ops.h>
>  #include <linux/version.h>
>  #include <linux/ctype.h>
> +#include <linux/isolation.h>
>  
>  #include <linux/compat.h>
>  #include <linux/syscalls.h>
> @@ -2513,6 +2514,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  
>  		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
>  		break;
> +	case PR_TASK_ISOLATION:
> +		if (arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		error = task_isolation_request(arg2);
> +		break;
>  	default:
>  		error = -EINVAL;
>  		break;
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 3a609e7344f3..5bb98f39bde6 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -30,6 +30,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/interrupt.h>
>  #include <linux/tick.h>
> +#include <linux/isolation.h>
>  #include <linux/err.h>
>  #include <linux/debugobjects.h>
>  #include <linux/sched/signal.h>
> @@ -721,6 +722,19 @@ static void retrigger_next_event(void *arg)
>  	raw_spin_unlock(&base->lock);
>  }
>  
> +#ifdef CONFIG_TASK_ISOLATION
> +void kick_hrtimer(void)
> +{
> +	unsigned long flags;
> +
> +	preempt_disable();
> +	local_irq_save(flags);
> +	retrigger_next_event(NULL);
> +	local_irq_restore(flags);
> +	preempt_enable();
> +}
> +#endif
> +
>  /*
>   * Switch to high resolution mode
>   */
> @@ -868,8 +882,21 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
>  void clock_was_set(void)
>  {
>  #ifdef CONFIG_HIGH_RES_TIMERS
> +#ifdef CONFIG_TASK_ISOLATION
> +	struct cpumask mask;
> +
> +	cpumask_clear(&mask);
> +	task_isolation_cpumask(&mask);
> +	cpumask_complement(&mask, &mask);
> +	/*
> +	 * Retrigger the CPU local events everywhere except CPUs
> +	 * running isolated tasks.
> +	 */
> +	on_each_cpu_mask(&mask, retrigger_next_event, NULL, 1);
> +#else
>  	/* Retrigger the CPU local events everywhere */
>  	on_each_cpu(retrigger_next_event, NULL, 1);
> +#endif
>  #endif
>  	timerfd_clock_was_set();
>  }
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index a792d21cac64..1d4dec9d3ee7 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -882,6 +882,24 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts)
>  #endif
>  }
>  
> +#ifdef CONFIG_TASK_ISOLATION
> +int try_stop_full_tick(void)
> +{
> +	int cpu = smp_processor_id();
> +	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
> +
> +	/* For an unstable clock, we should return a permanent error code. */
> +	if (atomic_read(&tick_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE)
> +		return -EINVAL;
> +
> +	if (!can_stop_full_tick(cpu, ts))
> +		return -EAGAIN;
> +
> +	tick_nohz_stop_sched_tick(ts, cpu);
> +	return 0;
> +}
> +#endif
> +
>  static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
>  {
>  	/*
> -- 
> 2.20.1
> 

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 03/12] task_isolation: userspace hard isolation from kernel
  2020-03-08  3:47   ` [PATCH v2 03/12] task_isolation: userspace hard isolation from kernel Alex Belits
                       ` (2 preceding siblings ...)
  2020-04-06  4:31     ` Kevyn-Alexandre Paré
@ 2020-04-06  4:43     ` Kevyn-Alexandre Paré
  3 siblings, 0 replies; 71+ messages in thread
From: Kevyn-Alexandre Paré @ 2020-04-06  4:43 UTC (permalink / raw)
  To: Alex Belits
  Cc: frederic, rostedt, mingo, peterz, linux-kernel, Prasun Kapoor,
	tglx, linux-api, catalin.marinas, linux-arm-kernel, netdev,
	davem, linux-arch, will

On Sun, Mar 08, 2020 at 03:47:08AM +0000, Alex Belits wrote:
> The existing nohz_full mode is designed as a "soft" isolation mode
> that makes tradeoffs to minimize userspace interruptions while
> still attempting to avoid overheads in the kernel entry/exit path,
> to provide 100% kernel semantics, etc.
> 
> However, some applications require a "hard" commitment from the
> kernel to avoid interruptions, in particular userspace device driver
> style applications, such as high-speed networking code.
> 
> This change introduces a framework to allow applications
> to elect to have the "hard" semantics as needed, specifying
> prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
> 
> The kernel must be built with the new TASK_ISOLATION Kconfig flag
> to enable this mode, and the kernel booted with an appropriate
> "isolcpus=nohz,domain,CPULIST" boot argument to enable
> nohz_full and isolcpus. The "task_isolation" state is then indicated
> by setting a new task struct field, task_isolation_flag, to the
> value passed by prctl(), and also setting a TIF_TASK_ISOLATION
> bit in the thread_info flags. When the kernel is returning to
> userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
> it calls the new task_isolation_start() routine to arrange for
> the task to avoid being interrupted in the future.
> 
> With interrupts disabled, task_isolation_start() ensures that kernel
> subsystems that might cause a future interrupt are quiesced. If it
> doesn't succeed, it adjusts the syscall return value to indicate that
> fact, and userspace can retry as desired. In addition to stopping
> the scheduler tick, the code takes any actions that might avoid
> a future interrupt to the core, such as a worker thread being
> scheduled that could be quiesced now (e.g. the vmstat worker)
> or a future IPI to the core to clean up some state that could be
> cleaned up now (e.g. the mm lru per-cpu cache).
> 
> Once the task has returned to userspace after issuing the prctl(),
> if it enters the kernel again via system call, page fault, or any
> other exception or irq, the kernel will kill it with SIGKILL.
> In addition to sending a signal, the code supports a kernel
> command-line "task_isolation_debug" flag which causes a stack
> backtrace to be generated whenever a task loses isolation.
> 
> To allow the state to be entered and exited, the syscall checking
> test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
> clear the bit again later, and ignores exit/exit_group to allow
> exiting the task without a pointless signal being delivered.
> 
> The prctl() API allows for specifying a signal number to use instead
> of the default SIGKILL, to allow for catching the notification
> signal; for example, in a production environment, it might be
> helpful to log information to the application logging mechanism
> before exiting. Or, the signal handler might choose to reset the
> program counter back to the code segment intended to be run isolated
> via prctl() to continue execution.
> 
> In a number of cases we can tell on a remote cpu that we are
> going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
> In that case we generate the diagnostic (and optional stack dump)
> on the remote core to be able to deliver better diagnostics.
> If the interrupt is not something caught by Linux (e.g. a
> hypervisor interrupt) we can also request a reschedule IPI to
> be sent to the remote core so it can be sure to generate a
> signal to notify the process.
> 
> Separate patches that follow provide these changes for x86, arm,
> and arm64.
> 
> Signed-off-by: Alex Belits <abelits@marvell.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |   6 +
>  include/linux/hrtimer.h                       |   4 +
>  include/linux/isolation.h                     | 229 ++++++
>  include/linux/sched.h                         |   4 +
>  include/linux/tick.h                          |   3 +
>  include/uapi/linux/prctl.h                    |   6 +
>  init/Kconfig                                  |  28 +
>  kernel/Makefile                               |   2 +
>  kernel/context_tracking.c                     |   2 +
>  kernel/isolation.c                            | 774 ++++++++++++++++++
>  kernel/signal.c                               |   2 +
>  kernel/sys.c                                  |   6 +
>  kernel/time/hrtimer.c                         |  27 +
>  kernel/time/tick-sched.c                      |  18 +
>  14 files changed, 1111 insertions(+)
>  create mode 100644 include/linux/isolation.h
>  create mode 100644 kernel/isolation.c
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index c07815d230bc..e4a2d6e37645 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4808,6 +4808,12 @@
>  			neutralize any effect of /proc/sys/kernel/sysrq.
>  			Useful for debugging.
>  
> +	task_isolation_debug	[KNL]
> +			In kernels built with CONFIG_TASK_ISOLATION, this
> +			setting will generate console backtraces to
> +			accompany the diagnostics generated about
> +			interrupting tasks running with task isolation.
> +
>  	tcpmhash_entries= [KNL,NET]
>  			Set the number of tcp_metrics_hash slots.
>  			Default value is 8192 or 16384 depending on total
> diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
> index 15c8ac313678..e81252eb4f92 100644
> --- a/include/linux/hrtimer.h
> +++ b/include/linux/hrtimer.h
> @@ -528,6 +528,10 @@ extern void __init hrtimers_init(void);
>  /* Show pending timers: */
>  extern void sysrq_timer_list_show(void);
>  
> +#ifdef CONFIG_TASK_ISOLATION
> +extern void kick_hrtimer(void);
> +#endif
> +
>  int hrtimers_prepare_cpu(unsigned int cpu);
>  #ifdef CONFIG_HOTPLUG_CPU
>  int hrtimers_dead_cpu(unsigned int cpu);
> diff --git a/include/linux/isolation.h b/include/linux/isolation.h
> new file mode 100644
> index 000000000000..6bd71c67f10f
> --- /dev/null
> +++ b/include/linux/isolation.h
> @@ -0,0 +1,229 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Task isolation support
> + *
> + * Authors:
> + *   Chris Metcalf <cmetcalf@mellanox.com>
> + *   Alex Belits <abelits@marvell.com>
> + *   Yuri Norov <ynorov@marvell.com>
> + */
> +#ifndef _LINUX_ISOLATION_H
> +#define _LINUX_ISOLATION_H
> +
> +#include <stdarg.h>
> +#include <linux/errno.h>
> +#include <linux/cpumask.h>
> +#include <linux/prctl.h>
> +#include <linux/types.h>
> +
> +struct task_struct;
> +
> +#ifdef CONFIG_TASK_ISOLATION
> +
> +int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...);
> +
> +#define pr_task_isol_emerg(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_EMERG, false, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_alert(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_ALERT, false, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_crit(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_CRIT, false, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_err(cpu, fmt, ...)				\
> +	task_isolation_message(cpu, LOGLEVEL_ERR, false, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_warn(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_WARNING, false, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_notice(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_NOTICE, false, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_info(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_INFO, false, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_debug(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_DEBUG, false, fmt, ##__VA_ARGS__)
> +
> +#define pr_task_isol_emerg_supp(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_EMERG, true, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_alert_supp(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_ALERT, true, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_crit_supp(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_CRIT, true, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_err_supp(cpu, fmt, ...)				\
> +	task_isolation_message(cpu, LOGLEVEL_ERR, true, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_warn_supp(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_WARNING, true, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_notice_supp(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_NOTICE, true, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_info_supp(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_INFO, true, fmt, ##__VA_ARGS__)
> +#define pr_task_isol_debug_supp(cpu, fmt, ...)			\
> +	task_isolation_message(cpu, LOGLEVEL_DEBUG, true, fmt, ##__VA_ARGS__)
> +DECLARE_PER_CPU(unsigned long, tsk_thread_flags_copy);
> +extern cpumask_var_t task_isolation_map;
> +
> +/**
> + * task_isolation_request() - prctl hook to request task isolation
> + * @flags:	Flags from <linux/prctl.h> PR_TASK_ISOLATION_xxx.
> + *
> + * This is called from the generic prctl() code for PR_TASK_ISOLATION.
> + *
> + * Return: Returns 0 when task isolation enabled, otherwise a negative
> + * errno.
> + */
> +extern int task_isolation_request(unsigned int flags);
> +extern void task_isolation_cpu_cleanup(void);
> +/**
> + * task_isolation_start() - attempt to actually start task isolation
> + *
> + * This function should be invoked as the last thing prior to returning to
> + * user space if TIF_TASK_ISOLATION is set in the thread_info flags.  It
> + * will attempt to quiesce the core and enter task-isolation mode.  If it
> + * fails, it will reset the system call return value to an error code that
> + * indicates the failure mode.
> + */
> +extern void task_isolation_start(void);
> +
> +/**
> + * is_isolation_cpu() - check if CPU is intended for running isolated tasks.
> + * @cpu:	CPU to check.
> + */
> +static inline bool is_isolation_cpu(int cpu)
> +{
> +	return task_isolation_map != NULL &&
> +		cpumask_test_cpu(cpu, task_isolation_map);
> +}
> +
> +/**
> + * task_isolation_on_cpu() - check if the cpu is running isolated task
> + * @cpu:	CPU to check.
> + */
> +extern int task_isolation_on_cpu(int cpu);
> +extern void task_isolation_check_run_cleanup(void);
> +
> +/**
> + * task_isolation_cpumask() - set CPUs currently running isolated tasks
> + * @mask:	Mask to modify.
> + */
> +extern void task_isolation_cpumask(struct cpumask *mask);
> +
> +/**
> + * task_isolation_clear_cpumask() - clear CPUs currently running isolated tasks
> + * @mask:      Mask to modify.
> + */
> +extern void task_isolation_clear_cpumask(struct cpumask *mask);
> +
> +/**
> + * task_isolation_syscall() - report a syscall from an isolated task
> + * @nr:		The syscall number.
> + *
> + * This routine should be invoked at syscall entry if TIF_TASK_ISOLATION is
> + * set in the thread_info flags.  It checks for valid syscalls,
> + * specifically prctl() with PR_TASK_ISOLATION, exit(), and exit_group().
> + * For any other syscall it will raise a signal and return failure.
> + *
> + * Return: 0 for acceptable syscalls, -1 for all others.
> + */
> +extern int task_isolation_syscall(int nr);
> +
> +/**
> + * _task_isolation_interrupt() - report an interrupt of an isolated task
> + * @fmt:	A format string describing the interrupt
> + * @...:	Format arguments, if any.
> + *
> + * This routine should be invoked at any exception or IRQ if
> + * TIF_TASK_ISOLATION is set in the thread_info flags.  It is not necessary
> + * to invoke it if the exception will generate a signal anyway (e.g. a bad
> + * page fault), and in that case it is preferable not to invoke it but just
> + * rely on the standard Linux signal.  The macro task_isolation_syscall()
> + * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
> + */
> +extern void _task_isolation_interrupt(const char *fmt, ...);
> +#define task_isolation_interrupt(fmt, ...)				\
> +	do {								\
> +		if (current_thread_info()->flags & _TIF_TASK_ISOLATION)	\
> +			_task_isolation_interrupt(fmt, ## __VA_ARGS__);	\
> +	} while (0)
> +
> +/**
> + * task_isolation_remote() - report a remote interrupt of an isolated task
> + * @cpu:	The remote cpu that is about to be interrupted.
> + * @fmt:	A format string describing the interrupt
> + * @...:	Format arguments, if any.
> + *
> + * This routine should be invoked any time a remote IPI or other type of
> + * interrupt is being delivered to another cpu. The function will check to
> + * see if the target core is running a task-isolation task, and generate a
> + * diagnostic on the console if so; in addition, we tag the task so it
> + * doesn't generate another diagnostic when the interrupt actually arrives.
> + * Generating a diagnostic remotely yields a clearer indication of what
> + * happened then just reporting only when the remote core is interrupted.
> + *
> + */
> +extern void task_isolation_remote(int cpu, const char *fmt, ...);
> +
> +/**
> + * task_isolation_remote_cpumask() - report interruption of multiple cpus
> + * @mask:	The set of remotes cpus that are about to be interrupted.
> + * @fmt:	A format string describing the interrupt
> + * @...:	Format arguments, if any.
> + *
> + * This is the cpumask variant of _task_isolation_remote().  We
> + * generate a single-line diagnostic message even if multiple remote
> + * task-isolation cpus are being interrupted.
> + */
> +extern void task_isolation_remote_cpumask(const struct cpumask *mask,
> +					  const char *fmt, ...);
> +
> +/**
> + * _task_isolation_signal() - disable task isolation when signal is pending
> + * @task:	The task for which to disable isolation.
> + *
> + * This function generates a diagnostic and disables task isolation; it
> + * should be called if TIF_TASK_ISOLATION is set when notifying a task of a
> + * pending signal.  The task_isolation_interrupt() function normally
> + * generates a diagnostic for events that just interrupt a task without
> + * generating a signal; here we need to hook the paths that correspond to
> + * interrupts that do generate a signal.  The macro task_isolation_signal()
> + * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
> + */
> +extern void _task_isolation_signal(struct task_struct *task);
> +#define task_isolation_signal(task)					\
> +	do {								\
> +		if (task_thread_info(task)->flags & _TIF_TASK_ISOLATION) \
> +			_task_isolation_signal(task);			\
> +	} while (0)
> +
> +/**
> + * task_isolation_user_exit() - debug all user_exit calls
> + *
> + * By default, we don't generate an exception in the low-level user_exit()
> + * code, because programs lose the ability to disable task isolation: the
> + * user_exit() hook will cause a signal prior to task_isolation_syscall()
> + * disabling task isolation.  In addition, it means that we lose all the
> + * diagnostic info otherwise available from task_isolation_interrupt() hooks
> + * later in the interrupt-handling process.  But you may enable it here for
> + * a special kernel build if you are having undiagnosed userspace jitter.
> + */
> +static inline void task_isolation_user_exit(void)
> +{
> +#ifdef DEBUG_TASK_ISOLATION
> +	task_isolation_interrupt("user_exit");
> +#endif
> +}
> +
> +#else /* !CONFIG_TASK_ISOLATION */
> +static inline int task_isolation_request(unsigned int flags) { return -EINVAL; }
> +static inline void task_isolation_start(void) { }
> +static inline bool is_isolation_cpu(int cpu) { return 0; }
> +static inline int task_isolation_on_cpu(int cpu) { return 0; }
> +static inline void task_isolation_cpumask(struct cpumask *mask) { }
> +static inline void task_isolation_clear_cpumask(struct cpumask *mask) { }
> +static inline void task_isolation_cpu_cleanup(void) { }
> +static inline void task_isolation_check_run_cleanup(void) { }
> +static inline int task_isolation_syscall(int nr) { return 0; }
> +static inline void task_isolation_interrupt(const char *fmt, ...) { }
> +static inline void task_isolation_remote(int cpu, const char *fmt, ...) { }
> +static inline void task_isolation_remote_cpumask(const struct cpumask *mask,
> +						 const char *fmt, ...) { }
> +static inline void task_isolation_signal(struct task_struct *task) { }
> +static inline void task_isolation_user_exit(void) { }
> +#endif
> +
> +#endif /* _LINUX_ISOLATION_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 04278493bf15..52fdb32aa3b9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1280,6 +1280,10 @@ struct task_struct {
>  	unsigned long			lowest_stack;
>  	unsigned long			prev_lowest_stack;
>  #endif
> +#ifdef CONFIG_TASK_ISOLATION
> +	unsigned short			task_isolation_flags;  /* prctl */
> +	unsigned short			task_isolation_state;
> +#endif
>  
>  	/*
>  	 * New fields for task_struct should be added above here, so that
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index 7340613c7eff..27c7c033d5a8 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -268,6 +268,9 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
>  extern void tick_nohz_full_kick_cpu(int cpu);
>  extern void __tick_nohz_task_switch(void);
>  extern void __init tick_nohz_full_setup(cpumask_var_t cpumask);
> +#ifdef CONFIG_TASK_ISOLATION
> +extern int try_stop_full_tick(void);
> +#endif
>  #else
>  static inline bool tick_nohz_full_enabled(void) { return false; }
>  static inline bool tick_nohz_full_cpu(int cpu) { return false; }
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 07b4f8131e36..f4848ed2a069 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -238,4 +238,10 @@ struct prctl_mm_map {
>  #define PR_SET_IO_FLUSHER		57
>  #define PR_GET_IO_FLUSHER		58
>  
> +/* Enable task_isolation mode for TASK_ISOLATION kernels. */
> +#define PR_TASK_ISOLATION		48
> +# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
> +# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
> +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index 20a6ac33761c..ecdf567f6bd4 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -576,6 +576,34 @@ config CPU_ISOLATION
>  
>  source "kernel/rcu/Kconfig"
>  
> +config HAVE_ARCH_TASK_ISOLATION
> +	bool
> +
> +config TASK_ISOLATION
> +	bool "Provide hard CPU isolation from the kernel on demand"
> +	depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
> +	help
> +
> +	Allow userspace processes that place themselves on cores with
> +	nohz_full and isolcpus enabled, and run prctl(PR_TASK_ISOLATION),
> +	to "isolate" themselves from the kernel.  Prior to returning to
> +	userspace, isolated tasks will arrange that no future kernel
> +	activity will interrupt the task while the task is running in
> +	userspace.  Attempting to re-enter the kernel while in this mode
> +	will cause the task to be terminated with a signal; you must
> +	explicitly use prctl() to disable task isolation before resuming
> +	normal use of the kernel.
> +
> +	This "hard" isolation from the kernel is required for userspace
> +	tasks that are running hard real-time tasks in userspace, such as
> +	a high-speed network driver in userspace.  Without this option, but
> +	with NO_HZ_FULL enabled, the kernel will make a best-faith, "soft"
> +	effort to shield a single userspace process from interrupts, but
> +	makes no guarantees.
> +
> +	You should say "N" unless you are intending to run a
> +	high-performance userspace driver or similar task.
> +
>  config BUILD_BIN2C
>  	bool
>  	default n
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 4cb4130ced32..2f2ae91f90d5 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -122,6 +122,8 @@ obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o
>  KASAN_SANITIZE_stackleak.o := n
>  KCOV_INSTRUMENT_stackleak.o := n
>  
> +obj-$(CONFIG_TASK_ISOLATION) += isolation.o
> +
>  $(obj)/configs.o: $(obj)/config_data.gz
>  
>  targets += config_data.gz
> diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
> index 0296b4bda8f1..e9206736f219 100644
> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -21,6 +21,7 @@
>  #include <linux/hardirq.h>
>  #include <linux/export.h>
>  #include <linux/kprobes.h>
> +#include <linux/isolation.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/context_tracking.h>
> @@ -157,6 +158,7 @@ void __context_tracking_exit(enum ctx_state state)
>  			if (state == CONTEXT_USER) {
>  				vtime_user_exit(current);
>  				trace_user_exit(0);
> +				task_isolation_user_exit();
>  			}
>  		}
>  		__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
> diff --git a/kernel/isolation.c b/kernel/isolation.c
> new file mode 100644
> index 000000000000..ae29732c376c
> --- /dev/null
> +++ b/kernel/isolation.c
> @@ -0,0 +1,774 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + *  linux/kernel/isolation.c
> + *
> + *  Implementation of task isolation.
> + *
> + * Authors:
> + *   Chris Metcalf <cmetcalf@mellanox.com>
> + *   Alex Belits <abelits@marvell.com>
> + *   Yuri Norov <ynorov@marvell.com>
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/swap.h>
> +#include <linux/vmstat.h>
> +#include <linux/sched.h>
> +#include <linux/isolation.h>
> +#include <linux/syscalls.h>
> +#include <linux/smp.h>
> +#include <linux/tick.h>
> +#include <asm/unistd.h>
> +#include <asm/syscall.h>
> +#include <linux/hrtimer.h>
> +
> +/*
> + * These values are stored in task_isolation_state.
> + * Note that STATE_NORMAL + TIF_TASK_ISOLATION means we are still
> + * returning from sys_prctl() to userspace.
> + */
> +enum {
> +	STATE_NORMAL = 0,	/* Not isolated */
> +	STATE_ISOLATED = 1	/* In userspace, isolated */
> +};
> +
> +/*
> + * This variable contains thread flags copied at the moment
> + * when schedule() switched to the task on a given CPU,
> + * or 0 if no task is running.
> + */
> +DEFINE_PER_CPU(unsigned long, tsk_thread_flags_cache);
> +
> +/*
> + * Counter for isolation state on a given CPU, increments when entering
> + * isolation and decrements when exiting isolation (before or after the
> + * cleanup). Multiple simultaneously running procedures entering or
> + * exiting isolation are prevented by checking the result of
> + * incrementing or decrementing this variable. This variable is both
> + * incremented and decremented by CPU that caused isolation entering or
> + * exit.
> + *
> + * This is necessary because multiple isolation-breaking events may happen
> + * at once (or one as the result of the other), however isolation exit
> + * may only happen once to transition from isolated to non-isolated state.
> + * Therefore, if decrementing this counter results in a value less than 0,
> + * isolation exit procedure can't be started -- it already happened, or is
> + * in progress, or isolation is not entered yet.
> + */
> +DEFINE_PER_CPU(atomic_t, isol_counter);
> +
> +/*
> + * Description of the last two tasks that ran isolated on a given CPU.
> + * This is intended only for messages about isolation breaking. We
> + * don't want any references to actual task while accessing this from
> + * CPU that caused isolation breaking -- we know nothing about timing
> + * and don't want to use locking or RCU.
> + */
> +struct isol_task_desc {
> +	atomic_t curr_index;
> +	atomic_t curr_index_wr;
> +	bool	warned[2];
> +	pid_t	pid[2];
> +	pid_t	tgid[2];
> +	char	comm[2][TASK_COMM_LEN];
> +};
> +static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs);
> +
> +/*
> + * Counter for isolation exiting procedures (from request to the start of
> + * cleanup) being attempted at once on a CPU. Normally incrementing of
> + * this counter is performed from the CPU that caused isolation breaking,
> + * however decrementing is done from the cleanup procedure, delegated to
> + * the CPU that is exiting isolation, not from the CPU that caused isolation
> + * breaking.
> + *
> + * If incrementing this counter while starting isolation exit procedure
> + * results in a value greater than 0, isolation exiting is already in
> + * progress, and cleanup did not start yet. This means, counter should be
> + * decremented back, and isolation exit that is already in progress, should
> + * be allowed to complete. Otherwise, a new isolation exit procedure should
> + * be started.
> + */
> +DEFINE_PER_CPU(atomic_t, isol_exit_counter);
> +
> +/*
> + * Descriptor for isolation-breaking SMP calls
> + */
> +DEFINE_PER_CPU(call_single_data_t, isol_break_csd);
> +
> +cpumask_var_t task_isolation_map;
> +cpumask_var_t task_isolation_cleanup_map;
> +static DEFINE_SPINLOCK(task_isolation_cleanup_lock);
> +
> +/* We can run on cpus that are isolated from the scheduler and are nohz_full. */
> +static int __init task_isolation_init(void)
> +{
> +	alloc_bootmem_cpumask_var(&task_isolation_cleanup_map);
> +	if (alloc_cpumask_var(&task_isolation_map, GFP_KERNEL))
> +		/*
> +		 * At this point task isolation should match
> +		 * nohz_full. This may change in the future.
> +		 */
> +		cpumask_copy(task_isolation_map, tick_nohz_full_mask);
> +	return 0;
> +}
> +core_initcall(task_isolation_init)
> +
> +/* Enable stack backtraces of any interrupts of task_isolation cores. */
> +static bool task_isolation_debug;
> +static int __init task_isolation_debug_func(char *str)
> +{
> +	task_isolation_debug = true;
> +	return 1;
> +}
> +__setup("task_isolation_debug", task_isolation_debug_func);
> +
> +/*
> + * Record name, pid and group pid of the task entering isolation on
> + * the current CPU.
> + */
> +static void record_curr_isolated_task(void)
> +{
> +	int ind;
> +	int cpu = smp_processor_id();
> +	struct isol_task_desc *desc = &per_cpu(isol_task_descs, cpu);
> +	struct task_struct *task = current;
> +
> +	/* Finish everything before recording current task */
> +	smp_mb();
> +	ind = atomic_inc_return(&desc->curr_index_wr) & 1;
> +	desc->comm[ind][sizeof(task->comm) - 1] = '\0';
> +	memcpy(desc->comm[ind], task->comm, sizeof(task->comm) - 1);
> +	desc->pid[ind] = task->pid;
> +	desc->tgid[ind] = task->tgid;
> +	desc->warned[ind] = false;
> +	/* Write everything, to be seen by other CPUs */
> +	smp_mb();
> +	atomic_inc(&desc->curr_index);
> +	/* Everyone will see the new record from this point */
> +	smp_mb();
> +}
> +
> +/*
> + * Print message prefixed with the description of the current (or
> + * last) isolated task on a given CPU. Intended for isolation breaking
> + * messages that include target task for the user's convenience.
> + *
> + * Messages produced with this function may have obsolete task
> + * information if isolated tasks managed to exit, start and enter
> + * isolation multiple times, or multiple tasks tried to enter
> + * isolation on the same CPU at once. For those unusual cases it would
> + * contain a valid description of the cause for isolation breaking and
> + * target CPU number, just not the correct description of which task
> + * ended up losing isolation.
> + */
> +int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...)
> +{
> +	struct isol_task_desc *desc;
> +	struct task_struct *task;
> +	va_list args;
> +	char buf_prefix[TASK_COMM_LEN + 20 + 3 * 20];
> +	char buf[200];
> +	int curr_cpu, ind_counter, ind_counter_old, ind;
> +
> +	curr_cpu = get_cpu();
> +	desc = &per_cpu(isol_task_descs, cpu);
> +	ind_counter = atomic_read(&desc->curr_index);
> +
> +	if (curr_cpu == cpu) {
> +		/*
> +		 * Message is for the current CPU so current
> +		 * task_struct should be used instead of cached
> +		 * information.
> +		 *
> +		 * Like in other diagnostic messages, if issued from
> +		 * interrupt context, current will be the interrupted
> +		 * task. Unlike other diagnostic messages, this is
> +		 * always relevant because the message is about
> +		 * interrupting a task.
> +		 */
> +		ind = ind_counter & 1;
> +		if (supp && desc->warned[ind]) {
> +			/*
> +			 * If supp is true, skip the message if the
> +			 * same task was mentioned in the message
> +			 * originated on remote CPU, and it did not
> +			 * re-enter isolated state since then (warned
> +			 * is true). Only local messages following
> +			 * remote messages, likely about the same
> +			 * isolation breaking event, are skipped to
> +			 * avoid duplication. If remote cause is
> +			 * immediately followed by a local one before
> +			 * isolation is broken, local cause is skipped
> +			 * from messages.
> +			 */
> +			put_cpu();
> +			return 0;
> +		}
> +		task = current;
> +		snprintf(buf_prefix, sizeof(buf_prefix),
> +			 "isolation %s/%d/%d (cpu %d)",
> +			 task->comm, task->tgid, task->pid, cpu);
> +		put_cpu();
> +	} else {
> +		/*
> +		 * Message is for remote CPU, use cached information.
> +		 */
> +		put_cpu();
> +		/*
> +		 * Make sure, index remained unchanged while data was
> +		 * copied. If it changed, data that was copied may be
> +		 * inconsistent because two updates in a sequence could
> +		 * overwrite the data while it was being read.
> +		 */
> +		do {
> +			/* Make sure we are reading up to date values */
> +			smp_mb();
> +			ind = ind_counter & 1;
> +			snprintf(buf_prefix, sizeof(buf_prefix),
> +				 "isolation %s/%d/%d (cpu %d)",
> +				 desc->comm[ind], desc->tgid[ind],
> +				 desc->pid[ind], cpu);
> +			desc->warned[ind] = true;
> +			ind_counter_old = ind_counter;
> +			/* Record the warned flag, then re-read descriptor */
> +			smp_mb();
> +			ind_counter = atomic_read(&desc->curr_index);
> +			/*
> +			 * If the counter changed, something was updated, so
> +			 * repeat everything to get the current data
> +			 */
> +		} while (ind_counter != ind_counter_old);
> +	}
> +
> +	va_start(args, fmt);
> +	vsnprintf(buf, sizeof(buf), fmt, args);
> +	va_end(args);
> +
> +	switch (level) {
> +	case LOGLEVEL_EMERG:
> +		pr_emerg("%s: %s", buf_prefix, buf);
> +		break;
> +	case LOGLEVEL_ALERT:
> +		pr_alert("%s: %s", buf_prefix, buf);
> +		break;
> +	case LOGLEVEL_CRIT:
> +		pr_crit("%s: %s", buf_prefix, buf);
> +		break;
> +	case LOGLEVEL_ERR:
> +		pr_err("%s: %s", buf_prefix, buf);
> +		break;
> +	case LOGLEVEL_WARNING:
> +		pr_warn("%s: %s", buf_prefix, buf);
> +		break;
> +	case LOGLEVEL_NOTICE:
> +		pr_notice("%s: %s", buf_prefix, buf);
> +		break;
> +	case LOGLEVEL_INFO:
> +		pr_info("%s: %s", buf_prefix, buf);
> +		break;
> +	case LOGLEVEL_DEBUG:
> +		pr_debug("%s: %s", buf_prefix, buf);
> +		break;
> +	default:
> +		/* No message without a valid level */
> +		return 0;
> +	}
> +	return 1;
> +}
> +
> +/*
> + * Dump stack if need be. This can be helpful even from the final exit
> + * to usermode code since stack traces sometimes carry information about
> + * what put you into the kernel, e.g. an interrupt number encoded in
> + * the initial entry stack frame that is still visible at exit time.
> + */
> +static void debug_dump_stack(void)
> +{
> +	if (task_isolation_debug)
> +		dump_stack();
> +}
> +
> +/*
> + * Set the flags word but don't try to actually start task isolation yet.
> + * We will start it when entering user space in task_isolation_start().
> + */
> +int task_isolation_request(unsigned int flags)
> +{
> +	struct task_struct *task = current;
> +
> +	/*
> +	 * The task isolation flags should always be cleared just by
> +	 * virtue of having entered the kernel.
> +	 */
> +	WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_TASK_ISOLATION));
> +	WARN_ON_ONCE(task->task_isolation_flags != 0);
> +	WARN_ON_ONCE(task->task_isolation_state != STATE_NORMAL);
> +
> +	task->task_isolation_flags = flags;
> +	if (!(task->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
> +		return 0;
> +
> +	/* We are trying to enable task isolation. */
> +	set_tsk_thread_flag(task, TIF_TASK_ISOLATION);
> +
> +	/*
> +	 * Shut down the vmstat worker so we're not interrupted later.
> +	 * We have to try to do this here (with interrupts enabled) since
> +	 * we are canceling delayed work and will call flush_work()
> +	 * (which enables interrupts) and possibly schedule().
> +	 */
> +	quiet_vmstat_sync();
> +
> +	/* We return 0 here but we may change that in task_isolation_start(). */
> +	return 0;
> +}
> +
> +/*
> + * Perform actions that should be done immediately on exit from isolation.
> + */
> +static void fast_task_isolation_cpu_cleanup(void *info)
> +{
> +	atomic_dec(&per_cpu(isol_exit_counter, smp_processor_id()));
> +	/* At this point breaking isolation from other CPUs is possible again */
> +
> +	/*
> +	 * This task is no longer isolated (and if by any chance this
> +	 * is the wrong task, it's already not isolated)
> +	 */
> +	current->task_isolation_flags = 0;
> +	clear_tsk_thread_flag(current, TIF_TASK_ISOLATION);
> +
> +	/* Run the rest of cleanup later */
> +	set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
> +
> +	/* Copy flags with task isolation disabled */
> +	this_cpu_write(tsk_thread_flags_cache,
> +		       READ_ONCE(task_thread_info(current)->flags));
> +}
> +
> +/* Disable task isolation for the specified task. */
> +static void stop_isolation(struct task_struct *p)
> +{
> +	int cpu, this_cpu;
> +	unsigned long flags;
> +
> +	this_cpu = get_cpu();
> +	cpu = task_cpu(p);
> +	if (atomic_inc_return(&per_cpu(isol_exit_counter, cpu)) > 1) {
> +		/* Already exiting isolation */
> +		atomic_dec(&per_cpu(isol_exit_counter, cpu));
> +		put_cpu();
> +		return;
> +	}
> +
> +	if (p == current) {
> +		p->task_isolation_state = STATE_NORMAL;
> +		fast_task_isolation_cpu_cleanup(NULL);
> +		task_isolation_cpu_cleanup();
> +		if (atomic_dec_return(&per_cpu(isol_counter, cpu)) < 0) {
> +			/* Is not isolated already */
> +			atomic_inc(&per_cpu(isol_counter, cpu));
> +		}
> +		put_cpu();
> +	} else {
> +		if (atomic_dec_return(&per_cpu(isol_counter, cpu)) < 0) {
> +			/* Is not isolated already */
> +			atomic_inc(&per_cpu(isol_counter, cpu));
> +			atomic_dec(&per_cpu(isol_exit_counter, cpu));
> +			put_cpu();
> +			return;
> +		}
> +		/*
> +		 * Schedule "slow" cleanup. This relies on
> +		 * TIF_NOTIFY_RESUME being set
> +		 */
> +		spin_lock_irqsave(&task_isolation_cleanup_lock, flags);
> +		cpumask_set_cpu(cpu, task_isolation_cleanup_map);
> +		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
> +		/*
> +		 * Setting flags is delegated to the CPU where
> +		 * isolated task is running
> +		 * isol_exit_counter will be decremented from there as well.
> +		 */
> +		per_cpu(isol_break_csd, cpu).func =
> +		    fast_task_isolation_cpu_cleanup;
> +		per_cpu(isol_break_csd, cpu).info = NULL;
> +		per_cpu(isol_break_csd, cpu).flags = 0;
> +		smp_call_function_single_async(cpu,
> +					       &per_cpu(isol_break_csd, cpu));
> +		put_cpu();
> +	}
> +}
> +
> +/*
> + * This code runs with interrupts disabled just before the return to
> + * userspace, after a prctl() has requested enabling task isolation.
> + * We take whatever steps are needed to avoid being interrupted later:
> + * drain the lru pages, stop the scheduler tick, etc.  More
> + * functionality may be added here later to avoid other types of
> + * interrupts from other kernel subsystems.
> + *
> + * If we can't enable task isolation, we update the syscall return
> + * value with an appropriate error.
> + */
> +void task_isolation_start(void)
> +{
> +	int error;
> +
> +	/*
> +	 * We should only be called in STATE_NORMAL (isolation disabled),
> +	 * on our way out of the kernel from the prctl() that turned it on.
> +	 * If we are exiting from the kernel in another state, it means we
> +	 * made it back into the kernel without disabling task isolation,
> +	 * and we should investigate how (and in any case disable task
> +	 * isolation at this point).  We are clearly not on the path back
> +	 * from the prctl() so we don't touch the syscall return value.
> +	 */
> +	if (WARN_ON_ONCE(current->task_isolation_state != STATE_NORMAL)) {
> +		/* Increment counter, this will allow isolation breaking */
> +		if (atomic_inc_return(&per_cpu(isol_counter,
> +					      smp_processor_id())) > 1) {
> +			atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
> +		}
> +		atomic_inc(&per_cpu(isol_counter, smp_processor_id()));
> +		stop_isolation(current);
> +		return;
> +	}
> +
> +	/*
> +	 * Must be affinitized to a single core with task isolation possible.
> +	 * In principle this could be remotely modified between the prctl()
> +	 * and the return to userspace, so we have to check it here.
> +	 */
> +	if (current->nr_cpus_allowed != 1 ||
> +	    !is_isolation_cpu(smp_processor_id())) {
> +		error = -EINVAL;
> +		goto error;
> +	}
> +
> +	/* If the vmstat delayed work is not canceled, we have to try again. */
> +	if (!vmstat_idle()) {
> +		error = -EAGAIN;
> +		goto error;
> +	}
> +
> +	/* Try to stop the dynamic tick. */
> +	error = try_stop_full_tick();
> +	if (error)
> +		goto error;
> +
> +	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
> +	lru_add_drain();
> +
> +	/* Increment counter, this will allow isolation breaking */
> +	if (atomic_inc_return(&per_cpu(isol_counter,
> +				      smp_processor_id())) > 1) {
> +		atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
> +	}
> +
> +	/* Record isolated task IDs and name */
> +	record_curr_isolated_task();
> +
> +	/* Copy flags with task isolation enabled */
> +	this_cpu_write(tsk_thread_flags_cache,
> +		       READ_ONCE(task_thread_info(current)->flags));
> +
> +	current->task_isolation_state = STATE_ISOLATED;
> +	return;
> +
> +error:
> +	/* Increment counter, this will allow isolation breaking */
> +	if (atomic_inc_return(&per_cpu(isol_counter,
> +				      smp_processor_id())) > 1) {
> +		atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
> +	}
> +	stop_isolation(current);
> +	syscall_set_return_value(current, current_pt_regs(), error, 0);
> +}
> +
> +/* Stop task isolation on the remote task and send it a signal. */
> +static void send_isolation_signal(struct task_struct *task)
> +{
> +	int flags = task->task_isolation_flags;
> +	kernel_siginfo_t info = {
> +		.si_signo = PR_TASK_ISOLATION_GET_SIG(flags) ?: SIGKILL,
> +	};
> +
> +	stop_isolation(task);
> +	send_sig_info(info.si_signo, &info, task);
> +}
> +
> +/* Only a few syscalls are valid once we are in task isolation mode. */
> +static bool is_acceptable_syscall(int syscall)
> +{
> +	/* No need to incur an isolation signal if we are just exiting. */
> +	if (syscall == __NR_exit || syscall == __NR_exit_group)
> +		return true;
> +
> +	/* Check to see if it's the prctl for isolation. */
> +	if (syscall == __NR_prctl) {
> +		unsigned long arg[SYSCALL_MAX_ARGS];

gcc output:

kernel/isolation.c: In function 'is_acceptable_syscall':
kernel/isolation.c:511:21: error: 'SYSCALL_MAX_ARGS' undeclared (first use in this function); did you mean 'SYSCALL_ALIAS'?
   unsigned long arg[SYSCALL_MAX_ARGS];
                     ^~~~~~~~~~~~~~~~
                     SYSCALL_ALIAS
kernel/isolation.c:511:21: note: each undeclared identifier is reported only once for each function it appears in
kernel/isolation.c:511:17: warning: unused variable 'arg' [-Wunused-variable]
   unsigned long arg[SYSCALL_MAX_ARGS];
                 ^~~
make[1]: *** [scripts/Makefile.build:267: kernel/isolation.o] Error 1
make[1]: *** Waiting for unfinished jobs....
make: *** [Makefile:1683: kernel] Error 2

quick search: 

grep -IHrn SYSCALL_MAX_ARGS
arch/arm/include/asm/syscall.h:54:#define SYSCALL_MAX_ARGS 7
arch/arm64/include/asm/syscall.h:53:#define SYSCALL_MAX_ARGS 6
arch/xtensa/include/asm/syscall.h:57:#define SYSCALL_MAX_ARGS 6
arch/x86/include/asm/syscall.h:91:#define SYSCALL_MAX_ARGS 6
arch/nds32/include/asm/syscall.h:125:#define SYSCALL_MAX_ARGS 6
kernel/isolation.c:511:         unsigned long arg[SYSCALL_MAX_ARGS];

my fix:

diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 8db3fdb6102e..b9de03a24e4f 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -88,6 +88,8 @@ static inline void syscall_set_return_value(struct task_struct *task,
        regs->ax = (long) error ?: val;
 }
 
+#define SYSCALL_MAX_ARGS 6
+
 #ifdef CONFIG_X86_32
 
 static inline void syscall_get_arguments(struct task_struct *task,




thx,

-- Kevyn-Alexandre Paré


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 00/13] "Task_isolation" mode
  2020-03-08  3:42 ` [PATCH v2 00/12] "Task_isolation" mode Alex Belits
                     ` (11 preceding siblings ...)
  2020-03-08  3:57   ` [PATCH v2 12/12] task_isolation: CONFIG_TASK_ISOLATION prevents distribution of jobs to non-housekeeping CPUs Alex Belits
@ 2020-04-09 15:09   ` Alex Belits
  2020-04-09 15:15     ` [PATCH 01/13] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
                       ` (12 more replies)
  12 siblings, 13 replies; 71+ messages in thread
From: Alex Belits @ 2020-04-09 15:09 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev


On Sat, 2020-03-07 at 19:42 -0800, Alex Belits wrote:
> This is the updated version of task isolation patchset.
> 
> 1. Commit messages updated to match changes.
> 2. Sign-off lines restored from original patches, changes listed
> wherever applicable.
> 3. arm platform -- added missing calls to syscall check and cleanup
> procedure after leaving isolation.
> 4. x86 platform -- added missing calls to cleanup procedure after
> leaving isolation.
> 

Another update, addressing CPU state / race conditions.

I believe, I have some usable solution for the problem of both missing
the events and race conditions on isolation entry and exit.

The idea is to make sure that CPU core remains in userspace and runs
userspace code regardless of what is happening in kernel and userspace
in the rest of the system, however any events that results in running
anything other than userspace code will result in CPU core
re-synchronizing with the rest of the system. Then any kernel code,
with the exception of small and clearly defined set of routines that
only perform kernel entry / exit themselves, will run on CPU after all
synchronization is done.

This does require an answer to possible races between isolation entry
/ exit (including isolation breaking on interrupts) and updates that
are normally carried by IPIs. So the solution should involve some
mechanism that limits what runs on CPU in its "stale" state, and
causes inevitable synchronization before the rest of the kernel is
called. This should also include any preemption -- if preemtion
happens in that "stale" state after entering the kernel but before
synchronization is completed, it should still go through
synchronization before running the rest of the kernel.

Then as long as it can be demonstrated that routines running in
"stale" state can safely run in it, and any event that would normally
require IPI, will result in entering the rest of kernel after
synchronization, race would cease to be a problem. Any sequence of
events would result in exactly the same CPU state when hitting the
rest of the kernel, as if CPU processed the update event through IPI.

I was under impression that this is already the case, however after
some closer look it appears that some barriers must be in place to
make sure that the sequence of events is enforced.

There is obviously a question of performance -- we don't want to cause
any additional flushes or add locking in anything
time-critical. Fortunately entering and exiting isolation (as opposed
to events that _potentially_ can call isolation-breaking routines) is
never performance-critical, it's what starts and ends a task that has
no performance-critical communication with kernel. So if a CPU that
has isolated task on it is running kernel code, it means that either
the task is not isolated yet (we are exiting to userspace), or it is
no longer running anything performance-critical (intentionally on exit
from isolation, or unintentionally on isolation breaking event).

Isolation state is read-mostly, and we would prefer RCU for that if we
can guarantee that "stale" state remains safe in all code that runs
until synchronization happen. I am not sure of that, so I tried to
make something more straightforward, however I might be wrong, and
RCU-ifying exit from isolation may be a better way do do it.

For now I want to make sure that there is some clearly defined small
amount of kernel code that runs before the inevitable synchronization,
and that code is unaffected by "stale" state.

I have tried to track down all call paths from kernel entry points
to the call of fast_task_isolation_cpu_cleanup(), and will post those
separately. It's possible that all architecture-specific code already
follows some clearly defined rules about this for other reasons,
however I am not that familiar with all of it, and tried to check if
existing implementation is always safe for running in "stale" state
before everything that makes task isolation call its cleanup. For now,
this is the implementation that assumes that "stale" state is safe for
kernel entry.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 01/13] task_isolation: vmstat: add quiet_vmstat_sync function
  2020-04-09 15:09   ` [PATCH v3 00/13] "Task_isolation" mode Alex Belits
@ 2020-04-09 15:15     ` Alex Belits
  2020-04-09 15:16     ` [PATCH 02/13] task_isolation: vmstat: add vmstat_idle function Alex Belits
                       ` (11 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-04-09 15:15 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
the quiet_vmstat() function became asynchronous, in the sense that
the vmstat work was still scheduled to run on the core when the
function returned.  For task isolation, we need a synchronous
version of the function that guarantees that the vmstat worker
will not run on the core on return from the function.  Add a
quiet_vmstat_sync() function with that semantic.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 include/linux/vmstat.h | 2 ++
 mm/vmstat.c            | 9 +++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 292485f3d24d..2bc5e85f2514 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -270,6 +270,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
+void quiet_vmstat_sync(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -372,6 +373,7 @@ static inline void __dec_node_page_state(struct page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat_sync(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 78d53378db99..1fa0b2d04afa 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1870,6 +1870,15 @@ void quiet_vmstat(void)
 	refresh_cpu_vm_stats(false);
 }
 
+/*
+ * Synchronously quiet vmstat so the work is guaranteed not to run on return.
+ */
+void quiet_vmstat_sync(void)
+{
+	cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work));
+	refresh_cpu_vm_stats(false);
+}
+
 /*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 02/13] task_isolation: vmstat: add vmstat_idle function
  2020-04-09 15:09   ` [PATCH v3 00/13] "Task_isolation" mode Alex Belits
  2020-04-09 15:15     ` [PATCH 01/13] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
@ 2020-04-09 15:16     ` Alex Belits
  2020-04-09 15:17     ` [PATCH v3 03/13] task_isolation: add instruction synchronization memory barrier Alex Belits
                       ` (10 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-04-09 15:16 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 2bc5e85f2514..66d9ae32cf07 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -271,6 +271,7 @@ extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
 void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -374,6 +375,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1fa0b2d04afa..5c4aec651062 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1879,6 +1879,16 @@ void quiet_vmstat_sync(void)
 	refresh_cpu_vm_stats(false);
 }
 
+/*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+	return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) &&
+		!need_update(smp_processor_id());
+}
+
 /*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v3 03/13] task_isolation: add instruction synchronization memory barrier
  2020-04-09 15:09   ` [PATCH v3 00/13] "Task_isolation" mode Alex Belits
  2020-04-09 15:15     ` [PATCH 01/13] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
  2020-04-09 15:16     ` [PATCH 02/13] task_isolation: vmstat: add vmstat_idle function Alex Belits
@ 2020-04-09 15:17     ` Alex Belits
  2020-04-15 12:44       ` Mark Rutland
  2020-04-09 15:20     ` [PATCH v3 04/13] task_isolation: userspace hard isolation from kernel Alex Belits
                       ` (9 subsequent siblings)
  12 siblings, 1 reply; 71+ messages in thread
From: Alex Belits @ 2020-04-09 15:17 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

Some architectures implement memory synchronization instructions for instruction cache. Make a separate kind of barrier that calls them.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 arch/arm/include/asm/barrier.h   | 2 ++
 arch/arm64/include/asm/barrier.h | 2 ++
 include/asm-generic/barrier.h    | 4 ++++
 3 files changed, 8 insertions(+)

diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 83ae97c049d9..6def62c95937 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -64,12 +64,14 @@ extern void arm_heavy_mb(void);
 #define mb()		__arm_heavy_mb()
 #define rmb()		dsb()
 #define wmb()		__arm_heavy_mb(st)
+#define imb()		isb()
 #define dma_rmb()	dmb(osh)
 #define dma_wmb()	dmb(oshst)
 #else
 #define mb()		barrier()
 #define rmb()		barrier()
 #define wmb()		barrier()
+#define imb()		barrier()
 #define dma_rmb()	barrier()
 #define dma_wmb()	barrier()
 #endif
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 7d9cc5ec4971..12a7dbd68bed 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -45,6 +45,8 @@
 #define rmb()		dsb(ld)
 #define wmb()		dsb(st)
 
+#define imb()		isb()
+
 #define dma_rmb()	dmb(oshld)
 #define dma_wmb()	dmb(oshst)
 
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 85b28eb80b11..d5a822fb3e92 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -46,6 +46,10 @@
 #define dma_wmb()	wmb()
 #endif
 
+#ifndef imb
+#define imb		barrier()
+#endif
+
 #ifndef read_barrier_depends
 #define read_barrier_depends()		do { } while (0)
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v3 04/13] task_isolation: userspace hard isolation from kernel
  2020-04-09 15:09   ` [PATCH v3 00/13] "Task_isolation" mode Alex Belits
                       ` (2 preceding siblings ...)
  2020-04-09 15:17     ` [PATCH v3 03/13] task_isolation: add instruction synchronization memory barrier Alex Belits
@ 2020-04-09 15:20     ` Alex Belits
  2020-04-09 18:00       ` Andy Lutomirski
  2020-04-09 15:21     ` [PATCH 05/13] task_isolation: Add task isolation hooks to arch-independent code Alex Belits
                       ` (8 subsequent siblings)
  12 siblings, 1 reply; 71+ messages in thread
From: Alex Belits @ 2020-04-09 15:20 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
"isolcpus=nohz,domain,CPULIST" boot argument to enable
nohz_full and isolcpus. The "task_isolation" state is then indicated
by setting a new task struct field, task_isolation_flag, to the
value passed by prctl(), and also setting a TIF_TASK_ISOLATION
bit in the thread_info flags. When the kernel is returning to
userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
it calls the new task_isolation_start() routine to arrange for
the task to avoid being interrupted in the future.

With interrupts disabled, task_isolation_start() ensures that kernel
subsystems that might cause a future interrupt are quiesced. If it
doesn't succeed, it adjusts the syscall return value to indicate that
fact, and userspace can retry as desired. In addition to stopping
the scheduler tick, the code takes any actions that might avoid
a future interrupt to the core, such as a worker thread being
scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
other exception or irq, the kernel will kill it with SIGKILL.
In addition to sending a signal, the code supports a kernel
command-line "task_isolation_debug" flag which causes a stack
backtrace to be generated whenever a task loses isolation.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
clear the bit again later, and ignores exit/exit_group to allow
exiting the task without a pointless signal being delivered.

The prctl() API allows for specifying a signal number to use instead
of the default SIGKILL, to allow for catching the notification
signal; for example, in a production environment, it might be
helpful to log information to the application logging mechanism
before exiting. Or, the signal handler might choose to reset the
program counter back to the code segment intended to be run isolated
via prctl() to continue execution.

In a number of cases we can tell on a remote cpu that we are
going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
In that case we generate the diagnostic (and optional stack dump)
on the remote core to be able to deliver better diagnostics.
If the interrupt is not something caught by Linux (e.g. a
hypervisor interrupt) we can also request a reschedule IPI to
be sent to the remote core so it can be sure to generate a
signal to notify the process.

Separate patches that follow provide these changes for x86, arm,
and arm64.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 .../admin-guide/kernel-parameters.txt         |   6 +
 include/linux/hrtimer.h                       |   4 +
 include/linux/isolation.h                     | 229 +++++
 include/linux/sched.h                         |   4 +
 include/linux/tick.h                          |   3 +
 include/uapi/linux/prctl.h                    |   6 +
 init/Kconfig                                  |  28 +
 kernel/Makefile                               |   2 +
 kernel/context_tracking.c                     |   2 +
 kernel/isolation.c                            | 831 ++++++++++++++++++
 kernel/signal.c                               |   2 +
 kernel/sys.c                                  |   6 +
 kernel/time/hrtimer.c                         |  27 +
 kernel/time/tick-sched.c                      |  18 +
 14 files changed, 1168 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index c07815d230bc..e4a2d6e37645 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4808,6 +4808,12 @@
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation_debug	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION, this
+			setting will generate console backtraces to
+			accompany the diagnostics generated about
+			interrupting tasks running with task isolation.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 15c8ac313678..e81252eb4f92 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -528,6 +528,10 @@ extern void __init hrtimers_init(void);
 /* Show pending timers: */
 extern void sysrq_timer_list_show(void);
 
+#ifdef CONFIG_TASK_ISOLATION
+extern void kick_hrtimer(void);
+#endif
+
 int hrtimers_prepare_cpu(unsigned int cpu);
 #ifdef CONFIG_HOTPLUG_CPU
 int hrtimers_dead_cpu(unsigned int cpu);
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..6bd71c67f10f
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,229 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Task isolation support
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@marvell.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <stdarg.h>
+#include <linux/errno.h>
+#include <linux/cpumask.h>
+#include <linux/prctl.h>
+#include <linux/types.h>
+
+struct task_struct;
+
+#ifdef CONFIG_TASK_ISOLATION
+
+int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...);
+
+#define pr_task_isol_emerg(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_EMERG, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_alert(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_ALERT, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_crit(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_CRIT, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_err(cpu, fmt, ...)				\
+	task_isolation_message(cpu, LOGLEVEL_ERR, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_warn(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_WARNING, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_notice(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_NOTICE, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_info(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_INFO, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_debug(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_DEBUG, false, fmt, ##__VA_ARGS__)
+
+#define pr_task_isol_emerg_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_EMERG, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_alert_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_ALERT, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_crit_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_CRIT, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_err_supp(cpu, fmt, ...)				\
+	task_isolation_message(cpu, LOGLEVEL_ERR, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_warn_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_WARNING, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_notice_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_NOTICE, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_info_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_INFO, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_debug_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_DEBUG, true, fmt, ##__VA_ARGS__)
+DECLARE_PER_CPU(unsigned long, tsk_thread_flags_copy);
+extern cpumask_var_t task_isolation_map;
+
+/**
+ * task_isolation_request() - prctl hook to request task isolation
+ * @flags:	Flags from <linux/prctl.h> PR_TASK_ISOLATION_xxx.
+ *
+ * This is called from the generic prctl() code for PR_TASK_ISOLATION.
+ *
+ * Return: Returns 0 when task isolation enabled, otherwise a negative
+ * errno.
+ */
+extern int task_isolation_request(unsigned int flags);
+extern void task_isolation_cpu_cleanup(void);
+/**
+ * task_isolation_start() - attempt to actually start task isolation
+ *
+ * This function should be invoked as the last thing prior to returning to
+ * user space if TIF_TASK_ISOLATION is set in the thread_info flags.  It
+ * will attempt to quiesce the core and enter task-isolation mode.  If it
+ * fails, it will reset the system call return value to an error code that
+ * indicates the failure mode.
+ */
+extern void task_isolation_start(void);
+
+/**
+ * is_isolation_cpu() - check if CPU is intended for running isolated tasks.
+ * @cpu:	CPU to check.
+ */
+static inline bool is_isolation_cpu(int cpu)
+{
+	return task_isolation_map != NULL &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+/**
+ * task_isolation_on_cpu() - check if the cpu is running isolated task
+ * @cpu:	CPU to check.
+ */
+extern int task_isolation_on_cpu(int cpu);
+extern void task_isolation_check_run_cleanup(void);
+
+/**
+ * task_isolation_cpumask() - set CPUs currently running isolated tasks
+ * @mask:	Mask to modify.
+ */
+extern void task_isolation_cpumask(struct cpumask *mask);
+
+/**
+ * task_isolation_clear_cpumask() - clear CPUs currently running isolated tasks
+ * @mask:      Mask to modify.
+ */
+extern void task_isolation_clear_cpumask(struct cpumask *mask);
+
+/**
+ * task_isolation_syscall() - report a syscall from an isolated task
+ * @nr:		The syscall number.
+ *
+ * This routine should be invoked at syscall entry if TIF_TASK_ISOLATION is
+ * set in the thread_info flags.  It checks for valid syscalls,
+ * specifically prctl() with PR_TASK_ISOLATION, exit(), and exit_group().
+ * For any other syscall it will raise a signal and return failure.
+ *
+ * Return: 0 for acceptable syscalls, -1 for all others.
+ */
+extern int task_isolation_syscall(int nr);
+
+/**
+ * _task_isolation_interrupt() - report an interrupt of an isolated task
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This routine should be invoked at any exception or IRQ if
+ * TIF_TASK_ISOLATION is set in the thread_info flags.  It is not necessary
+ * to invoke it if the exception will generate a signal anyway (e.g. a bad
+ * page fault), and in that case it is preferable not to invoke it but just
+ * rely on the standard Linux signal.  The macro task_isolation_syscall()
+ * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
+ */
+extern void _task_isolation_interrupt(const char *fmt, ...);
+#define task_isolation_interrupt(fmt, ...)				\
+	do {								\
+		if (current_thread_info()->flags & _TIF_TASK_ISOLATION)	\
+			_task_isolation_interrupt(fmt, ## __VA_ARGS__);	\
+	} while (0)
+
+/**
+ * task_isolation_remote() - report a remote interrupt of an isolated task
+ * @cpu:	The remote cpu that is about to be interrupted.
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This routine should be invoked any time a remote IPI or other type of
+ * interrupt is being delivered to another cpu. The function will check to
+ * see if the target core is running a task-isolation task, and generate a
+ * diagnostic on the console if so; in addition, we tag the task so it
+ * doesn't generate another diagnostic when the interrupt actually arrives.
+ * Generating a diagnostic remotely yields a clearer indication of what
+ * happened then just reporting only when the remote core is interrupted.
+ *
+ */
+extern void task_isolation_remote(int cpu, const char *fmt, ...);
+
+/**
+ * task_isolation_remote_cpumask() - report interruption of multiple cpus
+ * @mask:	The set of remotes cpus that are about to be interrupted.
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This is the cpumask variant of _task_isolation_remote().  We
+ * generate a single-line diagnostic message even if multiple remote
+ * task-isolation cpus are being interrupted.
+ */
+extern void task_isolation_remote_cpumask(const struct cpumask *mask,
+					  const char *fmt, ...);
+
+/**
+ * _task_isolation_signal() - disable task isolation when signal is pending
+ * @task:	The task for which to disable isolation.
+ *
+ * This function generates a diagnostic and disables task isolation; it
+ * should be called if TIF_TASK_ISOLATION is set when notifying a task of a
+ * pending signal.  The task_isolation_interrupt() function normally
+ * generates a diagnostic for events that just interrupt a task without
+ * generating a signal; here we need to hook the paths that correspond to
+ * interrupts that do generate a signal.  The macro task_isolation_signal()
+ * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
+ */
+extern void _task_isolation_signal(struct task_struct *task);
+#define task_isolation_signal(task)					\
+	do {								\
+		if (task_thread_info(task)->flags & _TIF_TASK_ISOLATION) \
+			_task_isolation_signal(task);			\
+	} while (0)
+
+/**
+ * task_isolation_user_exit() - debug all user_exit calls
+ *
+ * By default, we don't generate an exception in the low-level user_exit()
+ * code, because programs lose the ability to disable task isolation: the
+ * user_exit() hook will cause a signal prior to task_isolation_syscall()
+ * disabling task isolation.  In addition, it means that we lose all the
+ * diagnostic info otherwise available from task_isolation_interrupt() hooks
+ * later in the interrupt-handling process.  But you may enable it here for
+ * a special kernel build if you are having undiagnosed userspace jitter.
+ */
+static inline void task_isolation_user_exit(void)
+{
+#ifdef DEBUG_TASK_ISOLATION
+	task_isolation_interrupt("user_exit");
+#endif
+}
+
+#else /* !CONFIG_TASK_ISOLATION */
+static inline int task_isolation_request(unsigned int flags) { return -EINVAL; }
+static inline void task_isolation_start(void) { }
+static inline bool is_isolation_cpu(int cpu) { return 0; }
+static inline int task_isolation_on_cpu(int cpu) { return 0; }
+static inline void task_isolation_cpumask(struct cpumask *mask) { }
+static inline void task_isolation_clear_cpumask(struct cpumask *mask) { }
+static inline void task_isolation_cpu_cleanup(void) { }
+static inline void task_isolation_check_run_cleanup(void) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_interrupt(const char *fmt, ...) { }
+static inline void task_isolation_remote(int cpu, const char *fmt, ...) { }
+static inline void task_isolation_remote_cpumask(const struct cpumask *mask,
+						 const char *fmt, ...) { }
+static inline void task_isolation_signal(struct task_struct *task) { }
+static inline void task_isolation_user_exit(void) { }
+#endif
+
+#endif /* _LINUX_ISOLATION_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15..52fdb32aa3b9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1280,6 +1280,10 @@ struct task_struct {
 	unsigned long			lowest_stack;
 	unsigned long			prev_lowest_stack;
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned short			task_isolation_flags;  /* prctl */
+	unsigned short			task_isolation_state;
+#endif
 
 	/*
 	 * New fields for task_struct should be added above here, so that
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 7340613c7eff..27c7c033d5a8 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -268,6 +268,9 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
 extern void __init tick_nohz_full_setup(cpumask_var_t cpumask);
+#ifdef CONFIG_TASK_ISOLATION
+extern int try_stop_full_tick(void);
+#endif
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f8131e36..f4848ed2a069 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -238,4 +238,10 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Enable task_isolation mode for TASK_ISOLATION kernels. */
+#define PR_TASK_ISOLATION		48
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 20a6ac33761c..ecdf567f6bd4 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -576,6 +576,34 @@ config CPU_ISOLATION
 
 source "kernel/rcu/Kconfig"
 
+config HAVE_ARCH_TASK_ISOLATION
+	bool
+
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
+	help
+
+	Allow userspace processes that place themselves on cores with
+	nohz_full and isolcpus enabled, and run prctl(PR_TASK_ISOLATION),
+	to "isolate" themselves from the kernel.  Prior to returning to
+	userspace, isolated tasks will arrange that no future kernel
+	activity will interrupt the task while the task is running in
+	userspace.  Attempting to re-enter the kernel while in this mode
+	will cause the task to be terminated with a signal; you must
+	explicitly use prctl() to disable task isolation before resuming
+	normal use of the kernel.
+
+	This "hard" isolation from the kernel is required for userspace
+	tasks that are running hard real-time tasks in userspace, such as
+	a high-speed network driver in userspace.  Without this option, but
+	with NO_HZ_FULL enabled, the kernel will make a best-faith, "soft"
+	effort to shield a single userspace process from interrupts, but
+	makes no guarantees.
+
+	You should say "N" unless you are intending to run a
+	high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index 4cb4130ced32..2f2ae91f90d5 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -122,6 +122,8 @@ obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o
 KASAN_SANITIZE_stackleak.o := n
 KCOV_INSTRUMENT_stackleak.o := n
 
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
+
 $(obj)/configs.o: $(obj)/config_data.gz
 
 targets += config_data.gz
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 0296b4bda8f1..e9206736f219 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -21,6 +21,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -157,6 +158,7 @@ void __context_tracking_exit(enum ctx_state state)
 			if (state == CONTEXT_USER) {
 				vtime_user_exit(current);
 				trace_user_exit(0);
+				task_isolation_user_exit();
 			}
 		}
 		__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..bec587cee284
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,831 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation of task isolation.
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@marvell.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/sched.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include <linux/smp.h>
+#include <linux/tick.h>
+#include <asm/unistd.h>
+#include <asm/syscall.h>
+#include <linux/hrtimer.h>
+
+/*
+ * These values are stored in task_isolation_state.
+ * Note that STATE_NORMAL + TIF_TASK_ISOLATION means we are still
+ * returning from sys_prctl() to userspace.
+ */
+enum {
+	STATE_NORMAL = 0,	/* Not isolated */
+	STATE_ISOLATED = 1	/* In userspace, isolated */
+};
+
+/*
+ * This variable contains thread flags copied at the moment
+ * when schedule() switched to the task on a given CPU,
+ * or 0 if no task is running.
+ */
+DEFINE_PER_CPU(unsigned long, tsk_thread_flags_cache);
+
+/*
+ * Counter for isolation state on a given CPU, increments when entering
+ * isolation and decrements when exiting isolation (before or after the
+ * cleanup). Multiple simultaneously running procedures entering or
+ * exiting isolation are prevented by checking the result of
+ * incrementing or decrementing this variable. This variable is both
+ * incremented and decremented by CPU that caused isolation entering or
+ * exit.
+ *
+ * This is necessary because multiple isolation-breaking events may happen
+ * at once (or one as the result of the other), however isolation exit
+ * may only happen once to transition from isolated to non-isolated state.
+ * Therefore, if decrementing this counter results in a value less than 0,
+ * isolation exit procedure can't be started -- it already happened, or is
+ * in progress, or isolation is not entered yet.
+ */
+DEFINE_PER_CPU(atomic_t, isol_counter);
+
+/*
+ * Description of the last two tasks that ran isolated on a given CPU.
+ * This is intended only for messages about isolation breaking. We
+ * don't want any references to actual task while accessing this from
+ * CPU that caused isolation breaking -- we know nothing about timing
+ * and don't want to use locking or RCU.
+ */
+struct isol_task_desc {
+	atomic_t curr_index;
+	atomic_t curr_index_wr;
+	bool	warned[2];
+	pid_t	pid[2];
+	pid_t	tgid[2];
+	char	comm[2][TASK_COMM_LEN];
+};
+static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs);
+
+/*
+ * Counter for isolation exiting procedures (from request to the start of
+ * cleanup) being attempted at once on a CPU. Normally incrementing of
+ * this counter is performed from the CPU that caused isolation breaking,
+ * however decrementing is done from the cleanup procedure, delegated to
+ * the CPU that is exiting isolation, not from the CPU that caused isolation
+ * breaking.
+ *
+ * If incrementing this counter while starting isolation exit procedure
+ * results in a value greater than 0, isolation exiting is already in
+ * progress, and cleanup did not start yet. This means, counter should be
+ * decremented back, and isolation exit that is already in progress, should
+ * be allowed to complete. Otherwise, a new isolation exit procedure should
+ * be started.
+ */
+DEFINE_PER_CPU(atomic_t, isol_exit_counter);
+
+/*
+ * Descriptor for isolation-breaking SMP calls
+ */
+DEFINE_PER_CPU(call_single_data_t, isol_break_csd);
+
+cpumask_var_t task_isolation_map;
+cpumask_var_t task_isolation_cleanup_map;
+static DEFINE_SPINLOCK(task_isolation_cleanup_lock);
+
+/* We can run on cpus that are isolated from the scheduler and are nohz_full. */
+static int __init task_isolation_init(void)
+{
+	alloc_bootmem_cpumask_var(&task_isolation_cleanup_map);
+	if (alloc_cpumask_var(&task_isolation_map, GFP_KERNEL))
+		/*
+		 * At this point task isolation should match
+		 * nohz_full. This may change in the future.
+		 */
+		cpumask_copy(task_isolation_map, tick_nohz_full_mask);
+	return 0;
+}
+core_initcall(task_isolation_init)
+
+/* Enable stack backtraces of any interrupts of task_isolation cores. */
+static bool task_isolation_debug;
+static int __init task_isolation_debug_func(char *str)
+{
+	task_isolation_debug = true;
+	return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+/*
+ * Record name, pid and group pid of the task entering isolation on
+ * the current CPU.
+ */
+static void record_curr_isolated_task(void)
+{
+	int ind;
+	int cpu = smp_processor_id();
+	struct isol_task_desc *desc = &per_cpu(isol_task_descs, cpu);
+	struct task_struct *task = current;
+
+	/* Finish everything before recording current task */
+	smp_mb();
+	ind = atomic_inc_return(&desc->curr_index_wr) & 1;
+	desc->comm[ind][sizeof(task->comm) - 1] = '\0';
+	memcpy(desc->comm[ind], task->comm, sizeof(task->comm) - 1);
+	desc->pid[ind] = task->pid;
+	desc->tgid[ind] = task->tgid;
+	desc->warned[ind] = false;
+	/* Write everything, to be seen by other CPUs */
+	smp_mb();
+	atomic_inc(&desc->curr_index);
+	/* Everyone will see the new record from this point */
+	smp_mb();
+}
+
+/*
+ * Print message prefixed with the description of the current (or
+ * last) isolated task on a given CPU. Intended for isolation breaking
+ * messages that include target task for the user's convenience.
+ *
+ * Messages produced with this function may have obsolete task
+ * information if isolated tasks managed to exit, start and enter
+ * isolation multiple times, or multiple tasks tried to enter
+ * isolation on the same CPU at once. For those unusual cases it would
+ * contain a valid description of the cause for isolation breaking and
+ * target CPU number, just not the correct description of which task
+ * ended up losing isolation.
+ */
+int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...)
+{
+	struct isol_task_desc *desc;
+	struct task_struct *task;
+	va_list args;
+	char buf_prefix[TASK_COMM_LEN + 20 + 3 * 20];
+	char buf[200];
+	int curr_cpu, ind_counter, ind_counter_old, ind;
+
+	curr_cpu = get_cpu();
+	desc = &per_cpu(isol_task_descs, cpu);
+	ind_counter = atomic_read(&desc->curr_index);
+
+	if (curr_cpu == cpu) {
+		/*
+		 * Message is for the current CPU so current
+		 * task_struct should be used instead of cached
+		 * information.
+		 *
+		 * Like in other diagnostic messages, if issued from
+		 * interrupt context, current will be the interrupted
+		 * task. Unlike other diagnostic messages, this is
+		 * always relevant because the message is about
+		 * interrupting a task.
+		 */
+		ind = ind_counter & 1;
+		if (supp && desc->warned[ind]) {
+			/*
+			 * If supp is true, skip the message if the
+			 * same task was mentioned in the message
+			 * originated on remote CPU, and it did not
+			 * re-enter isolated state since then (warned
+			 * is true). Only local messages following
+			 * remote messages, likely about the same
+			 * isolation breaking event, are skipped to
+			 * avoid duplication. If remote cause is
+			 * immediately followed by a local one before
+			 * isolation is broken, local cause is skipped
+			 * from messages.
+			 */
+			put_cpu();
+			return 0;
+		}
+		task = current;
+		snprintf(buf_prefix, sizeof(buf_prefix),
+			 "isolation %s/%d/%d (cpu %d)",
+			 task->comm, task->tgid, task->pid, cpu);
+		put_cpu();
+	} else {
+		/*
+		 * Message is for remote CPU, use cached information.
+		 */
+		put_cpu();
+		/*
+		 * Make sure, index remained unchanged while data was
+		 * copied. If it changed, data that was copied may be
+		 * inconsistent because two updates in a sequence could
+		 * overwrite the data while it was being read.
+		 */
+		do {
+			/* Make sure we are reading up to date values */
+			smp_mb();
+			ind = ind_counter & 1;
+			snprintf(buf_prefix, sizeof(buf_prefix),
+				 "isolation %s/%d/%d (cpu %d)",
+				 desc->comm[ind], desc->tgid[ind],
+				 desc->pid[ind], cpu);
+			desc->warned[ind] = true;
+			ind_counter_old = ind_counter;
+			/* Record the warned flag, then re-read descriptor */
+			smp_mb();
+			ind_counter = atomic_read(&desc->curr_index);
+			/*
+			 * If the counter changed, something was updated, so
+			 * repeat everything to get the current data
+			 */
+		} while (ind_counter != ind_counter_old);
+	}
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	switch (level) {
+	case LOGLEVEL_EMERG:
+		pr_emerg("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_ALERT:
+		pr_alert("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_CRIT:
+		pr_crit("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_ERR:
+		pr_err("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_WARNING:
+		pr_warn("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_NOTICE:
+		pr_notice("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_INFO:
+		pr_info("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_DEBUG:
+		pr_debug("%s: %s", buf_prefix, buf);
+		break;
+	default:
+		/* No message without a valid level */
+		return 0;
+	}
+	return 1;
+}
+
+/*
+ * Dump stack if need be. This can be helpful even from the final exit
+ * to usermode code since stack traces sometimes carry information about
+ * what put you into the kernel, e.g. an interrupt number encoded in
+ * the initial entry stack frame that is still visible at exit time.
+ */
+static void debug_dump_stack(void)
+{
+	if (task_isolation_debug)
+		dump_stack();
+}
+
+/*
+ * Set the flags word but don't try to actually start task isolation yet.
+ * We will start it when entering user space in task_isolation_start().
+ */
+int task_isolation_request(unsigned int flags)
+{
+	struct task_struct *task = current;
+
+	/*
+	 * The task isolation flags should always be cleared just by
+	 * virtue of having entered the kernel.
+	 */
+	WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_TASK_ISOLATION));
+	WARN_ON_ONCE(task->task_isolation_flags != 0);
+	WARN_ON_ONCE(task->task_isolation_state != STATE_NORMAL);
+
+	task->task_isolation_flags = flags;
+	if (!(task->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
+		return 0;
+
+	/* We are trying to enable task isolation. */
+	set_tsk_thread_flag(task, TIF_TASK_ISOLATION);
+
+	/*
+	 * Shut down the vmstat worker so we're not interrupted later.
+	 * We have to try to do this here (with interrupts enabled) since
+	 * we are canceling delayed work and will call flush_work()
+	 * (which enables interrupts) and possibly schedule().
+	 */
+	quiet_vmstat_sync();
+
+	/* We return 0 here but we may change that in task_isolation_start(). */
+	return 0;
+}
+
+/*
+ * Perform actions that should be done immediately on exit from isolation.
+ */
+static void fast_task_isolation_cpu_cleanup(void *info)
+{
+	unsigned long flags;
+
+	/*
+	 * This function runs on a CPU that ran isolated task.
+	 *
+	 * We don't want this CPU running code from the rest of kernel
+	 * until other CPUs know that it is no longer isolated.
+	 * When CPU is running isolated task until this point anything
+	 * that causes an interrupt on this CPU must end up calling this
+	 * before touching the rest of kernel. That is, IPI to this
+	 * function or stop_isolation() calling it. If any interrupt,
+	 * including scheduling timer, arrivef before a call to this
+	 * function, it will still end up here early after entering kernel.
+	 * From this point interrupts are disabled until all CPUs will see
+	 * that this CPU is no longer running isolated task.
+	 */
+	local_irq_save(flags);
+	atomic_dec(&per_cpu(isol_exit_counter, smp_processor_id()));
+	smp_mb__after_atomic();
+	/*
+	 * At this point breaking isolation from other CPUs is possible again,
+	 * however interrupts won't arrive until local_irq_restore()
+	 */
+
+	/*
+	 * This task is no longer isolated (and if by any chance this
+	 * is the wrong task, it's already not isolated)
+	 */
+	current->task_isolation_flags = 0;
+	clear_tsk_thread_flag(current, TIF_TASK_ISOLATION);
+
+	/* Run the rest of cleanup later */
+	set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+
+	/* Copy flags with task isolation disabled */
+	this_cpu_write(tsk_thread_flags_cache,
+		       READ_ONCE(task_thread_info(current)->flags));
+	/*
+	 * If something happened that requires a barrier that would
+	 * otherwise be called from remote CPUs by CPU kick procedure,
+	 * this barrier runs instead of it. After this barrier, CPU
+	 * kick procedure would see the updated tsk_thread_flags_cache
+	 * and run its own IPI to trigger a barrier.
+	 */
+	smp_mb();
+	/*
+	 * Synchronize instructions -- this CPU was not kicked while
+	 * in isolated mode, so it might require synchronization.
+	 * There might be an IPI if kick procedure happened and
+	 * tsk_thread_flags_cache was already updated while it
+	 * assembled a CPU mask. However if this did not happen,
+	 * synchronize everything here.
+	 */
+	imb();
+	local_irq_restore(flags);
+}
+
+/* Disable task isolation for the specified task. */
+static void stop_isolation(struct task_struct *p)
+{
+	int cpu, this_cpu;
+	unsigned long flags;
+
+	this_cpu = get_cpu();
+	cpu = task_cpu(p);
+	if (atomic_inc_return(&per_cpu(isol_exit_counter, cpu)) > 1) {
+		/* Already exiting isolation */
+		atomic_dec(&per_cpu(isol_exit_counter, cpu));
+		put_cpu();
+		return;
+	}
+
+	if (p == current) {
+		p->task_isolation_state = STATE_NORMAL;
+		fast_task_isolation_cpu_cleanup(NULL);
+		task_isolation_cpu_cleanup();
+		if (atomic_dec_return(&per_cpu(isol_counter, cpu)) < 0) {
+			/* Is not isolated already */
+			atomic_inc(&per_cpu(isol_counter, cpu));
+		}
+		put_cpu();
+	} else {
+		if (atomic_dec_return(&per_cpu(isol_counter, cpu)) < 0) {
+			/* Is not isolated already */
+			atomic_inc(&per_cpu(isol_counter, cpu));
+			atomic_dec(&per_cpu(isol_exit_counter, cpu));
+			put_cpu();
+			return;
+		}
+		/*
+		 * Schedule "slow" cleanup. This relies on
+		 * TIF_NOTIFY_RESUME being set
+		 */
+		spin_lock_irqsave(&task_isolation_cleanup_lock, flags);
+		cpumask_set_cpu(cpu, task_isolation_cleanup_map);
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+		/*
+		 * Setting flags is delegated to the CPU where
+		 * isolated task is running
+		 * isol_exit_counter will be decremented from there as well.
+		 */
+		per_cpu(isol_break_csd, cpu).func =
+		    fast_task_isolation_cpu_cleanup;
+		per_cpu(isol_break_csd, cpu).info = NULL;
+		per_cpu(isol_break_csd, cpu).flags = 0;
+		smp_call_function_single_async(cpu,
+					       &per_cpu(isol_break_csd, cpu));
+		put_cpu();
+	}
+}
+
+/*
+ * This code runs with interrupts disabled just before the return to
+ * userspace, after a prctl() has requested enabling task isolation.
+ * We take whatever steps are needed to avoid being interrupted later:
+ * drain the lru pages, stop the scheduler tick, etc.  More
+ * functionality may be added here later to avoid other types of
+ * interrupts from other kernel subsystems.
+ *
+ * If we can't enable task isolation, we update the syscall return
+ * value with an appropriate error.
+ */
+void task_isolation_start(void)
+{
+	int error;
+	unsigned long flags;
+
+	/*
+	 * We should only be called in STATE_NORMAL (isolation disabled),
+	 * on our way out of the kernel from the prctl() that turned it on.
+	 * If we are exiting from the kernel in another state, it means we
+	 * made it back into the kernel without disabling task isolation,
+	 * and we should investigate how (and in any case disable task
+	 * isolation at this point).  We are clearly not on the path back
+	 * from the prctl() so we don't touch the syscall return value.
+	 */
+	if (WARN_ON_ONCE(current->task_isolation_state != STATE_NORMAL)) {
+		/* Increment counter, this will allow isolation breaking */
+		if (atomic_inc_return(&per_cpu(isol_counter,
+					      smp_processor_id())) > 1) {
+			atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
+		}
+		stop_isolation(current);
+		return;
+	}
+
+	/*
+	 * Must be affinitized to a single core with task isolation possible.
+	 * In principle this could be remotely modified between the prctl()
+	 * and the return to userspace, so we have to check it here.
+	 */
+	if (current->nr_cpus_allowed != 1 ||
+	    !is_isolation_cpu(smp_processor_id())) {
+		error = -EINVAL;
+		goto error;
+	}
+
+	/* If the vmstat delayed work is not canceled, we have to try again. */
+	if (!vmstat_idle()) {
+		error = -EAGAIN;
+		goto error;
+	}
+
+	/* Try to stop the dynamic tick. */
+	error = try_stop_full_tick();
+	if (error)
+		goto error;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/*
+	 * Task is going to be marked as isolated. This disables IPIs
+	 * used for synchronization, so to avoid inconsistency
+	 * don't let anything interrupt us and issue a barrier at the end.
+	 */
+	local_irq_save(flags);
+
+	/* Increment counter, this will allow isolation breaking */
+	if (atomic_inc_return(&per_cpu(isol_counter,
+				      smp_processor_id())) > 1) {
+		atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
+	}
+
+	/* Record isolated task IDs and name */
+	record_curr_isolated_task();
+
+	/* Copy flags with task isolation enabled */
+	this_cpu_write(tsk_thread_flags_cache,
+		       READ_ONCE(task_thread_info(current)->flags));
+	smp_wmb();
+	/* From this point this is recognized as isolated by other CPUs */
+	current->task_isolation_state = STATE_ISOLATED;
+	smp_mb();
+	local_irq_restore(flags);
+	/*
+	 * If anything interrupts us at this point, it will trigger
+	 * isolation breaking procedure.
+	 */
+	return;
+
+error:
+	/* Increment counter, this will allow isolation breaking */
+	if (atomic_inc_return(&per_cpu(isol_counter,
+				      smp_processor_id())) > 1) {
+		atomic_dec(&per_cpu(isol_counter, smp_processor_id()));
+	}
+	stop_isolation(current);
+	syscall_set_return_value(current, current_pt_regs(), error, 0);
+}
+
+/* Stop task isolation on the remote task and send it a signal. */
+static void send_isolation_signal(struct task_struct *task)
+{
+	int flags = task->task_isolation_flags;
+	kernel_siginfo_t info = {
+		.si_signo = PR_TASK_ISOLATION_GET_SIG(flags) ?: SIGKILL,
+	};
+
+	stop_isolation(task);
+	send_sig_info(info.si_signo, &info, task);
+}
+
+/* Only a few syscalls are valid once we are in task isolation mode. */
+static bool is_acceptable_syscall(int syscall)
+{
+	/* No need to incur an isolation signal if we are just exiting. */
+	if (syscall == __NR_exit || syscall == __NR_exit_group)
+		return true;
+
+	/* Check to see if it's the prctl for isolation. */
+	if (syscall == __NR_prctl) {
+		unsigned long arg[SYSCALL_MAX_ARGS];
+
+		syscall_get_arguments(current, current_pt_regs(), arg);
+		if (arg[0] == PR_TASK_ISOLATION)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * This routine is called from syscall entry, prevents most syscalls
+ * from executing, and if needed raises a signal to notify the process.
+ *
+ * Note that we have to stop isolation before we even print a message
+ * here, since otherwise we might end up reporting an interrupt due to
+ * kicking the printk handling code, rather than reporting the true
+ * cause of interrupt here.
+ *
+ * The message is not suppressed by previous remotely triggered
+ * messages.
+ */
+int task_isolation_syscall(int syscall)
+{
+	struct task_struct *task = current;
+
+	if (is_acceptable_syscall(syscall)) {
+		stop_isolation(task);
+		return 0;
+	}
+
+	send_isolation_signal(task);
+
+	pr_task_isol_warn(smp_processor_id(),
+			  "task_isolation lost due to syscall %d\n",
+			  syscall);
+	debug_dump_stack();
+
+	syscall_set_return_value(task, current_pt_regs(), -ERESTARTNOINTR, -1);
+	return -1;
+}
+
+/*
+ * This routine is called from any exception or irq that doesn't
+ * otherwise trigger a signal to the user process (e.g. page fault).
+ *
+ * Messages will be suppressed if there is already a reported remote
+ * cause for isolation breaking, so we don't generate multiple
+ * confusingly similar messages about the same event.
+ */
+void _task_isolation_interrupt(const char *fmt, ...)
+{
+	struct task_struct *task = current;
+	va_list args;
+	char buf[100];
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	/* Are we exiting isolation already? */
+	if (atomic_read(&per_cpu(isol_exit_counter, smp_processor_id())) != 0) {
+		task->task_isolation_state = STATE_NORMAL;
+		return;
+	}
+	/*
+	 * Avoid reporting interrupts that happen after we have prctl'ed
+	 * to enable isolation, but before we have returned to userspace.
+	 */
+	if (task->task_isolation_state == STATE_NORMAL)
+		return;
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	/* Handle NMIs minimally, since we can't send a signal. */
+	if (in_nmi()) {
+		pr_task_isol_err(smp_processor_id(),
+				 "isolation: in NMI; not delivering signal\n");
+	} else {
+		send_isolation_signal(task);
+	}
+
+	if (pr_task_isol_warn_supp(smp_processor_id(),
+				   "task_isolation lost due to %s\n", buf))
+		debug_dump_stack();
+}
+
+/*
+ * Called before we wake up a task that has a signal to process.
+ * Needs to be done to handle interrupts that trigger signals, which
+ * we don't catch with task_isolation_interrupt() hooks.
+ *
+ * This message is also suppressed if there was already a remotely
+ * caused message about the same isolation breaking event.
+ */
+void _task_isolation_signal(struct task_struct *task)
+{
+	struct isol_task_desc *desc;
+	int ind, cpu;
+	bool do_warn = (task->task_isolation_state == STATE_ISOLATED);
+
+	cpu = task_cpu(task);
+	desc = &per_cpu(isol_task_descs, cpu);
+	ind = atomic_read(&desc->curr_index) & 1;
+	if (desc->warned[ind])
+		do_warn = false;
+
+	stop_isolation(task);
+
+	if (do_warn) {
+		pr_warn("isolation: %s/%d/%d (cpu %d): task_isolation lost due to signal\n",
+			task->comm, task->tgid, task->pid, cpu);
+		debug_dump_stack();
+	}
+}
+
+/*
+ * Generate a stack backtrace if we are going to interrupt another task
+ * isolation process.
+ */
+void task_isolation_remote(int cpu, const char *fmt, ...)
+{
+	struct task_struct *curr_task;
+	va_list args;
+	char buf[200];
+
+	smp_rmb();
+	if (!is_isolation_cpu(cpu) || !task_isolation_on_cpu(cpu))
+		return;
+
+	curr_task = current;
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	if (pr_task_isol_warn(cpu,
+			      "task_isolation lost due to %s by %s/%d/%d on cpu %d\n",
+			      buf,
+			      curr_task->comm, curr_task->tgid,
+			      curr_task->pid, smp_processor_id()))
+		debug_dump_stack();
+}
+
+/*
+ * Generate a stack backtrace if any of the cpus in "mask" are running
+ * task isolation processes.
+ */
+void task_isolation_remote_cpumask(const struct cpumask *mask,
+				   const char *fmt, ...)
+{
+	struct task_struct *curr_task;
+	cpumask_var_t warn_mask;
+	va_list args;
+	char buf[200];
+	int cpu, first_cpu;
+
+	if (task_isolation_map == NULL ||
+		!zalloc_cpumask_var(&warn_mask, GFP_KERNEL))
+		return;
+
+	first_cpu = -1;
+	smp_rmb();
+	for_each_cpu_and(cpu, mask, task_isolation_map) {
+		if (task_isolation_on_cpu(cpu)) {
+			if (first_cpu < 0)
+				first_cpu = cpu;
+			else
+				cpumask_set_cpu(cpu, warn_mask);
+		}
+	}
+
+	if (first_cpu < 0)
+		goto done;
+
+	curr_task = current;
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	if (cpumask_weight(warn_mask) == 0)
+		pr_task_isol_warn(first_cpu,
+				  "task_isolation lost due to %s by %s/%d/%d on cpu %d\n",
+				  buf, curr_task->comm, curr_task->tgid,
+				  curr_task->pid, smp_processor_id());
+	else
+		pr_task_isol_warn(first_cpu,
+				  " and cpus %*pbl: task_isolation lost due to %s by %s/%d/%d on cpu %d\n",
+				  cpumask_pr_args(warn_mask),
+				  buf, curr_task->comm, curr_task->tgid,
+				  curr_task->pid, smp_processor_id());
+	debug_dump_stack();
+
+done:
+	free_cpumask_var(warn_mask);
+}
+
+/*
+ * Check if given CPU is running isolated task.
+ */
+int task_isolation_on_cpu(int cpu)
+{
+	return test_bit(TIF_TASK_ISOLATION,
+			&per_cpu(tsk_thread_flags_cache, cpu));
+}
+
+/*
+ * Set CPUs currently running isolated tasks in CPU mask.
+ */
+void task_isolation_cpumask(struct cpumask *mask)
+{
+	int cpu;
+
+	if (task_isolation_map == NULL)
+		return;
+
+	smp_rmb();
+	for_each_cpu(cpu, task_isolation_map)
+		if (task_isolation_on_cpu(cpu))
+			cpumask_set_cpu(cpu, mask);
+}
+
+/*
+ * Clear CPUs currently running isolated tasks in CPU mask.
+ */
+void task_isolation_clear_cpumask(struct cpumask *mask)
+{
+	int cpu;
+
+	if (task_isolation_map == NULL)
+		return;
+
+	smp_rmb();
+	for_each_cpu(cpu, task_isolation_map)
+		if (task_isolation_on_cpu(cpu))
+			cpumask_clear_cpu(cpu, mask);
+}
+
+/*
+ * Cleanup procedure. The call to this procedure may be delayed.
+ */
+void task_isolation_cpu_cleanup(void)
+{
+	kick_hrtimer();
+}
+
+/*
+ * Check if cleanup is scheduled on the current CPU, and if so, run it.
+ * Intended to be called from notify_resume() or another such callback
+ * on the target CPU.
+ */
+void task_isolation_check_run_cleanup(void)
+{
+	int cpu;
+	unsigned long flags;
+
+	spin_lock_irqsave(&task_isolation_cleanup_lock, flags);
+
+	cpu = smp_processor_id();
+
+	if (cpumask_test_cpu(cpu, task_isolation_cleanup_map)) {
+		cpumask_clear_cpu(cpu, task_isolation_cleanup_map);
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+		task_isolation_cpu_cleanup();
+	} else
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+}
diff --git a/kernel/signal.c b/kernel/signal.c
index 5b2396350dd1..1df57e38c361 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -46,6 +46,7 @@
 #include <linux/livepatch.h>
 #include <linux/cgroup.h>
 #include <linux/audit.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -758,6 +759,7 @@ static int dequeue_synchronous_signal(kernel_siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+	task_isolation_signal(t);
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/sys.c b/kernel/sys.c
index f9bc5c303e3f..0a4059a8c4f9 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,6 +42,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2513,6 +2514,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+	case PR_TASK_ISOLATION:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = task_isolation_request(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 3a609e7344f3..5bb98f39bde6 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -30,6 +30,7 @@
 #include <linux/syscalls.h>
 #include <linux/interrupt.h>
 #include <linux/tick.h>
+#include <linux/isolation.h>
 #include <linux/err.h>
 #include <linux/debugobjects.h>
 #include <linux/sched/signal.h>
@@ -721,6 +722,19 @@ static void retrigger_next_event(void *arg)
 	raw_spin_unlock(&base->lock);
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+void kick_hrtimer(void)
+{
+	unsigned long flags;
+
+	preempt_disable();
+	local_irq_save(flags);
+	retrigger_next_event(NULL);
+	local_irq_restore(flags);
+	preempt_enable();
+}
+#endif
+
 /*
  * Switch to high resolution mode
  */
@@ -868,8 +882,21 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
 void clock_was_set(void)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
+#ifdef CONFIG_TASK_ISOLATION
+	struct cpumask mask;
+
+	cpumask_clear(&mask);
+	task_isolation_cpumask(&mask);
+	cpumask_complement(&mask, &mask);
+	/*
+	 * Retrigger the CPU local events everywhere except CPUs
+	 * running isolated tasks.
+	 */
+	on_each_cpu_mask(&mask, retrigger_next_event, NULL, 1);
+#else
 	/* Retrigger the CPU local events everywhere */
 	on_each_cpu(retrigger_next_event, NULL, 1);
+#endif
 #endif
 	timerfd_clock_was_set();
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a792d21cac64..1d4dec9d3ee7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -882,6 +882,24 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts)
 #endif
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+int try_stop_full_tick(void)
+{
+	int cpu = smp_processor_id();
+	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+	/* For an unstable clock, we should return a permanent error code. */
+	if (atomic_read(&tick_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE)
+		return -EINVAL;
+
+	if (!can_stop_full_tick(cpu, ts))
+		return -EAGAIN;
+
+	tick_nohz_stop_sched_tick(ts, cpu);
+	return 0;
+}
+#endif
+
 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 {
 	/*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 05/13] task_isolation: Add task isolation hooks to arch-independent code
  2020-04-09 15:09   ` [PATCH v3 00/13] "Task_isolation" mode Alex Belits
                       ` (3 preceding siblings ...)
  2020-04-09 15:20     ` [PATCH v3 04/13] task_isolation: userspace hard isolation from kernel Alex Belits
@ 2020-04-09 15:21     ` Alex Belits
  2020-04-09 15:22     ` [PATCH 06/13] task_isolation: arch/x86: enable task isolation functionality Alex Belits
                       ` (7 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-04-09 15:21 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

From: Chris Metcalf <cmetcalf@mellanox.com>

This commit adds task isolation hooks as follows:

- __handle_domain_irq() generates an isolation warning for the
  local task

- irq_work_queue_on() generates an isolation warning for the remote
  task being interrupted for irq_work

- generic_exec_single() generates a remote isolation warning for
  the remote cpu being IPI'd

- smp_call_function_many() generates a remote isolation warning for
  the set of remote cpus being IPI'd

Calls to task_isolation_remote() or task_isolation_interrupt() can
be placed in the platform-independent code like this when doing so
results in fewer lines of code changes, as for example is true of
the users of the arch_send_call_function_*() APIs. Or, they can be
placed in the per-architecture code when there are many callers,
as for example is true of the smp_send_reschedule() call.

A further cleanup might be to create an intermediate layer, so that
for example smp_send_reschedule() is a single generic function that
just calls arch_smp_send_reschedule(), allowing generic code to be
called every time smp_send_reschedule() is invoked. But for now, we
just update either callers or callees as makes most sense.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
[abelits@marvell.com: adapted for kernel 5.6]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/irq/irqdesc.c | 9 +++++++++
 kernel/irq_work.c    | 5 ++++-
 kernel/smp.c         | 6 +++++-
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 98a5f10d1900..e2b81d035fa1 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -16,6 +16,7 @@
 #include <linux/bitmap.h>
 #include <linux/irqdomain.h>
 #include <linux/sysfs.h>
+#include <linux/isolation.h>
 
 #include "internals.h"
 
@@ -670,6 +671,10 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq,
 		irq = irq_find_mapping(domain, hwirq);
 #endif
 
+	task_isolation_interrupt((irq == hwirq) ?
+				 "irq %d (%s)" : "irq %d (%s hwirq %d)",
+				 irq, domain ? domain->name : "", hwirq);
+
 	/*
 	 * Some hardware gives randomly wrong interrupts.  Rather
 	 * than crashing, do something sensible.
@@ -711,6 +716,10 @@ int handle_domain_nmi(struct irq_domain *domain, unsigned int hwirq,
 
 	irq = irq_find_mapping(domain, hwirq);
 
+	task_isolation_interrupt((irq == hwirq) ?
+				 "NMI irq %d (%s)" : "NMI irq %d (%s hwirq %d)",
+				 irq, domain ? domain->name : "", hwirq);
+
 	/*
 	 * ack_bad_irq is not NMI-safe, just report
 	 * an invalid interrupt.
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 828cc30774bc..8fd4ece43dd8 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -18,6 +18,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/isolation.h>
 #include <asm/processor.h>
 
 
@@ -102,8 +103,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (cpu != smp_processor_id()) {
 		/* Arch remote IPI send/receive backend aren't NMI safe */
 		WARN_ON_ONCE(in_nmi());
-		if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+		if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+			task_isolation_remote(cpu, "irq_work");
 			arch_send_call_function_single_ipi(cpu);
+		}
 	} else {
 		__irq_work_queue_local(work);
 	}
diff --git a/kernel/smp.c b/kernel/smp.c
index d0ada39eb4d4..3a8bcbdd4ce6 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -20,6 +20,7 @@
 #include <linux/sched.h>
 #include <linux/sched/idle.h>
 #include <linux/hypervisor.h>
+#include <linux/isolation.h>
 
 #include "smpboot.h"
 
@@ -176,8 +177,10 @@ static int generic_exec_single(int cpu, call_single_data_t *csd,
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
-	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) {
+		task_isolation_remote(cpu, "IPI function");
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return 0;
 }
@@ -466,6 +469,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	task_isolation_remote_cpumask(cfd->cpumask_ipi, "IPI function");
 	arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
 
 	if (wait) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 06/13] task_isolation: arch/x86: enable task isolation functionality
  2020-04-09 15:09   ` [PATCH v3 00/13] "Task_isolation" mode Alex Belits
                       ` (4 preceding siblings ...)
  2020-04-09 15:21     ` [PATCH 05/13] task_isolation: Add task isolation hooks to arch-independent code Alex Belits
@ 2020-04-09 15:22     ` Alex Belits
  2020-04-09 15:23     ` [PATCH v3 07/13] task_isolation: arch/arm64: " Alex Belits
                       ` (6 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-04-09 15:22 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

From: Chris Metcalf <cmetcalf@mellanox.com>

In prepare_exit_to_usermode(), run cleanup for tasks exited fromi
isolation and call task_isolation_start() for tasks that entered
TIF_TASK_ISOLATION.

In syscall_trace_enter_phase1(), add the necessary support for
reporting syscalls for task-isolation processes.

Add task_isolation_remote() calls for the kernel exception types
that do not result in signals, namely non-signalling page faults.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
[abelits@marvell.com: adapted for kernel 5.6]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/entry/common.c            | 15 +++++++++++++++
 arch/x86/include/asm/apic.h        |  3 +++
 arch/x86/include/asm/thread_info.h |  4 +++-
 arch/x86/kernel/apic/ipi.c         |  2 ++
 arch/x86/mm/fault.c                |  4 ++++
 6 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index beea77046f9b..9ea6d3e6e77d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -144,6 +144,7 @@ config X86
 	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_STACKLEAK
 	select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 9747876980b5..ba8cd75dc7cf 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -26,6 +26,7 @@
 #include <linux/livepatch.h>
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
+#include <linux/isolation.h>
 
 #include <asm/desc.h>
 #include <asm/traps.h>
@@ -86,6 +87,15 @@ static long syscall_trace_enter(struct pt_regs *regs)
 			return -1L;
 	}
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->orig_ax) == -1)
+			return -1L;
+		work &= ~_TIF_TASK_ISOLATION;
+	}
 #ifdef CONFIG_SECCOMP
 	/*
 	 * Do seccomp after ptrace, to catch any tracer changes.
@@ -189,6 +199,8 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	lockdep_assert_irqs_disabled();
 	lockdep_sys_exit();
 
+	task_isolation_check_run_cleanup();
+
 	cached_flags = READ_ONCE(ti->flags);
 
 	if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
@@ -204,6 +216,9 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	if (unlikely(cached_flags & _TIF_NEED_FPU_LOAD))
 		switch_fpu_return();
 
+	if (cached_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
+
 #ifdef CONFIG_COMPAT
 	/*
 	 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 19e94af9cc5d..71149abbb0a0 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_APIC_H
 
 #include <linux/cpumask.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/cpufeature.h>
@@ -524,6 +525,7 @@ extern void irq_exit(void);
 
 static inline void entering_irq(void)
 {
+	task_isolation_interrupt("irq");
 	irq_enter();
 	kvm_set_cpu_l1tf_flush_l1d();
 }
@@ -536,6 +538,7 @@ static inline void entering_ack_irq(void)
 
 static inline void ipi_entering_ack_irq(void)
 {
+	task_isolation_interrupt("ack irq");
 	irq_enter();
 	ack_APIC_irq();
 	kvm_set_cpu_l1tf_flush_l1d();
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index cf4327986e98..60d107f784ee 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -92,6 +92,7 @@ struct thread_info {
 #define TIF_NOCPUID		15	/* CPUID is not accessible in userland */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
+#define TIF_TASK_ISOLATION	18	/* task isolation enabled for task */
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
@@ -122,6 +123,7 @@ struct thread_info {
 #define _TIF_NOCPUID		(1 << TIF_NOCPUID)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
@@ -140,7 +142,7 @@ struct thread_info {
 #define _TIF_WORK_SYSCALL_ENTRY	\
 	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |	\
 	 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT |	\
-	 _TIF_NOHZ)
+	 _TIF_NOHZ | _TIF_TASK_ISOLATION)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE					\
diff --git a/arch/x86/kernel/apic/ipi.c b/arch/x86/kernel/apic/ipi.c
index 6ca0f91372fd..b4dfaad6a440 100644
--- a/arch/x86/kernel/apic/ipi.c
+++ b/arch/x86/kernel/apic/ipi.c
@@ -2,6 +2,7 @@
 
 #include <linux/cpumask.h>
 #include <linux/smp.h>
+#include <linux/isolation.h>
 
 #include "local.h"
 
@@ -67,6 +68,7 @@ void native_smp_send_reschedule(int cpu)
 		WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu);
 		return;
 	}
+	task_isolation_remote(cpu, "reschedule IPI");
 	apic->send_IPI(cpu, RESCHEDULE_VECTOR);
 }
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fa4ea09593ab..2175a8655a7d 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 #include <linux/efi.h>			/* efi_recover_from_page_fault()*/
 #include <linux/mm_types.h>
+#include <linux/isolation.h>		/* task_isolation_interrupt     */
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1483,6 +1484,9 @@ void do_user_addr_fault(struct pt_regs *regs,
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
 	}
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	task_isolation_interrupt("page fault at %#lx", address);
+
 	check_v8086_mode(regs, address, tsk);
 }
 NOKPROBE_SYMBOL(do_user_addr_fault);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v3 07/13] task_isolation: arch/arm64: enable task isolation functionality
  2020-04-09 15:09   ` [PATCH v3 00/13] "Task_isolation" mode Alex Belits
                       ` (5 preceding siblings ...)
  2020-04-09 15:22     ` [PATCH 06/13] task_isolation: arch/x86: enable task isolation functionality Alex Belits
@ 2020-04-09 15:23     ` Alex Belits
  2020-04-22 12:08       ` Catalin Marinas
  2020-04-09 15:24     ` [PATCH v3 08/13] task_isolation: arch/arm: " Alex Belits
                       ` (5 subsequent siblings)
  12 siblings, 1 reply; 71+ messages in thread
From: Alex Belits @ 2020-04-09 15:23 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

From: Chris Metcalf <cmetcalf@mellanox.com>

In do_notify_resume(), call task_isolation_start() for
TIF_TASK_ISOLATION tasks. Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop,
since we don't clear _TIF_TASK_ISOLATION in the loop.

We instrument the smp_send_reschedule() routine so that it checks for
isolated tasks and generates a suitable warning if needed.

Finally, report on page faults in task-isolation processes in
do_page_faults().

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
[abelits@marvell.com: simplified to match kernel 5.6]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/thread_info.h |  5 ++++-
 arch/arm64/kernel/ptrace.c           | 10 ++++++++++
 arch/arm64/kernel/signal.c           | 13 ++++++++++++-
 arch/arm64/kernel/smp.c              |  7 +++++++
 arch/arm64/mm/fault.c                |  5 +++++
 6 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0b30e884e088..93b6aabc8be9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -129,6 +129,7 @@ config ARM64
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_STACKLEAK
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index f0cec4160136..7563098eb5b2 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -63,6 +63,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
 #define TIF_UPROBE		4	/* uprobe breakpoint or singlestep */
 #define TIF_FSCHECK		5	/* Check FS is USER_DS on return */
+#define TIF_TASK_ISOLATION	6
 #define TIF_NOHZ		7
 #define TIF_SYSCALL_TRACE	8	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	9	/* syscall auditing */
@@ -83,6 +84,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
@@ -96,7 +98,8 @@ void arch_release_task_struct(struct task_struct *tsk);
 
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
-				 _TIF_UPROBE | _TIF_FSCHECK)
+				 _TIF_UPROBE | _TIF_FSCHECK | \
+				 _TIF_TASK_ISOLATION)
 
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index cd6e5fa48b9c..b35b9b0c594c 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -29,6 +29,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/cpufeature.h>
@@ -1836,6 +1837,15 @@ int syscall_trace_enter(struct pt_regs *regs)
 			return -1;
 	}
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (test_thread_flag(TIF_TASK_ISOLATION)) {
+		if (task_isolation_syscall(regs->syscallno) == -1)
+			return -1;
+	}
+
 	/* Do the secure computing after ptrace; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 339882db5a91..d488c91a4877 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -20,6 +20,7 @@
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
 #include <linux/syscalls.h>
+#include <linux/isolation.h>
 
 #include <asm/daifflags.h>
 #include <asm/debug-monitors.h>
@@ -898,6 +899,11 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
+#define NOTIFY_RESUME_LOOP_FLAGS \
+	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
+	_TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
+	_TIF_UPROBE | _TIF_FSCHECK)
+
 asmlinkage void do_notify_resume(struct pt_regs *regs,
 				 unsigned long thread_flags)
 {
@@ -908,6 +914,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 	 */
 	trace_hardirqs_off();
 
+	task_isolation_check_run_cleanup();
+
 	do {
 		/* Check valid user FS if needed */
 		addr_limit_user_check();
@@ -938,7 +946,10 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 
 		local_daif_mask();
 		thread_flags = READ_ONCE(current_thread_info()->flags);
-	} while (thread_flags & _TIF_WORK_MASK);
+	} while (thread_flags & NOTIFY_RESUME_LOOP_FLAGS);
+
+	if (thread_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
 }
 
 unsigned long __ro_after_init signal_minsigstksz;
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index d4ed9a19d8fe..00f0f77adea0 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -32,6 +32,7 @@
 #include <linux/irq_work.h>
 #include <linux/kexec.h>
 #include <linux/kvm_host.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -818,6 +819,7 @@ void arch_send_call_function_single_ipi(int cpu)
 #ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
 void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "wakeup IPI");
 	smp_cross_call(mask, IPI_WAKEUP);
 }
 #endif
@@ -886,6 +888,9 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 		__inc_irq_stat(cpu, ipi_irqs[ipinr]);
 	}
 
+	task_isolation_interrupt("IPI type %d (%s)", ipinr,
+				 ipinr < NR_IPI ? ipi_types[ipinr] : "unknown");
+
 	switch (ipinr) {
 	case IPI_RESCHEDULE:
 		scheduler_ipi();
@@ -948,12 +953,14 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 
 void smp_send_reschedule(int cpu)
 {
+	task_isolation_remote(cpu, "reschedule IPI");
 	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
 }
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 void tick_broadcast(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "timer IPI");
 	smp_cross_call(mask, IPI_TIMER);
 }
 #endif
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 85566d32958f..fc4b42c81c4f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -23,6 +23,7 @@
 #include <linux/perf_event.h>
 #include <linux/preempt.h>
 #include <linux/hugetlb.h>
+#include <linux/isolation.h>
 
 #include <asm/acpi.h>
 #include <asm/bug.h>
@@ -543,6 +544,10 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 	 */
 	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
 			      VM_FAULT_BADACCESS)))) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (user_mode(regs))
+			task_isolation_interrupt("page fault at %#lx", addr);
+
 		/*
 		 * Major/minor page fault accounting is only done
 		 * once. If we go through a retry, it is extremely
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v3 08/13] task_isolation: arch/arm: enable task isolation functionality
  2020-04-09 15:09   ` [PATCH v3 00/13] "Task_isolation" mode Alex Belits
                       ` (6 preceding siblings ...)
  2020-04-09 15:23     ` [PATCH v3 07/13] task_isolation: arch/arm64: " Alex Belits
@ 2020-04-09 15:24     ` Alex Belits
  2020-04-09 15:25     ` [PATCH v3 09/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
                       ` (4 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-04-09 15:24 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

From: Francis Giraldeau <francis.giraldeau@gmail.com>

This patch is a port of the task isolation functionality to the arm 32-bit
architecture. The task isolation needs an additional thread flag that
requires to change the entry assembly code to accept a bitfield larger than
one byte. The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now
defined in the literal pool. The rest of the patch is straightforward and
reflects what is done on other architectures.

To avoid problems with the tst instruction in the v7m build, we renumber
TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7.

Signed-off-by: Francis Giraldeau <francis.giraldeau@gmail.com>
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> [with modifications]
[abelits@marvell.com: modified for kernel 5.6, added isolation cleanup]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 arch/arm/Kconfig                   |  1 +
 arch/arm/include/asm/thread_info.h | 10 +++++++---
 arch/arm/kernel/entry-common.S     | 15 ++++++++++-----
 arch/arm/kernel/ptrace.c           | 10 ++++++++++
 arch/arm/kernel/signal.c           | 13 ++++++++++++-
 arch/arm/kernel/smp.c              |  4 ++++
 arch/arm/mm/fault.c                |  8 +++++++-
 7 files changed, 51 insertions(+), 10 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 97864aabc2a6..1a66e6c6807c 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -67,6 +67,7 @@ config ARM
 	select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
 	select HAVE_ARCH_MMAP_RND_BITS if MMU
 	select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARM_SMCCC if CPU_V7
diff --git a/arch/arm/include/asm/thread_info.h b/arch/arm/include/asm/thread_info.h
index 0d0d5178e2c3..ec3c2084c391 100644
--- a/arch/arm/include/asm/thread_info.h
+++ b/arch/arm/include/asm/thread_info.h
@@ -139,7 +139,8 @@ extern int vfp_restore_user_hwstate(struct user_vfp *,
 #define TIF_SYSCALL_TRACE	4	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	5	/* syscall auditing active */
 #define TIF_SYSCALL_TRACEPOINT	6	/* syscall tracepoint instrumentation */
-#define TIF_SECCOMP		7	/* seccomp syscall filtering active */
+#define TIF_TASK_ISOLATION	7	/* task isolation active */
+#define TIF_SECCOMP		8	/* seccomp syscall filtering active */
 
 #define TIF_NOHZ		12	/* in adaptive nohz mode */
 #define TIF_USING_IWMMXT	17
@@ -153,18 +154,21 @@ extern int vfp_restore_user_hwstate(struct user_vfp *,
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SYSCALL_TRACEPOINT	(1 << TIF_SYSCALL_TRACEPOINT)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_USING_IWMMXT	(1 << TIF_USING_IWMMXT)
 
 /* Checks for any syscall work in entry-common.S */
 #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
-			   _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP)
+			   _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
+			   _TIF_TASK_ISOLATION)
 
 /*
  * Change these and you break ASM code in entry-common.S
  */
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
-				 _TIF_NOTIFY_RESUME | _TIF_UPROBE)
+				 _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
+				 _TIF_TASK_ISOLATION)
 
 #endif /* __KERNEL__ */
 #endif /* __ASM_ARM_THREAD_INFO_H */
diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
index 271cb8a1eba1..6ceb5cb808a9 100644
--- a/arch/arm/kernel/entry-common.S
+++ b/arch/arm/kernel/entry-common.S
@@ -53,7 +53,8 @@ __ret_fast_syscall:
 	cmp	r2, #TASK_SIZE
 	blne	addr_limit_check_failed
 	ldr	r1, [tsk, #TI_FLAGS]		@ re-check for syscall tracing
-	tst	r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	ldr	r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	tst	r1, r2
 	bne	fast_work_pending
 
 
@@ -90,7 +91,8 @@ __ret_fast_syscall:
 	cmp	r2, #TASK_SIZE
 	blne	addr_limit_check_failed
 	ldr	r1, [tsk, #TI_FLAGS]		@ re-check for syscall tracing
-	tst	r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	ldr	r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	tst	r1, r2
 	beq	no_work_pending
  UNWIND(.fnend		)
 ENDPROC(ret_fast_syscall)
@@ -98,7 +100,8 @@ ENDPROC(ret_fast_syscall)
 	/* Slower path - fall through to work_pending */
 #endif
 
-	tst	r1, #_TIF_SYSCALL_WORK
+	ldr	r2, =_TIF_SYSCALL_WORK
+	tst	r1, r2
 	bne	__sys_trace_return_nosave
 slow_work_pending:
 	mov	r0, sp				@ 'regs'
@@ -131,7 +134,8 @@ ENTRY(ret_to_user_from_irq)
 	cmp	r2, #TASK_SIZE
 	blne	addr_limit_check_failed
 	ldr	r1, [tsk, #TI_FLAGS]
-	tst	r1, #_TIF_WORK_MASK
+	ldr	r2, =_TIF_WORK_MASK
+	tst	r1, r2
 	bne	slow_work_pending
 no_work_pending:
 	asm_trace_hardirqs_on save = 0
@@ -251,7 +255,8 @@ local_restart:
 	ldr	r10, [tsk, #TI_FLAGS]		@ check for syscall tracing
 	stmdb	sp!, {r4, r5}			@ push fifth and sixth args
 
-	tst	r10, #_TIF_SYSCALL_WORK		@ are we tracing syscalls?
+	ldr	r11, =_TIF_SYSCALL_WORK		@ are we tracing syscalls?
+	tst	r10, r11
 	bne	__sys_trace
 
 	invoke_syscall tbl, scno, r10, __ret_fast_syscall
diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
index b606cded90cd..a69b0bfd71ae 100644
--- a/arch/arm/kernel/ptrace.c
+++ b/arch/arm/kernel/ptrace.c
@@ -24,6 +24,7 @@
 #include <linux/audit.h>
 #include <linux/tracehook.h>
 #include <linux/unistd.h>
+#include <linux/isolation.h>
 
 #include <asm/pgtable.h>
 #include <asm/traps.h>
@@ -921,6 +922,15 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs, int scno)
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (test_thread_flag(TIF_TASK_ISOLATION)) {
+		if (task_isolation_syscall(scno) == -1)
+			return -1;
+	}
+
 	/* Do seccomp after ptrace; syscall may have changed. */
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
 	if (secure_computing() == -1)
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index ab2568996ddb..29ccef8403cd 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -12,6 +12,7 @@
 #include <linux/tracehook.h>
 #include <linux/uprobes.h>
 #include <linux/syscalls.h>
+#include <linux/isolation.h>
 
 #include <asm/elf.h>
 #include <asm/cacheflush.h>
@@ -639,6 +640,9 @@ static int do_signal(struct pt_regs *regs, int syscall)
 	return 0;
 }
 
+#define WORK_PENDING_LOOP_FLAGS	(_TIF_NEED_RESCHED | _TIF_SIGPENDING |	\
+				 _TIF_NOTIFY_RESUME | _TIF_UPROBE)
+
 asmlinkage int
 do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 {
@@ -648,6 +652,9 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 	 * Update the trace code with the current status.
 	 */
 	trace_hardirqs_off();
+
+	task_isolation_check_run_cleanup();
+
 	do {
 		if (likely(thread_flags & _TIF_NEED_RESCHED)) {
 			schedule();
@@ -676,7 +683,11 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 		}
 		local_irq_disable();
 		thread_flags = current_thread_info()->flags;
-	} while (thread_flags & _TIF_WORK_MASK);
+	} while (thread_flags & WORK_PENDING_LOOP_FLAGS);
+
+	if (thread_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
+
 	return 0;
 }
 
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 46e1be9e57a8..95f19b980776 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -26,6 +26,7 @@
 #include <linux/completion.h>
 #include <linux/cpufreq.h>
 #include <linux/irq_work.h>
+#include <linux/isolation.h>
 
 #include <linux/atomic.h>
 #include <asm/bugs.h>
@@ -560,6 +561,7 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask)
 
 void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "wakeup IPI");
 	smp_cross_call(mask, IPI_WAKEUP);
 }
 
@@ -579,6 +581,7 @@ void arch_irq_work_raise(void)
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 void tick_broadcast(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "timer IPI");
 	smp_cross_call(mask, IPI_TIMER);
 }
 #endif
@@ -702,6 +705,7 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 
 void smp_send_reschedule(int cpu)
 {
+	task_isolation_remote(cpu, "reschedule IPI");
 	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
 }
 
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index bd0f4821f7e1..acd11a69c4e4 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -17,6 +17,7 @@
 #include <linux/sched/debug.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/isolation.h>
 
 #include <asm/pgtable.h>
 #include <asm/system_misc.h>
@@ -332,8 +333,13 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	/*
 	 * Handle the "normal" case first - VM_FAULT_MAJOR
 	 */
-	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
+	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
+			      VM_FAULT_BADACCESS)))) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (user_mode(regs))
+			task_isolation_interrupt("page fault at %#lx", addr);
 		return 0;
+	}
 
 	/*
 	 * If we are in kernel mode at this point, we
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v3 09/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
  2020-04-09 15:09   ` [PATCH v3 00/13] "Task_isolation" mode Alex Belits
                       ` (7 preceding siblings ...)
  2020-04-09 15:24     ` [PATCH v3 08/13] task_isolation: arch/arm: " Alex Belits
@ 2020-04-09 15:25     ` Alex Belits
  2020-04-09 15:26     ` [PATCH v3 10/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks Alex Belits
                       ` (3 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-04-09 15:25 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

From: Yuri Norov <ynorov@marvell.com>

For nohz_full CPUs the desirable behavior is to receive interrupts
generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
obviously not desirable because it breaks isolation.

This patch adds check for it.

Signed-off-by: Yuri Norov <ynorov@marvell.com>
[abelits@marvell.com: updated, only exclude CPUs running isolated tasks]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/time/tick-sched.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 1d4dec9d3ee7..8488c2825a45 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include <linux/sched/clock.h>
 #include <linux/sched/stat.h>
 #include <linux/sched/nohz.h>
+#include <linux/isolation.h>
 #include <linux/module.h>
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
@@ -262,7 +263,8 @@ static void tick_nohz_full_kick(void)
  */
 void tick_nohz_full_kick_cpu(int cpu)
 {
-	if (!tick_nohz_full_cpu(cpu))
+	smp_rmb();
+	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
 		return;
 
 	irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v3 10/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks
  2020-04-09 15:09   ` [PATCH v3 00/13] "Task_isolation" mode Alex Belits
                       ` (8 preceding siblings ...)
  2020-04-09 15:25     ` [PATCH v3 09/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
@ 2020-04-09 15:26     ` Alex Belits
  2020-04-09 15:27     ` [PATCH v3 11/13] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize Alex Belits
                       ` (2 subsequent siblings)
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-04-09 15:26 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

From: Yuri Norov <ynorov@marvell.com>

If CPU runs isolated task, there's no any backlog on it, and
so we don't need to flush it. Currently flush_all_backlogs()
enqueues corresponding work on all CPUs including ones that run
isolated tasks. It leads to breaking task isolation for nothing.

In this patch, backlog flushing is enqueued only on non-isolated CPUs.

Signed-off-by: Yuri Norov <ynorov@marvell.com>
[abelits@marvell.com: use safe task_isolation_on_cpu() implementation]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 net/core/dev.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c6c985fe7b1b..353d2be39202 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -74,6 +74,7 @@
 #include <linux/cpu.h>
 #include <linux/types.h>
 #include <linux/kernel.h>
+#include <linux/isolation.h>
 #include <linux/hash.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
@@ -5518,9 +5519,13 @@ static void flush_all_backlogs(void)
 
 	get_online_cpus();
 
-	for_each_online_cpu(cpu)
+	smp_rmb();
+	for_each_online_cpu(cpu) {
+		if (task_isolation_on_cpu(cpu))
+			continue;
 		queue_work_on(cpu, system_highpri_wq,
 			      per_cpu_ptr(&flush_works, cpu));
+	}
 
 	for_each_online_cpu(cpu)
 		flush_work(per_cpu_ptr(&flush_works, cpu));
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v3 11/13] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize
  2020-04-09 15:09   ` [PATCH v3 00/13] "Task_isolation" mode Alex Belits
                       ` (9 preceding siblings ...)
  2020-04-09 15:26     ` [PATCH v3 10/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks Alex Belits
@ 2020-04-09 15:27     ` Alex Belits
  2020-04-09 15:27     ` [PATCH v3 12/13] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits
  2020-04-09 15:28     ` [PATCH v3 13/13] task_isolation: CONFIG_TASK_ISOLATION prevents distribution of jobs to non-housekeeping CPUs Alex Belits
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-04-09 15:27 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

From: Yuri Norov <ynorov@marvell.com>

CPUs running isolated tasks are in userspace, so they don't have to
perform ring buffer updates immediately. If ring_buffer_resize()
schedules the update on those CPUs, isolation is broken. To prevent
that, updates for CPUs running isolated tasks are performed locally,
like for offline CPUs.

A race condition between this update and isolation breaking is avoided
at the cost of disabling per_cpu buffer writing for the time of update
when it coincides with isolation breaking.

Signed-off-by: Yuri Norov <ynorov@marvell.com>
[abelits@marvell.com: updated to prevent race with isolation breaking]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/trace/ring_buffer.c | 63 ++++++++++++++++++++++++++++++++++----
 1 file changed, 57 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 61f0e92ace99..972f26fc3540 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -21,6 +21,7 @@
 #include <linux/delay.h>
 #include <linux/slab.h>
 #include <linux/init.h>
+#include <linux/isolation.h>
 #include <linux/hash.h>
 #include <linux/list.h>
 #include <linux/cpu.h>
@@ -1701,6 +1702,38 @@ static void update_pages_handler(struct work_struct *work)
 	complete(&cpu_buffer->update_done);
 }
 
+static bool update_if_isolated(struct ring_buffer_per_cpu *cpu_buffer,
+			       int cpu)
+{
+	bool rv = false;
+
+	smp_rmb();
+	if (task_isolation_on_cpu(cpu)) {
+		/*
+		 * CPU is running isolated task. Since it may lose
+		 * isolation and re-enter kernel simultaneously with
+		 * this update, disable recording until it's done.
+		 */
+		atomic_inc(&cpu_buffer->record_disabled);
+		/* Make sure, update is done, and isolation state is current */
+		smp_mb();
+		if (task_isolation_on_cpu(cpu)) {
+			/*
+			 * If CPU is still running isolated task, we
+			 * can be sure that breaking isolation will
+			 * happen while recording is disabled, and CPU
+			 * will not touch this buffer until the update
+			 * is done.
+			 */
+			rb_update_pages(cpu_buffer);
+			cpu_buffer->nr_pages_to_update = 0;
+			rv = true;
+		}
+		atomic_dec(&cpu_buffer->record_disabled);
+	}
+	return rv;
+}
+
 /**
  * ring_buffer_resize - resize the ring buffer
  * @buffer: the buffer to resize.
@@ -1784,13 +1817,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
 			if (!cpu_buffer->nr_pages_to_update)
 				continue;
 
-			/* Can't run something on an offline CPU. */
+			/*
+			 * Can't run something on an offline CPU.
+			 *
+			 * CPUs running isolated tasks don't have to
+			 * update ring buffers until they exit
+			 * isolation because they are in
+			 * userspace. Use the procedure that prevents
+			 * race condition with isolation breaking.
+			 */
 			if (!cpu_online(cpu)) {
 				rb_update_pages(cpu_buffer);
 				cpu_buffer->nr_pages_to_update = 0;
 			} else {
-				schedule_work_on(cpu,
-						&cpu_buffer->update_pages_work);
+				if (!update_if_isolated(cpu_buffer, cpu))
+					schedule_work_on(cpu,
+					&cpu_buffer->update_pages_work);
 			}
 		}
 
@@ -1829,13 +1871,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
 
 		get_online_cpus();
 
-		/* Can't run something on an offline CPU. */
+		/*
+		 * Can't run something on an offline CPU.
+		 *
+		 * CPUs running isolated tasks don't have to update
+		 * ring buffers until they exit isolation because they
+		 * are in userspace. Use the procedure that prevents
+		 * race condition with isolation breaking.
+		 */
 		if (!cpu_online(cpu_id))
 			rb_update_pages(cpu_buffer);
 		else {
-			schedule_work_on(cpu_id,
+			if (!update_if_isolated(cpu_buffer, cpu_id))
+				schedule_work_on(cpu_id,
 					 &cpu_buffer->update_pages_work);
-			wait_for_completion(&cpu_buffer->update_done);
+				wait_for_completion(&cpu_buffer->update_done);
+			}
 		}
 
 		cpu_buffer->nr_pages_to_update = 0;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v3 12/13] task_isolation: kick_all_cpus_sync: don't kick isolated cpus
  2020-04-09 15:09   ` [PATCH v3 00/13] "Task_isolation" mode Alex Belits
                       ` (10 preceding siblings ...)
  2020-04-09 15:27     ` [PATCH v3 11/13] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize Alex Belits
@ 2020-04-09 15:27     ` Alex Belits
  2020-04-09 15:28     ` [PATCH v3 13/13] task_isolation: CONFIG_TASK_ISOLATION prevents distribution of jobs to non-housekeeping CPUs Alex Belits
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-04-09 15:27 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

From: Yuri Norov <ynorov@marvell.com>

Make sure that kick_all_cpus_sync() does not call CPUs that are running
isolated tasks.

Signed-off-by: Yuri Norov <ynorov@marvell.com>
[abelits@marvell.com: use safe task_isolation_cpumask() implementation]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/smp.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 3a8bcbdd4ce6..d9b4b2fedfed 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -731,9 +731,21 @@ static void do_nothing(void *unused)
  */
 void kick_all_cpus_sync(void)
 {
+	struct cpumask mask;
+
 	/* Make sure the change is visible before we kick the cpus */
 	smp_mb();
-	smp_call_function(do_nothing, NULL, 1);
+
+	preempt_disable();
+#ifdef CONFIG_TASK_ISOLATION
+	cpumask_clear(&mask);
+	task_isolation_cpumask(&mask);
+	cpumask_complement(&mask, &mask);
+#else
+	cpumask_setall(&mask);
+#endif
+	smp_call_function_many(&mask, do_nothing, NULL, 1);
+	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(kick_all_cpus_sync);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v3 13/13] task_isolation: CONFIG_TASK_ISOLATION prevents distribution of jobs to non-housekeeping CPUs
  2020-04-09 15:09   ` [PATCH v3 00/13] "Task_isolation" mode Alex Belits
                       ` (11 preceding siblings ...)
  2020-04-09 15:27     ` [PATCH v3 12/13] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits
@ 2020-04-09 15:28     ` Alex Belits
  12 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-04-09 15:28 UTC (permalink / raw)
  To: frederic, rostedt
  Cc: Prasun Kapoor, mingo, davem, linux-api, peterz, linux-arch,
	catalin.marinas, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

There are various mechanisms that select CPUs for jobs other than
regular workqueue selection. CPU isolation normally does not
prevent those jobs from running on isolated CPUs. When task
isolation is enabled those jobs should be limited to housekeeping
CPUs.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 drivers/pci/pci-driver.c |  9 +++++++
 lib/cpumask.c            | 53 +++++++++++++++++++++++++---------------
 net/core/net-sysfs.c     |  9 +++++++
 3 files changed, 51 insertions(+), 20 deletions(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 0454ca0e4e3f..cb872cdd1782 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -12,6 +12,7 @@
 #include <linux/string.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/sched/isolation.h>
 #include <linux/cpu.h>
 #include <linux/pm_runtime.h>
 #include <linux/suspend.h>
@@ -332,6 +333,9 @@ static bool pci_physfn_is_probed(struct pci_dev *dev)
 static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 			  const struct pci_device_id *id)
 {
+#ifdef CONFIG_TASK_ISOLATION
+	int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
+#endif
 	int error, node, cpu;
 	struct drv_dev_and_id ddi = { drv, dev, id };
 
@@ -353,7 +357,12 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 	    pci_physfn_is_probed(dev))
 		cpu = nr_cpu_ids;
 	else
+#ifdef CONFIG_TASK_ISOLATION
+		cpu = cpumask_any_and(cpumask_of_node(node),
+				      housekeeping_cpumask(hk_flags));
+#else
 		cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
+#endif
 
 	if (cpu < nr_cpu_ids)
 		error = work_on_cpu(cpu, local_pci_probe, &ddi);
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 0cb672eb107c..dcbc30a47600 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,6 +6,7 @@
 #include <linux/export.h>
 #include <linux/memblock.h>
 #include <linux/numa.h>
+#include <linux/sched/isolation.h>
 
 /**
  * cpumask_next - get the next cpu in a cpumask
@@ -205,28 +206,40 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
  */
 unsigned int cpumask_local_spread(unsigned int i, int node)
 {
-	int cpu;
+	const struct cpumask *mask;
+	int cpu, m, n;
+
+#ifdef CONFIG_TASK_ISOLATION
+	mask = housekeeping_cpumask(HK_FLAG_DOMAIN);
+	m = cpumask_weight(mask);
+#else
+	mask = cpu_online_mask;
+	m = num_online_cpus();
+#endif
 
 	/* Wrap: we always want a cpu. */
-	i %= num_online_cpus();
-
-	if (node == NUMA_NO_NODE) {
-		for_each_cpu(cpu, cpu_online_mask)
-			if (i-- == 0)
-				return cpu;
-	} else {
-		/* NUMA first. */
-		for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
-			if (i-- == 0)
-				return cpu;
-
-		for_each_cpu(cpu, cpu_online_mask) {
-			/* Skip NUMA nodes, done above. */
-			if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
-				continue;
-
-			if (i-- == 0)
-				return cpu;
+	n = i % m;
+
+	while (m-- > 0) {
+		if (node == NUMA_NO_NODE) {
+			for_each_cpu(cpu, mask)
+				if (n-- == 0)
+					return cpu;
+		} else {
+			/* NUMA first. */
+			for_each_cpu_and(cpu, cpumask_of_node(node), mask)
+				if (n-- == 0)
+					return cpu;
+
+			for_each_cpu(cpu, mask) {
+				/* Skip NUMA nodes, done above. */
+				if (cpumask_test_cpu(cpu,
+						     cpumask_of_node(node)))
+					continue;
+
+				if (n-- == 0)
+					return cpu;
+			}
 		}
 	}
 	BUG();
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 4c826b8bf9b1..253758f102d9 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -11,6 +11,7 @@
 #include <linux/if_arp.h>
 #include <linux/slab.h>
 #include <linux/sched/signal.h>
+#include <linux/sched/isolation.h>
 #include <linux/nsproxy.h>
 #include <net/sock.h>
 #include <net/net_namespace.h>
@@ -725,6 +726,14 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 		return err;
 	}
 
+#ifdef CONFIG_TASK_ISOLATION
+	cpumask_and(mask, mask, housekeeping_cpumask(HK_FLAG_DOMAIN));
+	if (cpumask_weight(mask) == 0) {
+		free_cpumask_var(mask);
+		return -EINVAL;
+	}
+#endif
+
 	map = kzalloc(max_t(unsigned int,
 			    RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
 		      GFP_KERNEL);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 04/13] task_isolation: userspace hard isolation from kernel
  2020-04-09 15:20     ` [PATCH v3 04/13] task_isolation: userspace hard isolation from kernel Alex Belits
@ 2020-04-09 18:00       ` Andy Lutomirski
  2020-04-19  5:07         ` Alex Belits
  0 siblings, 1 reply; 71+ messages in thread
From: Andy Lutomirski @ 2020-04-09 18:00 UTC (permalink / raw)
  To: Alex Belits
  Cc: frederic, rostedt, Prasun Kapoor, mingo, davem, linux-api,
	peterz, linux-arch, catalin.marinas, tglx, will,
	linux-arm-kernel, linux-kernel, netdev



> On Apr 9, 2020, at 8:21 AM, Alex Belits <abelits@marvell.com> wrote:
> 
> The existing nohz_full mode is designed as a "soft" isolation mode
> that makes tradeoffs to minimize userspace interruptions while
> still attempting to avoid overheads in the kernel entry/exit path,
> to provide 100% kernel semantics, etc.
> 
> However, some applications require a "hard" commitment from the
> kernel to avoid interruptions, in particular userspace device driver
> style applications, such as high-speed networking code.
> 
> This change introduces a framework to allow applications
> to elect to have the "hard" semantics as needed, specifying
> prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
> 
> The kernel must be built with the new TASK_ISOLATION Kconfig flag
> to enable this mode, and the kernel booted with an appropriate
> "isolcpus=nohz,domain,CPULIST" boot argument to enable
> nohz_full and isolcpus. The "task_isolation" state is then indicated
> by setting a new task struct field, task_isolation_flag, to the
> value passed by prctl(), and also setting a TIF_TASK_ISOLATION
> bit in the thread_info flags. When the kernel is returning to
> userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
> it calls the new task_isolation_start() routine to arrange for
> the task to avoid being interrupted in the future.
> 
> With interrupts disabled, task_isolation_start() ensures that kernel
> subsystems that might cause a future interrupt are quiesced. If it
> doesn't succeed, it adjusts the syscall return value to indicate that
> fact, and userspace can retry as desired. In addition to stopping
> the scheduler tick, the code takes any actions that might avoid
> a future interrupt to the core, such as a worker thread being
> scheduled that could be quiesced now (e.g. the vmstat worker)
> or a future IPI to the core to clean up some state that could be
> cleaned up now (e.g. the mm lru per-cpu cache).
> 
> Once the task has returned to userspace after issuing the prctl(),
> if it enters the kernel again via system call, page fault, or any
> other exception or irq, the kernel will kill it with SIGKILL.

I could easily imagine myself using task isolation, but not with the SIGKILL semantics. SIGKILL causes data loss. Please at least let users choose what signal to send.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 03/13] task_isolation: add instruction synchronization memory barrier
  2020-04-09 15:17     ` [PATCH v3 03/13] task_isolation: add instruction synchronization memory barrier Alex Belits
@ 2020-04-15 12:44       ` Mark Rutland
  2020-04-19  5:02         ` [EXT] " Alex Belits
  0 siblings, 1 reply; 71+ messages in thread
From: Mark Rutland @ 2020-04-15 12:44 UTC (permalink / raw)
  To: Alex Belits
  Cc: frederic, rostedt, Prasun Kapoor, mingo, davem, linux-api,
	peterz, linux-arch, catalin.marinas, tglx, will,
	linux-arm-kernel, linux-kernel, netdev

On Thu, Apr 09, 2020 at 03:17:40PM +0000, Alex Belits wrote:
> Some architectures implement memory synchronization instructions for
> instruction cache. Make a separate kind of barrier that calls them.

Modifying the instruction caches requries more than an ISB, and the
'IMB' naming implies you're trying to order against memory accesses,
which isn't what ISB (generally) does.

What exactly do you want to use this for?

As-is, I don't think this makes sense as a generic barrier.

Thanks,
Mark.

> 
> Signed-off-by: Alex Belits <abelits@marvell.com>
> ---
>  arch/arm/include/asm/barrier.h   | 2 ++
>  arch/arm64/include/asm/barrier.h | 2 ++
>  include/asm-generic/barrier.h    | 4 ++++
>  3 files changed, 8 insertions(+)
> 
> diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> index 83ae97c049d9..6def62c95937 100644
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -64,12 +64,14 @@ extern void arm_heavy_mb(void);
>  #define mb()		__arm_heavy_mb()
>  #define rmb()		dsb()
>  #define wmb()		__arm_heavy_mb(st)
> +#define imb()		isb()
>  #define dma_rmb()	dmb(osh)
>  #define dma_wmb()	dmb(oshst)
>  #else
>  #define mb()		barrier()
>  #define rmb()		barrier()
>  #define wmb()		barrier()
> +#define imb()		barrier()
>  #define dma_rmb()	barrier()
>  #define dma_wmb()	barrier()
>  #endif
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index 7d9cc5ec4971..12a7dbd68bed 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -45,6 +45,8 @@
>  #define rmb()		dsb(ld)
>  #define wmb()		dsb(st)
>  
> +#define imb()		isb()
> +
>  #define dma_rmb()	dmb(oshld)
>  #define dma_wmb()	dmb(oshst)
>  
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index 85b28eb80b11..d5a822fb3e92 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -46,6 +46,10 @@
>  #define dma_wmb()	wmb()
>  #endif
>  
> +#ifndef imb
> +#define imb		barrier()
> +#endif
> +
>  #ifndef read_barrier_depends
>  #define read_barrier_depends()		do { } while (0)
>  #endif
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [EXT] Re: [PATCH v3 03/13] task_isolation: add instruction synchronization memory barrier
  2020-04-15 12:44       ` Mark Rutland
@ 2020-04-19  5:02         ` Alex Belits
  2020-04-20 12:23           ` Will Deacon
  2020-04-20 12:45           ` Mark Rutland
  0 siblings, 2 replies; 71+ messages in thread
From: Alex Belits @ 2020-04-19  5:02 UTC (permalink / raw)
  To: mark.rutland
  Cc: mingo, davem, linux-api, rostedt, peterz, linux-arch,
	Prasun Kapoor, catalin.marinas, frederic, tglx, will,
	linux-kernel, netdev, linux-arm-kernel


On Wed, 2020-04-15 at 13:44 +0100, Mark Rutland wrote:
> External Email
> 
> -------------------------------------------------------------------
> ---
> On Thu, Apr 09, 2020 at 03:17:40PM +0000, Alex Belits wrote:
> > Some architectures implement memory synchronization instructions
> > for
> > instruction cache. Make a separate kind of barrier that calls them.
> 
> Modifying the instruction caches requries more than an ISB, and the
> 'IMB' naming implies you're trying to order against memory accesses,
> which isn't what ISB (generally) does.
> 
> What exactly do you want to use this for?

I guess, there should be different explanation and naming.

The intention is to have a separate barrier that causes cache
synchronization event, for use in architecture-independent code. I am
not sure, what exactly it should do to be implemented in architecture-
independent manner, so it probably only makes sense along with a
regular memory barrier.

The particular place where I had to use is the code that has to run
after isolated task returns to the kernel. In the model that I propose
for task isolation, remote context synchronization is skipped while
task is in isolated in userspace (it doesn't run kernel, and kernel
does not modify its userspace code, so it's harmless until entering the
kernel). So it will skip the results of kick_all_cpus_sync() that was
that was called from flush_icache_range() and other similar places.
This means that once it's out of userspace, it should only run
some "safe" kernel entry code, and then synchronize in some manner that
avoids race conditions with possible IPIs intended for context
synchronization that may happen at the same time. My next patch in the
series uses it in that one place.

Synchronization will have to be implemented without a mandatory
interrupt because it may be triggered locally, on the same CPU. On ARM,
ISB is definitely necessary there, however I am not sure, how this
should look like on x86 and other architectures. On ARM this probably
still should be combined with a real memory barrier and cache
synchronization, however I am not entirely sure about details. Would
it make more sense to run DMB, IC and ISB? 

> 
As-is, I don't think this makes sense as a generic barrier.

Thanks,
Mark.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 arch/arm/include/asm/barrier.h   | 2 ++
 arch/arm64/include/asm/barrier.h | 2 ++
 include/asm-generic/barrier.h    | 4 ++++
 3 files changed, 8 insertions(+)

diff --git a/arch/arm/include/asm/barrier.h
b/arch/arm/include/asm/barrier.h
index 83ae97c049d9..6def62c95937 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -64,12 +64,14 @@ extern void arm_heavy_mb(void);
 #define mb()		__arm_heavy_mb()
 #define rmb()		dsb()
 #define wmb()		__arm_heavy_mb(st)
+#define imb()		isb()
 #define dma_rmb()	dmb(osh)
 #define dma_wmb()	dmb(oshst)
 #else
 #define mb()		barrier()
 #define rmb()		barrier()
 #define wmb()		barrier()
+#define imb()		barrier()
 #define dma_rmb()	barrier()
 #define dma_wmb()	barrier()
 #endif
diff --git a/arch/arm64/include/asm/barrier.h
b/arch/arm64/include/asm/barrier.h
index 7d9cc5ec4971..12a7dbd68bed 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -45,6 +45,8 @@
 #define rmb()		dsb(ld)
 #define wmb()		dsb(st)
 
+#define imb()		isb()
+
 #define dma_rmb()	dmb(oshld)
 #define dma_wmb()	dmb(oshst)
 
diff --git a/include/asm-generic/barrier.h b/include/asm-
generic/barrier.h
index 85b28eb80b11..d5a822fb3e92 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -46,6 +46,10 @@
 #define dma_wmb()	wmb()
 #endif
 
+#ifndef imb
+#define imb		barrier()
+#endif
+
 #ifndef read_barrier_depends
 #define read_barrier_depends()		do { } while (0)
 #endif
-- 
2.20.1





^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 04/13] task_isolation: userspace hard isolation from kernel
  2020-04-09 18:00       ` Andy Lutomirski
@ 2020-04-19  5:07         ` Alex Belits
  0 siblings, 0 replies; 71+ messages in thread
From: Alex Belits @ 2020-04-19  5:07 UTC (permalink / raw)
  To: luto
  Cc: mingo, davem, linux-api, rostedt, peterz, linux-arch,
	Prasun Kapoor, catalin.marinas, frederic, tglx, will,
	linux-kernel, netdev, linux-arm-kernel


On Thu, 2020-04-09 at 11:00 -0700, Andy Lutomirski wrote:
> 
> > 
> > Once the task has returned to userspace after issuing the prctl(),
> > if it enters the kernel again via system call, page fault, or any
> > other exception or irq, the kernel will kill it with SIGKILL.
> 
> I could easily imagine myself using task isolation, but not with the
> SIGKILL semantics. SIGKILL causes data loss. Please at least let
> users choose what signal to send.

This is already done, even though the documentation is not updated.
There is even a userspace library that deals with this while
compensating for possible race conditions on isolation entry and
automatic re-entry after isolation is broken: 
https://github.com/abelits/libtmc

-- 
Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [EXT] Re: [PATCH v3 03/13] task_isolation: add instruction synchronization memory barrier
  2020-04-19  5:02         ` [EXT] " Alex Belits
@ 2020-04-20 12:23           ` Will Deacon
  2020-04-20 12:36             ` Mark Rutland
  2020-04-20 12:45           ` Mark Rutland
  1 sibling, 1 reply; 71+ messages in thread
From: Will Deacon @ 2020-04-20 12:23 UTC (permalink / raw)
  To: Alex Belits
  Cc: mark.rutland, mingo, davem, linux-api, rostedt, peterz,
	linux-arch, Prasun Kapoor, catalin.marinas, frederic, tglx,
	linux-kernel, netdev, linux-arm-kernel

On Sun, Apr 19, 2020 at 05:02:01AM +0000, Alex Belits wrote:
> On Wed, 2020-04-15 at 13:44 +0100, Mark Rutland wrote:
> > On Thu, Apr 09, 2020 at 03:17:40PM +0000, Alex Belits wrote:
> > > Some architectures implement memory synchronization instructions
> > > for
> > > instruction cache. Make a separate kind of barrier that calls them.
> > 
> > Modifying the instruction caches requries more than an ISB, and the
> > 'IMB' naming implies you're trying to order against memory accesses,
> > which isn't what ISB (generally) does.
> > 
> > What exactly do you want to use this for?
> 
> I guess, there should be different explanation and naming.
> 
> The intention is to have a separate barrier that causes cache
> synchronization event, for use in architecture-independent code. I am
> not sure, what exactly it should do to be implemented in architecture-
> independent manner, so it probably only makes sense along with a
> regular memory barrier.
> 
> The particular place where I had to use is the code that has to run
> after isolated task returns to the kernel. In the model that I propose
> for task isolation, remote context synchronization is skipped while
> task is in isolated in userspace (it doesn't run kernel, and kernel
> does not modify its userspace code, so it's harmless until entering the
> kernel).

> So it will skip the results of kick_all_cpus_sync() that was
> that was called from flush_icache_range() and other similar places.
> This means that once it's out of userspace, it should only run
> some "safe" kernel entry code, and then synchronize in some manner that
> avoids race conditions with possible IPIs intended for context
> synchronization that may happen at the same time. My next patch in the
> series uses it in that one place.
> 
> Synchronization will have to be implemented without a mandatory
> interrupt because it may be triggered locally, on the same CPU. On ARM,
> ISB is definitely necessary there, however I am not sure, how this
> should look like on x86 and other architectures. On ARM this probably
> still should be combined with a real memory barrier and cache
> synchronization, however I am not entirely sure about details. Would
> it make more sense to run DMB, IC and ISB? 

IIUC, we don't need to do anything on arm64 because taking an exception acts
as a context synchronization event, so I don't think you should try to
expose this as a new barrier macro. Instead, just make it a pre-requisite
that architectures need to ensure this behaviour when entering the kernel
from userspace if they are to select HAVE_ARCH_TASK_ISOLATION.

That way, it's /very/ similar to what we do for MEMBARRIER_SYNC_CORE, the
only real different being that that is concerned with return-to-user rather
than entry-from-user.

See Documentation/features/sched/membarrier-sync-core/arch-support.txt

Will

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [EXT] Re: [PATCH v3 03/13] task_isolation: add instruction synchronization memory barrier
  2020-04-20 12:23           ` Will Deacon
@ 2020-04-20 12:36             ` Mark Rutland
  2020-04-20 13:55               ` Will Deacon
  0 siblings, 1 reply; 71+ messages in thread
From: Mark Rutland @ 2020-04-20 12:36 UTC (permalink / raw)
  To: Will Deacon
  Cc: Alex Belits, mingo, davem, linux-api, rostedt, peterz,
	linux-arch, Prasun Kapoor, catalin.marinas, frederic, tglx,
	linux-kernel, netdev, linux-arm-kernel

On Mon, Apr 20, 2020 at 01:23:51PM +0100, Will Deacon wrote:
> On Sun, Apr 19, 2020 at 05:02:01AM +0000, Alex Belits wrote:
> > On Wed, 2020-04-15 at 13:44 +0100, Mark Rutland wrote:
> > > On Thu, Apr 09, 2020 at 03:17:40PM +0000, Alex Belits wrote:
> > > > Some architectures implement memory synchronization instructions
> > > > for
> > > > instruction cache. Make a separate kind of barrier that calls them.
> > > 
> > > Modifying the instruction caches requries more than an ISB, and the
> > > 'IMB' naming implies you're trying to order against memory accesses,
> > > which isn't what ISB (generally) does.
> > > 
> > > What exactly do you want to use this for?
> > 
> > I guess, there should be different explanation and naming.
> > 
> > The intention is to have a separate barrier that causes cache
> > synchronization event, for use in architecture-independent code. I am
> > not sure, what exactly it should do to be implemented in architecture-
> > independent manner, so it probably only makes sense along with a
> > regular memory barrier.
> > 
> > The particular place where I had to use is the code that has to run
> > after isolated task returns to the kernel. In the model that I propose
> > for task isolation, remote context synchronization is skipped while
> > task is in isolated in userspace (it doesn't run kernel, and kernel
> > does not modify its userspace code, so it's harmless until entering the
> > kernel).
> 
> > So it will skip the results of kick_all_cpus_sync() that was
> > that was called from flush_icache_range() and other similar places.
> > This means that once it's out of userspace, it should only run
> > some "safe" kernel entry code, and then synchronize in some manner that
> > avoids race conditions with possible IPIs intended for context
> > synchronization that may happen at the same time. My next patch in the
> > series uses it in that one place.
> > 
> > Synchronization will have to be implemented without a mandatory
> > interrupt because it may be triggered locally, on the same CPU. On ARM,
> > ISB is definitely necessary there, however I am not sure, how this
> > should look like on x86 and other architectures. On ARM this probably
> > still should be combined with a real memory barrier and cache
> > synchronization, however I am not entirely sure about details. Would
> > it make more sense to run DMB, IC and ISB? 
> 
> IIUC, we don't need to do anything on arm64 because taking an exception acts
> as a context synchronization event, so I don't think you should try to
> expose this as a new barrier macro. Instead, just make it a pre-requisite
> that architectures need to ensure this behaviour when entering the kernel
> from userspace if they are to select HAVE_ARCH_TASK_ISOLATION.

The CSE from the exception isn't sufficient here, because it needs to
occur after the CPU has re-registered to receive IPIs for
kick_all_cpus_sync(). Otherwise there's a window between taking the
exception and re-registering where a necessary context synchronization
event can be missed. e.g.

CPU A				CPU B
[ Modifies some code ]		
				[ enters exception ]
[ D cache maintenance ]
[ I cache maintenance ]
[ IPI ]				// IPI not taken
  ...				[ register for IPI ] 
[ IPI completes ] 
				[ execute stale code here ]

However, I think 'IMB' is far too generic, and we should have an arch
hook specific to task isolation, as it's far less likely to be abused as
IMB will.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [EXT] Re: [PATCH v3 03/13] task_isolation: add instruction synchronization memory barrier
  2020-04-19  5:02         ` [EXT] " Alex Belits
  2020-04-20 12:23           ` Will Deacon
@ 2020-04-20 12:45           ` Mark Rutland
  1 sibling, 0 replies; 71+ messages in thread
From: Mark Rutland @ 2020-04-20 12:45 UTC (permalink / raw)
  To: Alex Belits
  Cc: mingo, davem, linux-api, rostedt, peterz, linux-arch,
	Prasun Kapoor, catalin.marinas, frederic, tglx, will,
	linux-kernel, netdev, linux-arm-kernel

On Sun, Apr 19, 2020 at 05:02:01AM +0000, Alex Belits wrote:
> 
> On Wed, 2020-04-15 at 13:44 +0100, Mark Rutland wrote:
> > External Email
> > 
> > -------------------------------------------------------------------
> > ---
> > On Thu, Apr 09, 2020 at 03:17:40PM +0000, Alex Belits wrote:
> > > Some architectures implement memory synchronization instructions
> > > for
> > > instruction cache. Make a separate kind of barrier that calls them.
> > 
> > Modifying the instruction caches requries more than an ISB, and the
> > 'IMB' naming implies you're trying to order against memory accesses,
> > which isn't what ISB (generally) does.
> > 
> > What exactly do you want to use this for?
> 
> I guess, there should be different explanation and naming.
> 
> The intention is to have a separate barrier that causes cache
> synchronization event, for use in architecture-independent code. I am
> not sure, what exactly it should do to be implemented in architecture-
> independent manner, so it probably only makes sense along with a
> regular memory barrier.
> 
> The particular place where I had to use is the code that has to run
> after isolated task returns to the kernel. In the model that I propose
> for task isolation, remote context synchronization is skipped while
> task is in isolated in userspace (it doesn't run kernel, and kernel
> does not modify its userspace code, so it's harmless until entering the
> kernel). So it will skip the results of kick_all_cpus_sync() that was
> that was called from flush_icache_range() and other similar places.
> This means that once it's out of userspace, it should only run
> some "safe" kernel entry code, and then synchronize in some manner that
> avoids race conditions with possible IPIs intended for context
> synchronization that may happen at the same time. My next patch in the
> series uses it in that one place.
> 
> Synchronization will have to be implemented without a mandatory
> interrupt because it may be triggered locally, on the same CPU. On ARM,
> ISB is definitely necessary there, however I am not sure, how this
> should look like on x86 and other architectures. On ARM this probably
> still should be combined with a real memory barrier and cache
> synchronization, however I am not entirely sure about details. Would
> it make more sense to run DMB, IC and ISB? 

For the cases you mention above this really depends on how the new CPU
first synchronizes with the others, and what the scope of the "safe"
kernel entry code is.

Given that this is context-dependent, I think it would make more sense
for this to be an arch hook specific to task isolation rather than a
low-level common barrier.

Thanks,
Mark.

> 
> > 
> As-is, I don't think this makes sense as a generic barrier.
> 
> Thanks,
> Mark.
> 
> Signed-off-by: Alex Belits <abelits@marvell.com>
> ---
>  arch/arm/include/asm/barrier.h   | 2 ++
>  arch/arm64/include/asm/barrier.h | 2 ++
>  include/asm-generic/barrier.h    | 4 ++++
>  3 files changed, 8 insertions(+)
> 
> diff --git a/arch/arm/include/asm/barrier.h
> b/arch/arm/include/asm/barrier.h
> index 83ae97c049d9..6def62c95937 100644
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -64,12 +64,14 @@ extern void arm_heavy_mb(void);
>  #define mb()		__arm_heavy_mb()
>  #define rmb()		dsb()
>  #define wmb()		__arm_heavy_mb(st)
> +#define imb()		isb()
>  #define dma_rmb()	dmb(osh)
>  #define dma_wmb()	dmb(oshst)
>  #else
>  #define mb()		barrier()
>  #define rmb()		barrier()
>  #define wmb()		barrier()
> +#define imb()		barrier()
>  #define dma_rmb()	barrier()
>  #define dma_wmb()	barrier()
>  #endif
> diff --git a/arch/arm64/include/asm/barrier.h
> b/arch/arm64/include/asm/barrier.h
> index 7d9cc5ec4971..12a7dbd68bed 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -45,6 +45,8 @@
>  #define rmb()		dsb(ld)
>  #define wmb()		dsb(st)
>  
> +#define imb()		isb()
> +
>  #define dma_rmb()	dmb(oshld)
>  #define dma_wmb()	dmb(oshst)
>  
> diff --git a/include/asm-generic/barrier.h b/include/asm-
> generic/barrier.h
> index 85b28eb80b11..d5a822fb3e92 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -46,6 +46,10 @@
>  #define dma_wmb()	wmb()
>  #endif
>  
> +#ifndef imb
> +#define imb		barrier()
> +#endif
> +
>  #ifndef read_barrier_depends
>  #define read_barrier_depends()		do { } while (0)
>  #endif
> -- 
> 2.20.1
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [EXT] Re: [PATCH v3 03/13] task_isolation: add instruction synchronization memory barrier
  2020-04-20 12:36             ` Mark Rutland
@ 2020-04-20 13:55               ` Will Deacon
  2020-04-21  7:41                 ` Will Deacon
  0 siblings, 1 reply; 71+ messages in thread
From: Will Deacon @ 2020-04-20 13:55 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Alex Belits, mingo, davem, linux-api, rostedt, peterz,
	linux-arch, Prasun Kapoor, catalin.marinas, frederic, tglx,
	linux-kernel, netdev, linux-arm-kernel

On Mon, Apr 20, 2020 at 01:36:28PM +0100, Mark Rutland wrote:
> On Mon, Apr 20, 2020 at 01:23:51PM +0100, Will Deacon wrote:
> > On Sun, Apr 19, 2020 at 05:02:01AM +0000, Alex Belits wrote:
> > > On Wed, 2020-04-15 at 13:44 +0100, Mark Rutland wrote:
> > > > On Thu, Apr 09, 2020 at 03:17:40PM +0000, Alex Belits wrote:
> > > > > Some architectures implement memory synchronization instructions
> > > > > for
> > > > > instruction cache. Make a separate kind of barrier that calls them.
> > > > 
> > > > Modifying the instruction caches requries more than an ISB, and the
> > > > 'IMB' naming implies you're trying to order against memory accesses,
> > > > which isn't what ISB (generally) does.
> > > > 
> > > > What exactly do you want to use this for?
> > > 
> > > I guess, there should be different explanation and naming.
> > > 
> > > The intention is to have a separate barrier that causes cache
> > > synchronization event, for use in architecture-independent code. I am
> > > not sure, what exactly it should do to be implemented in architecture-
> > > independent manner, so it probably only makes sense along with a
> > > regular memory barrier.
> > > 
> > > The particular place where I had to use is the code that has to run
> > > after isolated task returns to the kernel. In the model that I propose
> > > for task isolation, remote context synchronization is skipped while
> > > task is in isolated in userspace (it doesn't run kernel, and kernel
> > > does not modify its userspace code, so it's harmless until entering the
> > > kernel).
> > 
> > > So it will skip the results of kick_all_cpus_sync() that was
> > > that was called from flush_icache_range() and other similar places.
> > > This means that once it's out of userspace, it should only run
> > > some "safe" kernel entry code, and then synchronize in some manner that
> > > avoids race conditions with possible IPIs intended for context
> > > synchronization that may happen at the same time. My next patch in the
> > > series uses it in that one place.
> > > 
> > > Synchronization will have to be implemented without a mandatory
> > > interrupt because it may be triggered locally, on the same CPU. On ARM,
> > > ISB is definitely necessary there, however I am not sure, how this
> > > should look like on x86 and other architectures. On ARM this probably
> > > still should be combined with a real memory barrier and cache
> > > synchronization, however I am not entirely sure about details. Would
> > > it make more sense to run DMB, IC and ISB? 
> > 
> > IIUC, we don't need to do anything on arm64 because taking an exception acts
> > as a context synchronization event, so I don't think you should try to
> > expose this as a new barrier macro. Instead, just make it a pre-requisite
> > that architectures need to ensure this behaviour when entering the kernel
> > from userspace if they are to select HAVE_ARCH_TASK_ISOLATION.
> 
> The CSE from the exception isn't sufficient here, because it needs to
> occur after the CPU has re-registered to receive IPIs for
> kick_all_cpus_sync(). Otherwise there's a window between taking the
> exception and re-registering where a necessary context synchronization
> event can be missed. e.g.
> 
> CPU A				CPU B
> [ Modifies some code ]		
> 				[ enters exception ]
> [ D cache maintenance ]
> [ I cache maintenance ]
> [ IPI ]				// IPI not taken
>   ...				[ register for IPI ] 
> [ IPI completes ] 
> 				[ execute stale code here ]

Thanks.

> However, I think 'IMB' is far too generic, and we should have an arch
> hook specific to task isolation, as it's far less likely to be abused as
> IMB will.

What guarantees we don't run any unsynchronised module code between
exception entry and registering for the IPI? It seems like we'd want that
code to run as early as possible, e.g. as part of
task_isolation_user_exit() but that doesn't seem to be what's happening.

Will

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [EXT] Re: [PATCH v3 03/13] task_isolation: add instruction synchronization memory barrier
  2020-04-20 13:55               ` Will Deacon
@ 2020-04-21  7:41                 ` Will Deacon
  0 siblings, 0 replies; 71+ messages in thread
From: Will Deacon @ 2020-04-21  7:41 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Alex Belits, mingo, davem, linux-api, rostedt, peterz,
	linux-arch, Prasun Kapoor, catalin.marinas, frederic, tglx,
	linux-kernel, netdev, linux-arm-kernel

On Mon, Apr 20, 2020 at 02:55:23PM +0100, Will Deacon wrote:
> On Mon, Apr 20, 2020 at 01:36:28PM +0100, Mark Rutland wrote:
> > On Mon, Apr 20, 2020 at 01:23:51PM +0100, Will Deacon wrote:
> > > IIUC, we don't need to do anything on arm64 because taking an exception acts
> > > as a context synchronization event, so I don't think you should try to
> > > expose this as a new barrier macro. Instead, just make it a pre-requisite
> > > that architectures need to ensure this behaviour when entering the kernel
> > > from userspace if they are to select HAVE_ARCH_TASK_ISOLATION.
> > 
> > The CSE from the exception isn't sufficient here, because it needs to
> > occur after the CPU has re-registered to receive IPIs for
> > kick_all_cpus_sync(). Otherwise there's a window between taking the
> > exception and re-registering where a necessary context synchronization
> > event can be missed. e.g.
> > 
> > CPU A				CPU B
> > [ Modifies some code ]		
> > 				[ enters exception ]
> > [ D cache maintenance ]
> > [ I cache maintenance ]
> > [ IPI ]				// IPI not taken
> >   ...				[ register for IPI ] 
> > [ IPI completes ] 
> > 				[ execute stale code here ]
> 
> Thanks.
> 
> > However, I think 'IMB' is far too generic, and we should have an arch
> > hook specific to task isolation, as it's far less likely to be abused as
> > IMB will.
> 
> What guarantees we don't run any unsynchronised module code between
> exception entry and registering for the IPI? It seems like we'd want that
> code to run as early as possible, e.g. as part of
> task_isolation_user_exit() but that doesn't seem to be what's happening.

Sorry, I guess that's more a question for Alex.

Alex -- do you think we could move the "register for IPI" step earlier
so that it's easier to reason about the code that runs in the dead zone
during exception entry?

Will

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 07/13] task_isolation: arch/arm64: enable task isolation functionality
  2020-04-09 15:23     ` [PATCH v3 07/13] task_isolation: arch/arm64: " Alex Belits
@ 2020-04-22 12:08       ` Catalin Marinas
  0 siblings, 0 replies; 71+ messages in thread
From: Catalin Marinas @ 2020-04-22 12:08 UTC (permalink / raw)
  To: Alex Belits
  Cc: frederic, rostedt, Prasun Kapoor, mingo, davem, linux-api,
	peterz, linux-arch, tglx, will, linux-arm-kernel, linux-kernel,
	netdev

On Thu, Apr 09, 2020 at 03:23:35PM +0000, Alex Belits wrote:
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index f0cec4160136..7563098eb5b2 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -63,6 +63,7 @@ void arch_release_task_struct(struct task_struct *tsk);
>  #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
>  #define TIF_UPROBE		4	/* uprobe breakpoint or singlestep */
>  #define TIF_FSCHECK		5	/* Check FS is USER_DS on return */
> +#define TIF_TASK_ISOLATION	6
>  #define TIF_NOHZ		7
>  #define TIF_SYSCALL_TRACE	8	/* syscall trace active */
>  #define TIF_SYSCALL_AUDIT	9	/* syscall auditing */
> @@ -83,6 +84,7 @@ void arch_release_task_struct(struct task_struct *tsk);
>  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
>  #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
>  #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
> +#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
>  #define _TIF_NOHZ		(1 << TIF_NOHZ)
>  #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
>  #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
> @@ -96,7 +98,8 @@ void arch_release_task_struct(struct task_struct *tsk);
>  
>  #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
>  				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
> -				 _TIF_UPROBE | _TIF_FSCHECK)
> +				 _TIF_UPROBE | _TIF_FSCHECK | \
> +				 _TIF_TASK_ISOLATION)
>  
>  #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
>  				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> index cd6e5fa48b9c..b35b9b0c594c 100644
> --- a/arch/arm64/kernel/ptrace.c
> +++ b/arch/arm64/kernel/ptrace.c
> @@ -29,6 +29,7 @@
>  #include <linux/regset.h>
>  #include <linux/tracehook.h>
>  #include <linux/elf.h>
> +#include <linux/isolation.h>
>  
>  #include <asm/compat.h>
>  #include <asm/cpufeature.h>
> @@ -1836,6 +1837,15 @@ int syscall_trace_enter(struct pt_regs *regs)
>  			return -1;
>  	}
>  
> +	/*
> +	 * In task isolation mode, we may prevent the syscall from
> +	 * running, and if so we also deliver a signal to the process.
> +	 */
> +	if (test_thread_flag(TIF_TASK_ISOLATION)) {
> +		if (task_isolation_syscall(regs->syscallno) == -1)
> +			return -1;
> +	}

Is this supposed to be called only when syscall tracing is enabled?
It only gets here if the task has any of the _TIF_SYSCALL_WORK flags.

> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index 339882db5a91..d488c91a4877 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -20,6 +20,7 @@
>  #include <linux/tracehook.h>
>  #include <linux/ratelimit.h>
>  #include <linux/syscalls.h>
> +#include <linux/isolation.h>
>  
>  #include <asm/daifflags.h>
>  #include <asm/debug-monitors.h>
> @@ -898,6 +899,11 @@ static void do_signal(struct pt_regs *regs)
>  	restore_saved_sigmask();
>  }
>  
> +#define NOTIFY_RESUME_LOOP_FLAGS \
> +	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
> +	_TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
> +	_TIF_UPROBE | _TIF_FSCHECK)

AFAICT, that's just _TIF_WORK_MASK without _TIF_TASK_ISOLATION. I'd
rather not duplicate these, they are prone to get out of sync. You could
do something like:

#define NOTIFY_RESUME_LOOP_FLAGS (_TIF_WORK_MASK & ~_TIF_TASK_ISOLATION)

> +
>  asmlinkage void do_notify_resume(struct pt_regs *regs,
>  				 unsigned long thread_flags)
>  {
> @@ -908,6 +914,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
>  	 */
>  	trace_hardirqs_off();
>  
> +	task_isolation_check_run_cleanup();
> +
>  	do {
>  		/* Check valid user FS if needed */
>  		addr_limit_user_check();
> @@ -938,7 +946,10 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
>  
>  		local_daif_mask();
>  		thread_flags = READ_ONCE(current_thread_info()->flags);
> -	} while (thread_flags & _TIF_WORK_MASK);
> +	} while (thread_flags & NOTIFY_RESUME_LOOP_FLAGS);
> +
> +	if (thread_flags & _TIF_TASK_ISOLATION)
> +		task_isolation_start();
>  }
>  
>  unsigned long __ro_after_init signal_minsigstksz;

-- 
Catalin

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/12] task_isolation: userspace hard isolation from kernel
  2020-03-05 18:33   ` Frederic Weisbecker
  2020-03-08  5:32     ` [EXT] " Alex Belits
@ 2020-04-28 14:12     ` Marcelo Tosatti
  1 sibling, 0 replies; 71+ messages in thread
From: Marcelo Tosatti @ 2020-04-28 14:12 UTC (permalink / raw)
  To: Frederic Weisbecker, Alex Belits
  Cc: Alex Belits, rostedt, mingo, peterz, linux-kernel, Prasun Kapoor,
	tglx, linux-api, linux-mm, linux-arch


I like the idea as well, especially the reporting infrastructure, and 
would like to see something like this integrated upstream.

On Thu, Mar 05, 2020 at 07:33:13PM +0100, Frederic Weisbecker wrote:
> On Wed, Mar 04, 2020 at 04:07:12PM +0000, Alex Belits wrote:
> > The existing nohz_full mode is designed as a "soft" isolation mode
> > that makes tradeoffs to minimize userspace interruptions while
> > still attempting to avoid overheads in the kernel entry/exit path,
> > to provide 100% kernel semantics, etc.
> > 
> > However, some applications require a "hard" commitment from the
> > kernel to avoid interruptions, in particular userspace device driver
> > style applications, such as high-speed networking code.
> > 
> > This change introduces a framework to allow applications
> > to elect to have the "hard" semantics as needed, specifying
> > prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
> > 
> > The kernel must be built with the new TASK_ISOLATION Kconfig flag
> > to enable this mode, and the kernel booted with an appropriate
> > "isolcpus=nohz,domain,CPULIST" boot argument to enable
> > nohz_full and isolcpus. The "task_isolation" state is then indicated
> > by setting a new task struct field, task_isolation_flag, to the
> > value passed by prctl(), and also setting a TIF_TASK_ISOLATION
> > bit in the thread_info flags. When the kernel is returning to
> > userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
> > it calls the new task_isolation_start() routine to arrange for
> > the task to avoid being interrupted in the future.
> > 
> > With interrupts disabled, task_isolation_start() ensures that kernel
> > subsystems that might cause a future interrupt are quiesced. If it
> > doesn't succeed, it adjusts the syscall return value to indicate that
> > fact, and userspace can retry as desired. In addition to stopping
> > the scheduler tick, the code takes any actions that might avoid
> > a future interrupt to the core, such as a worker thread being
> > scheduled that could be quiesced now (e.g. the vmstat worker)
> > or a future IPI to the core to clean up some state that could be
> > cleaned up now (e.g. the mm lru per-cpu cache).
> > 
> > Once the task has returned to userspace after issuing the prctl(),
> > if it enters the kernel again via system call, page fault, or any
> > other exception or irq, the kernel will kill it with SIGKILL.

This severely limits usage of the interface. 

I suppose the reason for blocking system calls is to make sure 
userspace does not initiate actions that might generate interruptions, 
such as IPI flushes (memory unmaps or changes), vmstat work items
(page dirtying), or is there any reason for it ?


+/* Only a few syscalls are valid once we are in task isolation mode. */
+static bool is_acceptable_syscall(int syscall)
+{
+       /* No need to incur an isolation signal if we are just exiting. */
+       if (syscall == __NR_exit || syscall == __NR_exit_group)
+               return true;
+       
+       /* Check to see if it's the prctl for isolation. */
+       if (syscall == __NR_prctl) {
+               unsigned long arg[SYSCALL_MAX_ARGS];
+       
+               syscall_get_arguments(current, current_pt_regs(), arg);
+               if (arg[0] == PR_TASK_ISOLATION)
+                       return true;
+       }
+ 
+       return false;
+}


> > In addition to sending a signal, the code supports a kernel
> > command-line "task_isolation_debug" flag which causes a stack
> > backtrace to be generated whenever a task loses isolation.
> > 
> > To allow the state to be entered and exited, the syscall checking
> > test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
> > clear the bit again later, and ignores exit/exit_group to allow
> > exiting the task without a pointless signal being delivered.
> > 
> > The prctl() API allows for specifying a signal number to use instead
> > of the default SIGKILL, to allow for catching the notification
> > signal; for example, in a production environment, it might be
> > helpful to log information to the application logging mechanism
> > before exiting. Or, the signal handler might choose to reset the
> > program counter back to the code segment intended to be run isolated
> > via prctl() to continue execution.
> 
> Hi Alew,
> 
> I'm glad this patchset is being resurected.
> Reading that changelog, I like the general idea and the direction.
> The diff is a bit scary though but I'll check the patches in detail
> in the upcoming days.
> 
> > 
> > In a number of cases we can tell on a remote cpu that we are
> > going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
> > In that case we generate the diagnostic (and optional stack dump)
> > on the remote core to be able to deliver better diagnostics.
> > If the interrupt is not something caught by Linux (e.g. a
> > hypervisor interrupt) we can also request a reschedule IPI to
> > be sent to the remote core so it can be sure to generate a
> > signal to notify the process.
> 
> I'm wondering if it's wise to run that on a guest at all :-)
> Or we should consider any guest exit to the host as a
> disturbance, we would then need some sort of paravirt
> driver to notify that, etc... That doesn't sound appealing.
> 
> Thanks.


^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2020-04-28 15:28 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-04 16:01 [PATCH 00/12] "Task_isolation" mode Alex Belits
2020-03-04 16:03 ` [PATCH 01/12] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
2020-03-04 16:04 ` [PATCH 02/12] task_isolation: vmstat: add vmstat_idle function Alex Belits
2020-03-04 16:07 ` [PATCH 03/12] task_isolation: userspace hard isolation from kernel Alex Belits
2020-03-05 18:33   ` Frederic Weisbecker
2020-03-08  5:32     ` [EXT] " Alex Belits
2020-04-28 14:12     ` Marcelo Tosatti
2020-03-06 15:26   ` Frederic Weisbecker
2020-03-08  6:06     ` [EXT] " Alex Belits
2020-03-06 16:00   ` Frederic Weisbecker
2020-03-08  7:16     ` [EXT] " Alex Belits
2020-03-04 16:08 ` [PATCH 04/12] task_isolation: Add task isolation hooks to arch-independent code Alex Belits
2020-03-04 16:09 ` [PATCH 05/12] task_isolation: arch/x86: enable task isolation functionality Alex Belits
2020-03-04 16:10 ` [PATCH 06/12] task_isolation: arch/arm64: " Alex Belits
2020-03-04 16:31   ` Mark Rutland
2020-03-08  4:48     ` [EXT] " Alex Belits
2020-03-04 16:11 ` [PATCH 07/12] task_isolation: arch/arm: " Alex Belits
2020-03-04 16:12 ` [PATCH 08/12] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
2020-03-06 16:03   ` Frederic Weisbecker
2020-03-08  7:28     ` [EXT] " Alex Belits
2020-03-09  2:38       ` Frederic Weisbecker
2020-03-04 16:13 ` [PATCH 09/12] task_isolation: net: don't flush backlog on CPUs running isolated tasks Alex Belits
2020-03-04 16:14 ` [PATCH 10/12] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize Alex Belits
2020-03-04 16:15 ` [PATCH 11/12] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits
2020-03-06 15:34   ` Frederic Weisbecker
2020-03-08  6:48     ` [EXT] " Alex Belits
2020-03-09  2:28       ` Frederic Weisbecker
2020-03-04 16:16 ` [PATCH 12/12] task_isolation: CONFIG_TASK_ISOLATION prevents distribution of jobs to non-housekeeping CPUs Alex Belits
2020-03-08  3:42 ` [PATCH v2 00/12] "Task_isolation" mode Alex Belits
2020-03-08  3:44   ` [PATCH v2 01/12] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
2020-03-08  3:46   ` [PATCH v2 02/12] task_isolation: vmstat: add vmstat_idle function Alex Belits
2020-03-08  3:47   ` [PATCH v2 03/12] task_isolation: userspace hard isolation from kernel Alex Belits
     [not found]     ` <20200307214254.7a8f6c22@hermes.lan>
2020-03-08  7:33       ` [EXT] " Alex Belits
2020-03-27  8:42     ` Marta Rybczynska
2020-04-06  4:31     ` Kevyn-Alexandre Paré
2020-04-06  4:43     ` Kevyn-Alexandre Paré
2020-03-08  3:48   ` [PATCH v2 04/12] task_isolation: Add task isolation hooks to arch-independent code Alex Belits
2020-03-08  3:49   ` [PATCH v2 05/12] task_isolation: arch/x86: enable task isolation functionality Alex Belits
2020-03-08  3:50   ` [PATCH v2 06/12] task_isolation: arch/arm64: " Alex Belits
2020-03-09 16:59     ` Mark Rutland
2020-03-08  3:52   ` [PATCH v2 07/12] task_isolation: arch/arm: " Alex Belits
2020-03-08  3:53   ` [PATCH v2 08/12] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
2020-03-08  3:54   ` [PATCH v2 09/12] task_isolation: net: don't flush backlog on CPUs running isolated tasks Alex Belits
2020-03-08  3:55   ` [PATCH v2 10/12] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize Alex Belits
2020-04-06  4:27     ` Kevyn-Alexandre Paré
2020-03-08  3:56   ` [PATCH v2 11/12] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits
2020-03-08  3:57   ` [PATCH v2 12/12] task_isolation: CONFIG_TASK_ISOLATION prevents distribution of jobs to non-housekeeping CPUs Alex Belits
2020-04-09 15:09   ` [PATCH v3 00/13] "Task_isolation" mode Alex Belits
2020-04-09 15:15     ` [PATCH 01/13] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits
2020-04-09 15:16     ` [PATCH 02/13] task_isolation: vmstat: add vmstat_idle function Alex Belits
2020-04-09 15:17     ` [PATCH v3 03/13] task_isolation: add instruction synchronization memory barrier Alex Belits
2020-04-15 12:44       ` Mark Rutland
2020-04-19  5:02         ` [EXT] " Alex Belits
2020-04-20 12:23           ` Will Deacon
2020-04-20 12:36             ` Mark Rutland
2020-04-20 13:55               ` Will Deacon
2020-04-21  7:41                 ` Will Deacon
2020-04-20 12:45           ` Mark Rutland
2020-04-09 15:20     ` [PATCH v3 04/13] task_isolation: userspace hard isolation from kernel Alex Belits
2020-04-09 18:00       ` Andy Lutomirski
2020-04-19  5:07         ` Alex Belits
2020-04-09 15:21     ` [PATCH 05/13] task_isolation: Add task isolation hooks to arch-independent code Alex Belits
2020-04-09 15:22     ` [PATCH 06/13] task_isolation: arch/x86: enable task isolation functionality Alex Belits
2020-04-09 15:23     ` [PATCH v3 07/13] task_isolation: arch/arm64: " Alex Belits
2020-04-22 12:08       ` Catalin Marinas
2020-04-09 15:24     ` [PATCH v3 08/13] task_isolation: arch/arm: " Alex Belits
2020-04-09 15:25     ` [PATCH v3 09/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits
2020-04-09 15:26     ` [PATCH v3 10/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks Alex Belits
2020-04-09 15:27     ` [PATCH v3 11/13] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize Alex Belits
2020-04-09 15:27     ` [PATCH v3 12/13] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits
2020-04-09 15:28     ` [PATCH v3 13/13] task_isolation: CONFIG_TASK_ISOLATION prevents distribution of jobs to non-housekeeping CPUs Alex Belits

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).