linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v14 00/14] support "task_isolation" mode
@ 2016-08-09 20:29 Chris Metcalf
  2016-08-09 20:29 ` [PATCH v14 01/14] vmstat: add quiet_vmstat_sync function Chris Metcalf
                   ` (13 more replies)
  0 siblings, 14 replies; 30+ messages in thread
From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, Francis Giraldeau, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

Here is a respin of the task-isolation patch set.  This primarily
reflects some testing on x86, and a rebase to 4.8.

I have been getting email asking me when and where this patch will be
upstreamed so folks can start using it.  I had been thinking the
obvious path was via Frederic Weisbecker to Ingo as a NOHZ kind of
thing.  But perhaps it touches enough other subsystems that that
doesn't really make sense?  Andrew, would it make sense to take it
directly via your tree?  Frederic, Ingo, what do you think?

Changes since v13:

- Rebased on v4.8-rc1 (and thus uses the standard try_get_task_struct).

- Fixes a bug when using the clocksource watchdog; it is now scheduled
  to run only on the housekeeping cpus [by Christoph Lameter].

- Fixes a bug in x86 syscall_trace_enter() [seen by Francis Giraldeau].

- Includes a selftest.

The previous (v13) patch series is here:

https://lkml.kernel.org/r/1468529299-27929-1-git-send-email-cmetcalf@mellanox.com

This version of the patch series has been tested on arm64 and tilegx,
and build-tested on x86 (plus some volunteer testing on x86 by
Christoph and Francis).

It remains true that the 1 Hz tick needs to be disabled for this
patch series to be able to achieve its primary goal of enabling
truly tick-free operation, but that is ongoing orthogonal work.
Frederic, do you have a sense of what is left to be done there?
I can certainly try to contribute to that effort as well.

The series is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Chris Metcalf (13):
  vmstat: add quiet_vmstat_sync function
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: track asynchronous interrupts
  arch/x86: enable task isolation functionality
  arm64: factor work_pending state machine to C
  arch/arm64: enable task isolation functionality
  arch/tile: enable task isolation functionality
  arm, tile: turn off timer tick for oneshot_stopped state
  task_isolation: support CONFIG_TASK_ISOLATION_ALL
  task_isolation: add user-settable notification signal
  task_isolation self test

Christoph Lameter (1):
  clocksource: Do not schedule watchdog on isolated or NOHZ cpus

 Documentation/kernel-parameters.txt                |  16 +
 arch/arm64/Kconfig                                 |   1 +
 arch/arm64/include/asm/thread_info.h               |   5 +-
 arch/arm64/kernel/entry.S                          |  12 +-
 arch/arm64/kernel/ptrace.c                         |  15 +-
 arch/arm64/kernel/signal.c                         |  42 +-
 arch/arm64/kernel/smp.c                            |   2 +
 arch/arm64/mm/fault.c                              |   8 +-
 arch/tile/Kconfig                                  |   1 +
 arch/tile/include/asm/thread_info.h                |   4 +-
 arch/tile/kernel/process.c                         |   9 +
 arch/tile/kernel/ptrace.c                          |   7 +
 arch/tile/kernel/single_step.c                     |   7 +
 arch/tile/kernel/smp.c                             |  26 +-
 arch/tile/kernel/time.c                            |   1 +
 arch/tile/kernel/unaligned.c                       |   4 +
 arch/tile/mm/fault.c                               |  13 +-
 arch/tile/mm/homecache.c                           |   2 +
 arch/x86/Kconfig                                   |   1 +
 arch/x86/entry/common.c                            |  20 +-
 arch/x86/include/asm/thread_info.h                 |   2 +
 arch/x86/kernel/smp.c                              |   2 +
 arch/x86/kernel/traps.c                            |   3 +
 arch/x86/mm/fault.c                                |   5 +
 drivers/base/cpu.c                                 |  18 +
 drivers/clocksource/arm_arch_timer.c               |   2 +
 include/linux/context_tracking_state.h             |   6 +
 include/linux/isolation.h                          |  73 +++
 include/linux/sched.h                              |   3 +
 include/linux/swap.h                               |   1 +
 include/linux/tick.h                               |   2 +
 include/linux/vmstat.h                             |   4 +
 include/uapi/linux/prctl.h                         |  10 +
 init/Kconfig                                       |  37 ++
 kernel/Makefile                                    |   1 +
 kernel/fork.c                                      |   3 +
 kernel/irq_work.c                                  |   5 +-
 kernel/isolation.c                                 | 337 +++++++++++
 kernel/sched/core.c                                |  14 +
 kernel/signal.c                                    |  15 +
 kernel/smp.c                                       |   6 +-
 kernel/softirq.c                                   |  33 ++
 kernel/sys.c                                       |   9 +
 kernel/time/clocksource.c                          |  10 +-
 kernel/time/tick-sched.c                           |  36 +-
 mm/swap.c                                          |  15 +-
 mm/vmstat.c                                        |  19 +
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/task_isolation/Makefile    |  11 +
 tools/testing/selftests/task_isolation/config      |   2 +
 tools/testing/selftests/task_isolation/isolation.c | 618 +++++++++++++++++++++
 51 files changed, 1440 insertions(+), 59 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c
 create mode 100644 tools/testing/selftests/task_isolation/Makefile
 create mode 100644 tools/testing/selftests/task_isolation/config
 create mode 100644 tools/testing/selftests/task_isolation/isolation.c

-- 
2.7.2

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v14 01/14] vmstat: add quiet_vmstat_sync function
  2016-08-09 20:29 [PATCH v14 00/14] support "task_isolation" mode Chris Metcalf
@ 2016-08-09 20:29 ` Chris Metcalf
  2016-08-09 20:29 ` [PATCH v14 02/14] vmstat: add vmstat_idle function Chris Metcalf
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Michal Hocko, linux-kernel
  Cc: Chris Metcalf

In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
the quiet_vmstat() function became asynchronous, in the sense that
the vmstat work was still scheduled to run on the core when the
function returned.  For task isolation, we need a synchronous
version of the function that guarantees that the vmstat worker
will not run on the core on return from the function.  Add a
quiet_vmstat_sync() function with that semantic.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/vmstat.h | 2 ++
 mm/vmstat.c            | 9 +++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 613771909b6e..fab62aa74079 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -234,6 +234,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
+void quiet_vmstat_sync(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -336,6 +337,7 @@ static inline void __dec_node_page_state(struct page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat_sync(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 89cec42d19ff..57fc29750da6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1754,6 +1754,15 @@ void quiet_vmstat(void)
 }
 
 /*
+ * Synchronously quiet vmstat so the work is guaranteed not to run on return.
+ */
+void quiet_vmstat_sync(void)
+{
+	cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work));
+	refresh_cpu_vm_stats(false);
+}
+
+/*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
  * threads for vm statistics updates disabled because of
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v14 02/14] vmstat: add vmstat_idle function
  2016-08-09 20:29 [PATCH v14 00/14] support "task_isolation" mode Chris Metcalf
  2016-08-09 20:29 ` [PATCH v14 01/14] vmstat: add quiet_vmstat_sync function Chris Metcalf
@ 2016-08-09 20:29 ` Chris Metcalf
  2016-08-09 20:29 ` [PATCH v14 03/14] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-mm, linux-kernel
  Cc: Chris Metcalf

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index fab62aa74079..69b6cc4be909 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -235,6 +235,7 @@ extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
 void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -338,6 +339,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 57fc29750da6..7dd17c06d3a7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1763,6 +1763,16 @@ void quiet_vmstat_sync(void)
 }
 
 /*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+	return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) &&
+		!need_update(smp_processor_id());
+}
+
+/*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
  * threads for vm statistics updates disabled because of
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v14 03/14] lru_add_drain_all: factor out lru_add_drain_needed
  2016-08-09 20:29 [PATCH v14 00/14] support "task_isolation" mode Chris Metcalf
  2016-08-09 20:29 ` [PATCH v14 01/14] vmstat: add quiet_vmstat_sync function Chris Metcalf
  2016-08-09 20:29 ` [PATCH v14 02/14] vmstat: add vmstat_idle function Chris Metcalf
@ 2016-08-09 20:29 ` Chris Metcalf
  2016-08-09 20:29 ` [PATCH v14 04/14] task_isolation: add initial support Chris Metcalf
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-mm, linux-kernel
  Cc: Chris Metcalf

This per-cpu check was being done in the loop in lru_add_drain_all(),
but having it be callable for a particular cpu is helpful for the
task-isolation patches.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/swap.h |  1 +
 mm/swap.c            | 15 ++++++++++-----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b17cc4830fa6..58966a235298 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -295,6 +295,7 @@ extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
+extern bool lru_add_drain_needed(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index 75c63bb2a1da..a2be6f0931b5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -655,6 +655,15 @@ void deactivate_page(struct page *page)
 	}
 }
 
+bool lru_add_drain_needed(int cpu)
+{
+	return (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
+		pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
+		pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+		need_activate_page_drain(cpu));
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
@@ -699,11 +708,7 @@ void lru_add_drain_all(void)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
-		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
-		    need_activate_page_drain(cpu)) {
+		if (lru_add_drain_needed(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, lru_add_drain_wq, work);
 			cpumask_set_cpu(cpu, &has_work);
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v14 04/14] task_isolation: add initial support
  2016-08-09 20:29 [PATCH v14 00/14] support "task_isolation" mode Chris Metcalf
                   ` (2 preceding siblings ...)
  2016-08-09 20:29 ` [PATCH v14 03/14] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
@ 2016-08-09 20:29 ` Chris Metcalf
  2016-08-11 18:11   ` Frederic Weisbecker
  2016-08-09 20:29 ` [PATCH v14 05/14] task_isolation: track asynchronous interrupts Chris Metcalf
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 30+ messages in thread
From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well.  The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in
thread_info flags.  When task isolation is enabled for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future.

The task_isolation_ready() call is invoked when TIF_TASK_ISOLATION is
set in prepare_exit_to_usermode() or its architectural equivalent,
and forces the loop to retry if the system is not ready.  It is
called with interrupts disabled and inspects the kernel state
to determine if it is safe to return into an isolated state.
In particular, if it sees that the scheduler tick is still enabled,
it reports that it is not yet safe.

Each time through the loop of TIF work to do, if TIF_TASK_ISOLATION
is set, we call the new task_isolation_enter() routine.  This
takes any actions that might avoid a future interrupt to the core,
such as a worker thread being scheduled that could be quiesced now
(e.g. the vmstat worker) or a future IPI to the core to clean up some
state that could be cleaned up now (e.g. the mm lru per-cpu cache).
In addition, it reqeusts rescheduling if the scheduler dyntick is
still running.

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
of a number of other synchronous traps, the kernel will kill it
with SIGKILL.  For system calls, this test is performed immediately
before the SECCOMP test and causes the syscall to return immediately
with ENOSYS.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl() syscall so that we can clear the bit again
later, and ignores exit/exit_group to allow exiting the task without
a pointless signal killing you as you try to do so.

A new /sys/devices/system/cpu/task_isolation pseudo-file is added,
parallel to the comparable nohz_full file.

Separate patches that follow provide these changes for x86, tile,
and arm64.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 Documentation/kernel-parameters.txt |   8 ++
 drivers/base/cpu.c                  |  18 +++
 include/linux/isolation.h           |  60 ++++++++++
 include/linux/sched.h               |   3 +
 include/linux/tick.h                |   2 +
 include/uapi/linux/prctl.h          |   5 +
 init/Kconfig                        |  27 +++++
 kernel/Makefile                     |   1 +
 kernel/fork.c                       |   3 +
 kernel/isolation.c                  | 217 ++++++++++++++++++++++++++++++++++++
 kernel/signal.c                     |   8 ++
 kernel/sys.c                        |   9 ++
 kernel/time/tick-sched.c            |  36 +++---
 13 files changed, 384 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 46c030a49186..7f1336b50dcc 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3943,6 +3943,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation=	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION=y, set
+			the specified list of CPUs where cpus will be able
+			to use prctl(PR_SET_TASK_ISOLATION) to set up task
+			isolation mode.  Setting this boot flag implicitly
+			also sets up nohz_full and isolcpus mode for the
+			listed set of cpus.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 691eeea2f19a..eaf40f4264ee 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -17,6 +17,7 @@
 #include <linux/of.h>
 #include <linux/cpufeature.h>
 #include <linux/tick.h>
+#include <linux/isolation.h>
 
 #include "base.h"
 
@@ -290,6 +291,20 @@ static ssize_t print_cpus_nohz_full(struct device *dev,
 static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL);
 #endif
 
+#ifdef CONFIG_TASK_ISOLATION
+static ssize_t print_cpus_task_isolation(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	int n = 0, len = PAGE_SIZE-2;
+
+	n = scnprintf(buf, len, "%*pbl\n", cpumask_pr_args(task_isolation_map));
+
+	return n;
+}
+static DEVICE_ATTR(task_isolation, 0444, print_cpus_task_isolation, NULL);
+#endif
+
 static void cpu_device_release(struct device *dev)
 {
 	/*
@@ -460,6 +475,9 @@ static struct attribute *cpu_root_attrs[] = {
 #ifdef CONFIG_NO_HZ_FULL
 	&dev_attr_nohz_full.attr,
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	&dev_attr_task_isolation.attr,
+#endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
 #endif
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..d9288b85b41f
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,60 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/* cpus that are configured to support task isolation */
+extern cpumask_var_t task_isolation_map;
+
+extern int task_isolation_init(void);
+
+static inline bool task_isolation_possible(int cpu)
+{
+	return task_isolation_map != NULL &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+extern int task_isolation_set(unsigned int flags);
+
+extern bool task_isolation_ready(void);
+extern void task_isolation_enter(void);
+
+static inline void task_isolation_set_flags(struct task_struct *p,
+					    unsigned int flags)
+{
+	p->task_isolation_flags = flags;
+
+	if (flags & PR_TASK_ISOLATION_ENABLE)
+		set_tsk_thread_flag(p, TIF_TASK_ISOLATION);
+	else
+		clear_tsk_thread_flag(p, TIF_TASK_ISOLATION);
+}
+
+extern int task_isolation_syscall(int nr);
+
+/* Report on exceptions that don't cause a signal for the user process. */
+extern void _task_isolation_quiet_exception(const char *fmt, ...);
+#define task_isolation_quiet_exception(fmt, ...)			\
+	do {								\
+		if (current_thread_info()->flags & _TIF_TASK_ISOLATION) \
+			_task_isolation_quiet_exception(fmt, ## __VA_ARGS__); \
+	} while (0)
+
+#else
+static inline void task_isolation_init(void) { }
+static inline bool task_isolation_possible(int cpu) { return false; }
+static inline bool task_isolation_ready(void) { return true; }
+static inline void task_isolation_enter(void) { }
+extern inline void task_isolation_set_flags(struct task_struct *p,
+					    unsigned int flags) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_quiet_exception(const char *fmt, ...) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e513e39..77dc12cd4fe8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1923,6 +1923,9 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned int	task_isolation_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 62be0786d6d0..fbd81e322860 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -235,6 +235,8 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
 
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
+extern void tick_nohz_full_add_cpus(const struct cpumask *mask);
+extern bool can_stop_my_full_tick(void);
 #else
 static inline int housekeeping_any_cpu(void)
 {
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..2a49d0d2940a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,9 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Enable/disable or query task_isolation mode for TASK_ISOLATION kernels. */
+#define PR_SET_TASK_ISOLATION		48
+#define PR_GET_TASK_ISOLATION		49
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 69886493ff1e..85a4b6dd26f2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -786,6 +786,33 @@ config RCU_EXPEDITE_BOOT
 
 endmenu # "RCU Subsystem"
 
+config HAVE_ARCH_TASK_ISOLATION
+	bool
+
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
+	help
+	 Allow userspace processes to place themselves on task_isolation
+	 cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+	 themselves from the kernel.  Prior to returning to userspace,
+	 isolated tasks will arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  By default, attempting to re-enter the kernel
+	 while in this mode will cause the task to be terminated
+	 with a signal; you must explicitly use prctl() to disable
+	 task isolation before resuming normal use of the kernel.
+
+	 This "hard" isolation from the kernel is required for
+	 userspace tasks that are running hard real-time tasks in
+	 userspace, such as a 10 Gbit network driver in userspace.
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index e2ec54e2b952..91ff1615f4d6 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -112,6 +112,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 52e725d4a866..54542266d7a8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -76,6 +76,7 @@
 #include <linux/compiler.h>
 #include <linux/sysctl.h>
 #include <linux/kcov.h>
+#include <linux/isolation.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1533,6 +1534,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #endif
 	clear_all_latency_tracing(p);
 
+	task_isolation_set_flags(p, 0);
+
 	/* ok, now we should be set up.. */
 	p->pid = pid_nr(pid);
 	if (clone_flags & CLONE_THREAD) {
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..bf3ebb0a727c
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,217 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include <asm/unistd.h>
+#include <asm/syscall.h>
+#include "time/tick-sched.h"
+
+cpumask_var_t task_isolation_map;
+static bool saw_boot_arg;
+
+/*
+ * Isolation requires both nohz and isolcpus support from the scheduler.
+ * We provide a boot flag that enables both for now, and which we can
+ * add other functionality to over time if needed.  Note that just
+ * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
+ */
+static int __init task_isolation_setup(char *str)
+{
+	saw_boot_arg = true;
+
+	alloc_bootmem_cpumask_var(&task_isolation_map);
+	if (cpulist_parse(str, task_isolation_map) < 0) {
+		pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
+		return 1;
+	}
+
+	return 1;
+}
+__setup("task_isolation=", task_isolation_setup);
+
+int __init task_isolation_init(void)
+{
+	/* For offstack cpumask, ensure we allocate an empty cpumask early. */
+	if (!saw_boot_arg) {
+		zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+		return 0;
+	}
+
+	/*
+	 * Add our task_isolation cpus to nohz_full and isolcpus.  Note
+	 * that we are called relatively early in boot, from tick_init();
+	 * at this point neither nohz_full nor isolcpus has been used
+	 * to configure the system, but isolcpus has been allocated
+	 * already in sched_init().
+	 */
+	tick_nohz_full_add_cpus(task_isolation_map);
+	cpumask_or(cpu_isolated_map, cpu_isolated_map, task_isolation_map);
+
+	return 0;
+}
+
+/*
+ * Get a snapshot of whether, at this moment, it would be possible to
+ * stop the tick.  This test normally requires interrupts disabled since
+ * the condition can change if an interrupt is delivered.  However, in
+ * this case we are using it in an advisory capacity to see if there
+ * is anything obviously indicating that the task isolation
+ * preconditions have not been met, so it's OK that in principle it
+ * might not still be true later in the prctl() syscall path.
+ */
+static bool can_stop_my_full_tick_now(void)
+{
+	bool ret;
+
+	local_irq_disable();
+	ret = can_stop_my_full_tick();
+	local_irq_enable();
+	return ret;
+}
+
+/*
+ * This routine controls whether we can enable task-isolation mode.
+ * The task must be affinitized to a single task_isolation core, or
+ * else we return EINVAL.  And, it must be at least statically able to
+ * stop the nohz_full tick (e.g., no other schedulable tasks currently
+ * running, no POSIX cpu timers currently set up, etc.); if not, we
+ * return EAGAIN.
+ */
+int task_isolation_set(unsigned int flags)
+{
+	if (flags != 0) {
+		if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
+		    !task_isolation_possible(raw_smp_processor_id())) {
+			/* Invalid task affinity setting. */
+			return -EINVAL;
+		}
+		if (!can_stop_my_full_tick_now()) {
+			/* System not yet ready for task isolation. */
+			return -EAGAIN;
+		}
+	}
+
+	task_isolation_set_flags(current, flags);
+	return 0;
+}
+
+/*
+ * In task isolation mode we try to return to userspace only after
+ * attempting to make sure we won't be interrupted again.  This test
+ * is run with interrupts disabled to test that everything we need
+ * to be true is true before we can return to userspace.
+ */
+bool task_isolation_ready(void)
+{
+	WARN_ON_ONCE(!irqs_disabled());
+
+	return (!lru_add_drain_needed(smp_processor_id()) &&
+		vmstat_idle() &&
+		tick_nohz_tick_stopped());
+}
+
+/*
+ * Each time we try to prepare for return to userspace in a process
+ * with task isolation enabled, we run this code to quiesce whatever
+ * subsystems we can readily quiesce to avoid later interrupts.
+ */
+void task_isolation_enter(void)
+{
+	WARN_ON_ONCE(irqs_disabled());
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	quiet_vmstat_sync();
+
+	/*
+	 * Request rescheduling unless we are in full dynticks mode.
+	 * We would eventually get pre-empted without this, and if
+	 * there's another task waiting, it would run; but by
+	 * explicitly requesting the reschedule, we may reduce the
+	 * latency.  We could directly call schedule() here as well,
+	 * but since our caller is the standard place where schedule()
+	 * is called, we defer to the caller.
+	 *
+	 * A more substantive approach here would be to use a struct
+	 * completion here explicitly, and complete it when we shut
+	 * down dynticks, but since we presumably have nothing better
+	 * to do on this core anyway, just spinning seems plausible.
+	 */
+	if (!tick_nohz_tick_stopped())
+		set_tsk_need_resched(current);
+}
+
+static void task_isolation_deliver_signal(struct task_struct *task,
+					  const char *buf)
+{
+	siginfo_t info = {};
+
+	info.si_signo = SIGKILL;
+
+	/*
+	 * Report on the fact that isolation was violated for the task.
+	 * It may not be the task's fault (e.g. a TLB flush from another
+	 * core) but we are not blaming it, just reporting that it lost
+	 * its isolation status.
+	 */
+	pr_warn("%s/%d: task_isolation mode lost due to %s\n",
+		task->comm, task->pid, buf);
+
+	/* Turn off task isolation mode to avoid further isolation callbacks. */
+	task_isolation_set_flags(task, 0);
+
+	send_sig_info(info.si_signo, &info, task);
+}
+
+/*
+ * This routine is called from any userspace exception that doesn't
+ * otherwise trigger a signal to the user process (e.g. simple page fault).
+ */
+void _task_isolation_quiet_exception(const char *fmt, ...)
+{
+	struct task_struct *task = current;
+	va_list args;
+	char buf[100];
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	task_isolation_deliver_signal(task, buf);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in), and prevents most syscalls from executing and raises a
+ * signal to notify the process.
+ */
+int task_isolation_syscall(int syscall)
+{
+	char buf[20];
+
+	if (syscall == __NR_prctl ||
+	    syscall == __NR_exit ||
+	    syscall == __NR_exit_group)
+		return 0;
+
+	snprintf(buf, sizeof(buf), "syscall %d", syscall);
+	task_isolation_deliver_signal(current, buf);
+
+	syscall_set_return_value(current, current_pt_regs(),
+					 -ERESTARTNOINTR, -1);
+	return -1;
+}
diff --git a/kernel/signal.c b/kernel/signal.c
index af21afc00d08..895f547ff66f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -34,6 +34,7 @@
 #include <linux/compat.h>
 #include <linux/cn_proc.h>
 #include <linux/compiler.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -2213,6 +2214,13 @@ relock:
 		/* Trace actually delivered signals. */
 		trace_signal_deliver(signr, &ksig->info, ka);
 
+		/*
+		 * Disable task isolation when delivering a signal.
+		 * The isolation model requires users to reset task
+		 * isolation from the signal handler if desired.
+		 */
+		task_isolation_set_flags(current, 0);
+
 		if (ka->sa.sa_handler == SIG_IGN) /* Do nothing.  */
 			continue;
 		if (ka->sa.sa_handler != SIG_DFL) {
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be418157..4df84af425e3 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2270,6 +2271,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_TASK_ISOLATION
+	case PR_SET_TASK_ISOLATION:
+		error = task_isolation_set(arg2);
+		break;
+	case PR_GET_TASK_ISOLATION:
+		error = me->task_isolation_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 204fdc86863d..a6e29527743e 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -23,6 +23,7 @@
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 
 #include <asm/irq_regs.h>
 
@@ -205,6 +206,11 @@ static bool can_stop_full_tick(struct tick_sched *ts)
 	return true;
 }
 
+bool can_stop_my_full_tick(void)
+{
+	return can_stop_full_tick(this_cpu_ptr(&tick_cpu_sched));
+}
+
 static void nohz_full_kick_func(struct irq_work *work)
 {
 	/* Empty, the tick restart happens on tick_nohz_irq_exit() */
@@ -407,30 +413,34 @@ static int tick_nohz_cpu_down_callback(struct notifier_block *nfb,
 	return NOTIFY_OK;
 }
 
-static int tick_nohz_init_all(void)
+void tick_nohz_full_add_cpus(const struct cpumask *mask)
 {
-	int err = -1;
+	if (!cpumask_weight(mask))
+		return;
 
-#ifdef CONFIG_NO_HZ_FULL_ALL
-	if (!alloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
+	if (tick_nohz_full_mask == NULL &&
+	    !zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
 		WARN(1, "NO_HZ: Can't allocate full dynticks cpumask\n");
-		return err;
+		return;
 	}
-	err = 0;
-	cpumask_setall(tick_nohz_full_mask);
+
+	cpumask_or(tick_nohz_full_mask, tick_nohz_full_mask, mask);
 	tick_nohz_full_running = true;
-#endif
-	return err;
 }
 
 void __init tick_nohz_init(void)
 {
 	int cpu;
 
-	if (!tick_nohz_full_running) {
-		if (tick_nohz_init_all() < 0)
-			return;
-	}
+	task_isolation_init();
+
+#ifdef CONFIG_NO_HZ_FULL_ALL
+	if (!tick_nohz_full_running)
+		tick_nohz_full_add_cpus(cpu_possible_mask);
+#endif
+
+	if (!tick_nohz_full_running)
+		return;
 
 	if (!alloc_cpumask_var(&housekeeping_mask, GFP_KERNEL)) {
 		WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n");
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v14 05/14] task_isolation: track asynchronous interrupts
  2016-08-09 20:29 [PATCH v14 00/14] support "task_isolation" mode Chris Metcalf
                   ` (3 preceding siblings ...)
  2016-08-09 20:29 ` [PATCH v14 04/14] task_isolation: add initial support Chris Metcalf
@ 2016-08-09 20:29 ` Chris Metcalf
  2016-08-09 20:29 ` [PATCH v14 06/14] arch/x86: enable task isolation functionality Chris Metcalf
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-kernel
  Cc: Chris Metcalf

This commit adds support for tracking asynchronous interrupts
delivered to task-isolation tasks, e.g. IPIs or IRQs.  Just
as for exceptions and syscalls, when this occurs we arrange to
deliver a signal to the task so that it knows it has been
interrupted.  If the task is interrupted by an NMI, we can't
safely deliver a signal, so we just dump out a console stack.

We also support a new "task_isolation_debug" flag which forces
the console stack to be dumped out regardless.  We try to catch
the original source of the interrupt, e.g. if an IPI is dispatched
to a task-isolation task, we dump the backtrace of the remote
core that is sending the IPI, rather than just dumping out a
trace showing the core received an IPI from somewhere.

Calls to task_isolation_debug() can be placed in the
platform-independent code when that results in fewer lines
of code changes, as for example is true of the users of the
arch_send_call_function_*() APIs.  Or, they can be placed in the
per-architecture code when there are many callers, as for example
is true of the smp_send_reschedule() call.

A further cleanup might be to create an intermediate layer, so that
for example smp_send_reschedule() is a single generic function that
just calls arch_smp_send_reschedule(), allowing generic code to be
called every time smp_send_reschedule() is invoked.  But for now,
we just update either callers or callees as makes most sense.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 Documentation/kernel-parameters.txt    |  8 ++++
 include/linux/context_tracking_state.h |  6 +++
 include/linux/isolation.h              | 13 ++++++
 kernel/irq_work.c                      |  5 ++-
 kernel/isolation.c                     | 74 ++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                    | 14 +++++++
 kernel/signal.c                        |  7 ++++
 kernel/smp.c                           |  6 ++-
 kernel/softirq.c                       | 33 +++++++++++++++
 9 files changed, 164 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 7f1336b50dcc..f172cd310cf4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3951,6 +3951,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			also sets up nohz_full and isolcpus mode for the
 			listed set of cpus.
 
+	task_isolation_debug	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION
+			and booted in task_isolation= mode, this
+			setting will generate console backtraces when
+			the kernel is about to interrupt a task that
+			has requested PR_TASK_ISOLATION_ENABLE and is
+			running on a task_isolation core.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 1d34fe68f48a..4e2c4b900b82 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -39,8 +39,14 @@ static inline bool context_tracking_in_user(void)
 {
 	return __this_cpu_read(context_tracking.state) == CONTEXT_USER;
 }
+
+static inline bool context_tracking_cpu_in_user(int cpu)
+{
+	return per_cpu(context_tracking.state, cpu) == CONTEXT_USER;
+}
 #else
 static inline bool context_tracking_in_user(void) { return false; }
+static inline bool context_tracking_cpu_in_user(int cpu) { return false; }
 static inline bool context_tracking_active(void) { return false; }
 static inline bool context_tracking_is_enabled(void) { return false; }
 static inline bool context_tracking_cpu_is_enabled(void) { return false; }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index d9288b85b41f..02728b1f8775 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -46,6 +46,17 @@ extern void _task_isolation_quiet_exception(const char *fmt, ...);
 			_task_isolation_quiet_exception(fmt, ## __VA_ARGS__); \
 	} while (0)
 
+extern void _task_isolation_debug(int cpu, const char *type);
+#define task_isolation_debug(cpu, type)					\
+	do {								\
+		if (task_isolation_possible(cpu))			\
+			_task_isolation_debug(cpu, type);		\
+	} while (0)
+
+extern void task_isolation_debug_cpumask(const struct cpumask *,
+					 const char *type);
+extern void task_isolation_debug_task(int cpu, struct task_struct *p,
+				      const char *type);
 #else
 static inline void task_isolation_init(void) { }
 static inline bool task_isolation_possible(int cpu) { return false; }
@@ -55,6 +66,8 @@ extern inline void task_isolation_set_flags(struct task_struct *p,
 					    unsigned int flags) { }
 static inline int task_isolation_syscall(int nr) { return 0; }
 static inline void task_isolation_quiet_exception(const char *fmt, ...) { }
+static inline void task_isolation_debug(int cpu, const char *type) { }
+#define task_isolation_debug_cpumask(mask, type) do {} while (0)
 #endif
 
 #endif
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index bcf107ce0854..15f3d44acf11 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/isolation.h>
 #include <asm/processor.h>
 
 
@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (!irq_work_claim(work))
 		return false;
 
-	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+		task_isolation_debug(cpu, "irq_work");
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return true;
 }
diff --git a/kernel/isolation.c b/kernel/isolation.c
index bf3ebb0a727c..7cd57ca95be5 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,7 @@
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
 #include <linux/syscalls.h>
+#include <linux/ratelimit.h>
 #include <asm/unistd.h>
 #include <asm/syscall.h>
 #include "time/tick-sched.h"
@@ -215,3 +216,76 @@ int task_isolation_syscall(int syscall)
 					 -ERESTARTNOINTR, -1);
 	return -1;
 }
+
+/* Enable debugging of any interrupts of task_isolation cores. */
+static int task_isolation_debug_flag;
+static int __init task_isolation_debug_func(char *str)
+{
+	task_isolation_debug_flag = true;
+	return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+void task_isolation_debug_task(int cpu, struct task_struct *p, const char *type)
+{
+	static DEFINE_RATELIMIT_STATE(console_output, HZ, 1);
+	bool force_debug = false;
+
+	/*
+	 * Our caller made sure the task was running on a task isolation
+	 * core, but make sure the task has enabled isolation.
+	 */
+	if (!(p->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
+		return;
+
+	/*
+	 * Ensure the task is actually in userspace; if it is in kernel
+	 * mode, it is expected that it may receive interrupts, and in
+	 * any case they don't affect the isolation.  Note that there
+	 * is a race condition here as a task may have committed
+	 * to returning to user space but not yet set the context
+	 * tracking state to reflect it, and the check here is before
+	 * we trigger the interrupt, so we might fail to warn about a
+	 * legitimate interrupt.  However, the race window is narrow
+	 * and hitting it does not cause any incorrect behavior other
+	 * than failing to send the warning.
+	 */
+	if (cpu != smp_processor_id() && !context_tracking_cpu_in_user(cpu))
+		return;
+
+	/*
+	 * We disable task isolation mode when we deliver a signal
+	 * so we won't end up recursing back here again.
+	 * If we are in an NMI, we don't try delivering the signal
+	 * and instead just treat it as if "debug" mode was enabled,
+	 * since that's pretty much all we can do.
+	 */
+	if (in_nmi())
+		force_debug = true;
+	else
+		task_isolation_deliver_signal(p, type);
+
+	/*
+	 * If (for example) the timer interrupt starts ticking
+	 * unexpectedly, we will get an unmanageable flow of output,
+	 * so limit to one backtrace per second.
+	 */
+	if (force_debug ||
+	    (task_isolation_debug_flag && __ratelimit(&console_output))) {
+		pr_err("cpu %d: %s violating task isolation for %s/%d on cpu %d\n",
+		       smp_processor_id(), type, p->comm, p->pid, cpu);
+		dump_stack();
+	}
+}
+
+void task_isolation_debug_cpumask(const struct cpumask *mask, const char *type)
+{
+	int cpu, thiscpu = get_cpu();
+
+	/* No need to report on this cpu since we're already in the kernel. */
+	for_each_cpu_and(cpu, mask, task_isolation_map)
+		if (cpu != thiscpu)
+			_task_isolation_debug(cpu, type);
+
+	put_cpu();
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5c883fe8e440..21d847bfe5c2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
 #include <linux/context_tracking.h>
 #include <linux/compiler.h>
 #include <linux/frame.h>
+#include <linux/isolation.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -663,6 +664,19 @@ bool sched_can_stop_tick(struct rq *rq)
 }
 #endif /* CONFIG_NO_HZ_FULL */
 
+#ifdef CONFIG_TASK_ISOLATION
+void _task_isolation_debug(int cpu, const char *type)
+{
+	struct rq *rq = cpu_rq(cpu);
+	struct task_struct *task = try_get_task_struct(&rq->curr);
+
+	if (task) {
+		task_isolation_debug_task(cpu, task, type);
+		put_task_struct(task);
+	}
+}
+#endif
+
 void sched_avg_update(struct rq *rq)
 {
 	s64 period = sched_avg_period();
diff --git a/kernel/signal.c b/kernel/signal.c
index 895f547ff66f..40356a06b761 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -639,6 +639,13 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+	/*
+	 * We're delivering a signal anyway, so no need for more
+	 * warnings.  This also avoids self-deadlock since an IPI to
+	 * kick the task would otherwise generate another signal.
+	 */
+	task_isolation_set_flags(t, 0);
+
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 3aa642d39c03..35ca174db581 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
 #include <linux/smp.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include <linux/isolation.h>
 
 #include "smpboot.h"
 
@@ -162,8 +163,10 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
-	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) {
+		task_isolation_debug(cpu, "IPI function");
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return 0;
 }
@@ -441,6 +444,7 @@ void smp_call_function_many(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	task_isolation_debug_cpumask(cfd->cpumask, "IPI function");
 	arch_send_call_function_ipi_mask(cfd->cpumask);
 
 	if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..2f1065795318 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -26,6 +26,7 @@
 #include <linux/smpboot.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -319,6 +320,37 @@ asmlinkage __visible void do_softirq(void)
 	local_irq_restore(flags);
 }
 
+/* Determine whether this IRQ is something task isolation cares about. */
+static void task_isolation_irq(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+	struct pt_regs *regs;
+
+	if (!context_tracking_cpu_is_enabled())
+		return;
+
+	/*
+	 * We have not yet called __irq_enter() and so we haven't
+	 * adjusted the hardirq count.  This test will allow us to
+	 * avoid false positives for nested IRQs.
+	 */
+	if (in_interrupt())
+		return;
+
+	/*
+	 * If we were already in the kernel, not from an irq but from
+	 * a syscall or synchronous exception/fault, this test should
+	 * avoid a false positive as well.  Note that this requires
+	 * architecture support for calling set_irq_regs() prior to
+	 * calling irq_enter(), and if it's not done consistently, we
+	 * will not consistently avoid false positives here.
+	 */
+	regs = get_irq_regs();
+	if (regs && user_mode(regs))
+		task_isolation_debug(smp_processor_id(), "irq");
+#endif
+}
+
 /*
  * Enter an interrupt context.
  */
@@ -335,6 +367,7 @@ void irq_enter(void)
 		_local_bh_enable();
 	}
 
+	task_isolation_irq();
 	__irq_enter();
 }
 
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v14 06/14] arch/x86: enable task isolation functionality
  2016-08-09 20:29 [PATCH v14 00/14] support "task_isolation" mode Chris Metcalf
                   ` (4 preceding siblings ...)
  2016-08-09 20:29 ` [PATCH v14 05/14] task_isolation: track asynchronous interrupts Chris Metcalf
@ 2016-08-09 20:29 ` Chris Metcalf
  2016-08-10  7:52   ` Andy Lutomirski
  2016-08-09 20:29 ` [PATCH v14 07/14] arm64: factor work_pending state machine to C Chris Metcalf
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 30+ messages in thread
From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	H. Peter Anvin, x86, linux-kernel
  Cc: Chris Metcalf

In exit_to_usermode_loop(), call task_isolation_ready() for
TIF_TASK_ISOLATION tasks when we are checking the thread-info flags,
and after we've handled the other work, call task_isolation_enter()
for such tasks.

In syscall_trace_enter_phase1(), we add the necessary support for
reporting syscalls for task-isolation processes.

We add strict reporting for the kernel exception types that do
not result in signals, namely non-signalling page faults and
non-signalling MPX fixups.

Tested-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/entry/common.c            | 20 +++++++++++++++++++-
 arch/x86/include/asm/thread_info.h |  2 ++
 arch/x86/kernel/smp.c              |  2 ++
 arch/x86/kernel/traps.c            |  3 +++
 arch/x86/mm/fault.c                |  5 +++++
 6 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5c6e7471b732..10b2c0567dad 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -89,6 +89,7 @@ config X86
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS	if MMU && COMPAT
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_SOFT_DIRTY		if X86_64
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_EBPF_JIT			if X86_64
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 1433f6b4607d..5c1b9fc89bf2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -21,6 +21,7 @@
 #include <linux/context_tracking.h>
 #include <linux/user-return-notifier.h>
 #include <linux/uprobes.h>
+#include <linux/isolation.h>
 
 #include <asm/desc.h>
 #include <asm/traps.h>
@@ -91,6 +92,15 @@ static long syscall_trace_enter(struct pt_regs *regs)
 	if (emulated)
 		return -1L;
 
+	/* In isolation mode, we may prevent the syscall from running. */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->orig_ax) == -1) {
+			regs->orig_ax = -1;
+			return 0;
+		}
+		work &= ~_TIF_TASK_ISOLATION;
+	}
+
 #ifdef CONFIG_SECCOMP
 	/*
 	 * Do seccomp after ptrace, to catch any tracer changes.
@@ -136,7 +146,7 @@ static long syscall_trace_enter(struct pt_regs *regs)
 
 #define EXIT_TO_USERMODE_LOOP_FLAGS				\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |	\
-	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY)
+	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_TASK_ISOLATION)
 
 static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 {
@@ -170,11 +180,19 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
 			fire_user_return_notifiers();
 
+		if (cached_flags & _TIF_TASK_ISOLATION)
+			task_isolation_enter();
+
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
 		cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
 
+		/* Clear task isolation from cached_flags manually. */
+		if ((cached_flags & _TIF_TASK_ISOLATION) &&
+		    task_isolation_ready())
+			cached_flags &= ~_TIF_TASK_ISOLATION;
+
 		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
 			break;
 
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 84b59846154a..eb3b648617c1 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -93,6 +93,7 @@ struct thread_info {
 #define TIF_SECCOMP		8	/* secure computing */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
 #define TIF_UPROBE		12	/* breakpointed or singlestepping */
+#define TIF_TASK_ISOLATION	13	/* task isolation enabled for task */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
 #define TIF_FORK		18	/* ret_from_fork */
@@ -117,6 +118,7 @@ struct thread_info {
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_UPROBE		(1 << TIF_UPROBE)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 658777cf3851..e4ffd9581cdb 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/isolation.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -125,6 +126,7 @@ static void native_smp_send_reschedule(int cpu)
 		WARN_ON(1);
 		return;
 	}
+	task_isolation_debug(cpu, "reschedule IPI");
 	apic->send_IPI(cpu, RESCHEDULE_VECTOR);
 }
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b70ca12dd389..eae51685c2b3 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -36,6 +36,7 @@
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/io.h>
+#include <linux/isolation.h>
 
 #ifdef CONFIG_EISA
 #include <linux/ioport.h>
@@ -383,6 +384,8 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 	case 2:	/* Bound directory has invalid entry. */
 		if (mpx_handle_bd_fault())
 			goto exit_trap;
+		/* No signal was generated, but notify task-isolation tasks. */
+		task_isolation_quiet_exception("bounds check");
 		break; /* Success, it was handled */
 	case 1: /* Bound violation. */
 		info = mpx_generate_siginfo(regs);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index dc8023060456..b1509876794c 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -14,6 +14,7 @@
 #include <linux/prefetch.h>		/* prefetchw			*/
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
+#include <linux/isolation.h>		/* task_isolation_quiet_exception */
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1397,6 +1398,10 @@ good_area:
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
 	}
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	if (flags & PF_USER)
+		task_isolation_quiet_exception("page fault at %#lx", address);
+
 	check_v8086_mode(regs, address, tsk);
 }
 NOKPROBE_SYMBOL(__do_page_fault);
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v14 07/14] arm64: factor work_pending state machine to C
  2016-08-09 20:29 [PATCH v14 00/14] support "task_isolation" mode Chris Metcalf
                   ` (5 preceding siblings ...)
  2016-08-09 20:29 ` [PATCH v14 06/14] arch/x86: enable task isolation functionality Chris Metcalf
@ 2016-08-09 20:29 ` Chris Metcalf
  2016-08-09 20:29 ` [PATCH v14 08/14] arch/arm64: enable task isolation functionality Chris Metcalf
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Mark Rutland, linux-arm-kernel, linux-kernel
  Cc: Chris Metcalf

Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
state machine that can be difficult to reason about due to duplicated
code and a large number of branch targets.

This patch factors the common logic out into the existing
do_notify_resume function, converting the code to C in the process,
making the code more legible.

This patch tries to closely mirror the existing behaviour while using
the usual C control flow primitives. As local_irq_{disable,enable} may
be instrumented, we balance exception entry (where we will almost most
likely enable IRQs) with a call to trace_hardirqs_on just before the
return to userspace.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
(Note: Will Deacon has said he will look at queueing this separately for 4.9)

 arch/arm64/kernel/entry.S  | 12 ++++--------
 arch/arm64/kernel/signal.c | 36 ++++++++++++++++++++++++++----------
 2 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 96e4a2b64cc1..4af25bb7f116 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -700,18 +700,13 @@ ret_fast_syscall_trace:
  * Ok, we need to do extra processing, enter the slow path.
  */
 work_pending:
-	tbnz	x1, #TIF_NEED_RESCHED, work_resched
-	/* TIF_SIGPENDING, TIF_NOTIFY_RESUME or TIF_FOREIGN_FPSTATE case */
 	mov	x0, sp				// 'regs'
-	enable_irq				// enable interrupts for do_notify_resume()
 	bl	do_notify_resume
-	b	ret_to_user
-work_resched:
 #ifdef CONFIG_TRACE_IRQFLAGS
-	bl	trace_hardirqs_off		// the IRQs are off here, inform the tracing code
+	bl	trace_hardirqs_on		// enabled while in userspace
 #endif
-	bl	schedule
-
+	ldr	x1, [tsk, #TI_FLAGS]		// re-check for single-step
+	b	finish_ret_to_user
 /*
  * "slow" syscall return path.
  */
@@ -720,6 +715,7 @@ ret_to_user:
 	ldr	x1, [tsk, #TI_FLAGS]
 	and	x2, x1, #_TIF_WORK_MASK
 	cbnz	x2, work_pending
+finish_ret_to_user:
 	enable_step_tsk x1, x2
 	kernel_exit 0
 ENDPROC(ret_to_user)
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index a8eafdbc7cb8..404dd67080b9 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -402,15 +402,31 @@ static void do_signal(struct pt_regs *regs)
 asmlinkage void do_notify_resume(struct pt_regs *regs,
 				 unsigned int thread_flags)
 {
-	if (thread_flags & _TIF_SIGPENDING)
-		do_signal(regs);
-
-	if (thread_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-	}
-
-	if (thread_flags & _TIF_FOREIGN_FPSTATE)
-		fpsimd_restore_current_state();
+	/*
+	 * The assembly code enters us with IRQs off, but it hasn't
+	 * informed the tracing code of that for efficiency reasons.
+	 * Update the trace code with the current status.
+	 */
+	trace_hardirqs_off();
+	do {
+		if (thread_flags & _TIF_NEED_RESCHED) {
+			schedule();
+		} else {
+			local_irq_enable();
+
+			if (thread_flags & _TIF_SIGPENDING)
+				do_signal(regs);
+
+			if (thread_flags & _TIF_NOTIFY_RESUME) {
+				clear_thread_flag(TIF_NOTIFY_RESUME);
+				tracehook_notify_resume(regs);
+			}
+
+			if (thread_flags & _TIF_FOREIGN_FPSTATE)
+				fpsimd_restore_current_state();
+		}
 
+		local_irq_disable();
+		thread_flags = READ_ONCE(current_thread_info()->flags);
+	} while (thread_flags & _TIF_WORK_MASK);
 }
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v14 08/14] arch/arm64: enable task isolation functionality
  2016-08-09 20:29 [PATCH v14 00/14] support "task_isolation" mode Chris Metcalf
                   ` (6 preceding siblings ...)
  2016-08-09 20:29 ` [PATCH v14 07/14] arm64: factor work_pending state machine to C Chris Metcalf
@ 2016-08-09 20:29 ` Chris Metcalf
  2016-08-09 20:29 ` [PATCH v14 09/14] arch/tile: " Chris Metcalf
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Mark Rutland, linux-arm-kernel, linux-kernel
  Cc: Chris Metcalf

In do_notify_resume(), call task_isolation_ready() for
TIF_TASK_ISOLATION tasks when we are checking the thread-info flags;
and after we've handled the other work, call task_isolation_enter()
for such tasks.  To ensure we always call task_isolation_enter() when
returning to userspace, add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
while leaving the old bitmask value as _TIF_WORK_LOOP_MASK to
check while looping.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one.  This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

We instrument the smp_send_reschedule() routine so that it checks for
isolated tasks and generates a suitable warning if we are about
to disturb one of them in strict or debug mode.

Finally, report on page faults in task-isolation processes in
do_page_faults().

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/thread_info.h |  5 ++++-
 arch/arm64/kernel/ptrace.c           | 15 ++++++++++++---
 arch/arm64/kernel/signal.c           | 10 ++++++++++
 arch/arm64/kernel/smp.c              |  2 ++
 arch/arm64/mm/fault.c                |  8 +++++++-
 6 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 69c8787bec7d..1aea9a329b85 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -61,6 +61,7 @@ config ARM64
 	select HAVE_ARCH_MMAP_RND_BITS
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_ARM_SMCCC
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index abd64bd1f6d9..bdc6426b9968 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -109,6 +109,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	1
 #define TIF_NOTIFY_RESUME	2	/* callback before returning to user */
 #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
+#define TIF_TASK_ISOLATION	4
 #define TIF_NOHZ		7
 #define TIF_SYSCALL_TRACE	8
 #define TIF_SYSCALL_AUDIT	9
@@ -124,6 +125,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
@@ -132,7 +134,8 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_32BIT		(1 << TIF_32BIT)
 
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
-				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE)
+				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
+				 _TIF_TASK_ISOLATION)
 
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index e0c81da60f76..9b25ffb436cc 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1347,14 +1348,22 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
-	if (test_thread_flag(TIF_SYSCALL_TRACE))
+	unsigned long work = ACCESS_ONCE(current_thread_info()->flags);
+
+	if (work & _TIF_SYSCALL_TRACE)
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
-	/* Do the secure computing after ptrace; failures should be fast. */
+	/* In isolation mode, we may prevent the syscall from running. */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->syscallno) == -1)
+			return -1;
+	}
+
+	/* Do the secure computing check early; failures should be fast. */
 	if (secure_computing(NULL) == -1)
 		return -1;
 
-	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
+	if (work & _TIF_SYSCALL_TRACEPOINT)
 		trace_sys_enter(regs, regs->syscallno);
 
 	audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 404dd67080b9..f9b9b25636ca 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -25,6 +25,7 @@
 #include <linux/uaccess.h>
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
+#include <linux/isolation.h>
 
 #include <asm/debug-monitors.h>
 #include <asm/elf.h>
@@ -424,9 +425,18 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 
 			if (thread_flags & _TIF_FOREIGN_FPSTATE)
 				fpsimd_restore_current_state();
+
+			if (thread_flags & _TIF_TASK_ISOLATION)
+				task_isolation_enter();
 		}
 
 		local_irq_disable();
 		thread_flags = READ_ONCE(current_thread_info()->flags);
+
+		/* Clear task isolation from cached_flags manually. */
+		if ((thread_flags & _TIF_TASK_ISOLATION) &&
+		    task_isolation_ready())
+			thread_flags &= ~_TIF_TASK_ISOLATION;
+
 	} while (thread_flags & _TIF_WORK_MASK);
 }
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 76a6d9263908..ed75244014f5 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -37,6 +37,7 @@
 #include <linux/completion.h>
 #include <linux/of.h>
 #include <linux/irq_work.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -874,6 +875,7 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 
 void smp_send_reschedule(int cpu)
 {
+	task_isolation_debug(cpu, "reschedule IPI");
 	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
 }
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index c8beaa0da7df..f0c4d6ba346e 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/isolation.h>
 
 #include <asm/cpufeature.h>
 #include <asm/exception.h>
@@ -382,8 +383,13 @@ retry:
 	 * Handle the "normal" case first - VM_FAULT_MAJOR
 	 */
 	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
-			      VM_FAULT_BADACCESS))))
+			      VM_FAULT_BADACCESS)))) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (user_mode(regs))
+			task_isolation_quiet_exception("page fault at %#lx",
+						       addr);
 		return 0;
+	}
 
 	/*
 	 * If we are in kernel mode at this point, we have no context to
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v14 09/14] arch/tile: enable task isolation functionality
  2016-08-09 20:29 [PATCH v14 00/14] support "task_isolation" mode Chris Metcalf
                   ` (7 preceding siblings ...)
  2016-08-09 20:29 ` [PATCH v14 08/14] arch/arm64: enable task isolation functionality Chris Metcalf
@ 2016-08-09 20:29 ` Chris Metcalf
  2016-08-09 20:29 ` [PATCH v14 10/14] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel
  Cc: Chris Metcalf

We add the necessary call to task_isolation_enter() in the
prepare_exit_to_usermode() routine.  We already unconditionally
call into this routine if TIF_NOHZ is set, since that's where
we do the user_enter() call.

We add calls to task_isolation_quiet_exception() in places
where exceptions may not generate signals to the application.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/tile/Kconfig                   |  1 +
 arch/tile/include/asm/thread_info.h |  4 +++-
 arch/tile/kernel/process.c          |  9 +++++++++
 arch/tile/kernel/ptrace.c           |  7 +++++++
 arch/tile/kernel/single_step.c      |  7 +++++++
 arch/tile/kernel/smp.c              | 26 ++++++++++++++------------
 arch/tile/kernel/unaligned.c        |  4 ++++
 arch/tile/mm/fault.c                | 13 ++++++++++++-
 arch/tile/mm/homecache.c            |  2 ++
 9 files changed, 59 insertions(+), 14 deletions(-)

diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig
index 4820a02838ac..937cfe4cbb5b 100644
--- a/arch/tile/Kconfig
+++ b/arch/tile/Kconfig
@@ -18,6 +18,7 @@ config TILE
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_CONTEXT_TRACKING
 	select HAVE_DEBUG_BUGVERBOSE
diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index b7659b8f1117..8fe17c7e872e 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -126,6 +126,7 @@ extern void _cpu_idle(void);
 #define TIF_SYSCALL_TRACEPOINT	9	/* syscall tracepoint instrumentation */
 #define TIF_POLLING_NRFLAG	10	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_NOHZ		11	/* in adaptive nohz mode */
+#define TIF_TASK_ISOLATION	12	/* in task isolation mode */
 
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1<<TIF_NEED_RESCHED)
@@ -139,11 +140,12 @@ extern void _cpu_idle(void);
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
 #define _TIF_NOHZ		(1<<TIF_NOHZ)
+#define _TIF_TASK_ISOLATION	(1<<TIF_TASK_ISOLATION)
 
 /* Work to do as we loop to exit to user space. */
 #define _TIF_WORK_MASK \
 	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
-	 _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME)
+	 _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME | _TIF_TASK_ISOLATION)
 
 /* Work to do on any return to user space. */
 #define _TIF_ALLWORK_MASK \
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index a465d8372edd..bbe1d29b242f 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -29,6 +29,7 @@
 #include <linux/signal.h>
 #include <linux/delay.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 #include <asm/stack.h>
 #include <asm/switch_to.h>
 #include <asm/homecache.h>
@@ -496,9 +497,17 @@ void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags)
 			tracehook_notify_resume(regs);
 		}
 
+		if (thread_info_flags & _TIF_TASK_ISOLATION)
+			task_isolation_enter();
+
 		local_irq_disable();
 		thread_info_flags = READ_ONCE(current_thread_info()->flags);
 
+		/* Clear task isolation from cached_flags manually. */
+		if ((thread_info_flags & _TIF_TASK_ISOLATION) &&
+		    task_isolation_ready())
+			thread_info_flags &= ~_TIF_TASK_ISOLATION;
+
 	} while (thread_info_flags & _TIF_WORK_MASK);
 
 	if (thread_info_flags & _TIF_SINGLESTEP) {
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index d89b7011667c..b3f8dffcd9ae 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -23,6 +23,7 @@
 #include <linux/elf.h>
 #include <linux/tracehook.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 #include <asm/traps.h>
 #include <arch/chip.h>
 
@@ -261,6 +262,12 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 		return -1;
 	}
 
+	/* In isolation mode, we may prevent the syscall from running. */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]) == -1)
+			return -1;
+	}
+
 	if (secure_computing(NULL) == -1)
 		return -1;
 
diff --git a/arch/tile/kernel/single_step.c b/arch/tile/kernel/single_step.c
index 862973074bf9..b48da9860b80 100644
--- a/arch/tile/kernel/single_step.c
+++ b/arch/tile/kernel/single_step.c
@@ -23,6 +23,7 @@
 #include <linux/types.h>
 #include <linux/err.h>
 #include <linux/prctl.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <asm/uaccess.h>
@@ -320,6 +321,9 @@ void single_step_once(struct pt_regs *regs)
 	int size = 0, sign_ext = 0;  /* happy compiler */
 	int align_ctl;
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	task_isolation_quiet_exception("single step at %#lx", regs->pc);
+
 	align_ctl = unaligned_fixup;
 	switch (task_thread_info(current)->align_ctl) {
 	case PR_UNALIGN_NOPRINT:
@@ -767,6 +771,9 @@ void single_step_once(struct pt_regs *regs)
 	unsigned long *ss_pc = this_cpu_ptr(&ss_saved_pc);
 	unsigned long control = __insn_mfspr(SPR_SINGLE_STEP_CONTROL_K);
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	task_isolation_quiet_exception("single step at %#lx", regs->pc);
+
 	*ss_pc = regs->pc;
 	control |= SPR_SINGLE_STEP_CONTROL_1__CANCELED_MASK;
 	control |= SPR_SINGLE_STEP_CONTROL_1__INHIBIT_MASK;
diff --git a/arch/tile/kernel/smp.c b/arch/tile/kernel/smp.c
index 07e3ff5cc740..d610322026d0 100644
--- a/arch/tile/kernel/smp.c
+++ b/arch/tile/kernel/smp.c
@@ -20,6 +20,7 @@
 #include <linux/irq.h>
 #include <linux/irq_work.h>
 #include <linux/module.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/homecache.h>
 
@@ -181,10 +182,11 @@ void flush_icache_range(unsigned long start, unsigned long end)
 	struct ipi_flush flush = { start, end };
 
 	/* If invoked with irqs disabled, we can not issue IPIs. */
-	if (irqs_disabled())
+	if (irqs_disabled()) {
+		task_isolation_debug_cpumask(task_isolation_map, "icache flush");
 		flush_remote(0, HV_FLUSH_EVICT_L1I, NULL, 0, 0, 0,
 			NULL, NULL, 0);
-	else {
+	} else {
 		preempt_disable();
 		on_each_cpu(ipi_flush_icache_range, &flush, 1);
 		preempt_enable();
@@ -258,10 +260,8 @@ void __init ipi_init(void)
 
 #if CHIP_HAS_IPI()
 
-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
 {
-	WARN_ON(cpu_is_offline(cpu));
-
 	/*
 	 * We just want to do an MMIO store.  The traditional writeq()
 	 * functions aren't really correct here, since they're always
@@ -273,15 +273,17 @@ void smp_send_reschedule(int cpu)
 
 #else
 
-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
 {
-	HV_Coord coord;
-
-	WARN_ON(cpu_is_offline(cpu));
-
-	coord.y = cpu_y(cpu);
-	coord.x = cpu_x(cpu);
+	HV_Coord coord = { .y = cpu_y(cpu), .x = cpu_x(cpu) };
 	hv_trigger_ipi(coord, IRQ_RESCHEDULE);
 }
 
 #endif /* CHIP_HAS_IPI() */
+
+void smp_send_reschedule(int cpu)
+{
+	WARN_ON(cpu_is_offline(cpu));
+	task_isolation_debug(cpu, "reschedule IPI");
+	__smp_send_reschedule(cpu);
+}
diff --git a/arch/tile/kernel/unaligned.c b/arch/tile/kernel/unaligned.c
index 9772a3554282..0335f7cd81f4 100644
--- a/arch/tile/kernel/unaligned.c
+++ b/arch/tile/kernel/unaligned.c
@@ -25,6 +25,7 @@
 #include <linux/module.h>
 #include <linux/compat.h>
 #include <linux/prctl.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <asm/uaccess.h>
@@ -1545,6 +1546,9 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
 		return;
 	}
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	task_isolation_quiet_exception("unaligned JIT at %#lx", regs->pc);
+
 	if (!info->unalign_jit_base) {
 		void __user *user_page;
 
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index beba986589e5..9bd3600092dc 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -35,6 +35,7 @@
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 #include <linux/kdebug.h>
+#include <linux/isolation.h>
 
 #include <asm/pgalloc.h>
 #include <asm/sections.h>
@@ -308,8 +309,13 @@ static int handle_page_fault(struct pt_regs *regs,
 	 */
 	pgd = get_current_pgd();
 	if (handle_migrating_pte(pgd, fault_num, address, regs->pc,
-				 is_kernel_mode, write))
+				 is_kernel_mode, write)) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (!is_kernel_mode)
+			task_isolation_quiet_exception("migration fault at %#lx",
+						       address);
 		return 1;
+	}
 
 	si_code = SEGV_MAPERR;
 
@@ -479,6 +485,11 @@ good_area:
 #endif
 
 	up_read(&mm->mmap_sem);
+
+	/* No signal was generated, but notify task-isolation tasks. */
+	if (flags & FAULT_FLAG_USER)
+		task_isolation_quiet_exception("page fault at %#lx", address);
+
 	return 1;
 
 /*
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..2fe368599df6 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
 #include <linux/smp.h>
 #include <linux/module.h>
 #include <linux/hugetlb.h>
+#include <linux/isolation.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -83,6 +84,7 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
 	 * Don't bother to update atomically; losing a count
 	 * here is not that critical.
 	 */
+	task_isolation_debug_cpumask(&mask, "remote cache/TLB flush");
 	for_each_cpu(cpu, &mask)
 		++per_cpu(irq_stat, cpu).irq_hv_flush_count;
 }
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v14 10/14] arm, tile: turn off timer tick for oneshot_stopped state
  2016-08-09 20:29 [PATCH v14 00/14] support "task_isolation" mode Chris Metcalf
                   ` (8 preceding siblings ...)
  2016-08-09 20:29 ` [PATCH v14 09/14] arch/tile: " Chris Metcalf
@ 2016-08-09 20:29 ` Chris Metcalf
  2016-08-09 20:29 ` [PATCH v14 11/14] clocksource: Do not schedule watchdog on isolated or NOHZ cpus Chris Metcalf
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-kernel
  Cc: Chris Metcalf

When the schedule tick is disabled in tick_nohz_stop_sched_tick(),
we call hrtimer_cancel(), which eventually calls down into
__remove_hrtimer() and thus into hrtimer_force_reprogram().
That function's call to tick_program_event() detects that
we are trying to set the expiration to KTIME_MAX and calls
clockevents_switch_state() to set the state to ONESHOT_STOPPED,
and returns.  See commit 8fff52fd5093 ("clockevents: Introduce
CLOCK_EVT_STATE_ONESHOT_STOPPED state") for more background.

However, by default the internal __clockevents_switch_state() code
doesn't have a "set_state_oneshot_stopped" function pointer for
the arm_arch_timer or tile clock_event_device structures, so that
code returns -ENOSYS, and we end up not setting the state, and more
importantly, we don't actually turn off the hardware timer.
As a result, the timer tick we were waiting for before is still
queued, and fires shortly afterwards, only to discover there was
nothing for it to do, at which point it quiesces.

The fix is to provide that function pointer field, and like the
other function pointers, have it just turn off the timer interrupt.
Any call to set a new timer interval will properly re-enable it.

This fix avoids a small performance hiccup for regular applications,
but for TASK_ISOLATION code, it fixes a potentially serious
kernel timer interruption to the time-sensitive application.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
 arch/tile/kernel/time.c              | 1 +
 drivers/clocksource/arm_arch_timer.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index 178989e6d3e3..fbedf380d9d4 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -159,6 +159,7 @@ static DEFINE_PER_CPU(struct clock_event_device, tile_timer) = {
 	.set_next_event = tile_timer_set_next_event,
 	.set_state_shutdown = tile_timer_shutdown,
 	.set_state_oneshot = tile_timer_shutdown,
+	.set_state_oneshot_stopped = tile_timer_shutdown,
 	.tick_resume = tile_timer_shutdown,
 };
 
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index 28bce3f4f81d..c725156e5a21 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -314,6 +314,8 @@ static void __arch_timer_setup(unsigned type,
 		}
 	}
 
+	clk->set_state_oneshot_stopped = clk->set_state_shutdown;
+
 	clk->set_state_shutdown(clk);
 
 	clockevents_config_and_register(clk, arch_timer_rate, 0xf, 0x7fffffff);
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v14 11/14] clocksource: Do not schedule watchdog on isolated or NOHZ cpus
  2016-08-09 20:29 [PATCH v14 00/14] support "task_isolation" mode Chris Metcalf
                   ` (9 preceding siblings ...)
  2016-08-09 20:29 ` [PATCH v14 10/14] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
@ 2016-08-09 20:29 ` Chris Metcalf
  2016-08-09 20:29 ` [PATCH v14 12/14] task_isolation: support CONFIG_TASK_ISOLATION_ALL Chris Metcalf
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	H. Peter Anvin, x86, linux-kernel
  Cc: Chris Metcalf

From: Christoph Lameter <cl@linux.com>

watchdog checks can only run on housekeeping capable cpus. Otherwise
we will be generating noise that we would like to avoid on the isolated
processors.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
  [line-wrapped and added equivalent fix in clocksource_start_watchdog()]
---
 kernel/time/clocksource.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 6a5a310a1a53..b9c79c96d069 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -269,9 +269,12 @@ static void clocksource_watchdog(unsigned long data)
 	 * Cycle through CPUs to check if the CPUs stay synchronized
 	 * to each other.
 	 */
-	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
+	next_cpu = cpumask_next_and(raw_smp_processor_id(), cpu_online_mask,
+				    housekeeping_cpumask());
 	if (next_cpu >= nr_cpu_ids)
-		next_cpu = cpumask_first(cpu_online_mask);
+		next_cpu = cpumask_first_and(cpu_online_mask,
+					     housekeeping_cpumask());
+
 	watchdog_timer.expires += WATCHDOG_INTERVAL;
 	add_timer_on(&watchdog_timer, next_cpu);
 out:
@@ -285,7 +288,8 @@ static inline void clocksource_start_watchdog(void)
 	init_timer(&watchdog_timer);
 	watchdog_timer.function = clocksource_watchdog;
 	watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL;
-	add_timer_on(&watchdog_timer, cpumask_first(cpu_online_mask));
+	add_timer_on(&watchdog_timer,
+		     cpumask_first_and(cpu_online_mask, housekeeping_cpumask()));
 	watchdog_running = 1;
 }
 
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v14 12/14] task_isolation: support CONFIG_TASK_ISOLATION_ALL
  2016-08-09 20:29 [PATCH v14 00/14] support "task_isolation" mode Chris Metcalf
                   ` (10 preceding siblings ...)
  2016-08-09 20:29 ` [PATCH v14 11/14] clocksource: Do not schedule watchdog on isolated or NOHZ cpus Chris Metcalf
@ 2016-08-09 20:29 ` Chris Metcalf
  2016-08-09 20:29 ` [PATCH v14 13/14] task_isolation: add user-settable notification signal Chris Metcalf
  2016-08-09 20:29 ` [PATCH v14 14/14] task_isolation self test Chris Metcalf
  13 siblings, 0 replies; 30+ messages in thread
From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-kernel
  Cc: Chris Metcalf

This option, similar to NO_HZ_FULL_ALL, simplifies configuring
a system to boot by default with all cores except the boot core
running in task isolation mode.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 init/Kconfig       | 10 ++++++++++
 kernel/isolation.c |  6 ++++++
 2 files changed, 16 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index 85a4b6dd26f2..2d49c5b78b93 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -813,6 +813,16 @@ config TASK_ISOLATION
 	 You should say "N" unless you are intending to run a
 	 high-performance userspace driver or similar task.
 
+config TASK_ISOLATION_ALL
+	bool "Provide task isolation on all CPUs by default (except CPU 0)"
+	depends on TASK_ISOLATION
+	help
+	 If the user doesn't pass the task_isolation boot option to
+	 define the range of task isolation CPUs, consider that all
+	 CPUs in the system are task isolation by default.
+	 Note the boot CPU will still be kept outside the range to
+	 handle timekeeping duty, etc.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 7cd57ca95be5..f8ccf5e67e38 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -43,8 +43,14 @@ int __init task_isolation_init(void)
 {
 	/* For offstack cpumask, ensure we allocate an empty cpumask early. */
 	if (!saw_boot_arg) {
+#ifdef CONFIG_TASK_ISOLATION_ALL
+		alloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+		cpumask_copy(task_isolation_map, cpu_possible_mask);
+		cpumask_clear_cpu(smp_processor_id(), task_isolation_map);
+#else
 		zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
 		return 0;
+#endif
 	}
 
 	/*
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v14 13/14] task_isolation: add user-settable notification signal
  2016-08-09 20:29 [PATCH v14 00/14] support "task_isolation" mode Chris Metcalf
                   ` (11 preceding siblings ...)
  2016-08-09 20:29 ` [PATCH v14 12/14] task_isolation: support CONFIG_TASK_ISOLATION_ALL Chris Metcalf
@ 2016-08-09 20:29 ` Chris Metcalf
  2016-08-09 20:29 ` [PATCH v14 14/14] task_isolation self test Chris Metcalf
  13 siblings, 0 replies; 30+ messages in thread
From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

By default, if a task in task isolation mode re-enters the kernel,
it is terminated with SIGKILL.  With this commit, the application
can choose what signal to receive on a task isolation violation
by invoking prctl() with PR_TASK_ISOLATION_ENABLE, or'ing in the
PR_TASK_ISOLATION_USERSIG bit, and setting the specific requested
signal by or'ing in PR_TASK_ISOLATION_SET_SIG(sig).

This mode allows for catching the notification signal; for example,
in a production environment, it might be helpful to log information
to the application logging mechanism before exiting.  Or, the
application might choose to re-enable task isolation and return to
continue execution.

As a special case, the user may set the signal to 0, which means
that no signal will be delivered.  In this mode, the application
may freely enter the kernel for syscalls and synchronous exceptions
such as page faults, but each time it will be held in the kernel
before returning to userspace until the kernel has quiesced timer
ticks or other potential future interruptions, just like it does
on return from the initial prctl() call.  Note that in this mode,
the task can be migrated away from its initial task_isolation core,
and if it is migrated to a non-isolated core it will lose task
isolation until it is migrated back to an isolated core.
In addition, in this mode we no longer require the affinity to
be set correctly on entry (though we warn on the console if it's
not right), and we don't bother to notify the user that the kernel
isn't ready to quiesce either (since we'll presumably be in and
out of the kernel multiple times with task isolation enabled anyway).
The PR_TASK_ISOLATION_NOSIG define is provided as a convenience
wrapper to express this semantic.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/uapi/linux/prctl.h |  5 ++++
 kernel/isolation.c         | 62 ++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 2a49d0d2940a..7af6eb51c1dc 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,10 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		48
 #define PR_GET_TASK_ISOLATION		49
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_USERSIG	(1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+# define PR_TASK_ISOLATION_NOSIG \
+	(PR_TASK_ISOLATION_USERSIG | PR_TASK_ISOLATION_SET_SIG(0))
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index f8ccf5e67e38..d36cb3943c80 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -85,6 +85,15 @@ static bool can_stop_my_full_tick_now(void)
 	return ret;
 }
 
+/* Get the signal number that will be sent for a particular set of flag bits. */
+static int task_isolation_sig(int flags)
+{
+	if (flags & PR_TASK_ISOLATION_USERSIG)
+		return PR_TASK_ISOLATION_GET_SIG(flags);
+	else
+		return SIGKILL;
+}
+
 /*
  * This routine controls whether we can enable task-isolation mode.
  * The task must be affinitized to a single task_isolation core, or
@@ -92,16 +101,30 @@ static bool can_stop_my_full_tick_now(void)
  * stop the nohz_full tick (e.g., no other schedulable tasks currently
  * running, no POSIX cpu timers currently set up, etc.); if not, we
  * return EAGAIN.
+ *
+ * If we will not be strictly enforcing kernel re-entry with a signal,
+ * we just generate a warning printk if there is a bad affinity set
+ * on entry (since after all you can always change it again after you
+ * call prctl) and we don't bother failing the prctl with -EAGAIN
+ * since we assume you will go in and out of kernel mode anyway.
  */
 int task_isolation_set(unsigned int flags)
 {
 	if (flags != 0) {
+		int sig = task_isolation_sig(flags);
+
 		if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
 		    !task_isolation_possible(raw_smp_processor_id())) {
 			/* Invalid task affinity setting. */
-			return -EINVAL;
+			if (sig)
+				return -EINVAL;
+			else
+				pr_warn("%s/%d: enabling non-signalling task isolation\n"
+					"and not bound to a single task isolation core\n",
+					current->comm, current->pid);
 		}
-		if (!can_stop_my_full_tick_now()) {
+
+		if (sig && !can_stop_my_full_tick_now()) {
 			/* System not yet ready for task isolation. */
 			return -EAGAIN;
 		}
@@ -160,11 +183,11 @@ void task_isolation_enter(void)
 }
 
 static void task_isolation_deliver_signal(struct task_struct *task,
-					  const char *buf)
+					  const char *buf, int sig)
 {
 	siginfo_t info = {};
 
-	info.si_signo = SIGKILL;
+	info.si_signo = sig;
 
 	/*
 	 * Report on the fact that isolation was violated for the task.
@@ -175,7 +198,10 @@ static void task_isolation_deliver_signal(struct task_struct *task,
 	pr_warn("%s/%d: task_isolation mode lost due to %s\n",
 		task->comm, task->pid, buf);
 
-	/* Turn off task isolation mode to avoid further isolation callbacks. */
+	/*
+	 * Turn off task isolation mode to avoid further isolation callbacks.
+	 * It can choose to re-enable task isolation mode in the signal handler.
+	 */
 	task_isolation_set_flags(task, 0);
 
 	send_sig_info(info.si_signo, &info, task);
@@ -190,15 +216,20 @@ void _task_isolation_quiet_exception(const char *fmt, ...)
 	struct task_struct *task = current;
 	va_list args;
 	char buf[100];
+	int sig;
 
 	/* RCU should have been enabled prior to this point. */
 	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
 
+	sig = task_isolation_sig(task->task_isolation_flags);
+	if (sig == 0)
+		return;
+
 	va_start(args, fmt);
 	vsnprintf(buf, sizeof(buf), fmt, args);
 	va_end(args);
 
-	task_isolation_deliver_signal(task, buf);
+	task_isolation_deliver_signal(task, buf, sig);
 }
 
 /*
@@ -209,14 +240,19 @@ void _task_isolation_quiet_exception(const char *fmt, ...)
 int task_isolation_syscall(int syscall)
 {
 	char buf[20];
+	int sig;
 
 	if (syscall == __NR_prctl ||
 	    syscall == __NR_exit ||
 	    syscall == __NR_exit_group)
 		return 0;
 
+	sig = task_isolation_sig(current->task_isolation_flags);
+	if (sig == 0)
+		return 0;
+
 	snprintf(buf, sizeof(buf), "syscall %d", syscall);
-	task_isolation_deliver_signal(current, buf);
+	task_isolation_deliver_signal(current, buf, sig);
 
 	syscall_set_return_value(current, current_pt_regs(),
 					 -ERESTARTNOINTR, -1);
@@ -236,6 +272,7 @@ void task_isolation_debug_task(int cpu, struct task_struct *p, const char *type)
 {
 	static DEFINE_RATELIMIT_STATE(console_output, HZ, 1);
 	bool force_debug = false;
+	int sig;
 
 	/*
 	 * Our caller made sure the task was running on a task isolation
@@ -266,10 +303,13 @@ void task_isolation_debug_task(int cpu, struct task_struct *p, const char *type)
 	 * and instead just treat it as if "debug" mode was enabled,
 	 * since that's pretty much all we can do.
 	 */
-	if (in_nmi())
-		force_debug = true;
-	else
-		task_isolation_deliver_signal(p, type);
+	sig = task_isolation_sig(p->task_isolation_flags);
+	if (sig != 0) {
+		if (in_nmi())
+			force_debug = true;
+		else
+			task_isolation_deliver_signal(p, type, sig);
+	}
 
 	/*
 	 * If (for example) the timer interrupt starts ticking
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v14 14/14] task_isolation self test
  2016-08-09 20:29 [PATCH v14 00/14] support "task_isolation" mode Chris Metcalf
                   ` (12 preceding siblings ...)
  2016-08-09 20:29 ` [PATCH v14 13/14] task_isolation: add user-settable notification signal Chris Metcalf
@ 2016-08-09 20:29 ` Chris Metcalf
  13 siblings, 0 replies; 30+ messages in thread
From: Chris Metcalf @ 2016-08-09 20:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, Shuah Khan, Shuah Khan, linux-kselftest,
	linux-kernel
  Cc: Chris Metcalf

This code tests various aspects of task_isolation.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/task_isolation/Makefile    |  11 +
 tools/testing/selftests/task_isolation/config      |   2 +
 tools/testing/selftests/task_isolation/isolation.c | 618 +++++++++++++++++++++
 4 files changed, 632 insertions(+)
 create mode 100644 tools/testing/selftests/task_isolation/Makefile
 create mode 100644 tools/testing/selftests/task_isolation/config
 create mode 100644 tools/testing/selftests/task_isolation/isolation.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index ff9e5f20a5a7..bd97479f44b3 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -23,6 +23,7 @@ TARGETS += sigaltstack
 TARGETS += size
 TARGETS += static_keys
 TARGETS += sysctl
+TARGETS += task_isolation
 ifneq (1, $(quicktest))
 TARGETS += timers
 endif
diff --git a/tools/testing/selftests/task_isolation/Makefile b/tools/testing/selftests/task_isolation/Makefile
new file mode 100644
index 000000000000..c8927994fe18
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/Makefile
@@ -0,0 +1,11 @@
+CFLAGS += -O2 -g -W -Wall
+LDFLAGS += -pthread
+
+TEST_PROGS := isolation
+
+all: $(TEST_PROGS)
+
+include ../lib.mk
+
+clean:
+	$(RM) $(TEST_PROGS)
diff --git a/tools/testing/selftests/task_isolation/config b/tools/testing/selftests/task_isolation/config
new file mode 100644
index 000000000000..49e18e43b737
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/config
@@ -0,0 +1,2 @@
+CONFIG_TASK_ISOLATION=y
+CONFIG_TASK_ISOLATION_ALL=y
diff --git a/tools/testing/selftests/task_isolation/isolation.c b/tools/testing/selftests/task_isolation/isolation.c
new file mode 100644
index 000000000000..30d188844c7e
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/isolation.c
@@ -0,0 +1,618 @@
+/*
+ * This test program tests the features of task isolation.
+ * 
+ * - Makes sure enabling task isolation fails if you are unaffinitized 
+ *   or on a non-task-isolation cpu.
+ * 
+ * - Tests that /sys/devices/system/cpu/task_isolation works correctly.
+ * 
+ * - Validates that various synchronous exceptions are fatal in isolation
+ *   mode:
+ * 
+ *   * Page fault
+ *   * System call
+ *   * TLB invalidation from another thread [1]
+ *   * Unaligned access [2]
+ * 
+ * - Tests that taking a user-defined signal for the above faults works.
+ * 
+ * - Tests that isolation in "no signal" mode works as expected: you can
+ *   perform multiple system calls without a signal, and if another
+ *   process bumps you, you return to userspace without any extra jitter.
+ * 
+ * [1] TLB invalidations do not cause IPIs on some platforms, e.g. arm64
+ * [2] Unaligned access only causes exceptions on some platforms, e.g. tile
+ * 
+ * 
+ * You must be running under a kernel configured with TASK_ISOLATION.
+ * 
+ * You must either have configured with TASK_ISOLATION_ALL or else
+ * booted with an argument like "task_isolation=1-15" to enable some
+ * task-isolation cores.  If you get interrupts, you can also add
+ * the boot argument "task_isolation_debug" to learn more.
+ * 
+ * NOTE: you must disable the code in tick_nohz_stop_sched_tick()
+ * that limits the tick delta to the maximum scheduler deferment
+ * by making it conditional not just on "!ts->inidle" but also
+ * on !test_thread_flag(TIF_TASK_ISOLATION).  This is around line 1292
+ * in kernel/time/tick-sched.c (as of kernel 4.7).
+ * 
+ *
+ * To compile the test program, run "make".
+ * 
+ * Run the program as "./isolation" and if you want to run the
+ * jitter-detection loop for longer than 10 giga-cycles, specify the
+ * number of giga-cycles to run it for as a command-line argument.
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <assert.h>
+#include <string.h>
+#include <errno.h>
+#include <sched.h>
+#include <pthread.h>
+#include <sys/wait.h>
+#include <sys/mman.h>
+#include <sys/time.h>
+#include <sys/prctl.h>
+#include "../kselftest.h"
+
+#ifndef PR_SET_TASK_ISOLATION   /* Not in system headers yet? */
+# define PR_SET_TASK_ISOLATION		48
+# define PR_GET_TASK_ISOLATION		49
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_USERSIG	(1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+# define PR_TASK_ISOLATION_NOSIG \
+    (PR_TASK_ISOLATION_USERSIG | PR_TASK_ISOLATION_SET_SIG(0))
+#endif
+
+/* The cpu we are using for isolation tests. */
+static int task_isolation_cpu;
+
+/* Overall status, maintained as tests run. */
+static int exit_status = KSFT_PASS;
+
+/* Set affinity to a single cpu or die if trying to do so fails. */
+void set_my_cpu(int cpu)
+{
+	cpu_set_t set;
+	CPU_ZERO(&set);
+	CPU_SET(cpu, &set);
+	int rc = sched_setaffinity(0, sizeof(cpu_set_t), &set);
+	assert(rc == 0);
+}
+
+/*
+ * Run a child process in task isolation mode and report its status.
+ * The child does mlockall() and moves itself to the task isolation cpu.
+ * It then runs SETUP_FUNC (if specified), calls prctl(PR_SET_TASK_ISOLATION, )
+ * with FLAGS (if non-zero), and then invokes TEST_FUNC and exits
+ * with its status.
+ */
+static int run_test(void (*setup_func)(), int (*test_func)(), int flags)
+{
+	fflush(stdout);
+	int pid = fork();
+	assert(pid >= 0);
+	if (pid != 0) {
+		/* In parent; wait for child and return its status. */
+		int status;
+		waitpid(pid, &status, 0);
+		return status;
+	}
+
+	/* In child. */
+	int rc = mlockall(MCL_CURRENT);
+	assert(rc == 0);
+	set_my_cpu(task_isolation_cpu);
+	if (setup_func)
+		setup_func();
+	if (flags) {
+		int rc;
+		do
+			rc = prctl(PR_SET_TASK_ISOLATION, flags);
+		while (rc != 0 && errno == EAGAIN);
+		if (rc != 0) {
+			printf("couldn't enable isolation (%d): FAIL\n", errno);
+			ksft_exit_fail();
+		}
+	}
+	rc = test_func();
+	exit(rc);
+}
+
+/*
+ * Run a test and ensure it is killed with SIGKILL by default,
+ * for whatever misdemeanor is committed in TEST_FUNC.
+ * Also test it with SIGUSR1 as well to make sure that works.
+ */
+static void test_killed(const char *testname, void (*setup_func)(),
+			int (*test_func)())
+{
+	int status = run_test(setup_func, test_func, PR_TASK_ISOLATION_ENABLE);
+	if (WIFSIGNALED(status) && WTERMSIG(status) == SIGKILL) {
+		printf("%s: OK\n", testname);
+	} else {
+		printf("%s: FAIL (%#x)\n", testname, status);
+		exit_status = KSFT_FAIL;
+	}
+
+	status = run_test(setup_func, test_func,
+			  PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_USERSIG |
+			  PR_TASK_ISOLATION_SET_SIG(SIGUSR1));
+	if (WIFSIGNALED(status) && WTERMSIG(status) == SIGUSR1) {
+		printf("%s (SIGUSR1): OK\n", testname);
+	} else {
+		printf("%s (SIGUSR1): FAIL (%#x)\n", testname, status);
+		exit_status = KSFT_FAIL;
+	}
+}
+
+/* Run a test and make sure it exits with success. */
+static void test_ok(const char *testname, void (*setup_func)(),
+		    int (*test_func)())
+{
+	int status = run_test(setup_func, test_func, PR_TASK_ISOLATION_ENABLE);
+	if (status == KSFT_PASS) {
+		printf("%s: OK\n", testname);
+	} else {
+		printf("%s: FAIL (%#x)\n", testname, status);
+		exit_status = KSFT_FAIL;
+	}
+}
+
+/* Run a test with no signals and make sure it exits with success. */
+static void test_nosig(const char *testname, void (*setup_func)(),
+		       int (*test_func)())
+{
+	int status =
+		run_test(setup_func, test_func,
+			 PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_NOSIG);
+	if (status == KSFT_PASS) {
+		printf("%s: OK\n", testname);
+	} else {
+		printf("%s: FAIL (%#x)\n", testname, status);
+		exit_status = KSFT_FAIL;
+	}
+}
+
+/* Mapping address passed from setup function to test function. */
+static char *fault_file_mapping;
+
+/* mmap() a file in so we can test touching an unmapped page. */
+static void setup_fault(void)
+{
+	char fault_file[] = "/tmp/isolation_XXXXXX";
+	int fd = mkstemp(fault_file);
+	assert(fd >= 0);
+	int rc = ftruncate(fd, getpagesize());
+	assert(rc == 0);
+	fault_file_mapping = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
+				  MAP_SHARED, fd, 0);
+	assert(fault_file_mapping != MAP_FAILED);
+	close(fd);
+	unlink(fault_file);
+}
+
+/* Now touch the unmapped page (and be killed). */
+static int do_fault(void)
+{
+	*fault_file_mapping = 1;
+	return KSFT_FAIL;
+}
+
+/* Make a syscall (and be killed). */
+static int do_syscall(void)
+{
+	write(STDOUT_FILENO, "goodbye, world\n", 13);
+	return KSFT_FAIL;
+}
+
+/* Turn isolation back off and don't be killed. */
+static int do_syscall_off(void)
+{
+	prctl(PR_SET_TASK_ISOLATION, 0);
+	write(STDOUT_FILENO, "==> hello, world\n", 17);
+	return KSFT_PASS;
+}
+
+/* If we're not getting a signal, make sure we can do multiple system calls. */
+static int do_syscall_multi(void)
+{
+	write(STDOUT_FILENO, "==> hello, world 1\n", 19);
+	write(STDOUT_FILENO, "==> hello, world 2\n", 19);
+	return KSFT_PASS;
+}
+
+#ifdef __aarch64__
+/* ARM64 uses tlbi instructions so doesn't need to interrupt the remote core. */
+static void test_munmap(void) {}
+#else
+
+/*
+ * Fork a thread that will munmap() after a short while.
+ * It will deliver a TLB flush to the task isolation core.
+ */
+
+static void *start_munmap(void *p)
+{
+	usleep(500000);   /* 0.5s */
+	munmap(p, getpagesize());
+	return 0;
+}
+
+static void setup_munmap(void)
+{
+	/* First, go back to cpu 0 and allocate some memory. */
+	set_my_cpu(0);
+	void *p = mmap(0, getpagesize(), PROT_READ|PROT_WRITE,
+		       MAP_ANONYMOUS|MAP_POPULATE|MAP_PRIVATE, 0, 0);
+	assert(p != MAP_FAILED);
+
+	/*
+	 * Now fire up a thread that will wait half a second on cpu 0
+	 * and then munmap the mapping.
+	 */
+	pthread_t thr;
+	int rc = pthread_create(&thr, NULL, start_munmap, p);
+	assert(rc == 0);
+
+	/* Back to the task-isolation cpu. */
+	set_my_cpu(task_isolation_cpu);
+}
+
+/* Global variable to avoid the compiler outsmarting us. */
+volatile int munmap_spin;
+
+static int do_munmap(void)
+{
+	while (munmap_spin < 1000000000)
+		++munmap_spin;
+	return KSFT_FAIL;
+}
+
+static void test_munmap(void)
+{
+	test_killed("test_munmap", setup_munmap, do_munmap);
+}
+#endif
+
+#ifdef __tilegx__
+/*
+ * Make an unaligned access (and be killed).
+ * Only for tilegx, since other platforms don't do in-kernel fixups.
+ */
+static int
+do_unaligned(void)
+{
+	static int buf[2];
+	volatile int* addr = (volatile int *)((char *)buf + 1);
+
+	*addr;
+
+	asm("nop");
+	return KSFT_FAIL;
+}
+
+static void test_unaligned(void)
+{
+	test_killed("test_unaligned", NULL, do_unaligned);
+}
+#else
+static void test_unaligned(void) {}
+#endif
+
+/*
+ * Fork a process that will spin annoyingly on the same core
+ * for a second.  Since prctl() won't work if this task is actively
+ * running, we following this handshake sequence:
+ *
+ * 1. Child (in setup_quiesce, here) starts up, sets state 1 to let the
+ *    parent know it's running, and starts doing short sleeps waiting on a
+ *    state change.
+ * 2. Parent (in do_quiesce, below) starts up, spins waiting for state 1,
+ *    then spins waiting on prctl() to succeed.  At that point it is in
+ *    isolation mode and the child is completing its most recent sleep.
+ *    Now, as soon as the parent is scheduled out, it won't schedule back
+ *    in until the child stops spinning.
+ * 3. Child sees the state change to 2, sets it to 3, and starts spinning
+ *    waiting for a second to elapse, at which point it exits.
+ * 4. Parent spins waiting for the state to get to 3, then makes one
+ *    syscall.  This should take about a second even though the child
+ *    was spinning for a whole second after changing the state to 3.
+ */
+
+volatile int *statep, *childstate;
+struct timeval quiesce_start, quiesce_end;
+int child_pid;
+
+static void setup_quiesce(void)
+{
+	/* First, go back to cpu 0 and allocate some shared memory. */
+	set_my_cpu(0);
+	statep = mmap(0, getpagesize(), PROT_READ|PROT_WRITE,
+		      MAP_ANONYMOUS|MAP_SHARED, 0, 0);
+	assert(statep != MAP_FAILED);
+	childstate = statep + 1;
+
+	gettimeofday(&quiesce_start, NULL);
+
+	/* Fork and fault in all memory in both. */
+	child_pid = fork();
+	assert(child_pid >= 0);
+	if (child_pid == 0)
+		*childstate = 1;
+	int rc = mlockall(MCL_CURRENT);
+	assert(rc == 0);
+	if (child_pid != 0) {
+		set_my_cpu(task_isolation_cpu);
+		return;
+	}
+
+	/*
+	 * In child.  Wait until parent notifies us that it has completed
+	 * its prctl, then jump to its cpu and let it know.
+	 */
+	*childstate = 2;
+	while (*statep == 0)
+		;
+	*childstate = 3;
+	set_my_cpu(task_isolation_cpu);
+	*statep = 2;
+	*childstate = 4;
+
+	/*
+	 * Now we are competing for the runqueue on task_isolation_cpu.
+	 * Spin for one second to ensure the parent gets caught in kernel space.
+	 */
+	struct timeval start, tv;
+	gettimeofday(&start, NULL);
+	while (1) {
+		gettimeofday(&tv, NULL);
+		double time = (tv.tv_sec - start.tv_sec) +
+			(tv.tv_usec - start.tv_usec) / 1000000.0;
+		if (time >= 0.5)
+			exit(0);
+	}
+}
+
+static int do_quiesce(void)
+{
+	double time;
+	int rc;
+
+	rc = prctl(PR_SET_TASK_ISOLATION,
+		   PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_NOSIG);
+	if (rc != 0) {
+		prctl(PR_SET_TASK_ISOLATION, 0);
+		printf("prctl failed: rc %d", rc);
+		goto fail;
+	}
+	*statep = 1;
+    
+	/* Wait for child to come disturb us. */
+	while (*statep == 1) {
+		gettimeofday(&quiesce_end, NULL);
+		time = (quiesce_end.tv_sec - quiesce_start.tv_sec) +
+			(quiesce_end.tv_usec - quiesce_start.tv_usec)/1000000.0;
+		if (time > 0.1 && *statep == 1)	{
+			prctl(PR_SET_TASK_ISOLATION, 0);
+			printf("timed out at %gs in child migrate loop (%d)\n",
+			       time, *childstate);
+			char buf[100];
+			sprintf(buf, "cat /proc/%d/stack", child_pid);
+			system(buf);
+			goto fail;
+		}
+	}
+	assert(*statep == 2);
+
+	/*
+	 * At this point the child is spinning, so any interrupt will keep us
+	 * in kernel space.  Make a syscall to make sure it happens at least
+	 * once during the second that the child is spinning.
+	 */
+	kill(0, 0);
+	gettimeofday(&quiesce_end, NULL);
+	prctl(PR_SET_TASK_ISOLATION, 0);
+	time = (quiesce_end.tv_sec - quiesce_start.tv_sec) +
+		(quiesce_end.tv_usec - quiesce_start.tv_usec) / 1000000.0;
+	if (time < 0.4 || time > 0.6) {
+		printf("expected 1s wait after quiesce: was %g\n", time);
+		goto fail;
+	}
+	kill(child_pid, SIGKILL);
+	return KSFT_PASS;
+
+fail:
+	kill(child_pid, SIGKILL);
+	return KSFT_FAIL;
+}
+
+#ifdef __tile__
+#include <arch/spr_def.h>
+#endif
+
+static inline unsigned long get_cycle_count(void)
+{
+#ifdef __x86_64__
+	unsigned int lower, upper;
+	__asm__ __volatile__("rdtsc" : "=a"(lower), "=d"(upper));
+	return lower | ((unsigned long)upper << 32);
+#elif defined(__tile__)
+	return __insn_mfspr(SPR_CYCLE);
+#elif defined(__aarch64__)
+	unsigned long vtick;
+	__asm__ volatile("mrs %0, cntvct_el0" : "=r" (vtick));
+	return vtick;
+#else
+#error Unsupported architecture
+#endif
+}
+
+/* Histogram of cycle counts up to HISTSIZE cycles. */
+#define HISTSIZE 500
+long hist[HISTSIZE];
+
+/* Information on loss of control of the cpu (more than HISTSIZE cycles). */
+struct jitter_info {
+	unsigned long at;      /* cycle of jitter event */
+	long cycles;           /* how long we lost the cpu for */
+};
+#define MAX_EVENTS 100
+volatile struct jitter_info jitter[MAX_EVENTS];
+unsigned int count;            /* index into jitter[] */
+
+void jitter_summarize(void)
+{
+	printf("INFO: loop times:\n");
+	unsigned int i;
+	for (i = 0 ;i < HISTSIZE; ++i)
+		if (hist[i])
+			printf("  %d x %ld\n", i, hist[i]);
+
+	if (count)
+		printf("ERROR: jitter:\n");
+	for (i = 0; i < count; ++i)
+		printf("  %ld: %ld cycles\n", jitter[i].at, jitter[i].cycles);
+	if (count == sizeof(jitter)/sizeof(jitter[0]))
+		printf("  ... more\n");
+}
+
+void jitter_handler(int sig)
+{
+	printf("\n");
+	if (sig == SIGUSR1) {
+		exit_status = KSFT_FAIL;
+		printf("ERROR: Program unexpectedly entered kernel.\n");
+	}
+	jitter_summarize();
+	exit(exit_status);
+}
+
+void test_jitter(unsigned long waitticks)
+{
+	printf("testing task isolation jitter for %ld ticks\n", waitticks);
+
+	signal(SIGINT, jitter_handler);
+	signal(SIGUSR1, jitter_handler);
+	set_my_cpu(task_isolation_cpu);
+	int rc = mlockall(MCL_CURRENT);
+	assert(rc == 0);
+
+	do
+		rc = prctl(PR_SET_TASK_ISOLATION, 
+			   PR_TASK_ISOLATION_ENABLE |
+			   PR_TASK_ISOLATION_USERSIG |
+			   PR_TASK_ISOLATION_SET_SIG(SIGUSR1));
+	while (rc != 0 && errno == EAGAIN);
+	if (rc != 0) {
+		printf("couldn't enable isolation (%d): FAIL\n", errno);
+		ksft_exit_fail();
+	}
+
+	unsigned long start = get_cycle_count();
+	unsigned long last = start;
+	unsigned long elapsed;
+	do {
+		unsigned long next = get_cycle_count();
+		unsigned long delta = next - last;
+		elapsed = next - start;
+		if (__builtin_expect(delta > HISTSIZE, 0)) {
+			exit_status = KSFT_FAIL;
+			if (count < sizeof(jitter)/sizeof(jitter[0])) {
+				jitter[count].cycles = delta;
+				jitter[count].at = elapsed;
+				++count;
+			}
+		} else {
+			hist[delta]++;
+		}
+		last = next;
+
+	} while (elapsed < waitticks);
+
+	prctl(PR_SET_TASK_ISOLATION, 0);
+	jitter_summarize();
+}
+
+int main(int argc, char **argv)
+{
+	/* How many billion ticks to wait after running the other tests? */
+	unsigned long waitticks;
+	if (argc == 1)
+		waitticks = 10;
+	else if (argc == 2)
+		waitticks = strtol(argv[1], NULL, 10);
+	else {
+		printf("syntax: isolation [gigaticks]\n");
+		ksft_exit_fail();
+	}
+	waitticks *= 1000000000;
+
+	/* Test that the /sys device is present and pick a cpu. */
+	FILE *f = fopen("/sys/devices/system/cpu/task_isolation", "r");
+	if (f == NULL) {
+		printf("/sys device: SKIP (%s)\n", strerror(errno));
+		ksft_exit_skip();
+	}
+	char buf[100];
+	char *result = fgets(buf, sizeof(buf), f);
+	assert(result == buf);
+	fclose(f);
+	if (*buf == '\n') {
+		printf("No task_isolation cores configured.\n");
+		ksft_exit_skip();
+	}
+	char *end;
+	task_isolation_cpu = strtol(buf, &end, 10);
+	assert(end != buf);
+	assert(*end == ',' || *end == '-' || *end == '\n');
+	assert(task_isolation_cpu >= 0);
+	printf("/sys device : OK (using task isolation cpu %d)\n",
+	       task_isolation_cpu);
+
+	/* Test to see if with no mask set, we fail. */
+	if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) == 0 ||
+	    errno != EINVAL) {
+		printf("prctl unaffinitized: FAIL\n");
+		exit_status = KSFT_FAIL;
+	} else {
+		printf("prctl unaffinitized: OK\n");
+	}
+
+	/* Or if affinitized to the wrong cpu. */
+	set_my_cpu(0);
+	if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) == 0 ||
+	    errno != EINVAL) {
+		printf("prctl on cpu 0: FAIL\n");
+		exit_status = KSFT_FAIL;
+	} else {
+		printf("prctl on cpu 0: OK\n");
+	}
+
+	/* Run the tests. */
+	test_killed("test_fault", setup_fault, do_fault);
+	test_killed("test_syscall", NULL, do_syscall);
+	test_munmap();
+	test_unaligned();
+	test_ok("test_off", NULL, do_syscall_off);
+	test_nosig("test_multi", NULL, do_syscall_multi);
+	test_nosig("test_quiesce", setup_quiesce, do_quiesce);
+
+	/* Exit failure if any test failed. */
+	if (exit_status != KSFT_PASS) {
+		printf("Skipping jitter testing due to test failures\n");
+		return exit_status;
+	}
+
+	test_jitter(waitticks);
+
+	return exit_status;
+}
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v14 06/14] arch/x86: enable task isolation functionality
  2016-08-09 20:29 ` [PATCH v14 06/14] arch/x86: enable task isolation functionality Chris Metcalf
@ 2016-08-10  7:52   ` Andy Lutomirski
  2016-08-10 14:30     ` Chris Metcalf
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2016-08-10  7:52 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Thomas Gleixner, Christoph Lameter, Gilad Ben Yossef,
	Andrew Morton, Viresh Kumar, Ingo Molnar, Steven Rostedt,
	Tejun Heo, Will Deacon, Rik van Riel, Frederic Weisbecker,
	Paul E. McKenney, linux-kernel, X86 ML, H. Peter Anvin,
	Catalin Marinas, Peter Zijlstra

On Aug 9, 2016 11:30 PM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>
> In exit_to_usermode_loop(), call task_isolation_ready() for
> TIF_TASK_ISOLATION tasks when we are checking the thread-info flags,
> and after we've handled the other work, call task_isolation_enter()
> for such tasks.
>
> In syscall_trace_enter_phase1(), we add the necessary support for
> reporting syscalls for task-isolation processes.
>
> We add strict reporting for the kernel exception types that do
> not result in signals, namely non-signalling page faults and
> non-signalling MPX fixups.
>
> Tested-by: Christoph Lameter <cl@linux.com>
> Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
> ---
>  arch/x86/Kconfig                   |  1 +
>  arch/x86/entry/common.c            | 20 +++++++++++++++++++-
>  arch/x86/include/asm/thread_info.h |  2 ++
>  arch/x86/kernel/smp.c              |  2 ++
>  arch/x86/kernel/traps.c            |  3 +++
>  arch/x86/mm/fault.c                |  5 +++++
>  6 files changed, 32 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5c6e7471b732..10b2c0567dad 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -89,6 +89,7 @@ config X86
>         select HAVE_ARCH_MMAP_RND_COMPAT_BITS   if MMU && COMPAT
>         select HAVE_ARCH_SECCOMP_FILTER
>         select HAVE_ARCH_SOFT_DIRTY             if X86_64
> +       select HAVE_ARCH_TASK_ISOLATION
>         select HAVE_ARCH_TRACEHOOK
>         select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>         select HAVE_EBPF_JIT                    if X86_64
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 1433f6b4607d..5c1b9fc89bf2 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -21,6 +21,7 @@
>  #include <linux/context_tracking.h>
>  #include <linux/user-return-notifier.h>
>  #include <linux/uprobes.h>
> +#include <linux/isolation.h>
>
>  #include <asm/desc.h>
>  #include <asm/traps.h>
> @@ -91,6 +92,15 @@ static long syscall_trace_enter(struct pt_regs *regs)
>         if (emulated)
>                 return -1L;
>
> +       /* In isolation mode, we may prevent the syscall from running. */
> +       if (work & _TIF_TASK_ISOLATION) {
> +               if (task_isolation_syscall(regs->orig_ax) == -1) {
> +                       regs->orig_ax = -1;
> +                       return 0;
> +               }
> +               work &= ~_TIF_TASK_ISOLATION;
> +       }
> +

What is this?  It's not mentioned in the changelog.  It seems
nonsensical to me.  If nothing else, you forgot to update regs->ax,
but I don't even know what you're trying to do.

>  #ifdef CONFIG_SECCOMP
>         /*
>          * Do seccomp after ptrace, to catch any tracer changes.
> @@ -136,7 +146,7 @@ static long syscall_trace_enter(struct pt_regs *regs)
>
>  #define EXIT_TO_USERMODE_LOOP_FLAGS                            \
>         (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |   \
> -        _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY)
> +        _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_TASK_ISOLATION)
>

Where are you updating the conditions to force use of the slow path?
(That's _TIF_ALLWORK_MASK.)

--Andy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v14 06/14] arch/x86: enable task isolation functionality
  2016-08-10  7:52   ` Andy Lutomirski
@ 2016-08-10 14:30     ` Chris Metcalf
  2016-08-10 19:17       ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Chris Metcalf @ 2016-08-10 14:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Christoph Lameter, Gilad Ben Yossef,
	Andrew Morton, Viresh Kumar, Ingo Molnar, Steven Rostedt,
	Tejun Heo, Will Deacon, Rik van Riel, Frederic Weisbecker,
	Paul E. McKenney, linux-kernel, X86 ML, H. Peter Anvin,
	Catalin Marinas, Peter Zijlstra, Francis Giraldeau

On 8/10/2016 3:52 AM, Andy Lutomirski wrote:
> On Aug 9, 2016 11:30 PM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
> @@ -91,6 +92,15 @@ static long syscall_trace_enter(struct pt_regs *regs)
>          if (emulated)
>                  return -1L;
>
> +       /* In isolation mode, we may prevent the syscall from running. */
> +       if (work & _TIF_TASK_ISOLATION) {
> +               if (task_isolation_syscall(regs->orig_ax) == -1) {
> +                       regs->orig_ax = -1;
> +                       return 0;
> +               }
> +               work &= ~_TIF_TASK_ISOLATION;
> +       }
> +
> What is this?  It's not mentioned in the changelog.  It seems
> nonsensical to me.  If nothing else, you forgot to update regs->ax,
> but I don't even know what you're trying to do.

It's mentioned in the changelog as "Fixes a bug in x86 syscall_trace_enter()
[seen by Francis Giraldeau]."  To be fair, I didn't hear back from Francis, and
you're right, this doesn't look like it makes any sense now.  (I've added him
to the cc's on this email; for this series I had just put him on the cover letter.)

I modeled this code on a snippet from the old two-phase syscall entry work:

                if (ret == SECCOMP_PHASE1_SKIP) {
                        regs->orig_ax = -1;
                        ret = 0;
                }

You got rid of this during the 4.7-rc series, but my code above was at least
plausibly valid until then :-)

Regardless, I assume that the right thing for that condition to do now when
it triggers is to set regs->ax = -ENOSYS and return -1L?  I'll update the
git repository with that in any case.

Thanks!

>>   #ifdef CONFIG_SECCOMP
>>          /*
>>           * Do seccomp after ptrace, to catch any tracer changes.
>> @@ -136,7 +146,7 @@ static long syscall_trace_enter(struct pt_regs *regs)
>>
>>   #define EXIT_TO_USERMODE_LOOP_FLAGS                            \
>>          (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |   \
>> -        _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY)
>> +        _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_TASK_ISOLATION)
>>
> Where are you updating the conditions to force use of the slow path?
> (That's _TIF_ALLWORK_MASK.)

Whenever _TIF_TASK_ISOLATION is set, _TIF_NOHZ is also set.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v14 06/14] arch/x86: enable task isolation functionality
  2016-08-10 14:30     ` Chris Metcalf
@ 2016-08-10 19:17       ` Andy Lutomirski
  2016-08-10 19:40         ` Chris Metcalf
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2016-08-10 19:17 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Thomas Gleixner, Christoph Lameter, X86 ML, Gilad Ben Yossef,
	Andrew Morton, Viresh Kumar, Steven Rostedt, Ingo Molnar,
	Tejun Heo, Rik van Riel, Will Deacon, Frederic Weisbecker,
	Paul E. McKenney, Francis Giraldeau, linux-kernel,
	Catalin Marinas, H. Peter Anvin, Peter Zijlstra

On Aug 10, 2016 5:30 PM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>
> On 8/10/2016 3:52 AM, Andy Lutomirski wrote:
>>
>> On Aug 9, 2016 11:30 PM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>> @@ -91,6 +92,15 @@ static long syscall_trace_enter(struct pt_regs *regs)
>>          if (emulated)
>>                  return -1L;
>>
>> +       /* In isolation mode, we may prevent the syscall from running. */
>> +       if (work & _TIF_TASK_ISOLATION) {
>> +               if (task_isolation_syscall(regs->orig_ax) == -1) {
>> +                       regs->orig_ax = -1;
>> +                       return 0;
>> +               }
>> +               work &= ~_TIF_TASK_ISOLATION;
>> +       }
>> +
>> What is this?  It's not mentioned in the changelog.  It seems
>> nonsensical to me.  If nothing else, you forgot to update regs->ax,
>> but I don't even know what you're trying to do.
>
>
> It's mentioned in the changelog as "Fixes a bug in x86 syscall_trace_enter()
> [seen by Francis Giraldeau]."  To be fair, I didn't hear back from Francis, and
> you're right, this doesn't look like it makes any sense now.  (I've added him
> to the cc's on this email; for this series I had just put him on the cover letter.)
>
> I modeled this code on a snippet from the old two-phase syscall entry work:
>
>                if (ret == SECCOMP_PHASE1_SKIP) {
>                        regs->orig_ax = -1;
>                        ret = 0;
>                }
>
> You got rid of this during the 4.7-rc series, but my code above was at least
> plausibly valid until then :-)
>
> Regardless, I assume that the right thing for that condition to do now when
> it triggers is to set regs->ax = -ENOSYS and return -1L?  I'll update the
> git repository with that in any case.

regs->ax will already be -ENOSYS unless something changed it, but I'm
not sure what this code is trying to do.  Is the idea that
task_isolation_syscall might enqueue a signal and you want to deliver
it without processing the syscall?  If so, a comment would be nice.
You could even WARN_ON(!signal_pending()).

>
> Thanks!
>
>
>>>   #ifdef CONFIG_SECCOMP
>>>          /*
>>>           * Do seccomp after ptrace, to catch any tracer changes.
>>> @@ -136,7 +146,7 @@ static long syscall_trace_enter(struct pt_regs *regs)
>>>
>>>   #define EXIT_TO_USERMODE_LOOP_FLAGS                            \
>>>          (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |   \
>>> -        _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY)
>>> +        _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_TASK_ISOLATION)
>>>
>> Where are you updating the conditions to force use of the slow path?
>> (That's _TIF_ALLWORK_MASK.)
>
>
> Whenever _TIF_TASK_ISOLATION is set, _TIF_NOHZ is also set.

OK, but why not decouple it a bit and add it to the mask?  I keep
meaning to add a BUILD_BUG_ON checking for bits in
EXIT_TO_USERMODE_LOOP_FLAGS that aren't in the appropriate slow path
masks.

>
> --
> Chris Metcalf, Mellanox Technologies
> http://www.mellanox.com
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v14 06/14] arch/x86: enable task isolation functionality
  2016-08-10 19:17       ` Andy Lutomirski
@ 2016-08-10 19:40         ` Chris Metcalf
  2016-08-10 20:06           ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Chris Metcalf @ 2016-08-10 19:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Christoph Lameter, X86 ML, Gilad Ben Yossef,
	Andrew Morton, Viresh Kumar, Steven Rostedt, Ingo Molnar,
	Tejun Heo, Rik van Riel, Will Deacon, Frederic Weisbecker,
	Paul E. McKenney, Francis Giraldeau, linux-kernel,
	Catalin Marinas, H. Peter Anvin, Peter Zijlstra

On 8/10/2016 3:17 PM, Andy Lutomirski wrote:
> On Aug 10, 2016 5:30 PM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>> On 8/10/2016 3:52 AM, Andy Lutomirski wrote:
>>> On Aug 9, 2016 11:30 PM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>>> @@ -91,6 +92,15 @@ static long syscall_trace_enter(struct pt_regs *regs)
>>>           if (emulated)
>>>                   return -1L;
>>>
>>> +       /* In isolation mode, we may prevent the syscall from running. */
>>> +       if (work & _TIF_TASK_ISOLATION) {
>>> +               if (task_isolation_syscall(regs->orig_ax) == -1) {
>>> +                       regs->orig_ax = -1;
>>> +                       return 0;
>>> +               }
>>> +               work &= ~_TIF_TASK_ISOLATION;
>>> +       }
>>> +
>>> What is this?  It's not mentioned in the changelog.  It seems
>>> nonsensical to me.  If nothing else, you forgot to update regs->ax,
>>> but I don't even know what you're trying to do.
>>
>> It's mentioned in the changelog as "Fixes a bug in x86 syscall_trace_enter()
>> [seen by Francis Giraldeau]."  To be fair, I didn't hear back from Francis, and
>> you're right, this doesn't look like it makes any sense now.  (I've added him
>> to the cc's on this email; for this series I had just put him on the cover letter.)
>>
>> I modeled this code on a snippet from the old two-phase syscall entry work:
>>
>>                 if (ret == SECCOMP_PHASE1_SKIP) {
>>                         regs->orig_ax = -1;
>>                         ret = 0;
>>                 }
>>
>> You got rid of this during the 4.7-rc series, but my code above was at least
>> plausibly valid until then :-)
>>
>> Regardless, I assume that the right thing for that condition to do now when
>> it triggers is to set regs->ax = -ENOSYS and return -1L?  I'll update the
>> git repository with that in any case.
> regs->ax will already be -ENOSYS unless something changed it

Right, I see that now in entry_SYSCALL_64_after_swapgs.  Good.

> but I'm
> not sure what this code is trying to do.  Is the idea that
> task_isolation_syscall might enqueue a signal and you want to deliver
> it without processing the syscall?  If so, a comment would be nice.
> You could even WARN_ON(!signal_pending()).

If you are in task isolation mode (and you haven't also requested NOSIG),
then attempting a system call fails, and you get a signal delivered.  You
convinced me that failing the syscall was the thing to do back here:

https://lkml.kernel.org/r/CALCETrUrc_LJyLJLHefSDYagCrNqqzKuknr6uLgVXnPW8PmZKw@mail.gmail.com

>>>>    #ifdef CONFIG_SECCOMP
>>>>           /*
>>>>            * Do seccomp after ptrace, to catch any tracer changes.
>>>> @@ -136,7 +146,7 @@ static long syscall_trace_enter(struct pt_regs *regs)
>>>>
>>>>    #define EXIT_TO_USERMODE_LOOP_FLAGS                            \
>>>>           (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |   \
>>>> -        _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY)
>>>> +        _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_TASK_ISOLATION)
>>>>
>>> Where are you updating the conditions to force use of the slow path?
>>> (That's _TIF_ALLWORK_MASK.)
>>
>> Whenever _TIF_TASK_ISOLATION is set, _TIF_NOHZ is also set.
> OK, but why not decouple it a bit and add it to the mask?  I keep
> meaning to add a BUILD_BUG_ON checking for bits in
> EXIT_TO_USERMODE_LOOP_FLAGS that aren't in the appropriate slow path
> masks.

That does seem reasonable; I'll make the change.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v14 06/14] arch/x86: enable task isolation functionality
  2016-08-10 19:40         ` Chris Metcalf
@ 2016-08-10 20:06           ` Andy Lutomirski
  0 siblings, 0 replies; 30+ messages in thread
From: Andy Lutomirski @ 2016-08-10 20:06 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Thomas Gleixner, Christoph Lameter, X86 ML, Gilad Ben Yossef,
	Andrew Morton, Viresh Kumar, Steven Rostedt, Ingo Molnar,
	Tejun Heo, Rik van Riel, Will Deacon, Frederic Weisbecker,
	Paul E. McKenney, Francis Giraldeau, linux-kernel,
	Catalin Marinas, H. Peter Anvin, Peter Zijlstra

On Wed, Aug 10, 2016 at 12:40 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
> On 8/10/2016 3:17 PM, Andy Lutomirski wrote:
>>
>> On Aug 10, 2016 5:30 PM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>>>
>>> On 8/10/2016 3:52 AM, Andy Lutomirski wrote:
>>>>
>>>> On Aug 9, 2016 11:30 PM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>>>> @@ -91,6 +92,15 @@ static long syscall_trace_enter(struct pt_regs *regs)
>>>>           if (emulated)
>>>>                   return -1L;
>>>>
>>>> +       /* In isolation mode, we may prevent the syscall from running.
>>>> */
>>>> +       if (work & _TIF_TASK_ISOLATION) {
>>>> +               if (task_isolation_syscall(regs->orig_ax) == -1) {
>>>> +                       regs->orig_ax = -1;
>>>> +                       return 0;
>>>> +               }
>>>> +               work &= ~_TIF_TASK_ISOLATION;
>>>> +       }
>>>> +
>>>> What is this?  It's not mentioned in the changelog.  It seems
>>>> nonsensical to me.  If nothing else, you forgot to update regs->ax,
>>>> but I don't even know what you're trying to do.
>>>
>>>
>>> It's mentioned in the changelog as "Fixes a bug in x86
>>> syscall_trace_enter()
>>> [seen by Francis Giraldeau]."  To be fair, I didn't hear back from
>>> Francis, and
>>> you're right, this doesn't look like it makes any sense now.  (I've added
>>> him
>>> to the cc's on this email; for this series I had just put him on the
>>> cover letter.)
>>>
>>> I modeled this code on a snippet from the old two-phase syscall entry
>>> work:
>>>
>>>                 if (ret == SECCOMP_PHASE1_SKIP) {
>>>                         regs->orig_ax = -1;
>>>                         ret = 0;
>>>                 }
>>>
>>> You got rid of this during the 4.7-rc series, but my code above was at
>>> least
>>> plausibly valid until then :-)
>>>
>>> Regardless, I assume that the right thing for that condition to do now
>>> when
>>> it triggers is to set regs->ax = -ENOSYS and return -1L?  I'll update the
>>> git repository with that in any case.
>>
>> regs->ax will already be -ENOSYS unless something changed it
>
>
> Right, I see that now in entry_SYSCALL_64_after_swapgs.  Good.
>
>> but I'm
>> not sure what this code is trying to do.  Is the idea that
>> task_isolation_syscall might enqueue a signal and you want to deliver
>> it without processing the syscall?  If so, a comment would be nice.
>> You could even WARN_ON(!signal_pending()).
>
>
> If you are in task isolation mode (and you haven't also requested NOSIG),
> then attempting a system call fails, and you get a signal delivered.  You
> convinced me that failing the syscall was the thing to do back here:
>
> https://lkml.kernel.org/r/CALCETrUrc_LJyLJLHefSDYagCrNqqzKuknr6uLgVXnPW8PmZKw@mail.gmail.com

Indeed.  Then please add the comment any maybe the WARN_ON.

>
>>>>>    #ifdef CONFIG_SECCOMP
>>>>>           /*
>>>>>            * Do seccomp after ptrace, to catch any tracer changes.
>>>>> @@ -136,7 +146,7 @@ static long syscall_trace_enter(struct pt_regs
>>>>> *regs)
>>>>>
>>>>>    #define EXIT_TO_USERMODE_LOOP_FLAGS                            \
>>>>>           (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |   \
>>>>> -        _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY)
>>>>> +        _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY |
>>>>> _TIF_TASK_ISOLATION)
>>>>>
>>>> Where are you updating the conditions to force use of the slow path?
>>>> (That's _TIF_ALLWORK_MASK.)
>>>
>>>
>>> Whenever _TIF_TASK_ISOLATION is set, _TIF_NOHZ is also set.
>>
>> OK, but why not decouple it a bit and add it to the mask?  I keep
>> meaning to add a BUILD_BUG_ON checking for bits in
>> EXIT_TO_USERMODE_LOOP_FLAGS that aren't in the appropriate slow path
>> masks.
>
>
> That does seem reasonable; I'll make the change.
>
>
> --
> Chris Metcalf, Mellanox Technologies
> http://www.mellanox.com
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v14 04/14] task_isolation: add initial support
  2016-08-09 20:29 ` [PATCH v14 04/14] task_isolation: add initial support Chris Metcalf
@ 2016-08-11 18:11   ` Frederic Weisbecker
  2016-08-11 18:50     ` Christoph Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2016-08-11 18:11 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On Tue, Aug 09, 2016 at 04:29:46PM -0400, Chris Metcalf wrote:
> +/*
> + * Each time we try to prepare for return to userspace in a process
> + * with task isolation enabled, we run this code to quiesce whatever
> + * subsystems we can readily quiesce to avoid later interrupts.
> + */
> +void task_isolation_enter(void)
> +{
> +	WARN_ON_ONCE(irqs_disabled());
> +
> +	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
> +	lru_add_drain();
> +
> +	/* Quieten the vmstat worker so it won't interrupt us. */
> +	quiet_vmstat_sync();

So, this is going to be called everytime we resume to userspace
while in task isolation mode, right?

Do we need to quiesce vmstat everytime before entering userspace?
I thought that vmstat only need to be offlined once and for all?

And how about lru?

> +
> +	/*
> +	 * Request rescheduling unless we are in full dynticks mode.
> +	 * We would eventually get pre-empted without this, and if
> +	 * there's another task waiting, it would run; but by
> +	 * explicitly requesting the reschedule, we may reduce the
> +	 * latency.  We could directly call schedule() here as well,
> +	 * but since our caller is the standard place where schedule()
> +	 * is called, we defer to the caller.
> +	 *
> +	 * A more substantive approach here would be to use a struct
> +	 * completion here explicitly, and complete it when we shut
> +	 * down dynticks, but since we presumably have nothing better
> +	 * to do on this core anyway, just spinning seems plausible.
> +	 */
> +	if (!tick_nohz_tick_stopped())
> +		set_tsk_need_resched(current);

Again, that won't help :-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v14 04/14] task_isolation: add initial support
  2016-08-11 18:11   ` Frederic Weisbecker
@ 2016-08-11 18:50     ` Christoph Lameter
  2016-08-15 14:59       ` Chris Metcalf
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2016-08-11 18:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc,
	linux-api, linux-kernel

On Thu, 11 Aug 2016, Frederic Weisbecker wrote:

> Do we need to quiesce vmstat everytime before entering userspace?
> I thought that vmstat only need to be offlined once and for all?

Once is sufficient after disabling the tick.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v14 04/14] task_isolation: add initial support
  2016-08-11 18:50     ` Christoph Lameter
@ 2016-08-15 14:59       ` Chris Metcalf
  2016-08-30  0:55         ` Frederic Weisbecker
  0 siblings, 1 reply; 30+ messages in thread
From: Chris Metcalf @ 2016-08-15 14:59 UTC (permalink / raw)
  To: Christoph Lameter, Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Michal Hocko, linux-mm, linux-doc, linux-api,
	linux-kernel

On 8/11/2016 2:50 PM, Christoph Lameter wrote:
> On Thu, 11 Aug 2016, Frederic Weisbecker wrote:
>
>> Do we need to quiesce vmstat everytime before entering userspace?
>> I thought that vmstat only need to be offlined once and for all?
> Once is sufficient after disabling the tick.

It's true that task_isolation_enter() is called every time before
returning to user space while task isolation is enabled.

But once we enter the kernel again after returning from the initial
prctl() -- assuming we are in NOSIG mode so doing so is legal in the
first place -- almost anything can happen, certainly including
restarting the tick.  Thus, we have to make sure that normal quiescing
happens again before we return to userspace.

For vmstat, you're right that it's somewhat heavyweight to do the
quiesce, and if we don't need it, it's wasted time on the return path.
So I will add a guard call to the new vmstat_idle() before invoking
quiet_vmstat_sync().  This slows down the path where it turns out we
do need to quieten vmstat, but not by too much.

The LRU quiesce is quite light-weight.  We just check pagevec_count()
on a handful of pagevec's, confirm they are all zero, and return
without further work.  So for that one, adding a separate
lru_add_drain_needed() guard test would just be wasted effort.

The thing to remember is that this is only relevant if the user has
explicitly requested the NOSIG behavior from task isolation, which we
don't really expect to be the default - we are implicitly encouraging
use of the default semantics of "you can't enter the kernel again
until you turn off isolation".

> > +	if (!tick_nohz_tick_stopped())
> > +		set_tsk_need_resched(current);
> > Again, that won't help

It won't be better than spinning in a loop if there aren't any other
schedulable processes, but it won't be worse either.  If there is
another schedulable process, we at least will schedule it sooner than
if we just sat in a busy loop and waited for the scheduler to kick
us. But there's nothing else we can do anyway if we want to maintain
the guarantee that the dyn tick is stopped before return to userspace.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v14 04/14] task_isolation: add initial support
  2016-08-15 14:59       ` Chris Metcalf
@ 2016-08-30  0:55         ` Frederic Weisbecker
  2016-08-30 15:41           ` Chris Metcalf
  0 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2016-08-30  0:55 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc,
	linux-api, linux-kernel

On Mon, Aug 15, 2016 at 10:59:55AM -0400, Chris Metcalf wrote:
> On 8/11/2016 2:50 PM, Christoph Lameter wrote:
> >On Thu, 11 Aug 2016, Frederic Weisbecker wrote:
> >
> >>Do we need to quiesce vmstat everytime before entering userspace?
> >>I thought that vmstat only need to be offlined once and for all?
> >Once is sufficient after disabling the tick.
> 
> It's true that task_isolation_enter() is called every time before
> returning to user space while task isolation is enabled.
> 
> But once we enter the kernel again after returning from the initial
> prctl() -- assuming we are in NOSIG mode so doing so is legal in the
> first place -- almost anything can happen, certainly including
> restarting the tick.  Thus, we have to make sure that normal quiescing
> happens again before we return to userspace.

Yes but we need to sort out what needs to be called only once on prctl().

Once vmstat is quiesced, it's not going to need quiescing again even if we
restart the tick.

> 
> For vmstat, you're right that it's somewhat heavyweight to do the
> quiesce, and if we don't need it, it's wasted time on the return path.
> So I will add a guard call to the new vmstat_idle() before invoking
> quiet_vmstat_sync().  This slows down the path where it turns out we
> do need to quieten vmstat, but not by too much.

Why not do this on prctl() only?

> The LRU quiesce is quite light-weight.  We just check pagevec_count()
> on a handful of pagevec's, confirm they are all zero, and return
> without further work.  So for that one, adding a separate
> lru_add_drain_needed() guard test would just be wasted effort.

Ok if this one is justified, like LRU may need update everytime we re-enter
the kernel, then we can keep it (I can't tell, I don't know much about -mm).

> The thing to remember is that this is only relevant if the user has
> explicitly requested the NOSIG behavior from task isolation, which we
> don't really expect to be the default - we are implicitly encouraging
> use of the default semantics of "you can't enter the kernel again
> until you turn off isolation".

That's right. Although NOSIG is the only thing we can afford as long as
we drag around the 1Hz.

> 
> >> +	if (!tick_nohz_tick_stopped())
> >> +		set_tsk_need_resched(current);
> >> Again, that won't help
> 
> It won't be better than spinning in a loop if there aren't any other
> schedulable processes, but it won't be worse either.  If there is
> another schedulable process, we at least will schedule it sooner than
> if we just sat in a busy loop and waited for the scheduler to kick
> us. But there's nothing else we can do anyway if we want to maintain
> the guarantee that the dyn tick is stopped before return to userspace.

I don't think it helps either way. If reschedule is pending, the current
task already has TIF_RESCHED set.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v14 04/14] task_isolation: add initial support
  2016-08-30  0:55         ` Frederic Weisbecker
@ 2016-08-30 15:41           ` Chris Metcalf
  2016-08-30 17:10             ` Frederic Weisbecker
  0 siblings, 1 reply; 30+ messages in thread
From: Chris Metcalf @ 2016-08-30 15:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc,
	linux-api, linux-kernel

On 8/29/2016 8:55 PM, Frederic Weisbecker wrote:
> On Mon, Aug 15, 2016 at 10:59:55AM -0400, Chris Metcalf wrote:
>> On 8/11/2016 2:50 PM, Christoph Lameter wrote:
>>> On Thu, 11 Aug 2016, Frederic Weisbecker wrote:
>>>
>>>> Do we need to quiesce vmstat everytime before entering userspace?
>>>> I thought that vmstat only need to be offlined once and for all?
>>> Once is sufficient after disabling the tick.
>> It's true that task_isolation_enter() is called every time before
>> returning to user space while task isolation is enabled.
>>
>> But once we enter the kernel again after returning from the initial
>> prctl() -- assuming we are in NOSIG mode so doing so is legal in the
>> first place -- almost anything can happen, certainly including
>> restarting the tick.  Thus, we have to make sure that normal quiescing
>> happens again before we return to userspace.
> Yes but we need to sort out what needs to be called only once on prctl().
>
> Once vmstat is quiesced, it's not going to need quiescing again even if we
> restart the tick.

That's true, but I really do like the idea of having a clean structure
where we verify all our prerequisites in task_isolation_ready(), and
have code to try to get things fixed up in task_isolation_enter().
If we start moving some things here and some things there, it gets
harder to manage.  I think by testing "!vmstat_idle()" in
task_isolation_enter() we are avoiding any substantial overhead.

I think it would be clearer to rename task_isolation_enter()
to task_isolation_prepare(); it might be less confusing.

Remember too that in general, we really don't need to think about
return-to-userspace as a hot path for task isolation, unlike how we
think about it all the rest of the time.  So it makes sense to
prioritize keeping things clean from a software development
perspective first, and high-performance only secondarily.

>> The thing to remember is that this is only relevant if the user has
>> explicitly requested the NOSIG behavior from task isolation, which we
>> don't really expect to be the default - we are implicitly encouraging
>> use of the default semantics of "you can't enter the kernel again
>> until you turn off isolation".
> That's right. Although NOSIG is the only thing we can afford as long as
> we drag around the 1Hz.

True enough.  Hopefully we'll finish sorting that out soon enough.

>>>> +	if (!tick_nohz_tick_stopped())
>>>> +		set_tsk_need_resched(current);
>>>> Again, that won't help
>> It won't be better than spinning in a loop if there aren't any other
>> schedulable processes, but it won't be worse either.  If there is
>> another schedulable process, we at least will schedule it sooner than
>> if we just sat in a busy loop and waited for the scheduler to kick
>> us. But there's nothing else we can do anyway if we want to maintain
>> the guarantee that the dyn tick is stopped before return to userspace.
> I don't think it helps either way. If reschedule is pending, the current
> task already has TIF_RESCHED set.

See the other thread with Peter Z for the longer discussion of this.
At this point I'm leaning towards replacing the set_tsk_need_resched() with

     set_current_state(TASK_INTERRUPTIBLE);
     schedule();
     __set_current_state(TASK_RUNNING);

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v14 04/14] task_isolation: add initial support
  2016-08-30 15:41           ` Chris Metcalf
@ 2016-08-30 17:10             ` Frederic Weisbecker
  2016-08-30 17:36               ` Chris Metcalf
  0 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2016-08-30 17:10 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc,
	linux-api, linux-kernel

On Tue, Aug 30, 2016 at 11:41:36AM -0400, Chris Metcalf wrote:
> On 8/29/2016 8:55 PM, Frederic Weisbecker wrote:
> >On Mon, Aug 15, 2016 at 10:59:55AM -0400, Chris Metcalf wrote:
> >>On 8/11/2016 2:50 PM, Christoph Lameter wrote:
> >>>On Thu, 11 Aug 2016, Frederic Weisbecker wrote:
> >>>
> >>>>Do we need to quiesce vmstat everytime before entering userspace?
> >>>>I thought that vmstat only need to be offlined once and for all?
> >>>Once is sufficient after disabling the tick.
> >>It's true that task_isolation_enter() is called every time before
> >>returning to user space while task isolation is enabled.
> >>
> >>But once we enter the kernel again after returning from the initial
> >>prctl() -- assuming we are in NOSIG mode so doing so is legal in the
> >>first place -- almost anything can happen, certainly including
> >>restarting the tick.  Thus, we have to make sure that normal quiescing
> >>happens again before we return to userspace.
> >Yes but we need to sort out what needs to be called only once on prctl().
> >
> >Once vmstat is quiesced, it's not going to need quiescing again even if we
> >restart the tick.
> 
> That's true, but I really do like the idea of having a clean structure
> where we verify all our prerequisites in task_isolation_ready(), and
> have code to try to get things fixed up in task_isolation_enter().
> If we start moving some things here and some things there, it gets
> harder to manage.  I think by testing "!vmstat_idle()" in
> task_isolation_enter() we are avoiding any substantial overhead.

I think that making the code clearer on what needs to be done once for
all on prctl() and what needs to be done on all actual syscall return
is more important for readability.

> 
> I think it would be clearer to rename task_isolation_enter()
> to task_isolation_prepare(); it might be less confusing.

For the prctl part, why not.

> 
> Remember too that in general, we really don't need to think about
> return-to-userspace as a hot path for task isolation, unlike how we
> think about it all the rest of the time.  So it makes sense to
> prioritize keeping things clean from a software development
> perspective first, and high-performance only secondarily.
> 
> >>The thing to remember is that this is only relevant if the user has
> >>explicitly requested the NOSIG behavior from task isolation, which we
> >>don't really expect to be the default - we are implicitly encouraging
> >>use of the default semantics of "you can't enter the kernel again
> >>until you turn off isolation".
> >That's right. Although NOSIG is the only thing we can afford as long as
> >we drag around the 1Hz.
> 
> True enough.  Hopefully we'll finish sorting that out soon enough.
> 
> >>>>+	if (!tick_nohz_tick_stopped())
> >>>>+		set_tsk_need_resched(current);
> >>>>Again, that won't help
> >>It won't be better than spinning in a loop if there aren't any other
> >>schedulable processes, but it won't be worse either.  If there is
> >>another schedulable process, we at least will schedule it sooner than
> >>if we just sat in a busy loop and waited for the scheduler to kick
> >>us. But there's nothing else we can do anyway if we want to maintain
> >>the guarantee that the dyn tick is stopped before return to userspace.
> >I don't think it helps either way. If reschedule is pending, the current
> >task already has TIF_RESCHED set.
> 
> See the other thread with Peter Z for the longer discussion of this.
> At this point I'm leaning towards replacing the set_tsk_need_resched() with
> 
>     set_current_state(TASK_INTERRUPTIBLE);
>     schedule();
>     __set_current_state(TASK_RUNNING);

I don't see how that helps. What will wake the thread up except a signal?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v14 04/14] task_isolation: add initial support
  2016-08-30 17:10             ` Frederic Weisbecker
@ 2016-08-30 17:36               ` Chris Metcalf
  2016-08-30 18:17                 ` Chris Metcalf
  0 siblings, 1 reply; 30+ messages in thread
From: Chris Metcalf @ 2016-08-30 17:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc,
	linux-api, linux-kernel

On 8/30/2016 1:10 PM, Frederic Weisbecker wrote:
> On Tue, Aug 30, 2016 at 11:41:36AM -0400, Chris Metcalf wrote:
>> On 8/29/2016 8:55 PM, Frederic Weisbecker wrote:
>>> On Mon, Aug 15, 2016 at 10:59:55AM -0400, Chris Metcalf wrote:
>>>> On 8/11/2016 2:50 PM, Christoph Lameter wrote:
>>>>> On Thu, 11 Aug 2016, Frederic Weisbecker wrote:
>>>>>
>>>>>> Do we need to quiesce vmstat everytime before entering userspace?
>>>>>> I thought that vmstat only need to be offlined once and for all?
>>>>> Once is sufficient after disabling the tick.
>>>> It's true that task_isolation_enter() is called every time before
>>>> returning to user space while task isolation is enabled.
>>>>
>>>> But once we enter the kernel again after returning from the initial
>>>> prctl() -- assuming we are in NOSIG mode so doing so is legal in the
>>>> first place -- almost anything can happen, certainly including
>>>> restarting the tick.  Thus, we have to make sure that normal quiescing
>>>> happens again before we return to userspace.
>>> Yes but we need to sort out what needs to be called only once on prctl().
>>>
>>> Once vmstat is quiesced, it's not going to need quiescing again even if we
>>> restart the tick.
>> That's true, but I really do like the idea of having a clean structure
>> where we verify all our prerequisites in task_isolation_ready(), and
>> have code to try to get things fixed up in task_isolation_enter().
>> If we start moving some things here and some things there, it gets
>> harder to manage.  I think by testing "!vmstat_idle()" in
>> task_isolation_enter() we are avoiding any substantial overhead.
> I think that making the code clearer on what needs to be done once for
> all on prctl() and what needs to be done on all actual syscall return
> is more important for readability.

We don't need to just do it on prctl(), though.  We also need to do
it whenever we have been in the kernel for another reason, which
can happen with NOSIG.  So we need to do this on the common return
to userspace path no matter what, right?  Or am I missing something?
It seems like if, for example, we do mmaps or page faults, then on return
to userspace, some of those counters will have been incremented and
we'll need to run the quiet_vmstat_sync() code.

>>>>>> +	if (!tick_nohz_tick_stopped())
>>>>>> +		set_tsk_need_resched(current);
>>>>>> Again, that won't help
>>>> It won't be better than spinning in a loop if there aren't any other
>>>> schedulable processes, but it won't be worse either.  If there is
>>>> another schedulable process, we at least will schedule it sooner than
>>>> if we just sat in a busy loop and waited for the scheduler to kick
>>>> us. But there's nothing else we can do anyway if we want to maintain
>>>> the guarantee that the dyn tick is stopped before return to userspace.
>>> I don't think it helps either way. If reschedule is pending, the current
>>> task already has TIF_RESCHED set.
>> See the other thread with Peter Z for the longer discussion of this.
>> At this point I'm leaning towards replacing the set_tsk_need_resched() with
>>
>>      set_current_state(TASK_INTERRUPTIBLE);
>>      schedule();
>>      __set_current_state(TASK_RUNNING);
> I don't see how that helps. What will wake the thread up except a signal?

The answer is that the scheduler will keep bringing us back to this
point (either after running another runnable task if there is one,
or else just returning to this point immediately without doing a
context switch), and we will then go around the "prepare exit to
userspace" loop and perhaps discover that enough time has gone
by that the last dyntick interrupt has triggered and the kernel has
quiesced the dynticks.  At that point we stop calling schedule()
over and over and can return normally to userspace.

It's very counter-intuitive to burn cpu time intentionally in the kernel.
I really don't see another way to resolve this, though.  I don't think
it would be safe, for example, to just promote the next dyntick to
running immediately (rather than waiting a few microseconds until
it is scheduled to go off).

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v14 04/14] task_isolation: add initial support
  2016-08-30 17:36               ` Chris Metcalf
@ 2016-08-30 18:17                 ` Chris Metcalf
  2016-09-03 15:31                   ` Frederic Weisbecker
  0 siblings, 1 reply; 30+ messages in thread
From: Chris Metcalf @ 2016-08-30 18:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc,
	linux-api, linux-kernel

On 8/30/2016 1:36 PM, Chris Metcalf wrote:
>>> See the other thread with Peter Z for the longer discussion of this.
>>> At this point I'm leaning towards replacing the set_tsk_need_resched() with
>>>
>>>      set_current_state(TASK_INTERRUPTIBLE);
>>>      schedule();
>>>      __set_current_state(TASK_RUNNING);
>> I don't see how that helps. What will wake the thread up except a signal?
>
> The answer is that the scheduler will keep bringing us back to this
> point (either after running another runnable task if there is one,
> or else just returning to this point immediately without doing a
> context switch), and we will then go around the "prepare exit to
> userspace" loop and perhaps discover that enough time has gone
> by that the last dyntick interrupt has triggered and the kernel has
> quiesced the dynticks.  At that point we stop calling schedule()
> over and over and can return normally to userspace. 

Oops, you're right that if I set TASK_INTERRUPTIBLE, then if I call
schedule(), I never get control back.  So I don't want to do that.

I suppose I could do a schedule_timeout() here instead and try
to figure out how long to wait for the next dyntick.  But since we
don't expect anything else running on the core anyway, it seems
like we are probably working too hard at this point.  I don't think
it's worth it just to go into the idle task and (possibly) save some
power for a few microseconds.

The more I think about this, the more I think I am micro-optimizing
by trying to poke the scheduler prior to some external thing setting
need_resched, so I think the thing to do here is in fact, nothing.
I won't worry about rescheduling but will just continue going around
the prepare-exit-to-userspace loop until the last dyn tick fires.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v14 04/14] task_isolation: add initial support
  2016-08-30 18:17                 ` Chris Metcalf
@ 2016-09-03 15:31                   ` Frederic Weisbecker
  2016-09-09 18:54                     ` Chris Metcalf
  0 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2016-09-03 15:31 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc,
	linux-api, linux-kernel

On Tue, Aug 30, 2016 at 02:17:27PM -0400, Chris Metcalf wrote:
> On 8/30/2016 1:36 PM, Chris Metcalf wrote:
> >>>See the other thread with Peter Z for the longer discussion of this.
> >>>At this point I'm leaning towards replacing the set_tsk_need_resched() with
> >>>
> >>>     set_current_state(TASK_INTERRUPTIBLE);
> >>>     schedule();
> >>>     __set_current_state(TASK_RUNNING);
> >>I don't see how that helps. What will wake the thread up except a signal?
> >
> >The answer is that the scheduler will keep bringing us back to this
> >point (either after running another runnable task if there is one,
> >or else just returning to this point immediately without doing a
> >context switch), and we will then go around the "prepare exit to
> >userspace" loop and perhaps discover that enough time has gone
> >by that the last dyntick interrupt has triggered and the kernel has
> >quiesced the dynticks.  At that point we stop calling schedule()
> >over and over and can return normally to userspace.
> 
> Oops, you're right that if I set TASK_INTERRUPTIBLE, then if I call
> schedule(), I never get control back.  So I don't want to do that.
> 
> I suppose I could do a schedule_timeout() here instead and try
> to figure out how long to wait for the next dyntick.  But since we
> don't expect anything else running on the core anyway, it seems
> like we are probably working too hard at this point.  I don't think
> it's worth it just to go into the idle task and (possibly) save some
> power for a few microseconds.
> 
> The more I think about this, the more I think I am micro-optimizing
> by trying to poke the scheduler prior to some external thing setting
> need_resched, so I think the thing to do here is in fact, nothing.

Exactly, I fear there is nothing you can do about that.

> I won't worry about rescheduling but will just continue going around
> the prepare-exit-to-userspace loop until the last dyn tick fires.

You mean waiting in prepare-exit-to-userspace until the last tick fires?
I'm not sure it's a good idea either, this could take ages, it could as
well never happen.

I'd rather say that if we are in signal mode, fire such, otherwise just
return to userspace. If there is a tick, it means that the environment is
not suitable for isolation anyway.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v14 04/14] task_isolation: add initial support
  2016-09-03 15:31                   ` Frederic Weisbecker
@ 2016-09-09 18:54                     ` Chris Metcalf
  0 siblings, 0 replies; 30+ messages in thread
From: Chris Metcalf @ 2016-09-09 18:54 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Michal Hocko, linux-mm, linux-doc,
	linux-api, linux-kernel

On 9/3/2016 11:31 AM, Frederic Weisbecker wrote:
> On Tue, Aug 30, 2016 at 02:17:27PM -0400, Chris Metcalf wrote:
>> On 8/30/2016 1:36 PM, Chris Metcalf wrote:
>>>>> See the other thread with Peter Z for the longer discussion of this.
>>>>> At this point I'm leaning towards replacing the set_tsk_need_resched() with
>>>>>
>>>>>      set_current_state(TASK_INTERRUPTIBLE);
>>>>>      schedule();
>>>>>      __set_current_state(TASK_RUNNING);
>>>> I don't see how that helps. What will wake the thread up except a signal?
>>> The answer is that the scheduler will keep bringing us back to this
>>> point (either after running another runnable task if there is one,
>>> or else just returning to this point immediately without doing a
>>> context switch), and we will then go around the "prepare exit to
>>> userspace" loop and perhaps discover that enough time has gone
>>> by that the last dyntick interrupt has triggered and the kernel has
>>> quiesced the dynticks.  At that point we stop calling schedule()
>>> over and over and can return normally to userspace.
>> Oops, you're right that if I set TASK_INTERRUPTIBLE, then if I call
>> schedule(), I never get control back.  So I don't want to do that.
>>
>> I suppose I could do a schedule_timeout() here instead and try
>> to figure out how long to wait for the next dyntick.  But since we
>> don't expect anything else running on the core anyway, it seems
>> like we are probably working too hard at this point.  I don't think
>> it's worth it just to go into the idle task and (possibly) save some
>> power for a few microseconds.
>>
>> The more I think about this, the more I think I am micro-optimizing
>> by trying to poke the scheduler prior to some external thing setting
>> need_resched, so I think the thing to do here is in fact, nothing.
> Exactly, I fear there is nothing you can do about that.
>
>> I won't worry about rescheduling but will just continue going around
>> the prepare-exit-to-userspace loop until the last dyn tick fires.
> You mean waiting in prepare-exit-to-userspace until the last tick fires?
> I'm not sure it's a good idea either, this could take ages, it could as
> well never happen.

If you don't mind, let's take this to the other thread discussing what to do
at return-to-userspace time:

https://lkml.kernel.org/r/440e20d1-441a-3228-6b37-6e71e9fce47c@mellanox.com

In general, I think if your task ends up waiting forever for the dyntick to
stop, with the approach suggested in that thread you will at least be
able to tell more easily, since the core will be running the idle task and
your task will be waiting on a dyntick-specific completion.

> I'd rather say that if we are in signal mode, fire such, otherwise just
> return to userspace. If there is a tick, it means that the environment is
> not suitable for isolation anyway.

True if there is an ongoing tick, but not if the tick is about to stop and we just need
to wait for the last tick to fire.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2016-09-10  1:29 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-09 20:29 [PATCH v14 00/14] support "task_isolation" mode Chris Metcalf
2016-08-09 20:29 ` [PATCH v14 01/14] vmstat: add quiet_vmstat_sync function Chris Metcalf
2016-08-09 20:29 ` [PATCH v14 02/14] vmstat: add vmstat_idle function Chris Metcalf
2016-08-09 20:29 ` [PATCH v14 03/14] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
2016-08-09 20:29 ` [PATCH v14 04/14] task_isolation: add initial support Chris Metcalf
2016-08-11 18:11   ` Frederic Weisbecker
2016-08-11 18:50     ` Christoph Lameter
2016-08-15 14:59       ` Chris Metcalf
2016-08-30  0:55         ` Frederic Weisbecker
2016-08-30 15:41           ` Chris Metcalf
2016-08-30 17:10             ` Frederic Weisbecker
2016-08-30 17:36               ` Chris Metcalf
2016-08-30 18:17                 ` Chris Metcalf
2016-09-03 15:31                   ` Frederic Weisbecker
2016-09-09 18:54                     ` Chris Metcalf
2016-08-09 20:29 ` [PATCH v14 05/14] task_isolation: track asynchronous interrupts Chris Metcalf
2016-08-09 20:29 ` [PATCH v14 06/14] arch/x86: enable task isolation functionality Chris Metcalf
2016-08-10  7:52   ` Andy Lutomirski
2016-08-10 14:30     ` Chris Metcalf
2016-08-10 19:17       ` Andy Lutomirski
2016-08-10 19:40         ` Chris Metcalf
2016-08-10 20:06           ` Andy Lutomirski
2016-08-09 20:29 ` [PATCH v14 07/14] arm64: factor work_pending state machine to C Chris Metcalf
2016-08-09 20:29 ` [PATCH v14 08/14] arch/arm64: enable task isolation functionality Chris Metcalf
2016-08-09 20:29 ` [PATCH v14 09/14] arch/tile: " Chris Metcalf
2016-08-09 20:29 ` [PATCH v14 10/14] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
2016-08-09 20:29 ` [PATCH v14 11/14] clocksource: Do not schedule watchdog on isolated or NOHZ cpus Chris Metcalf
2016-08-09 20:29 ` [PATCH v14 12/14] task_isolation: support CONFIG_TASK_ISOLATION_ALL Chris Metcalf
2016-08-09 20:29 ` [PATCH v14 13/14] task_isolation: add user-settable notification signal Chris Metcalf
2016-08-09 20:29 ` [PATCH v14 14/14] task_isolation self test Chris Metcalf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).