linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v12 00/13] support "task_isolation" mode
@ 2016-04-05 17:38 Chris Metcalf
  2016-04-05 17:38 ` [PATCH v12 01/13] vmstat: add quiet_vmstat_sync function Chris Metcalf
                   ` (13 more replies)
  0 siblings, 14 replies; 39+ messages in thread
From: Chris Metcalf @ 2016-04-05 17:38 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

Here is a respin of the task-isolation patch set.  The previous one
came out just before the merge window for 4.6 opened, so I suspect
folks may have been busy merging, since it got few comments.

Frederic, how are you feeling about taking this all via your tree?
And what is your take on the new PR_TASK_ISOLATION_ONE_SHOT mode?
I'm not sure what the right path to upstream for this series is.

Changes since v11:

- Rebased on v4.6-rc1.  This required me to create a
  can_stop_my_full_tick() helper in tick-sched.c, since the underlying
  can_stop_full_tick() now takes a struct tick_sched.

- Added a HAVE_ARCH_TASK_ISOLATION Kconfig flag so that you can't
  try to build with TASK_ISOLATION enabled for an architecture until
  it is explicitly configured to work.  This avoids possible
  allyesconfig build failures for unsupported architectures, or even
  for supported ones when bisecting to the middle of this series.

- Return EAGAIN instead of EINVAL for the enabling prctl() if the task
  is affinitized to a task-isolation core, but things just aren't yet
  right for it (e.g. another task running).  This lets the caller
  differentiate a potentially transient failure from a permanent
  failure, for which we still return EINVAL.

The previous (v11) patch series is here:

https://lkml.kernel.org/r/1457734223-26209-1-git-send-email-cmetcalf@mellanox.com

This version of the patch series has been tested on arm64 and tile,
and build-tested on x86.

It remains true that the 1 Hz tick needs to be disabled for this
patch series to be able to achieve its primary goal of enabling
truly tick-free operation, but that is ongoing orthogonal work.

The series is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Chris Metcalf (13):
  vmstat: add quiet_vmstat_sync function
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: support CONFIG_TASK_ISOLATION_ALL
  task_isolation: support PR_TASK_ISOLATION_STRICT mode
  task_isolation: add debug boot flag
  task_isolation: add PR_TASK_ISOLATION_ONE_SHOT flag
  arm, tile: turn off timer tick for oneshot_stopped state
  arch/x86: enable task isolation functionality
  arch/tile: enable task isolation functionality
  arm64: factor work_pending state machine to C
  arch/arm64: enable task isolation functionality

 Documentation/kernel-parameters.txt    |  16 ++
 arch/arm64/Kconfig                     |   1 +
 arch/arm64/include/asm/thread_info.h   |   5 +-
 arch/arm64/kernel/entry.S              |  12 +-
 arch/arm64/kernel/ptrace.c             |  15 +-
 arch/arm64/kernel/signal.c             |  42 ++++-
 arch/arm64/kernel/smp.c                |   2 +
 arch/arm64/mm/fault.c                  |   4 +
 arch/tile/Kconfig                      |   1 +
 arch/tile/include/asm/thread_info.h    |   4 +-
 arch/tile/kernel/process.c             |   9 +
 arch/tile/kernel/ptrace.c              |   7 +
 arch/tile/kernel/single_step.c         |   5 +
 arch/tile/kernel/smp.c                 |  28 +--
 arch/tile/kernel/time.c                |   1 +
 arch/tile/kernel/unaligned.c           |   3 +
 arch/tile/mm/fault.c                   |   3 +
 arch/tile/mm/homecache.c               |   2 +
 arch/x86/Kconfig                       |   1 +
 arch/x86/entry/common.c                |  18 +-
 arch/x86/include/asm/thread_info.h     |   2 +
 arch/x86/kernel/traps.c                |   2 +
 arch/x86/mm/fault.c                    |   2 +
 drivers/base/cpu.c                     |  18 ++
 drivers/clocksource/arm_arch_timer.c   |   2 +
 include/linux/context_tracking_state.h |   6 +
 include/linux/isolation.h              |  63 +++++++
 include/linux/sched.h                  |   3 +
 include/linux/swap.h                   |   1 +
 include/linux/tick.h                   |   2 +
 include/linux/vmstat.h                 |   4 +
 include/uapi/linux/prctl.h             |   9 +
 init/Kconfig                           |  33 ++++
 kernel/Makefile                        |   1 +
 kernel/fork.c                          |   3 +
 kernel/irq_work.c                      |   5 +-
 kernel/isolation.c                     | 316 +++++++++++++++++++++++++++++++++
 kernel/sched/core.c                    |  18 ++
 kernel/signal.c                        |   8 +
 kernel/smp.c                           |   6 +-
 kernel/softirq.c                       |  33 ++++
 kernel/sys.c                           |   9 +
 kernel/time/tick-sched.c               |  36 ++--
 mm/swap.c                              |  15 +-
 mm/vmstat.c                            |  21 +++
 45 files changed, 743 insertions(+), 54 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

-- 
2.7.2

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v12 01/13] vmstat: add quiet_vmstat_sync function
  2016-04-05 17:38 [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf
@ 2016-04-05 17:38 ` Chris Metcalf
  2016-04-05 17:38 ` [PATCH v12 02/13] vmstat: add vmstat_idle function Chris Metcalf
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 39+ messages in thread
From: Chris Metcalf @ 2016-04-05 17:38 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Michal Hocko, linux-kernel
  Cc: Chris Metcalf

In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
the quiet_vmstat() function became asynchronous, in the sense that
the vmstat work was still scheduled to run on the core when the
function returned.  For task isolation, we need a synchronous
version of the function that guarantees that the vmstat worker
will not run on the core on return from the function.  Add a
quiet_vmstat_sync() function with that semantic.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/vmstat.h | 2 ++
 mm/vmstat.c            | 9 +++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 73fae8c4a5fb..43b2f1c33266 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -190,6 +190,7 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
 void quiet_vmstat(void);
+void quiet_vmstat_sync(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -251,6 +252,7 @@ static inline void __dec_zone_page_state(struct page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat_sync(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 5e4300482897..7a1cfe383349 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1458,6 +1458,15 @@ void quiet_vmstat(void)
 	refresh_cpu_vm_stats(false);
 }
 
+/*
+ * Synchronously quiet vmstat so the work is guaranteed not to run on return.
+ */
+void quiet_vmstat_sync(void)
+{
+	cpumask_set_cpu(smp_processor_id(), cpu_stat_off);
+	cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work));
+	refresh_cpu_vm_stats(false);
+}
 
 /*
  * Shepherd worker thread that checks the
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v12 02/13] vmstat: add vmstat_idle function
  2016-04-05 17:38 [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf
  2016-04-05 17:38 ` [PATCH v12 01/13] vmstat: add quiet_vmstat_sync function Chris Metcalf
@ 2016-04-05 17:38 ` Chris Metcalf
  2016-04-05 17:38 ` [PATCH v12 03/13] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 39+ messages in thread
From: Chris Metcalf @ 2016-04-05 17:38 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-mm, linux-kernel
  Cc: Chris Metcalf

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 43b2f1c33266..504ebd1fdf33 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -191,6 +191,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
 void quiet_vmstat(void);
 void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -253,6 +254,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7a1cfe383349..fa34ea480ac0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1469,6 +1469,18 @@ void quiet_vmstat_sync(void)
 }
 
 /*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+	int cpu = smp_processor_id();
+	return cpumask_test_cpu(cpu, cpu_stat_off) &&
+		!delayed_work_pending(this_cpu_ptr(&vmstat_work)) &&
+		!need_update(cpu);
+}
+
+/*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
  * threads for vm statistics updates disabled because of
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v12 03/13] lru_add_drain_all: factor out lru_add_drain_needed
  2016-04-05 17:38 [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf
  2016-04-05 17:38 ` [PATCH v12 01/13] vmstat: add quiet_vmstat_sync function Chris Metcalf
  2016-04-05 17:38 ` [PATCH v12 02/13] vmstat: add vmstat_idle function Chris Metcalf
@ 2016-04-05 17:38 ` Chris Metcalf
  2016-04-05 17:38 ` [PATCH v12 04/13] task_isolation: add initial support Chris Metcalf
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 39+ messages in thread
From: Chris Metcalf @ 2016-04-05 17:38 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-mm, linux-kernel
  Cc: Chris Metcalf

This per-cpu check was being done in the loop in lru_add_drain_all(),
but having it be callable for a particular cpu is helpful for the
task-isolation patches.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/swap.h |  1 +
 mm/swap.c            | 15 ++++++++++-----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d18b65c53dbb..da21f5240702 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -304,6 +304,7 @@ extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
+extern bool lru_add_drain_needed(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index 09fe5e97714a..bdcdfa21094c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -653,6 +653,15 @@ void deactivate_page(struct page *page)
 	}
 }
 
+bool lru_add_drain_needed(int cpu)
+{
+	return (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
+		pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
+		pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+		need_activate_page_drain(cpu));
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
@@ -679,11 +688,7 @@ void lru_add_drain_all(void)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
-		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
-		    need_activate_page_drain(cpu)) {
+		if (lru_add_drain_needed(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			schedule_work_on(cpu, work);
 			cpumask_set_cpu(cpu, &has_work);
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v12 04/13] task_isolation: add initial support
  2016-04-05 17:38 [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf
                   ` (2 preceding siblings ...)
  2016-04-05 17:38 ` [PATCH v12 03/13] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
@ 2016-04-05 17:38 ` Chris Metcalf
  2016-05-18 13:34   ` Peter Zijlstra
  2016-04-05 17:38 ` [PATCH v12 05/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL Chris Metcalf
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 39+ messages in thread
From: Chris Metcalf @ 2016-04-05 17:38 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well.  The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in
thread_info flags.  When task isolation is enabled for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future.

A new /sys/devices/system/cpu/task_isolation pseudo-file is added,
parallel to the comparable nohz_full file.

The task_isolation_ready() call is invoked when TIF_TASK_ISOLATION is
set in prepare_exit_to_usermode() or its architectural equivalent,
and forces the loop to retry if the system is not ready.  It is
called with interrupts disabled and inspects the kernel state
to determine if it is safe to return into an isolated state.
In particular, if it sees that the scheduler tick is still enabled,
it reports that it is not yet safe.

Each time through the loop of TIF work to do, if TIF_TASK_ISOLATION
is set, we call the new task_isolation_enter() routine.  This
takes any actions that might avoid a future interrupt to the core,
such as a worker thread being scheduled that could be quiesced now
(e.g. the vmstat worker) or a future IPI to the core to clean up some
state that could be cleaned up now (e.g. the mm lru per-cpu cache).
In addition, it reqeusts rescheduling if the scheduler dyntick is
still running.

As a result of these tests on the "return to userspace" path, sys
calls (and page faults, etc.) can be inordinately slow.  However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

Separate patches that follow provide these changes for x86, tile,
and arm64.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 Documentation/kernel-parameters.txt |   8 ++
 drivers/base/cpu.c                  |  18 +++++
 include/linux/isolation.h           |  48 +++++++++++
 include/linux/sched.h               |   3 +
 include/linux/tick.h                |   2 +
 include/uapi/linux/prctl.h          |   5 ++
 init/Kconfig                        |  23 ++++++
 kernel/Makefile                     |   1 +
 kernel/fork.c                       |   3 +
 kernel/isolation.c                  | 153 ++++++++++++++++++++++++++++++++++++
 kernel/signal.c                     |   4 +
 kernel/sys.c                        |   9 +++
 kernel/time/tick-sched.c            |  36 ++++++---
 13 files changed, 300 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index ecc74fa4bfde..9bd5e91357b1 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3808,6 +3808,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation=	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION=y, set
+			the specified list of CPUs where cpus will be able
+			to use prctl(PR_SET_TASK_ISOLATION) to set up task
+			isolation mode.  Setting this boot flag implicitly
+			also sets up nohz_full and isolcpus mode for the
+			listed set of cpus.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 691eeea2f19a..eaf40f4264ee 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -17,6 +17,7 @@
 #include <linux/of.h>
 #include <linux/cpufeature.h>
 #include <linux/tick.h>
+#include <linux/isolation.h>
 
 #include "base.h"
 
@@ -290,6 +291,20 @@ static ssize_t print_cpus_nohz_full(struct device *dev,
 static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL);
 #endif
 
+#ifdef CONFIG_TASK_ISOLATION
+static ssize_t print_cpus_task_isolation(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	int n = 0, len = PAGE_SIZE-2;
+
+	n = scnprintf(buf, len, "%*pbl\n", cpumask_pr_args(task_isolation_map));
+
+	return n;
+}
+static DEVICE_ATTR(task_isolation, 0444, print_cpus_task_isolation, NULL);
+#endif
+
 static void cpu_device_release(struct device *dev)
 {
 	/*
@@ -460,6 +475,9 @@ static struct attribute *cpu_root_attrs[] = {
 #ifdef CONFIG_NO_HZ_FULL
 	&dev_attr_nohz_full.attr,
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	&dev_attr_task_isolation.attr,
+#endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
 #endif
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..99b909462e64
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,48 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/* cpus that are configured to support task isolation */
+extern cpumask_var_t task_isolation_map;
+
+extern int task_isolation_init(void);
+
+static inline bool task_isolation_possible(int cpu)
+{
+	return task_isolation_map != NULL &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+extern int task_isolation_set(unsigned int flags);
+
+extern bool task_isolation_ready(void);
+extern void task_isolation_enter(void);
+
+static inline void task_isolation_set_flags(struct task_struct *p,
+					    unsigned int flags)
+{
+	p->task_isolation_flags = flags;
+
+	if (flags & PR_TASK_ISOLATION_ENABLE)
+		set_tsk_thread_flag(p, TIF_TASK_ISOLATION);
+	else
+		clear_tsk_thread_flag(p, TIF_TASK_ISOLATION);
+}
+
+#else
+static inline void task_isolation_init(void) { }
+static inline bool task_isolation_possible(int cpu) { return false; }
+static inline bool task_isolation_ready(void) { return true; }
+static inline void task_isolation_enter(void) { }
+extern inline void task_isolation_set_flags(struct task_struct *p,
+					    unsigned int flags) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 60bba7e032dc..90f6856493bb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1852,6 +1852,9 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned int	task_isolation_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 62be0786d6d0..fbd81e322860 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -235,6 +235,8 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
 
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
+extern void tick_nohz_full_add_cpus(const struct cpumask *mask);
+extern bool can_stop_my_full_tick(void);
 #else
 static inline int housekeeping_any_cpu(void)
 {
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..67224df4b559 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,9 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */
+#define PR_SET_TASK_ISOLATION		48
+#define PR_GET_TASK_ISOLATION		49
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index e0d26162432e..767f37bc3391 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -782,6 +782,29 @@ config RCU_EXPEDITE_BOOT
 
 endmenu # "RCU Subsystem"
 
+config HAVE_ARCH_TASK_ISOLATION
+	bool
+
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
+	help
+	 Allow userspace processes to place themselves on task_isolation
+	 cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+	 themselves from the kernel.  On return to userspace,
+	 isolated tasks will first arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  This "hard" isolation from the kernel is
+	 required for userspace tasks that are running hard real-time
+	 tasks in userspace, such as a 10 Gbit network driver in userspace.
+
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index f0c40bf49d9f..5281b866b0a1 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -114,6 +114,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/fork.c b/kernel/fork.c
index d277e83ed3e0..8541b7ee231c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -76,6 +76,7 @@
 #include <linux/compiler.h>
 #include <linux/sysctl.h>
 #include <linux/kcov.h>
+#include <linux/isolation.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1507,6 +1508,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #endif
 	clear_all_latency_tracing(p);
 
+	task_isolation_set_flags(p, 0);
+
 	/* ok, now we should be set up.. */
 	p->pid = pid_nr(pid);
 	if (clone_flags & CLONE_THREAD) {
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..282a34ecb22a
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,153 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include "time/tick-sched.h"
+
+cpumask_var_t task_isolation_map;
+static bool saw_boot_arg;
+
+/*
+ * Isolation requires both nohz and isolcpus support from the scheduler.
+ * We provide a boot flag that enables both for now, and which we can
+ * add other functionality to over time if needed.  Note that just
+ * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
+ */
+static int __init task_isolation_setup(char *str)
+{
+	saw_boot_arg = true;
+
+	alloc_bootmem_cpumask_var(&task_isolation_map);
+	if (cpulist_parse(str, task_isolation_map) < 0) {
+		pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
+		return 1;
+	}
+
+	return 1;
+}
+__setup("task_isolation=", task_isolation_setup);
+
+int __init task_isolation_init(void)
+{
+	/* For offstack cpumask, ensure we allocate an empty cpumask early. */
+	if (!saw_boot_arg) {
+		zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+		return 0;
+	}
+
+	/*
+	 * Add our task_isolation cpus to nohz_full and isolcpus.  Note
+	 * that we are called relatively early in boot, from tick_init();
+	 * at this point neither nohz_full nor isolcpus has been used
+	 * to configure the system, but isolcpus has been allocated
+	 * already in sched_init().
+	 */
+	tick_nohz_full_add_cpus(task_isolation_map);
+	cpumask_or(cpu_isolated_map, cpu_isolated_map, task_isolation_map);
+
+	return 0;
+}
+
+/*
+ * Get a snapshot of whether, at this moment, it would be possible to
+ * stop the tick.  This test normally requires interrupts disabled since
+ * the condition can change if an interrupt is delivered.  However, in
+ * this case we are using it in an advisory capacity to see if there
+ * is anything obviously indicating that the task isolation
+ * preconditions have not been met, so it's OK that in principle it
+ * might not still be true later in the prctl() syscall path.
+ */
+static bool can_stop_my_full_tick_now(void)
+{
+	bool ret;
+
+	local_irq_disable();
+	ret = can_stop_my_full_tick();
+	local_irq_enable();
+	return ret;
+}
+
+/*
+ * This routine controls whether we can enable task-isolation mode.
+ * The task must be affinitized to a single task_isolation core, or
+ * else we return EINVAL.  And, it must be at least statically able to
+ * stop the nohz_full tick (e.g., no other schedulable tasks currently
+ * running, no POSIX cpu timers currently set up, etc.); if not, we
+ * return EAGAIN.
+ *
+ * Although the application could later re-affinitize to a
+ * housekeeping core and lose task isolation semantics, or other tasks
+ * could be forcibly scheduled onto this core to restart preemptive
+ * scheduling, etc., this initial test should catch 99% of bugs with
+ * task placement prior to enabling task isolation.
+ */
+int task_isolation_set(unsigned int flags)
+{
+	if (flags != 0) {
+		if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
+		    !task_isolation_possible(raw_smp_processor_id()))
+			return -EINVAL;
+		if (!can_stop_my_full_tick_now())
+			return -EAGAIN;
+	}
+
+	task_isolation_set_flags(current, flags);
+	return 0;
+}
+
+/*
+ * In task isolation mode we try to return to userspace only after
+ * attempting to make sure we won't be interrupted again.  This test
+ * is run with interrupts disabled to test that everything we need
+ * to be true is true before we can return to userspace.
+ */
+bool task_isolation_ready(void)
+{
+	WARN_ON_ONCE(!irqs_disabled());
+
+	return (!lru_add_drain_needed(smp_processor_id()) &&
+		vmstat_idle() &&
+		tick_nohz_tick_stopped());
+}
+
+/*
+ * Each time we try to prepare for return to userspace in a process
+ * with task isolation enabled, we run this code to quiesce whatever
+ * subsystems we can readily quiesce to avoid later interrupts.
+ */
+void task_isolation_enter(void)
+{
+	WARN_ON_ONCE(irqs_disabled());
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	quiet_vmstat_sync();
+
+	/*
+	 * Request rescheduling unless we are in full dynticks mode.
+	 * We would eventually get pre-empted without this, and if
+	 * there's another task waiting, it would run; but by
+	 * explicitly requesting the reschedule, we may reduce the
+	 * latency.  We could directly call schedule() here as well,
+	 * but since our caller is the standard place where schedule()
+	 * is called, we defer to the caller.
+	 *
+	 * A more substantive approach here would be to use a struct
+	 * completion here explicitly, and complete it when we shut
+	 * down dynticks, but since we presumably have nothing better
+	 * to do on this core anyway, just spinning seems plausible.
+	 */
+	if (!tick_nohz_tick_stopped())
+		set_tsk_need_resched(current);
+}
diff --git a/kernel/signal.c b/kernel/signal.c
index aa9bf00749c1..53e4e62f2778 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -34,6 +34,7 @@
 #include <linux/compat.h>
 #include <linux/cn_proc.h>
 #include <linux/compiler.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -2213,6 +2214,9 @@ relock:
 		/* Trace actually delivered signals. */
 		trace_signal_deliver(signr, &ksig->info, ka);
 
+		/* Disable task isolation when delivering a signal. */
+		task_isolation_set_flags(current, 0);
+
 		if (ka->sa.sa_handler == SIG_IGN) /* Do nothing.  */
 			continue;
 		if (ka->sa.sa_handler != SIG_DFL) {
diff --git a/kernel/sys.c b/kernel/sys.c
index cf8ba545c7d3..6d5b87273fcc 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2269,6 +2270,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_TASK_ISOLATION
+	case PR_SET_TASK_ISOLATION:
+		error = task_isolation_set(arg2);
+		break;
+	case PR_GET_TASK_ISOLATION:
+		error = me->task_isolation_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 084b79f5917e..04e77e562ea1 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -23,6 +23,7 @@
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 
 #include <asm/irq_regs.h>
 
@@ -207,6 +208,11 @@ static bool can_stop_full_tick(struct tick_sched *ts)
 	return true;
 }
 
+bool can_stop_my_full_tick(void)
+{
+	return can_stop_full_tick(this_cpu_ptr(&tick_cpu_sched));
+}
+
 static void nohz_full_kick_func(struct irq_work *work)
 {
 	/* Empty, the tick restart happens on tick_nohz_irq_exit() */
@@ -408,30 +414,34 @@ static int tick_nohz_cpu_down_callback(struct notifier_block *nfb,
 	return NOTIFY_OK;
 }
 
-static int tick_nohz_init_all(void)
+void tick_nohz_full_add_cpus(const struct cpumask *mask)
 {
-	int err = -1;
+	if (!cpumask_weight(mask))
+		return;
 
-#ifdef CONFIG_NO_HZ_FULL_ALL
-	if (!alloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
+	if (tick_nohz_full_mask == NULL &&
+	    !zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
 		WARN(1, "NO_HZ: Can't allocate full dynticks cpumask\n");
-		return err;
+		return;
 	}
-	err = 0;
-	cpumask_setall(tick_nohz_full_mask);
+
+	cpumask_or(tick_nohz_full_mask, tick_nohz_full_mask, mask);
 	tick_nohz_full_running = true;
-#endif
-	return err;
 }
 
 void __init tick_nohz_init(void)
 {
 	int cpu;
 
-	if (!tick_nohz_full_running) {
-		if (tick_nohz_init_all() < 0)
-			return;
-	}
+	task_isolation_init();
+
+#ifdef CONFIG_NO_HZ_FULL_ALL
+	if (!tick_nohz_full_running)
+		tick_nohz_full_add_cpus(cpu_possible_mask);
+#endif
+
+	if (!tick_nohz_full_running)
+		return;
 
 	if (!alloc_cpumask_var(&housekeeping_mask, GFP_KERNEL)) {
 		WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n");
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v12 05/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL
  2016-04-05 17:38 [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf
                   ` (3 preceding siblings ...)
  2016-04-05 17:38 ` [PATCH v12 04/13] task_isolation: add initial support Chris Metcalf
@ 2016-04-05 17:38 ` Chris Metcalf
  2016-05-18 13:35   ` Peter Zijlstra
  2016-04-05 17:38 ` [PATCH v12 06/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 39+ messages in thread
From: Chris Metcalf @ 2016-04-05 17:38 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-kernel
  Cc: Chris Metcalf

This option, similar to NO_HZ_FULL_ALL, simplifies configuring
a system to boot by default with all cores except the boot core
running in task isolation mode.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 init/Kconfig       | 10 ++++++++++
 kernel/isolation.c |  6 ++++++
 2 files changed, 16 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index 767f37bc3391..b2717e505157 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -805,6 +805,16 @@ config TASK_ISOLATION
 	 You should say "N" unless you are intending to run a
 	 high-performance userspace driver or similar task.
 
+config TASK_ISOLATION_ALL
+	bool "Provide task isolation on all CPUs by default (except CPU 0)"
+	depends on TASK_ISOLATION
+	help
+	 If the user doesn't pass the task_isolation boot option to
+	 define the range of task isolation CPUs, consider that all
+	 CPUs in the system are task isolation by default.
+	 Note the boot CPU will still be kept outside the range to
+	 handle timekeeping duty, etc.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 282a34ecb22a..b364182dd8e2 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -40,8 +40,14 @@ int __init task_isolation_init(void)
 {
 	/* For offstack cpumask, ensure we allocate an empty cpumask early. */
 	if (!saw_boot_arg) {
+#ifdef CONFIG_TASK_ISOLATION_ALL
+		alloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+		cpumask_copy(task_isolation_map, cpu_possible_mask);
+		cpumask_clear_cpu(smp_processor_id(), task_isolation_map);
+#else
 		zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
 		return 0;
+#endif
 	}
 
 	/*
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v12 06/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode
  2016-04-05 17:38 [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf
                   ` (4 preceding siblings ...)
  2016-04-05 17:38 ` [PATCH v12 05/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL Chris Metcalf
@ 2016-04-05 17:38 ` Chris Metcalf
  2016-05-18 13:44   ` Peter Zijlstra
  2016-04-05 17:38 ` [PATCH v12 07/13] task_isolation: add debug boot flag Chris Metcalf
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 39+ messages in thread
From: Chris Metcalf @ 2016-04-05 17:38 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry generates a signal.  For system
calls, this test is performed immediately before the SECCOMP test
and causes the syscall to return immediately with ENOSYS.

By default, the task is signalled with SIGKILL, but we add prctl()
bits to support requesting a specific signal instead.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl() syscall so that we can clear the bit again
later, and ignores exit/exit_group to allow exiting the task without
a pointless signal killing you as you try to do so.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/isolation.h  | 10 +++++++
 include/uapi/linux/prctl.h |  3 ++
 kernel/isolation.c         | 73 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 86 insertions(+)

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index 99b909462e64..eb78175ed811 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -36,6 +36,14 @@ static inline void task_isolation_set_flags(struct task_struct *p,
 		clear_tsk_thread_flag(p, TIF_TASK_ISOLATION);
 }
 
+extern int task_isolation_syscall(int nr);
+extern void _task_isolation_exception(const char *fmt, ...);
+#define task_isolation_exception(fmt, ...)				\
+	do {								\
+		if (current_thread_info()->flags & _TIF_TASK_ISOLATION) \
+			_task_isolation_exception(fmt, ## __VA_ARGS__); \
+	} while (0)
+
 #else
 static inline void task_isolation_init(void) { }
 static inline bool task_isolation_possible(int cpu) { return false; }
@@ -43,6 +51,8 @@ static inline bool task_isolation_ready(void) { return true; }
 static inline void task_isolation_enter(void) { }
 extern inline void task_isolation_set_flags(struct task_struct *p,
 					    unsigned int flags) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_exception(const char *fmt, ...) { }
 #endif
 
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 67224df4b559..a5582ace987f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,8 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		48
 #define PR_GET_TASK_ISOLATION		49
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_STRICT	(1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index b364182dd8e2..f44e90109472 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,8 @@
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
 #include <linux/syscalls.h>
+#include <asm/unistd.h>
+#include <asm/syscall.h>
 #include "time/tick-sched.h"
 
 cpumask_var_t task_isolation_map;
@@ -157,3 +159,74 @@ void task_isolation_enter(void)
 	if (!tick_nohz_tick_stopped())
 		set_tsk_need_resched(current);
 }
+
+void task_isolation_interrupt(struct task_struct *task, const char *buf)
+{
+	siginfo_t info = {};
+	int sig;
+
+	pr_warn("%s/%d: task_isolation strict mode violated by %s\n",
+		task->comm, task->pid, buf);
+
+	/* Get the signal number to use. */
+	sig = PR_TASK_ISOLATION_GET_SIG(task->task_isolation_flags);
+	if (sig == 0)
+		sig = SIGKILL;
+	info.si_signo = sig;
+
+	/*
+	 * Turn off task isolation mode entirely to avoid spamming
+	 * the process with signals.  It can re-enable task isolation
+	 * mode in the signal handler if it wants to.
+	 */
+	task_isolation_set_flags(task, 0);
+
+	send_sig_info(sig, &info, task);
+}
+
+/*
+ * This routine is called from any userspace exception that doesn't
+ * otherwise trigger a signal to the user process (e.g. simple page fault).
+ */
+void _task_isolation_exception(const char *fmt, ...)
+{
+	struct task_struct *task = current;
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	if (task->task_isolation_flags & PR_TASK_ISOLATION_STRICT) {
+		va_list args;
+		char buf[100];
+
+		va_start(args, fmt);
+		vsnprintf(buf, sizeof(buf), fmt, args);
+		va_end(args);
+
+		task_isolation_interrupt(task, buf);
+	}
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in), and in STRICT mode prevents most syscalls from executing
+ * and raises a signal to notify the process.
+ */
+int task_isolation_syscall(int syscall)
+{
+	struct task_struct *task = current;
+
+	if ((task->task_isolation_flags & PR_TASK_ISOLATION_STRICT) &&
+	    syscall != __NR_prctl &&
+	    syscall != __NR_exit && syscall != __NR_exit_group) {
+		char buf[20];
+
+		snprintf(buf, sizeof(buf), "syscall %d", syscall);
+		task_isolation_interrupt(task, buf);
+
+		syscall_set_return_value(task, current_pt_regs(), ENOSYS, -1);
+		return -1;
+	}
+
+	return 0;
+}
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v12 07/13] task_isolation: add debug boot flag
  2016-04-05 17:38 [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf
                   ` (5 preceding siblings ...)
  2016-04-05 17:38 ` [PATCH v12 06/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf
@ 2016-04-05 17:38 ` Chris Metcalf
  2016-05-18 13:56   ` Peter Zijlstra
  2016-04-05 17:38 ` [PATCH v12 08/13] task_isolation: add PR_TASK_ISOLATION_ONE_SHOT flag Chris Metcalf
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 39+ messages in thread
From: Chris Metcalf @ 2016-04-05 17:38 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-kernel
  Cc: Chris Metcalf

The new "task_isolation_debug" flag simplifies debugging
of TASK_ISOLATION kernels when processes are running in
PR_TASK_ISOLATION_ENABLE mode.  Such processes should get no
interrupts from the kernel, and if they do, we notify either the
process (if STRICT mode is set and the interrupt is not an NMI)
or with a kernel stack dump on the console (otherwise).

It's possible to use ftrace to simply detect whether a task_isolation
core has unexpectedly entered the kernel.  But what this boot flag
does is allow the kernel to provide better diagnostics, e.g. by
reporting in the IPI-generating code what remote core and context
is preparing to deliver an interrupt to a task_isolation core.
Additionally, delivering a signal to the process in STRICT mode
allows applications to report up task isolation failures into their
own application logging framework.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 Documentation/kernel-parameters.txt    |  8 ++++
 include/linux/context_tracking_state.h |  6 +++
 include/linux/isolation.h              |  5 +++
 kernel/irq_work.c                      |  5 ++-
 kernel/isolation.c                     | 77 ++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                    | 18 ++++++++
 kernel/signal.c                        |  4 ++
 kernel/smp.c                           |  6 ++-
 kernel/softirq.c                       | 33 +++++++++++++++
 9 files changed, 160 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 9bd5e91357b1..7884e69d08fa 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3816,6 +3816,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			also sets up nohz_full and isolcpus mode for the
 			listed set of cpus.
 
+	task_isolation_debug	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION
+			and booted in task_isolation= mode, this
+			setting will generate console backtraces when
+			the kernel is about to interrupt a task that
+			has requested PR_TASK_ISOLATION_ENABLE and is
+			running on a task_isolation core.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 1d34fe68f48a..4e2c4b900b82 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -39,8 +39,14 @@ static inline bool context_tracking_in_user(void)
 {
 	return __this_cpu_read(context_tracking.state) == CONTEXT_USER;
 }
+
+static inline bool context_tracking_cpu_in_user(int cpu)
+{
+	return per_cpu(context_tracking.state, cpu) == CONTEXT_USER;
+}
 #else
 static inline bool context_tracking_in_user(void) { return false; }
+static inline bool context_tracking_cpu_in_user(int cpu) { return false; }
 static inline bool context_tracking_active(void) { return false; }
 static inline bool context_tracking_is_enabled(void) { return false; }
 static inline bool context_tracking_cpu_is_enabled(void) { return false; }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index eb78175ed811..f04252c51cf1 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -44,6 +44,9 @@ extern void _task_isolation_exception(const char *fmt, ...);
 			_task_isolation_exception(fmt, ## __VA_ARGS__); \
 	} while (0)
 
+extern void task_isolation_debug(int cpu);
+extern void task_isolation_debug_cpumask(const struct cpumask *);
+extern void task_isolation_debug_task(int cpu, struct task_struct *p);
 #else
 static inline void task_isolation_init(void) { }
 static inline bool task_isolation_possible(int cpu) { return false; }
@@ -53,6 +56,8 @@ extern inline void task_isolation_set_flags(struct task_struct *p,
 					    unsigned int flags) { }
 static inline int task_isolation_syscall(int nr) { return 0; }
 static inline void task_isolation_exception(const char *fmt, ...) { }
+static inline void task_isolation_debug(int cpu) { }
+#define task_isolation_debug_cpumask(mask) do {} while (0)
 #endif
 
 #endif
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index bcf107ce0854..a9b95ce00667 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/isolation.h>
 #include <asm/processor.h>
 
 
@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (!irq_work_claim(work))
 		return false;
 
-	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+		task_isolation_debug(cpu);
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return true;
 }
diff --git a/kernel/isolation.c b/kernel/isolation.c
index f44e90109472..1c4f320a24a0 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,7 @@
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
 #include <linux/syscalls.h>
+#include <linux/ratelimit.h>
 #include <asm/unistd.h>
 #include <asm/syscall.h>
 #include "time/tick-sched.h"
@@ -230,3 +231,79 @@ int task_isolation_syscall(int syscall)
 
 	return 0;
 }
+
+/* Enable debugging of any interrupts of task_isolation cores. */
+static int task_isolation_debug_flag;
+static int __init task_isolation_debug_func(char *str)
+{
+	task_isolation_debug_flag = true;
+	return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+void task_isolation_debug_task(int cpu, struct task_struct *p)
+{
+	static DEFINE_RATELIMIT_STATE(console_output, HZ, 1);
+	bool force_debug = false;
+
+	/*
+	 * Our caller made sure the task was running on a task isolation
+	 * core, but make sure the task has enabled isolation.
+	 */
+	if (!(p->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
+		return;
+
+	/*
+	 * Ensure the task is actually in userspace; if it is in kernel
+	 * mode, it is expected that it may receive interrupts, and in
+	 * any case they don't affect the isolation.  Note that there
+	 * is a race condition here as a task may have committed
+	 * to returning to user space but not yet set the context
+	 * tracking state to reflect it, and the check here is before
+	 * we trigger the interrupt, so we might fail to warn about a
+	 * legitimate interrupt.  However, the race window is narrow
+	 * and hitting it does not cause any incorrect behavior other
+	 * than failing to send the warning.
+	 */
+	if (!context_tracking_cpu_in_user(cpu))
+		return;
+
+	/*
+	 * If the task was in strict mode, deliver a signal to it.
+	 * We disable task isolation mode when we deliver a signal
+	 * so we won't end up recursing back here again.
+	 * If we are in an NMI, we don't try delivering the signal
+	 * and instead just treat it as if "debug" mode was enabled,
+	 * since that's pretty much all we can do.
+	 */
+	if (p->task_isolation_flags & PR_TASK_ISOLATION_STRICT) {
+		if (in_nmi())
+			force_debug = true;
+		else
+			task_isolation_interrupt(p, "interrupt");
+	}
+
+	/*
+	 * If (for example) the timer interrupt starts ticking
+	 * unexpectedly, we will get an unmanageable flow of output,
+	 * so limit to one backtrace per second.
+	 */
+	if (force_debug ||
+	    (task_isolation_debug_flag && __ratelimit(&console_output))) {
+		pr_err("Interrupt detected for task_isolation cpu %d, %s/%d\n",
+		       cpu, p->comm, p->pid);
+		dump_stack();
+	}
+}
+
+void task_isolation_debug_cpumask(const struct cpumask *mask)
+{
+	int cpu, thiscpu = get_cpu();
+
+	/* No need to report on this cpu since we're already in the kernel. */
+	for_each_cpu(cpu, mask)
+		if (cpu != thiscpu)
+			task_isolation_debug(cpu);
+
+	put_cpu();
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d8465eeab8b3..00649f7ad567 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
 #include <linux/context_tracking.h>
 #include <linux/compiler.h>
 #include <linux/frame.h>
+#include <linux/isolation.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -605,6 +606,23 @@ bool sched_can_stop_tick(struct rq *rq)
 }
 #endif /* CONFIG_NO_HZ_FULL */
 
+#ifdef CONFIG_TASK_ISOLATION
+void task_isolation_debug(int cpu)
+{
+	struct task_struct *p;
+
+	if (!task_isolation_possible(cpu))
+		return;
+
+	rcu_read_lock();
+	p = cpu_curr(cpu);
+	get_task_struct(p);
+	rcu_read_unlock();
+	task_isolation_debug_task(cpu, p);
+	put_task_struct(p);
+}
+#endif
+
 void sched_avg_update(struct rq *rq)
 {
 	s64 period = sched_avg_period();
diff --git a/kernel/signal.c b/kernel/signal.c
index 53e4e62f2778..9c0be099fcd9 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -639,6 +639,10 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+	/* If the task is being killed, don't complain about task_isolation. */
+	if (state & TASK_WAKEKILL)
+		task_isolation_set_flags(t, 0);
+
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 74165443c240..586a1309053b 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
 #include <linux/smp.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include <linux/isolation.h>
 
 #include "smpboot.h"
 
@@ -177,8 +178,10 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
-	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) {
+		task_isolation_debug(cpu);
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return 0;
 }
@@ -456,6 +459,7 @@ void smp_call_function_many(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	task_isolation_debug_cpumask(cfd->cpumask);
 	arch_send_call_function_ipi_mask(cfd->cpumask);
 
 	if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..a96da9825582 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -26,6 +26,7 @@
 #include <linux/smpboot.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -319,6 +320,37 @@ asmlinkage __visible void do_softirq(void)
 	local_irq_restore(flags);
 }
 
+/* Determine whether this IRQ is something task isolation cares about. */
+static void task_isolation_irq(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+	struct pt_regs *regs;
+
+	if (!context_tracking_cpu_is_enabled())
+		return;
+
+	/*
+	 * We have not yet called __irq_enter() and so we haven't
+	 * adjusted the hardirq count.  This test will allow us to
+	 * avoid false positives for nested IRQs.
+	 */
+	if (in_interrupt())
+		return;
+
+	/*
+	 * If we were already in the kernel, not from an irq but from
+	 * a syscall or synchronous exception/fault, this test should
+	 * avoid a false positive as well.  Note that this requires
+	 * architecture support for calling set_irq_regs() prior to
+	 * calling irq_enter(), and if it's not done consistently, we
+	 * will not consistently avoid false positives here.
+	 */
+	regs = get_irq_regs();
+	if (regs && user_mode(regs))
+		task_isolation_debug(smp_processor_id());
+#endif
+}
+
 /*
  * Enter an interrupt context.
  */
@@ -335,6 +367,7 @@ void irq_enter(void)
 		_local_bh_enable();
 	}
 
+	task_isolation_irq();
 	__irq_enter();
 }
 
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v12 08/13] task_isolation: add PR_TASK_ISOLATION_ONE_SHOT flag
  2016-04-05 17:38 [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf
                   ` (6 preceding siblings ...)
  2016-04-05 17:38 ` [PATCH v12 07/13] task_isolation: add debug boot flag Chris Metcalf
@ 2016-04-05 17:38 ` Chris Metcalf
  2016-04-05 17:38 ` [PATCH v12 09/13] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 39+ messages in thread
From: Chris Metcalf @ 2016-04-05 17:38 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

When this flag is set by the initial prctl(), the semantics of task
isolation change to be "one-shot", i.e. as soon as the kernel is
re-entered for any reason, task isolation is turned off.

During application development, use of this flag is best coupled with
STRICT mode, since otherwise any bug (e.g. an munmap from another
thread in the same task causing an IPI TLB flush) could cause the
task to fall out of task isolation mode without being aware of it.

In production it is typically still best to use STRICT mode, with
a signal handler that will report violations of task isolation
up to the application layer.  However, if you are confident the
application will never fall out of task isolation mode, you may
wish to use ONE_SHOT mode to allow switching from userspace task
isolation mode, to using the kernel freely, without the small extra
penalty of invoking prctl() explicitly to turn task isolation off
before starting to use kernel services.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/uapi/linux/prctl.h | 1 +
 kernel/isolation.c         | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a5582ace987f..1e204f1a0f4a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -202,6 +202,7 @@ struct prctl_mm_map {
 #define PR_GET_TASK_ISOLATION		49
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
 # define PR_TASK_ISOLATION_STRICT	(1 << 1)
+# define PR_TASK_ISOLATION_ONE_SHOT	(1 << 2)
 # define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
 # define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 1c4f320a24a0..d0e94505bfac 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -205,7 +205,11 @@ void _task_isolation_exception(const char *fmt, ...)
 		va_end(args);
 
 		task_isolation_interrupt(task, buf);
+		return;
 	}
+
+	if (task->task_isolation_flags & PR_TASK_ISOLATION_ONE_SHOT)
+		task_isolation_set_flags(task, 0);
 }
 
 /*
@@ -229,6 +233,9 @@ int task_isolation_syscall(int syscall)
 		return -1;
 	}
 
+	if (task->task_isolation_flags & PR_TASK_ISOLATION_ONE_SHOT)
+		task_isolation_set_flags(task, 0);
+
 	return 0;
 }
 
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v12 09/13] arm, tile: turn off timer tick for oneshot_stopped state
  2016-04-05 17:38 [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf
                   ` (7 preceding siblings ...)
  2016-04-05 17:38 ` [PATCH v12 08/13] task_isolation: add PR_TASK_ISOLATION_ONE_SHOT flag Chris Metcalf
@ 2016-04-05 17:38 ` Chris Metcalf
  2016-04-07 16:58   ` Daniel Lezcano
  2016-04-05 17:38 ` [PATCH v12 10/13] arch/x86: enable task isolation functionality Chris Metcalf
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 39+ messages in thread
From: Chris Metcalf @ 2016-04-05 17:38 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-kernel
  Cc: Chris Metcalf

When the schedule tick is disabled in tick_nohz_stop_sched_tick(),
we call hrtimer_cancel(), which eventually calls down into
__remove_hrtimer() and thus into hrtimer_force_reprogram().
That function's call to tick_program_event() detects that
we are trying to set the expiration to KTIME_MAX and calls
clockevents_switch_state() to set the state to ONESHOT_STOPPED,
and returns.  See commit 8fff52fd5093 ("clockevents: Introduce
CLOCK_EVT_STATE_ONESHOT_STOPPED state") for more background.

However, by default the internal __clockevents_switch_state() code
doesn't have a "set_state_oneshot_stopped" function pointer for
the arm_arch_timer or tile clock_event_device structures, so that
code returns -ENOSYS, and we end up not setting the state, and more
importantly, we don't actually turn off the hardware timer.
As a result, the timer tick we were waiting for before is still
queued, and fires shortly afterwards, only to discover there was
nothing for it to do, at which point it quiesces.

The fix is to provide that function pointer field, and like the
other function pointers, have it just turn off the timer interrupt.
Any call to set a new timer interval will properly re-enable it.

This fix avoids a small performance hiccup for regular applications,
but for TASK_ISOLATION code, it fixes a potentially serious
kernel timer interruption to the time-sensitive application.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/tile/kernel/time.c              | 1 +
 drivers/clocksource/arm_arch_timer.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index 178989e6d3e3..fbedf380d9d4 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -159,6 +159,7 @@ static DEFINE_PER_CPU(struct clock_event_device, tile_timer) = {
 	.set_next_event = tile_timer_set_next_event,
 	.set_state_shutdown = tile_timer_shutdown,
 	.set_state_oneshot = tile_timer_shutdown,
+	.set_state_oneshot_stopped = tile_timer_shutdown,
 	.tick_resume = tile_timer_shutdown,
 };
 
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index 5152b3898155..f2bcad21ecd3 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -306,6 +306,8 @@ static void __arch_timer_setup(unsigned type,
 		}
 	}
 
+	clk->set_state_oneshot_stopped = clk->set_state_shutdown;
+
 	clk->set_state_shutdown(clk);
 
 	clockevents_config_and_register(clk, arch_timer_rate, 0xf, 0x7fffffff);
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v12 10/13] arch/x86: enable task isolation functionality
  2016-04-05 17:38 [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf
                   ` (8 preceding siblings ...)
  2016-04-05 17:38 ` [PATCH v12 09/13] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
@ 2016-04-05 17:38 ` Chris Metcalf
  2016-05-18 16:23   ` Peter Zijlstra
  2016-04-05 17:38 ` [PATCH v12 11/13] arch/tile: " Chris Metcalf
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 39+ messages in thread
From: Chris Metcalf @ 2016-04-05 17:38 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	H. Peter Anvin, x86, linux-kernel
  Cc: Chris Metcalf

In prepare_exit_to_usermode(), call task_isolation_ready() for
TIF_TASK_ISOLATION tasks when we are checking the thread-info flags,
and after we've handled the other work, call task_isolation_enter()
for such tasks.

In syscall_trace_enter_phase1(), we add the necessary support for
strict-mode detection of syscalls.

We add strict reporting for the kernel exception types that do
not result in signals, namely non-signalling page faults and
non-signalling MPX fixups.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/entry/common.c            | 18 +++++++++++++++++-
 arch/x86/include/asm/thread_info.h |  2 ++
 arch/x86/kernel/traps.c            |  2 ++
 arch/x86/mm/fault.c                |  2 ++
 5 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2dc18605831f..760401ba3df0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -89,6 +89,7 @@ config X86
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS	if MMU && COMPAT
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_SOFT_DIRTY		if X86_64
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_BPF_JIT			if X86_64
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index e79d93d44ecd..31dfe4ff8915 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -21,6 +21,7 @@
 #include <linux/context_tracking.h>
 #include <linux/user-return-notifier.h>
 #include <linux/uprobes.h>
+#include <linux/isolation.h>
 
 #include <asm/desc.h>
 #include <asm/traps.h>
@@ -87,6 +88,13 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 
 	work = ACCESS_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY;
 
+	/* In isolation mode, we may prevent the syscall from running. */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->orig_ax) == -1)
+			return -1;
+		work &= ~_TIF_TASK_ISOLATION;
+	}
+
 #ifdef CONFIG_SECCOMP
 	/*
 	 * Do seccomp first -- it should minimize exposure of other
@@ -202,7 +210,7 @@ long syscall_trace_enter(struct pt_regs *regs)
 
 #define EXIT_TO_USERMODE_LOOP_FLAGS				\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |	\
-	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY)
+	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_TASK_ISOLATION)
 
 static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 {
@@ -236,11 +244,19 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
 			fire_user_return_notifiers();
 
+		if (cached_flags & _TIF_TASK_ISOLATION)
+			task_isolation_enter();
+
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
 		cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
 
+		/* Clear task isolation from cached_flags manually. */
+		if ((cached_flags & _TIF_TASK_ISOLATION) &&
+		    task_isolation_ready())
+			cached_flags &= ~_TIF_TASK_ISOLATION;
+
 		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
 			break;
 
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 82866697fcf1..057176ae597f 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -97,6 +97,7 @@ struct thread_info {
 #define TIF_SECCOMP		8	/* secure computing */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
 #define TIF_UPROBE		12	/* breakpointed or singlestepping */
+#define TIF_TASK_ISOLATION	13	/* task isolation enabled for task */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
 #define TIF_FORK		18	/* ret_from_fork */
@@ -121,6 +122,7 @@ struct thread_info {
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_UPROBE		(1 << TIF_UPROBE)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 06cbe25861f1..b02205085571 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -36,6 +36,7 @@
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/io.h>
+#include <linux/isolation.h>
 
 #ifdef CONFIG_EISA
 #include <linux/ioport.h>
@@ -382,6 +383,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 	case 2:	/* Bound directory has invalid entry. */
 		if (mpx_handle_bd_fault())
 			goto exit_trap;
+		task_isolation_exception("bounds check");
 		break; /* Success, it was handled */
 	case 1: /* Bound violation. */
 		info = mpx_generate_siginfo(regs);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5ce1ed02f7e8..025e9d2850c1 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -14,6 +14,7 @@
 #include <linux/prefetch.h>		/* prefetchw			*/
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
+#include <linux/isolation.h>		/* task_isolation_exception	*/
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1259,6 +1260,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		local_irq_enable();
 		error_code |= PF_USER;
 		flags |= FAULT_FLAG_USER;
+		task_isolation_exception("page fault at %#lx", address);
 	} else {
 		if (regs->flags & X86_EFLAGS_IF)
 			local_irq_enable();
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v12 11/13] arch/tile: enable task isolation functionality
  2016-04-05 17:38 [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf
                   ` (9 preceding siblings ...)
  2016-04-05 17:38 ` [PATCH v12 10/13] arch/x86: enable task isolation functionality Chris Metcalf
@ 2016-04-05 17:38 ` Chris Metcalf
  2016-04-05 17:38 ` [PATCH v12 12/13] arm64: factor work_pending state machine to C Chris Metcalf
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 39+ messages in thread
From: Chris Metcalf @ 2016-04-05 17:38 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel
  Cc: Chris Metcalf

We add the necessary call to task_isolation_enter() in the
prepare_exit_to_usermode() routine.  We already unconditionally
call into this routine if TIF_NOHZ is set, since that's where
we do the user_enter() call.

We add calls to task_isolation_check_exception() in places
where exceptions may not generate signals to the application.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/tile/Kconfig                   |  1 +
 arch/tile/include/asm/thread_info.h |  4 +++-
 arch/tile/kernel/process.c          |  9 +++++++++
 arch/tile/kernel/ptrace.c           |  7 +++++++
 arch/tile/kernel/single_step.c      |  5 +++++
 arch/tile/kernel/smp.c              | 28 ++++++++++++++++------------
 arch/tile/kernel/unaligned.c        |  3 +++
 arch/tile/mm/fault.c                |  3 +++
 arch/tile/mm/homecache.c            |  2 ++
 9 files changed, 49 insertions(+), 13 deletions(-)

diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig
index 81719302b056..322ba76c8631 100644
--- a/arch/tile/Kconfig
+++ b/arch/tile/Kconfig
@@ -33,6 +33,7 @@ config TILE
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_TASK_ISOLATION
 
 # FIXME: investigate whether we need/want these options.
 #	select HAVE_IOREMAP_PROT
diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index 4b7cef9e94e0..ea7fbd0d879d 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -126,6 +126,7 @@ extern void _cpu_idle(void);
 #define TIF_SYSCALL_TRACEPOINT	9	/* syscall tracepoint instrumentation */
 #define TIF_POLLING_NRFLAG	10	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_NOHZ		11	/* in adaptive nohz mode */
+#define TIF_TASK_ISOLATION	12	/* in task isolation mode */
 
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1<<TIF_NEED_RESCHED)
@@ -139,11 +140,12 @@ extern void _cpu_idle(void);
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
 #define _TIF_NOHZ		(1<<TIF_NOHZ)
+#define _TIF_TASK_ISOLATION	(1<<TIF_TASK_ISOLATION)
 
 /* Work to do as we loop to exit to user space. */
 #define _TIF_WORK_MASK \
 	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
-	 _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME)
+	 _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME | _TIF_TASK_ISOLATION)
 
 /* Work to do on any return to user space. */
 #define _TIF_ALLWORK_MASK \
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index b5f30d376ce1..e752f6c2b645 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -29,6 +29,7 @@
 #include <linux/signal.h>
 #include <linux/delay.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 #include <asm/stack.h>
 #include <asm/switch_to.h>
 #include <asm/homecache.h>
@@ -495,9 +496,17 @@ void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags)
 			tracehook_notify_resume(regs);
 		}
 
+		if (thread_info_flags & _TIF_TASK_ISOLATION)
+			task_isolation_enter();
+
 		local_irq_disable();
 		thread_info_flags = READ_ONCE(current_thread_info()->flags);
 
+		/* Clear task isolation from cached_flags manually. */
+		if ((thread_info_flags & _TIF_TASK_ISOLATION) &&
+		    task_isolation_ready())
+			thread_info_flags &= ~_TIF_TASK_ISOLATION;
+
 	} while (thread_info_flags & _TIF_WORK_MASK);
 
 	if (thread_info_flags & _TIF_SINGLESTEP) {
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index 54e7b723db99..475a362415fb 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -23,6 +23,7 @@
 #include <linux/elf.h>
 #include <linux/tracehook.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 #include <asm/traps.h>
 #include <arch/chip.h>
 
@@ -255,6 +256,12 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 {
 	u32 work = ACCESS_ONCE(current_thread_info()->flags);
 
+	/* In isolation mode, we may prevent the syscall from running. */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]) == -1)
+			return -1;
+	}
+
 	if (secure_computing() == -1)
 		return -1;
 
diff --git a/arch/tile/kernel/single_step.c b/arch/tile/kernel/single_step.c
index 862973074bf9..3baa7230149e 100644
--- a/arch/tile/kernel/single_step.c
+++ b/arch/tile/kernel/single_step.c
@@ -23,6 +23,7 @@
 #include <linux/types.h>
 #include <linux/err.h>
 #include <linux/prctl.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <asm/uaccess.h>
@@ -320,6 +321,8 @@ void single_step_once(struct pt_regs *regs)
 	int size = 0, sign_ext = 0;  /* happy compiler */
 	int align_ctl;
 
+	task_isolation_exception("single step at %#lx", regs->pc);
+
 	align_ctl = unaligned_fixup;
 	switch (task_thread_info(current)->align_ctl) {
 	case PR_UNALIGN_NOPRINT:
@@ -767,6 +770,8 @@ void single_step_once(struct pt_regs *regs)
 	unsigned long *ss_pc = this_cpu_ptr(&ss_saved_pc);
 	unsigned long control = __insn_mfspr(SPR_SINGLE_STEP_CONTROL_K);
 
+	task_isolation_exception("single step at %#lx", regs->pc);
+
 	*ss_pc = regs->pc;
 	control |= SPR_SINGLE_STEP_CONTROL_1__CANCELED_MASK;
 	control |= SPR_SINGLE_STEP_CONTROL_1__INHIBIT_MASK;
diff --git a/arch/tile/kernel/smp.c b/arch/tile/kernel/smp.c
index 07e3ff5cc740..da1eb240fc57 100644
--- a/arch/tile/kernel/smp.c
+++ b/arch/tile/kernel/smp.c
@@ -20,6 +20,7 @@
 #include <linux/irq.h>
 #include <linux/irq_work.h>
 #include <linux/module.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/homecache.h>
 
@@ -67,6 +68,7 @@ void send_IPI_single(int cpu, int tag)
 		.x = cpu % smp_width,
 		.state = HV_TO_BE_SENT
 	};
+	task_isolation_debug(cpu);
 	__send_IPI_many(&recip, 1, tag);
 }
 
@@ -84,6 +86,7 @@ void send_IPI_many(const struct cpumask *mask, int tag)
 		r->x = cpu % smp_width;
 		r->state = HV_TO_BE_SENT;
 	}
+	task_isolation_debug_cpumask(mask);
 	__send_IPI_many(recip, nrecip, tag);
 }
 
@@ -181,10 +184,11 @@ void flush_icache_range(unsigned long start, unsigned long end)
 	struct ipi_flush flush = { start, end };
 
 	/* If invoked with irqs disabled, we can not issue IPIs. */
-	if (irqs_disabled())
+	if (irqs_disabled()) {
+		task_isolation_debug_cpumask(task_isolation_map);
 		flush_remote(0, HV_FLUSH_EVICT_L1I, NULL, 0, 0, 0,
 			NULL, NULL, 0);
-	else {
+	} else {
 		preempt_disable();
 		on_each_cpu(ipi_flush_icache_range, &flush, 1);
 		preempt_enable();
@@ -258,10 +262,8 @@ void __init ipi_init(void)
 
 #if CHIP_HAS_IPI()
 
-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
 {
-	WARN_ON(cpu_is_offline(cpu));
-
 	/*
 	 * We just want to do an MMIO store.  The traditional writeq()
 	 * functions aren't really correct here, since they're always
@@ -273,15 +275,17 @@ void smp_send_reschedule(int cpu)
 
 #else
 
-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
 {
-	HV_Coord coord;
-
-	WARN_ON(cpu_is_offline(cpu));
-
-	coord.y = cpu_y(cpu);
-	coord.x = cpu_x(cpu);
+	HV_Coord coord = { .y = cpu_y(cpu), .x = cpu_x(cpu) };
 	hv_trigger_ipi(coord, IRQ_RESCHEDULE);
 }
 
 #endif /* CHIP_HAS_IPI() */
+
+void smp_send_reschedule(int cpu)
+{
+	WARN_ON(cpu_is_offline(cpu));
+	task_isolation_debug(cpu);
+	__smp_send_reschedule(cpu);
+}
diff --git a/arch/tile/kernel/unaligned.c b/arch/tile/kernel/unaligned.c
index 0db5f7c9d9e5..754204987532 100644
--- a/arch/tile/kernel/unaligned.c
+++ b/arch/tile/kernel/unaligned.c
@@ -25,6 +25,7 @@
 #include <linux/module.h>
 #include <linux/compat.h>
 #include <linux/prctl.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <asm/uaccess.h>
@@ -1545,6 +1546,8 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
 		return;
 	}
 
+	task_isolation_exception("unaligned JIT at %#lx", regs->pc);
+
 	if (!info->unalign_jit_base) {
 		void __user *user_page;
 
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 26734214818c..7f910e4061f1 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -35,6 +35,7 @@
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 #include <linux/kdebug.h>
+#include <linux/isolation.h>
 
 #include <asm/pgalloc.h>
 #include <asm/sections.h>
@@ -844,6 +845,8 @@ static inline void __do_page_fault(struct pt_regs *regs, int fault_num,
 void do_page_fault(struct pt_regs *regs, int fault_num,
 		   unsigned long address, unsigned long write)
 {
+	task_isolation_exception("page fault interrupt %d at %#lx (%#lx)",
+				       fault_num, regs->pc, address);
 	__do_page_fault(regs, fault_num, address, write);
 }
 
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..e044e8dd8372 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
 #include <linux/smp.h>
 #include <linux/module.h>
 #include <linux/hugetlb.h>
+#include <linux/isolation.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -83,6 +84,7 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
 	 * Don't bother to update atomically; losing a count
 	 * here is not that critical.
 	 */
+	task_isolation_debug_cpumask(&mask);
 	for_each_cpu(cpu, &mask)
 		++per_cpu(irq_stat, cpu).irq_hv_flush_count;
 }
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v12 12/13] arm64: factor work_pending state machine to C
  2016-04-05 17:38 [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf
                   ` (10 preceding siblings ...)
  2016-04-05 17:38 ` [PATCH v12 11/13] arch/tile: " Chris Metcalf
@ 2016-04-05 17:38 ` Chris Metcalf
  2016-07-08 15:43   ` [PATCH]: " Chris Metcalf
  2016-04-05 17:38 ` [PATCH v12 13/13] arch/arm64: enable task isolation functionality Chris Metcalf
  2016-05-12 18:26 ` [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf
  13 siblings, 1 reply; 39+ messages in thread
From: Chris Metcalf @ 2016-04-05 17:38 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Mark Rutland, linux-arm-kernel, linux-kernel
  Cc: Chris Metcalf

Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
state machine that can be difficult to reason about due to duplicated
code and a large number of branch targets.

This patch factors the common logic out into the existing
do_notify_resume function, converting the code to C in the process,
making the code more legible.

This patch tries to closely mirror the existing behaviour while using
the usual C control flow primitives. As local_irq_{disable,enable} may
be instrumented, we balance exception entry (where we will almost most
likely enable IRQs) with a call to trace_hardirqs_on just before the
return to userspace.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/arm64/kernel/entry.S  | 12 ++++--------
 arch/arm64/kernel/signal.c | 36 ++++++++++++++++++++++++++----------
 2 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 12e8d2bcb3f9..d70a9e44b7d6 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -674,18 +674,13 @@ ret_fast_syscall_trace:
  * Ok, we need to do extra processing, enter the slow path.
  */
 work_pending:
-	tbnz	x1, #TIF_NEED_RESCHED, work_resched
-	/* TIF_SIGPENDING, TIF_NOTIFY_RESUME or TIF_FOREIGN_FPSTATE case */
 	mov	x0, sp				// 'regs'
-	enable_irq				// enable interrupts for do_notify_resume()
 	bl	do_notify_resume
-	b	ret_to_user
-work_resched:
 #ifdef CONFIG_TRACE_IRQFLAGS
-	bl	trace_hardirqs_off		// the IRQs are off here, inform the tracing code
+	bl	trace_hardirqs_on		// enabled while in userspace
 #endif
-	bl	schedule
-
+	ldr	x1, [tsk, #TI_FLAGS]		// re-check for single-step
+	b	finish_ret_to_user
 /*
  * "slow" syscall return path.
  */
@@ -694,6 +689,7 @@ ret_to_user:
 	ldr	x1, [tsk, #TI_FLAGS]
 	and	x2, x1, #_TIF_WORK_MASK
 	cbnz	x2, work_pending
+finish_ret_to_user:
 	enable_step_tsk x1, x2
 	kernel_exit 0
 ENDPROC(ret_to_user)
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index a8eafdbc7cb8..404dd67080b9 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -402,15 +402,31 @@ static void do_signal(struct pt_regs *regs)
 asmlinkage void do_notify_resume(struct pt_regs *regs,
 				 unsigned int thread_flags)
 {
-	if (thread_flags & _TIF_SIGPENDING)
-		do_signal(regs);
-
-	if (thread_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-	}
-
-	if (thread_flags & _TIF_FOREIGN_FPSTATE)
-		fpsimd_restore_current_state();
+	/*
+	 * The assembly code enters us with IRQs off, but it hasn't
+	 * informed the tracing code of that for efficiency reasons.
+	 * Update the trace code with the current status.
+	 */
+	trace_hardirqs_off();
+	do {
+		if (thread_flags & _TIF_NEED_RESCHED) {
+			schedule();
+		} else {
+			local_irq_enable();
+
+			if (thread_flags & _TIF_SIGPENDING)
+				do_signal(regs);
+
+			if (thread_flags & _TIF_NOTIFY_RESUME) {
+				clear_thread_flag(TIF_NOTIFY_RESUME);
+				tracehook_notify_resume(regs);
+			}
+
+			if (thread_flags & _TIF_FOREIGN_FPSTATE)
+				fpsimd_restore_current_state();
+		}
 
+		local_irq_disable();
+		thread_flags = READ_ONCE(current_thread_info()->flags);
+	} while (thread_flags & _TIF_WORK_MASK);
 }
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v12 13/13] arch/arm64: enable task isolation functionality
  2016-04-05 17:38 [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf
                   ` (11 preceding siblings ...)
  2016-04-05 17:38 ` [PATCH v12 12/13] arm64: factor work_pending state machine to C Chris Metcalf
@ 2016-04-05 17:38 ` Chris Metcalf
  2016-05-12 18:26 ` [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf
  13 siblings, 0 replies; 39+ messages in thread
From: Chris Metcalf @ 2016-04-05 17:38 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Mark Rutland, linux-arm-kernel, linux-kernel
  Cc: Chris Metcalf

In do_notify_resume(), call task_isolation_ready() for
TIF_TASK_ISOLATION tasks when we are checking the thread-info flags;
and after we've handled the other work, call task_isolation_enter()
for such tasks.  To ensure we always call task_isolation_enter() when
returning to userspace, add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
while leaving the old bitmask value as _TIF_WORK_LOOP_MASK to
check while looping.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one.  This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

We instrument the smp_cross_call() routine so that it checks for
isolated tasks and generates a suitable warning if we are about
to disturb one of them in strict or debug mode.

Finally, add an explicit check for STRICT mode in do_mem_abort()
to handle the case of page faults.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/thread_info.h |  5 ++++-
 arch/arm64/kernel/ptrace.c           | 15 ++++++++++++---
 arch/arm64/kernel/signal.c           | 10 ++++++++++
 arch/arm64/kernel/smp.c              |  2 ++
 arch/arm64/mm/fault.c                |  4 ++++
 6 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 4f436220384f..ec033abee9d5 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -57,6 +57,7 @@ config ARM64
 	select HAVE_ARCH_MMAP_RND_BITS
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_BPF_JIT
 	select HAVE_C_RECORDMCOUNT
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index abd64bd1f6d9..bdc6426b9968 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -109,6 +109,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	1
 #define TIF_NOTIFY_RESUME	2	/* callback before returning to user */
 #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
+#define TIF_TASK_ISOLATION	4
 #define TIF_NOHZ		7
 #define TIF_SYSCALL_TRACE	8
 #define TIF_SYSCALL_AUDIT	9
@@ -124,6 +125,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
@@ -132,7 +134,8 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_32BIT		(1 << TIF_32BIT)
 
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
-				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE)
+				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
+				 _TIF_TASK_ISOLATION)
 
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 3f6cd5c5234f..ae336065733d 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1246,14 +1247,22 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
-	/* Do the secure computing check first; failures should be fast. */
+	unsigned long work = ACCESS_ONCE(current_thread_info()->flags);
+
+	/* In isolation mode, we may prevent the syscall from running. */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->syscallno) == -1)
+			return -1;
+	}
+
+	/* Do the secure computing check early; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
 
-	if (test_thread_flag(TIF_SYSCALL_TRACE))
+	if (work & _TIF_SYSCALL_TRACE)
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
-	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
+	if (work & _TIF_SYSCALL_TRACEPOINT)
 		trace_sys_enter(regs, regs->syscallno);
 
 	audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 404dd67080b9..f9b9b25636ca 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -25,6 +25,7 @@
 #include <linux/uaccess.h>
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
+#include <linux/isolation.h>
 
 #include <asm/debug-monitors.h>
 #include <asm/elf.h>
@@ -424,9 +425,18 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 
 			if (thread_flags & _TIF_FOREIGN_FPSTATE)
 				fpsimd_restore_current_state();
+
+			if (thread_flags & _TIF_TASK_ISOLATION)
+				task_isolation_enter();
 		}
 
 		local_irq_disable();
 		thread_flags = READ_ONCE(current_thread_info()->flags);
+
+		/* Clear task isolation from cached_flags manually. */
+		if ((thread_flags & _TIF_TASK_ISOLATION) &&
+		    task_isolation_ready())
+			thread_flags &= ~_TIF_TASK_ISOLATION;
+
 	} while (thread_flags & _TIF_WORK_MASK);
 }
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index b2d5f4ee9a1c..83ed6b5baa4d 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -37,6 +37,7 @@
 #include <linux/completion.h>
 #include <linux/of.h>
 #include <linux/irq_work.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -710,6 +711,7 @@ static const char *ipi_types[NR_IPI] __tracepoint_string = {
 static void smp_cross_call(const struct cpumask *target, unsigned int ipinr)
 {
 	trace_ipi_raise(target, ipi_types[ipinr]);
+	task_isolation_debug_cpumask(target);
 	__smp_cross_call(target, ipinr);
 }
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 95df28bc875f..77f827b02c6d 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/isolation.h>
 
 #include <asm/cpufeature.h>
 #include <asm/exception.h>
@@ -482,6 +483,9 @@ asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
 	const struct fault_info *inf = fault_info + (esr & 63);
 	struct siginfo info;
 
+	if (user_mode(regs))
+		task_isolation_exception("%s at %#lx", inf->name, addr);
+
 	if (!inf->fn(addr, esr, regs))
 		return;
 
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 09/13] arm, tile: turn off timer tick for oneshot_stopped state
  2016-04-05 17:38 ` [PATCH v12 09/13] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
@ 2016-04-07 16:58   ` Daniel Lezcano
  0 siblings, 0 replies; 39+ messages in thread
From: Daniel Lezcano @ 2016-04-07 16:58 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel

On Tue, Apr 05, 2016 at 01:38:38PM -0400, Chris Metcalf wrote:
> When the schedule tick is disabled in tick_nohz_stop_sched_tick(),
> we call hrtimer_cancel(), which eventually calls down into
> __remove_hrtimer() and thus into hrtimer_force_reprogram().
> That function's call to tick_program_event() detects that
> we are trying to set the expiration to KTIME_MAX and calls
> clockevents_switch_state() to set the state to ONESHOT_STOPPED,
> and returns.  See commit 8fff52fd5093 ("clockevents: Introduce
> CLOCK_EVT_STATE_ONESHOT_STOPPED state") for more background.
> 
> However, by default the internal __clockevents_switch_state() code
> doesn't have a "set_state_oneshot_stopped" function pointer for
> the arm_arch_timer or tile clock_event_device structures, so that
> code returns -ENOSYS, and we end up not setting the state, and more
> importantly, we don't actually turn off the hardware timer.
> As a result, the timer tick we were waiting for before is still
> queued, and fires shortly afterwards, only to discover there was
> nothing for it to do, at which point it quiesces.
> 
> The fix is to provide that function pointer field, and like the
> other function pointers, have it just turn off the timer interrupt.
> Any call to set a new timer interval will properly re-enable it.
> 
> This fix avoids a small performance hiccup for regular applications,
> but for TASK_ISOLATION code, it fixes a potentially serious
> kernel timer interruption to the time-sensitive application.
> 
> Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
> ---

Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 00/13] support "task_isolation" mode
  2016-04-05 17:38 [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf
                   ` (12 preceding siblings ...)
  2016-04-05 17:38 ` [PATCH v12 13/13] arch/arm64: enable task isolation functionality Chris Metcalf
@ 2016-05-12 18:26 ` Chris Metcalf
  13 siblings, 0 replies; 39+ messages in thread
From: Chris Metcalf @ 2016-05-12 18:26 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel

Ping, since the 4.7 merge window is opening soon and I haven't received
too much feedback on this version of the patch series based on 4.6-rc1.

1. Patch 09/13 for timer ticks was acked by Daniel Lezcano and is standalone,
    so could be taken into the relevant trees.  I'm not sure if it should go in
    as two separate patches through the tile and arm architecture trees,
    or through the timer tree as a combined patch.  Catalin/Will, any ideas?

    http://lkml.kernel.org/g/1459877922-15512-10-git-send-email-cmetcalf@mellanox.com

2. Patch 12/13, factoring the work_pending state machine for ARM64 into C,
    should go via the arm64 tree.  Mark Rutland should probably Ack it but then
    it should go via the ARM64 tree:

    http://lkml.kernel.org/g/1459877922-15512-13-git-send-email-cmetcalf@mellanox.com

3. Frederick provided some more feedback and I think we are still waiting
    to close the loop on the notion of how strict we should be by default:

    http://lkml.kernel.org/g/571E7FC9.60405@mellanox.com

We have been flogging this patch series along for just over a year now;
v1 of the patch series was sent on May 8, 2015.  Phew!

On 4/5/2016 1:38 PM, Chris Metcalf wrote:
> Here is a respin of the task-isolation patch set.  The previous one
> came out just before the merge window for 4.6 opened, so I suspect
> folks may have been busy merging, since it got few comments.
>
> Frederic, how are you feeling about taking this all via your tree?
> And what is your take on the new PR_TASK_ISOLATION_ONE_SHOT mode?
> I'm not sure what the right path to upstream for this series is.
>
> Changes since v11:
>
> - Rebased on v4.6-rc1.  This required me to create a
>    can_stop_my_full_tick() helper in tick-sched.c, since the underlying
>    can_stop_full_tick() now takes a struct tick_sched.
>
> - Added a HAVE_ARCH_TASK_ISOLATION Kconfig flag so that you can't
>    try to build with TASK_ISOLATION enabled for an architecture until
>    it is explicitly configured to work.  This avoids possible
>    allyesconfig build failures for unsupported architectures, or even
>    for supported ones when bisecting to the middle of this series.
>
> - Return EAGAIN instead of EINVAL for the enabling prctl() if the task
>    is affinitized to a task-isolation core, but things just aren't yet
>    right for it (e.g. another task running).  This lets the caller
>    differentiate a potentially transient failure from a permanent
>    failure, for which we still return EINVAL.
>
> The previous (v11) patch series is here:
>
> https://lkml.kernel.org/r/1457734223-26209-1-git-send-email-cmetcalf@mellanox.com
>
> This version of the patch series has been tested on arm64 and tile,
> and build-tested on x86.
>
> It remains true that the 1 Hz tick needs to be disabled for this
> patch series to be able to achieve its primary goal of enabling
> truly tick-free operation, but that is ongoing orthogonal work.
>
> The series is available at:
>
>    git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane
>
> Chris Metcalf (13):
>    vmstat: add quiet_vmstat_sync function
>    vmstat: add vmstat_idle function
>    lru_add_drain_all: factor out lru_add_drain_needed
>    task_isolation: add initial support
>    task_isolation: support CONFIG_TASK_ISOLATION_ALL
>    task_isolation: support PR_TASK_ISOLATION_STRICT mode
>    task_isolation: add debug boot flag
>    task_isolation: add PR_TASK_ISOLATION_ONE_SHOT flag
>    arm, tile: turn off timer tick for oneshot_stopped state
>    arch/x86: enable task isolation functionality
>    arch/tile: enable task isolation functionality
>    arm64: factor work_pending state machine to C
>    arch/arm64: enable task isolation functionality
>
>   Documentation/kernel-parameters.txt    |  16 ++
>   arch/arm64/Kconfig                     |   1 +
>   arch/arm64/include/asm/thread_info.h   |   5 +-
>   arch/arm64/kernel/entry.S              |  12 +-
>   arch/arm64/kernel/ptrace.c             |  15 +-
>   arch/arm64/kernel/signal.c             |  42 ++++-
>   arch/arm64/kernel/smp.c                |   2 +
>   arch/arm64/mm/fault.c                  |   4 +
>   arch/tile/Kconfig                      |   1 +
>   arch/tile/include/asm/thread_info.h    |   4 +-
>   arch/tile/kernel/process.c             |   9 +
>   arch/tile/kernel/ptrace.c              |   7 +
>   arch/tile/kernel/single_step.c         |   5 +
>   arch/tile/kernel/smp.c                 |  28 +--
>   arch/tile/kernel/time.c                |   1 +
>   arch/tile/kernel/unaligned.c           |   3 +
>   arch/tile/mm/fault.c                   |   3 +
>   arch/tile/mm/homecache.c               |   2 +
>   arch/x86/Kconfig                       |   1 +
>   arch/x86/entry/common.c                |  18 +-
>   arch/x86/include/asm/thread_info.h     |   2 +
>   arch/x86/kernel/traps.c                |   2 +
>   arch/x86/mm/fault.c                    |   2 +
>   drivers/base/cpu.c                     |  18 ++
>   drivers/clocksource/arm_arch_timer.c   |   2 +
>   include/linux/context_tracking_state.h |   6 +
>   include/linux/isolation.h              |  63 +++++++
>   include/linux/sched.h                  |   3 +
>   include/linux/swap.h                   |   1 +
>   include/linux/tick.h                   |   2 +
>   include/linux/vmstat.h                 |   4 +
>   include/uapi/linux/prctl.h             |   9 +
>   init/Kconfig                           |  33 ++++
>   kernel/Makefile                        |   1 +
>   kernel/fork.c                          |   3 +
>   kernel/irq_work.c                      |   5 +-
>   kernel/isolation.c                     | 316 +++++++++++++++++++++++++++++++++
>   kernel/sched/core.c                    |  18 ++
>   kernel/signal.c                        |   8 +
>   kernel/smp.c                           |   6 +-
>   kernel/softirq.c                       |  33 ++++
>   kernel/sys.c                           |   9 +
>   kernel/time/tick-sched.c               |  36 ++--
>   mm/swap.c                              |  15 +-
>   mm/vmstat.c                            |  21 +++
>   45 files changed, 743 insertions(+), 54 deletions(-)
>   create mode 100644 include/linux/isolation.h
>   create mode 100644 kernel/isolation.c
>

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 04/13] task_isolation: add initial support
  2016-04-05 17:38 ` [PATCH v12 04/13] task_isolation: add initial support Chris Metcalf
@ 2016-05-18 13:34   ` Peter Zijlstra
  2016-05-18 16:34     ` Chris Metcalf
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-18 13:34 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On Tue, Apr 05, 2016 at 01:38:33PM -0400, Chris Metcalf wrote:
> diff --git a/kernel/signal.c b/kernel/signal.c
> index aa9bf00749c1..53e4e62f2778 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -34,6 +34,7 @@
>  #include <linux/compat.h>
>  #include <linux/cn_proc.h>
>  #include <linux/compiler.h>
> +#include <linux/isolation.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/signal.h>
> @@ -2213,6 +2214,9 @@ relock:
>  		/* Trace actually delivered signals. */
>  		trace_signal_deliver(signr, &ksig->info, ka);
>  
> +		/* Disable task isolation when delivering a signal. */

Why !? Changelog is quiet on this.

> +		task_isolation_set_flags(current, 0);
> +
>  		if (ka->sa.sa_handler == SIG_IGN) /* Do nothing.  */
>  			continue;
>  		if (ka->sa.sa_handler != SIG_DFL) {

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 05/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL
  2016-04-05 17:38 ` [PATCH v12 05/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL Chris Metcalf
@ 2016-05-18 13:35   ` Peter Zijlstra
  2016-05-18 16:34     ` Chris Metcalf
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-18 13:35 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-kernel

On Tue, Apr 05, 2016 at 01:38:34PM -0400, Chris Metcalf wrote:
> This option, similar to NO_HZ_FULL_ALL, simplifies configuring
> a system to boot by default with all cores except the boot core
> running in task isolation mode.

Hurm, we still have that option? I thought we killed it, because random
people set it and 'complain' their system misbehaves.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 06/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode
  2016-04-05 17:38 ` [PATCH v12 06/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf
@ 2016-05-18 13:44   ` Peter Zijlstra
  2016-05-18 16:34     ` Chris Metcalf
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-18 13:44 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Tue, Apr 05, 2016 at 01:38:35PM -0400, Chris Metcalf wrote:
> +void task_isolation_interrupt(struct task_struct *task, const char *buf)
> +{
> +	siginfo_t info = {};
> +	int sig;
> +
> +	pr_warn("%s/%d: task_isolation strict mode violated by %s\n",
> +		task->comm, task->pid, buf);
> +

So the function name suggests this is called for interrupts, except its
purpose is to deliver a signal.

Now, in case of exceptions the violation isn't necessarily _by_ the task
itself. You might want to change that to report the exception
type/number instead of the affected task.

> +	/* Get the signal number to use. */
> +	sig = PR_TASK_ISOLATION_GET_SIG(task->task_isolation_flags);
> +	if (sig == 0)
> +		sig = SIGKILL;
> +	info.si_signo = sig;
> +
> +	/*
> +	 * Turn off task isolation mode entirely to avoid spamming
> +	 * the process with signals.  It can re-enable task isolation
> +	 * mode in the signal handler if it wants to.
> +	 */
> +	task_isolation_set_flags(task, 0);
> +
> +	send_sig_info(sig, &info, task);
> +}

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 07/13] task_isolation: add debug boot flag
  2016-04-05 17:38 ` [PATCH v12 07/13] task_isolation: add debug boot flag Chris Metcalf
@ 2016-05-18 13:56   ` Peter Zijlstra
  2016-05-18 16:36     ` Chris Metcalf
       [not found]     ` <684587d7-3653-7570-215f-37d3e9e786bc@mellanox.com>
  0 siblings, 2 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-18 13:56 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-kernel

On Tue, Apr 05, 2016 at 01:38:36PM -0400, Chris Metcalf wrote:
> +#ifdef CONFIG_TASK_ISOLATION
> +void task_isolation_debug(int cpu)
> +{
> +	struct task_struct *p;
> +
> +	if (!task_isolation_possible(cpu))
> +		return;
> +
> +	rcu_read_lock();
> +	p = cpu_curr(cpu);
> +	get_task_struct(p);
> +	rcu_read_unlock();
> +	task_isolation_debug_task(cpu, p);
> +	put_task_struct(p);


This is still broken...

Also, I really don't like how you sprinkle a call all over the core
kernel. At the very least make an inline fast path for this function to
avoid the call whenever possible.

> +}
> +#endif

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 10/13] arch/x86: enable task isolation functionality
  2016-04-05 17:38 ` [PATCH v12 10/13] arch/x86: enable task isolation functionality Chris Metcalf
@ 2016-05-18 16:23   ` Peter Zijlstra
  2016-05-18 16:35     ` Chris Metcalf
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-18 16:23 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, H. Peter Anvin,
	x86, linux-kernel

On Tue, Apr 05, 2016 at 01:38:39PM -0400, Chris Metcalf wrote:
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 06cbe25861f1..b02205085571 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -36,6 +36,7 @@
>  #include <linux/mm.h>
>  #include <linux/smp.h>
>  #include <linux/io.h>
> +#include <linux/isolation.h>
>  
>  #ifdef CONFIG_EISA
>  #include <linux/ioport.h>
> @@ -382,6 +383,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
>  	case 2:	/* Bound directory has invalid entry. */
>  		if (mpx_handle_bd_fault())
>  			goto exit_trap;
> +		task_isolation_exception("bounds check");
>  		break; /* Success, it was handled */
>  	case 1: /* Bound violation. */
>  		info = mpx_generate_siginfo(regs);
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 5ce1ed02f7e8..025e9d2850c1 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -14,6 +14,7 @@
>  #include <linux/prefetch.h>		/* prefetchw			*/
>  #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
>  #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
> +#include <linux/isolation.h>		/* task_isolation_exception	*/
>  
>  #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
>  #include <asm/traps.h>			/* dotraplinkage, ...		*/
> @@ -1259,6 +1260,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
>  		local_irq_enable();
>  		error_code |= PF_USER;
>  		flags |= FAULT_FLAG_USER;
> +		task_isolation_exception("page fault at %#lx", address);
>  	} else {
>  		if (regs->flags & X86_EFLAGS_IF)
>  			local_irq_enable();


That seems to miss a whole bunch of exceptions... what up?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 04/13] task_isolation: add initial support
  2016-05-18 13:34   ` Peter Zijlstra
@ 2016-05-18 16:34     ` Chris Metcalf
  2016-05-18 17:09       ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Metcalf @ 2016-05-18 16:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On 5/18/2016 9:34 AM, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 01:38:33PM -0400, Chris Metcalf wrote:
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index aa9bf00749c1..53e4e62f2778 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -34,6 +34,7 @@
>>   #include <linux/compat.h>
>>   #include <linux/cn_proc.h>
>>   #include <linux/compiler.h>
>> +#include <linux/isolation.h>
>>   
>>   #define CREATE_TRACE_POINTS
>>   #include <trace/events/signal.h>
>> @@ -2213,6 +2214,9 @@ relock:
>>   		/* Trace actually delivered signals. */
>>   		trace_signal_deliver(signr, &ksig->info, ka);
>>   
>> +		/* Disable task isolation when delivering a signal. */
> Why !? Changelog is quiet on this.

There are really two reasons.

1. If the task is receiving a signal, it will know it's not isolated
    any more, so we don't need to worry about notifying it explicitly.
    This behavior is easy to document and allows the application to decide
    if the signal is unexpected and it should go straight to its error
    handling path (likely outcome, and in that case you want task isolation
    off anyway) or if it thinks it can plausibly re-enable isolation and
    return to where the signal interrupted you at (hard to imagine how this
    would ever make sense, but you could if you wanted to).

2. When we are delivering a signal we may already be holding the lock
    for the signal subsystem, and it gets hard to figure out whether it's
    safe to send another signal to the application as a "task isolation
    broken" notification.  For example, sending a signal to a task on
    another core involves doing an IPI to that core to kick it; the IPI
    normally is a generic point for notifying the remote core of broken
    task isolation and sending a signal - except that at the point where
    we would do that on the signal path we are already holding the lock,
    so we end up deadlocked.  We could no doubt work around that, but it
    seemed cleaner to decouple the existing signal mechanism from the
    signal delivery for task isolation.

I will add more discussion of the rationale to the commit message.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 05/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL
  2016-05-18 13:35   ` Peter Zijlstra
@ 2016-05-18 16:34     ` Chris Metcalf
  0 siblings, 0 replies; 39+ messages in thread
From: Chris Metcalf @ 2016-05-18 16:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-kernel

On 5/18/2016 9:35 AM, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 01:38:34PM -0400, Chris Metcalf wrote:
>> This option, similar to NO_HZ_FULL_ALL, simplifies configuring
>> a system to boot by default with all cores except the boot core
>> running in task isolation mode.
> Hurm, we still have that option? I thought we killed it, because random
> people set it and 'complain' their system misbehaves.

It's still in, as of 4.6 (and still in linux-next too).  I did receive
feedback saying the option was useful, when setting up a kernel to run
isolation apps on systems that may have a varying number of processsors,
since it means you don't need to tweak the boot arguments each time.

A different approach that I'd be happy to pursue would be to provide
a "clipping" version of cpulist_parse() that allows you to pass a boot
argument like "nohz_full=1-1000" and just clip off the impossible cpus.
We could then change "nohz_full=" and "task_isolation=" to use it.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 06/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode
  2016-05-18 13:44   ` Peter Zijlstra
@ 2016-05-18 16:34     ` Chris Metcalf
  0 siblings, 0 replies; 39+ messages in thread
From: Chris Metcalf @ 2016-05-18 16:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 5/18/2016 9:44 AM, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 01:38:35PM -0400, Chris Metcalf wrote:
>> +void task_isolation_interrupt(struct task_struct *task, const char *buf)
>> +{
>> +	siginfo_t info = {};
>> +	int sig;
>> +
>> +	pr_warn("%s/%d: task_isolation strict mode violated by %s\n",
>> +		task->comm, task->pid, buf);
>> +
> So the function name suggests this is called for interrupts, except its
> purpose is to deliver a signal.

Fair point.  I'll name it task_isolation_deliver_signal() in the next patch series.

> Now, in case of exceptions the violation isn't necessarily _by_ the task
> itself. You might want to change that to report the exception
> type/number instead of the affected task.

Well, we do report whatever exception information we have.  For example
a page fault exception will report the address or whatever other info is
handy; it's easy to tune since it's just a vsnprintf of some varargs from the
architecture layer.

For things like IPIs or TLB invalidations or whatever, the code currently just
reports "interrupt"; I could arrange to pass down more informative varargs
from the caller for that as well.  Let me look into it.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 10/13] arch/x86: enable task isolation functionality
  2016-05-18 16:23   ` Peter Zijlstra
@ 2016-05-18 16:35     ` Chris Metcalf
  2016-05-18 17:10       ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Metcalf @ 2016-05-18 16:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, H. Peter Anvin,
	x86, linux-kernel

On 5/18/2016 12:23 PM, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 01:38:39PM -0400, Chris Metcalf wrote:
>> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>> index 06cbe25861f1..b02205085571 100644
>> --- a/arch/x86/kernel/traps.c
>> +++ b/arch/x86/kernel/traps.c
>> @@ -36,6 +36,7 @@
>>   #include <linux/mm.h>
>>   #include <linux/smp.h>
>>   #include <linux/io.h>
>> +#include <linux/isolation.h>
>>   
>>   #ifdef CONFIG_EISA
>>   #include <linux/ioport.h>
>> @@ -382,6 +383,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
>>   	case 2:	/* Bound directory has invalid entry. */
>>   		if (mpx_handle_bd_fault())
>>   			goto exit_trap;
>> +		task_isolation_exception("bounds check");
>>   		break; /* Success, it was handled */
>>   	case 1: /* Bound violation. */
>>   		info = mpx_generate_siginfo(regs);
>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>> index 5ce1ed02f7e8..025e9d2850c1 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -14,6 +14,7 @@
>>   #include <linux/prefetch.h>		/* prefetchw			*/
>>   #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
>>   #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
>> +#include <linux/isolation.h>		/* task_isolation_exception	*/
>>   
>>   #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
>>   #include <asm/traps.h>			/* dotraplinkage, ...		*/
>> @@ -1259,6 +1260,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
>>   		local_irq_enable();
>>   		error_code |= PF_USER;
>>   		flags |= FAULT_FLAG_USER;
>> +		task_isolation_exception("page fault at %#lx", address);
>>   	} else {
>>   		if (regs->flags & X86_EFLAGS_IF)
>>   			local_irq_enable();
>
> That seems to miss a whole bunch of exceptions... what up?

We only need to do an explicit call in the case where the exception does NOT
result in a signal, since a signal is something that will be really obvious to
the application in any case.  So it's just stuff like handled page faults,
handled bounds checks for x86, handled unaligned load/store for tilegx, etc.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 07/13] task_isolation: add debug boot flag
  2016-05-18 13:56   ` Peter Zijlstra
@ 2016-05-18 16:36     ` Chris Metcalf
       [not found]     ` <684587d7-3653-7570-215f-37d3e9e786bc@mellanox.com>
  1 sibling, 0 replies; 39+ messages in thread
From: Chris Metcalf @ 2016-05-18 16:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-kernel

(Oops, missed one that I should have forced to text/plain. Resending.)

On 5/18/2016 9:56 AM, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 01:38:36PM -0400, Chris Metcalf wrote:
>> +#ifdef CONFIG_TASK_ISOLATION
>> +void task_isolation_debug(int cpu)
>> +{
>> +	struct task_struct *p;
>> +
>> +	if (!task_isolation_possible(cpu))
>> +		return;
>> +
>> +	rcu_read_lock();
>> +	p = cpu_curr(cpu);
>> +	get_task_struct(p);
>> +	rcu_read_unlock();
>> +	task_isolation_debug_task(cpu, p);
>> +	put_task_struct(p);
> This is still broken...

I don't know how or why, though. :-)  Can you give me a better idiom?
This looks to my eye just like how it's done for something like
sched_setaffinity() by one task on another task, and I would have
assumed the risks there of the other task evaporating part way
through would be the same as the risks here.

> Also, I really don't like how you sprinkle a call all over the core
> kernel. At the very least make an inline fast path for this function to
> avoid the call whenever possible.

I can boost the "task_isolation_possible()" test up into a static inline,
and only call in the case where we have a target cpu that is actually
in the "task_isolation=" boot argument set.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 07/13] task_isolation: add debug boot flag
       [not found]     ` <684587d7-3653-7570-215f-37d3e9e786bc@mellanox.com>
@ 2016-05-18 17:06       ` Peter Zijlstra
  2016-05-19 15:12         ` Chris Metcalf
       [not found]         ` <b22cf841-9555-fff3-435b-646c4fd95bdc@mellanox.com>
  0 siblings, 2 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-18 17:06 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-kernel

On Wed, May 18, 2016 at 12:35:19PM -0400, Chris Metcalf wrote:
> On 5/18/2016 9:56 AM, Peter Zijlstra wrote:
> >On Tue, Apr 05, 2016 at 01:38:36PM -0400, Chris Metcalf wrote:
> >>+#ifdef CONFIG_TASK_ISOLATION
> >>+void task_isolation_debug(int cpu)
> >>+{
> >>+	struct task_struct *p;
> >>+
> >>+	if (!task_isolation_possible(cpu))
> >>+		return;
> >>+
> >>+	rcu_read_lock();
> >>+	p = cpu_curr(cpu);
> >>+	get_task_struct(p);
> >>+	rcu_read_unlock();
> >>+	task_isolation_debug_task(cpu, p);
> >>+	put_task_struct(p);
> >
> >This is still broken...
> 
> I don't know how or why, though. :-)  Can you give me a better idiom?
> This looks to my eye just like how it's done for something like
> sched_setaffinity() by one task on another task, and I would have
> assumed the risks there of the other task evaporating part way
> through would be the same as the risks here.

Because rcu_read_lock() does not stop the task pointed to by
cpu_curr(cpu) from disappearing on you entirely.

See also the discussion around:

lkml.kernel.org/r/20160518170218.GY3192@twins.programming.kicks-ass.net

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 04/13] task_isolation: add initial support
  2016-05-18 16:34     ` Chris Metcalf
@ 2016-05-18 17:09       ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-18 17:09 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On Wed, May 18, 2016 at 12:34:22PM -0400, Chris Metcalf wrote:
> On 5/18/2016 9:34 AM, Peter Zijlstra wrote:
> >On Tue, Apr 05, 2016 at 01:38:33PM -0400, Chris Metcalf wrote:
> >>diff --git a/kernel/signal.c b/kernel/signal.c
> >>index aa9bf00749c1..53e4e62f2778 100644
> >>--- a/kernel/signal.c
> >>+++ b/kernel/signal.c
> >>@@ -34,6 +34,7 @@
> >>  #include <linux/compat.h>
> >>  #include <linux/cn_proc.h>
> >>  #include <linux/compiler.h>
> >>+#include <linux/isolation.h>
> >>  #define CREATE_TRACE_POINTS
> >>  #include <trace/events/signal.h>
> >>@@ -2213,6 +2214,9 @@ relock:
> >>  		/* Trace actually delivered signals. */
> >>  		trace_signal_deliver(signr, &ksig->info, ka);
> >>+		/* Disable task isolation when delivering a signal. */
> >Why !? Changelog is quiet on this.
> 
> There are really two reasons.
> 
> 1. If the task is receiving a signal, it will know it's not isolated
>    any more, so we don't need to worry about notifying it explicitly.
>    This behavior is easy to document and allows the application to decide
>    if the signal is unexpected and it should go straight to its error
>    handling path (likely outcome, and in that case you want task isolation
>    off anyway) or if it thinks it can plausibly re-enable isolation and
>    return to where the signal interrupted you at (hard to imagine how this
>    would ever make sense, but you could if you wanted to).
> 
> 2. When we are delivering a signal we may already be holding the lock
>    for the signal subsystem, and it gets hard to figure out whether it's
>    safe to send another signal to the application as a "task isolation
>    broken" notification.  For example, sending a signal to a task on
>    another core involves doing an IPI to that core to kick it; the IPI
>    normally is a generic point for notifying the remote core of broken
>    task isolation and sending a signal - except that at the point where
>    we would do that on the signal path we are already holding the lock,
>    so we end up deadlocked.  We could no doubt work around that, but it
>    seemed cleaner to decouple the existing signal mechanism from the
>    signal delivery for task isolation.
> 
> I will add more discussion of the rationale to the commit message.

Please also expand the in-code comment, as that is what we'll see first
when reading the code.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 10/13] arch/x86: enable task isolation functionality
  2016-05-18 16:35     ` Chris Metcalf
@ 2016-05-18 17:10       ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-18 17:10 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, H. Peter Anvin,
	x86, linux-kernel

On Wed, May 18, 2016 at 12:35:40PM -0400, Chris Metcalf wrote:
> On 5/18/2016 12:23 PM, Peter Zijlstra wrote:

> >
> >That seems to miss a whole bunch of exceptions... what up?
> 
> We only need to do an explicit call in the case where the exception does NOT
> result in a signal, since a signal is something that will be really obvious to
> the application in any case.  So it's just stuff like handled page faults,
> handled bounds checks for x86, handled unaligned load/store for tilegx, etc.

Right; as per your earlier signal explanation. This seems somewhat
fragile to maintain though and really could use more clarification.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 07/13] task_isolation: add debug boot flag
  2016-05-18 17:06       ` Peter Zijlstra
@ 2016-05-19 15:12         ` Chris Metcalf
       [not found]         ` <b22cf841-9555-fff3-435b-646c4fd95bdc@mellanox.com>
  1 sibling, 0 replies; 39+ messages in thread
From: Chris Metcalf @ 2016-05-19 15:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-kernel

(Resending in text/plain.  I just screwed around with my Thunderbird
config some more in hopes of getting it to pay attention to all the
settings that say "use plain text for LKML", but, we'll see.)

On 5/18/2016 1:06 PM, Peter Zijlstra wrote:
> On Wed, May 18, 2016 at 12:35:19PM -0400, Chris Metcalf wrote:
>> On 5/18/2016 9:56 AM, Peter Zijlstra wrote:
>>> On Tue, Apr 05, 2016 at 01:38:36PM -0400, Chris Metcalf wrote:
>>>> +#ifdef CONFIG_TASK_ISOLATION
>>>> +void task_isolation_debug(int cpu)
>>>> +{
>>>> +	struct task_struct *p;
>>>> +
>>>> +	if (!task_isolation_possible(cpu))
>>>> +		return;
>>>> +
>>>> +	rcu_read_lock();
>>>> +	p = cpu_curr(cpu);
>>>> +	get_task_struct(p);
>>>> +	rcu_read_unlock();
>>>> +	task_isolation_debug_task(cpu, p);
>>>> +	put_task_struct(p);
>>> This is still broken...
>> I don't know how or why, though. :-)  Can you give me a better idiom?
>> This looks to my eye just like how it's done for something like
>> sched_setaffinity() by one task on another task, and I would have
>> assumed the risks there of the other task evaporating part way
>> through would be the same as the risks here.
> Because rcu_read_lock() does not stop the task pointed to by
> cpu_curr(cpu) from disappearing on you entirely.

So clearly once we have a task struct with an incremented usage count,
we are golden: the task isolation code only touches immediate fields
of task_struct, which are guaranteed to stick around until we
put_task_struct(), and the other path is into send_sig_info(), which
is already robust to the task struct being exited (the ->sighand
becomes NULL and we bail out in __lock_task_sighand, otherwise we're
holding sighand->siglock until we deliver the signal).

So, I think what you're saying is that there is a race between when we
read per_cpu(runqueues, cpu).curr, and when we increment the
p->usage value in the task, and that the RCU read lock doesn't help
with that?  My impression was that by being the ".curr" task, we are
guaranteed that it hasn't gone through do_exit() yet, and thus we
benefit from an RCU guarantee around being able to validly dereference
the pointer, i.e. it hasn't yet been freed and so dereferencing is safe.

I don't see how grabbing the ->curr from the runqueue is any more
fragile from an RCU perspective than grabbing the task from the pid in
kill_pid_info().  And in fact, that code doesn't even bump
task->usage, as far as I can see, just relying on getting the
sighand->siglock.

Anyway, whatever more clarity you can offer me, or suggestions for
APIs to use are welcome.

> See also the discussion around:
>
> lkml.kernel.org/r/20160518170218.GY3192@twins.programming.kicks-ass.net

This makes me wonder if I should use rcu_dereference(&cpu_curr(p))
just for clarity, though I think it's just as correct either way.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 07/13] task_isolation: add debug boot flag
       [not found]         ` <b22cf841-9555-fff3-435b-646c4fd95bdc@mellanox.com>
@ 2016-05-19 17:54           ` Peter Zijlstra
  2016-05-19 18:05             ` Chris Metcalf
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2016-05-19 17:54 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-kernel

On Thu, May 19, 2016 at 10:42:39AM -0400, Chris Metcalf wrote:

> >>>>+	rcu_read_lock();
> >>>>+	p = cpu_curr(cpu);

Here @cpu can schedule, hit TASK_DEAD and do put_task_struct() and
kfree() the task.

> >>>>+	get_task_struct(p);

And here we then do a use-after-free.

> >>>>+	rcu_read_unlock();
> >>>>+	task_isolation_debug_task(cpu, p);
> >>>>+	put_task_struct(p);

> So, I think what you're saying is that there is a race between when we
> read per_cpu(runqueues, cpu).curr, and when we increment the
> p->usage value in the task, and that the RCU read lock doesn't help
> with that? 

Yep, as per the above.

> My impression was that by being the ".curr" task, we are
> guaranteed that it hasn't gone through do_exit() yet, and thus we
> benefit from an RCU guarantee around being able to validly dereference
> the pointer, i.e. it hasn't yet been freed and so dereferencing is safe.

Nope... the only way to avoid this from happening is taking @cpu's
rq->lock to prevent the remote CPU from scheduling.

> I don't see how grabbing the ->curr from the runqueue is any more
> fragile from an RCU perspective than grabbing the task from the pid in
> kill_pid_info().

The whole pid data structure is RCU managed, rq->curr is not.

> Anyway, whatever more clarity you can offer me, or suggestions for
> APIs to use are welcome.

The API proposed in the discussion below..

> >See also the discussion around:
> >
> >lkml.kernel.org/r/20160518170218.GY3192@twins.programming.kicks-ass.net
> 
> This makes me wonder if I should use rcu_dereference(&cpu_curr(p))
> just for clarity, though I think it's just as correct either way.

Nope, that's just as broken.

So the 'simple' thing is:

	struct rq *rq = cpu_rq(cpu);
	struct task_struct *task;

	raw_spin_lock_irq(&rq->lock);
	task = rq->curr;
	get_task_struct(task);
	raw_spin_unlock_irq(&rq->lock);

Because by holding rq->lock, the remote CPU cannot schedule and the
current task _must_ still be valid.

And note; the above can result in a task which already has PF_EXITING
set.

The complex thing is described in the linked thread and will likely make
your head hurt.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v12 07/13] task_isolation: add debug boot flag
  2016-05-19 17:54           ` Peter Zijlstra
@ 2016-05-19 18:05             ` Chris Metcalf
  0 siblings, 0 replies; 39+ messages in thread
From: Chris Metcalf @ 2016-05-19 18:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-kernel

On 5/19/2016 1:54 PM, Peter Zijlstra wrote:
> So the 'simple' thing is:
>
> 	struct rq *rq = cpu_rq(cpu);
> 	struct task_struct *task;
>
> 	raw_spin_lock_irq(&rq->lock);
> 	task = rq->curr;
> 	get_task_struct(task);
> 	raw_spin_unlock_irq(&rq->lock);
>
> Because by holding rq->lock, the remote CPU cannot schedule and the
> current task_must_  still be valid.

I will plan to use that idiom in the next patch series.  Thanks!

> And note; the above can result in a task which already has PF_EXITING
> set.

I think that should be benign though; we may generate an unnecessary
warning, but somebody was doing something that could have resulted in
interrupting an isolated task anyway, so warning about it is reasonable.  And
presumably PF_EXITING just means we don't wake any threads and leave
the signal queued, but that gets flushed when the task finally exits.

> The complex thing is described in the linked thread and will likely make
> your head hurt.

I read the linked thread and was entertained. :-)  I suspect locking the
runqueue may be the more robust solution anyway, and since this is
presumably not a hot path, it seems easier to reason about this way.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH]: arm64: factor work_pending state machine to C
  2016-04-05 17:38 ` [PATCH v12 12/13] arm64: factor work_pending state machine to C Chris Metcalf
@ 2016-07-08 15:43   ` Chris Metcalf
  2016-07-08 15:49     ` Will Deacon
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Metcalf @ 2016-07-08 15:43 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel,
	linux-kernel

Ping!

I am hopeful that this patch [1] can be picked up for the 4.8 merge window
in the arm64 tree.  As I mentioned in my last patch series that included
this patch [2], I'm hopeful that this version addresses the performance
issues that were seen with Mark's original patch.  This version tests the
TIF flags prior to calling out to the loop in C code.

It makes more sense for it go via the arm64 tree than to wait and go
through the nohz_full tree, I suspect.

Thanks!

[1] https://lkml.kernel.org/g/1459877922-15512-13-git-send-email-cmetcalf@mellanox.com
[2] https://lkml.kernel.org/g/56E9A3CE.2040209@mellanox.com

On 4/5/2016 1:38 PM, Chris Metcalf wrote:
> Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
> state machine that can be difficult to reason about due to duplicated
> code and a large number of branch targets.
>
> This patch factors the common logic out into the existing
> do_notify_resume function, converting the code to C in the process,
> making the code more legible.
>
> This patch tries to closely mirror the existing behaviour while using
> the usual C control flow primitives. As local_irq_{disable,enable} may
> be instrumented, we balance exception entry (where we will almost most
> likely enable IRQs) with a call to trace_hardirqs_on just before the
> return to userspace.
>
> Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
> ---
>   arch/arm64/kernel/entry.S  | 12 ++++--------
>   arch/arm64/kernel/signal.c | 36 ++++++++++++++++++++++++++----------
>   2 files changed, 30 insertions(+), 18 deletions(-)
>
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 12e8d2bcb3f9..d70a9e44b7d6 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -674,18 +674,13 @@ ret_fast_syscall_trace:
>    * Ok, we need to do extra processing, enter the slow path.
>    */
>   work_pending:
> -	tbnz	x1, #TIF_NEED_RESCHED, work_resched
> -	/* TIF_SIGPENDING, TIF_NOTIFY_RESUME or TIF_FOREIGN_FPSTATE case */
>   	mov	x0, sp				// 'regs'
> -	enable_irq				// enable interrupts for do_notify_resume()
>   	bl	do_notify_resume
> -	b	ret_to_user
> -work_resched:
>   #ifdef CONFIG_TRACE_IRQFLAGS
> -	bl	trace_hardirqs_off		// the IRQs are off here, inform the tracing code
> +	bl	trace_hardirqs_on		// enabled while in userspace
>   #endif
> -	bl	schedule
> -
> +	ldr	x1, [tsk, #TI_FLAGS]		// re-check for single-step
> +	b	finish_ret_to_user
>   /*
>    * "slow" syscall return path.
>    */
> @@ -694,6 +689,7 @@ ret_to_user:
>   	ldr	x1, [tsk, #TI_FLAGS]
>   	and	x2, x1, #_TIF_WORK_MASK
>   	cbnz	x2, work_pending
> +finish_ret_to_user:
>   	enable_step_tsk x1, x2
>   	kernel_exit 0
>   ENDPROC(ret_to_user)
> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index a8eafdbc7cb8..404dd67080b9 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -402,15 +402,31 @@ static void do_signal(struct pt_regs *regs)
>   asmlinkage void do_notify_resume(struct pt_regs *regs,
>   				 unsigned int thread_flags)
>   {
> -	if (thread_flags & _TIF_SIGPENDING)
> -		do_signal(regs);
> -
> -	if (thread_flags & _TIF_NOTIFY_RESUME) {
> -		clear_thread_flag(TIF_NOTIFY_RESUME);
> -		tracehook_notify_resume(regs);
> -	}
> -
> -	if (thread_flags & _TIF_FOREIGN_FPSTATE)
> -		fpsimd_restore_current_state();
> +	/*
> +	 * The assembly code enters us with IRQs off, but it hasn't
> +	 * informed the tracing code of that for efficiency reasons.
> +	 * Update the trace code with the current status.
> +	 */
> +	trace_hardirqs_off();
> +	do {
> +		if (thread_flags & _TIF_NEED_RESCHED) {
> +			schedule();
> +		} else {
> +			local_irq_enable();
> +
> +			if (thread_flags & _TIF_SIGPENDING)
> +				do_signal(regs);
> +
> +			if (thread_flags & _TIF_NOTIFY_RESUME) {
> +				clear_thread_flag(TIF_NOTIFY_RESUME);
> +				tracehook_notify_resume(regs);
> +			}
> +
> +			if (thread_flags & _TIF_FOREIGN_FPSTATE)
> +				fpsimd_restore_current_state();
> +		}
>   
> +		local_irq_disable();
> +		thread_flags = READ_ONCE(current_thread_info()->flags);
> +	} while (thread_flags & _TIF_WORK_MASK);
>   }

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH]: arm64: factor work_pending state machine to C
  2016-07-08 15:43   ` [PATCH]: " Chris Metcalf
@ 2016-07-08 15:49     ` Will Deacon
  2016-07-08 16:11       ` Chris Metcalf
  2016-07-11 11:42       ` Will Deacon
  0 siblings, 2 replies; 39+ messages in thread
From: Will Deacon @ 2016-07-08 15:49 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Catalin Marinas, Mark Rutland, linux-arm-kernel, linux-kernel

Hi Chris,

On Fri, Jul 08, 2016 at 11:43:50AM -0400, Chris Metcalf wrote:
> Ping!
> 
> I am hopeful that this patch [1] can be picked up for the 4.8 merge window
> in the arm64 tree.  As I mentioned in my last patch series that included
> this patch [2], I'm hopeful that this version addresses the performance
> issues that were seen with Mark's original patch.  This version tests the
> TIF flags prior to calling out to the loop in C code.

I still need to get round to measuring that again. This might also
conflict with a late fix I just sent for 4.7.

Is your task isolation series all ready apart from this? It seems like
there's still discussion over there.

Will

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH]: arm64: factor work_pending state machine to C
  2016-07-08 15:49     ` Will Deacon
@ 2016-07-08 16:11       ` Chris Metcalf
  2016-07-11 11:42       ` Will Deacon
  1 sibling, 0 replies; 39+ messages in thread
From: Chris Metcalf @ 2016-07-08 16:11 UTC (permalink / raw)
  To: Will Deacon; +Cc: Catalin Marinas, Mark Rutland, linux-arm-kernel, linux-kernel

On 7/8/2016 11:49 AM, Will Deacon wrote:
> Hi Chris,
>
> On Fri, Jul 08, 2016 at 11:43:50AM -0400, Chris Metcalf wrote:
>> Ping!
>>
>> I am hopeful that this patch [1] can be picked up for the 4.8 merge window
>> in the arm64 tree.  As I mentioned in my last patch series that included
>> this patch [2], I'm hopeful that this version addresses the performance
>> issues that were seen with Mark's original patch.  This version tests the
>> TIF flags prior to calling out to the loop in C code.
> I still need to get round to measuring that again. This might also
> conflict with a late fix I just sent for 4.7.
>
> Is your task isolation series all ready apart from this? It seems like
> there's still discussion over there.

No, discussion of task isolation is still ongoing.  Alas :-)

I'm trying to pick off low-hanging fruit where possible - I think this refactoring
is independently good for arm64 regardless of whether task isolation makes it in for
the 4.8 merge window or not, so figured I'd break it out.

If you get a chance to remeasure, that will be great.  Thanks!

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH]: arm64: factor work_pending state machine to C
  2016-07-08 15:49     ` Will Deacon
  2016-07-08 16:11       ` Chris Metcalf
@ 2016-07-11 11:42       ` Will Deacon
  2016-07-11 12:19         ` Mark Rutland
  1 sibling, 1 reply; 39+ messages in thread
From: Will Deacon @ 2016-07-11 11:42 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Catalin Marinas, Mark Rutland, linux-arm-kernel, linux-kernel

On Fri, Jul 08, 2016 at 04:49:01PM +0100, Will Deacon wrote:
> On Fri, Jul 08, 2016 at 11:43:50AM -0400, Chris Metcalf wrote:
> > I am hopeful that this patch [1] can be picked up for the 4.8 merge window
> > in the arm64 tree.  As I mentioned in my last patch series that included
> > this patch [2], I'm hopeful that this version addresses the performance
> > issues that were seen with Mark's original patch.  This version tests the
> > TIF flags prior to calling out to the loop in C code.
> 
> I still need to get round to measuring that again. This might also
> conflict with a late fix I just sent for 4.7.

FWIW, I've failed to measure any performance overhead from this change,
so it certainly looks better thsn Mark's original patch from that angle.

Will

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH]: arm64: factor work_pending state machine to C
  2016-07-11 11:42       ` Will Deacon
@ 2016-07-11 12:19         ` Mark Rutland
  2016-07-29 18:16           ` Chris Metcalf
  0 siblings, 1 reply; 39+ messages in thread
From: Mark Rutland @ 2016-07-11 12:19 UTC (permalink / raw)
  To: Will Deacon
  Cc: Chris Metcalf, Catalin Marinas, linux-arm-kernel, linux-kernel

On Mon, Jul 11, 2016 at 12:42:37PM +0100, Will Deacon wrote:
> On Fri, Jul 08, 2016 at 04:49:01PM +0100, Will Deacon wrote:
> > On Fri, Jul 08, 2016 at 11:43:50AM -0400, Chris Metcalf wrote:
> > > I am hopeful that this patch [1] can be picked up for the 4.8 merge window
> > > in the arm64 tree.  As I mentioned in my last patch series that included
> > > this patch [2], I'm hopeful that this version addresses the performance
> > > issues that were seen with Mark's original patch.  This version tests the
> > > TIF flags prior to calling out to the loop in C code.
> > 
> > I still need to get round to measuring that again. This might also
> > conflict with a late fix I just sent for 4.7.
> 
> FWIW, I've failed to measure any performance overhead from this change,
> so it certainly looks better thsn Mark's original patch from that angle.

FWIW, likewise, with perf bench sched atop of v4.7-rc6.

Mark.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH]: arm64: factor work_pending state machine to C
  2016-07-11 12:19         ` Mark Rutland
@ 2016-07-29 18:16           ` Chris Metcalf
  2016-08-09 10:21             ` Will Deacon
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Metcalf @ 2016-07-29 18:16 UTC (permalink / raw)
  To: Mark Rutland, Will Deacon; +Cc: Catalin Marinas, linux-arm-kernel, linux-kernel

On 7/11/2016 8:19 AM, Mark Rutland wrote:
> On Mon, Jul 11, 2016 at 12:42:37PM +0100, Will Deacon wrote:
>> On Fri, Jul 08, 2016 at 04:49:01PM +0100, Will Deacon wrote:
>>> On Fri, Jul 08, 2016 at 11:43:50AM -0400, Chris Metcalf wrote:
>>>> I am hopeful that this patch [1] can be picked up for the 4.8 merge window
>>>> in the arm64 tree.  As I mentioned in my last patch series that included
>>>> this patch [2], I'm hopeful that this version addresses the performance
>>>> issues that were seen with Mark's original patch.  This version tests the
>>>> TIF flags prior to calling out to the loop in C code.
>>> I still need to get round to measuring that again. This might also
>>> conflict with a late fix I just sent for 4.7.
>> FWIW, I've failed to measure any performance overhead from this change,
>> so it certainly looks better thsn Mark's original patch from that angle.
> FWIW, likewise, with perf bench sched atop of v4.7-rc6.

I don't see this commit pushed up as part of what's in 4.8 so far. Any chance
it's going to go in later in the merge window?  I'm just hoping to trim down the
task-isolation series by being able to drop this patch from its next iteration... :-)

Thanks!

[1] https://lkml.kernel.org/r/1459877922-15512-13-git-send-email-cmetcalf@mellanox.com

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH]: arm64: factor work_pending state machine to C
  2016-07-29 18:16           ` Chris Metcalf
@ 2016-08-09 10:21             ` Will Deacon
  0 siblings, 0 replies; 39+ messages in thread
From: Will Deacon @ 2016-08-09 10:21 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Mark Rutland, Catalin Marinas, linux-arm-kernel, linux-kernel

On Fri, Jul 29, 2016 at 02:16:07PM -0400, Chris Metcalf wrote:
> On 7/11/2016 8:19 AM, Mark Rutland wrote:
> >On Mon, Jul 11, 2016 at 12:42:37PM +0100, Will Deacon wrote:
> >>On Fri, Jul 08, 2016 at 04:49:01PM +0100, Will Deacon wrote:
> >>>On Fri, Jul 08, 2016 at 11:43:50AM -0400, Chris Metcalf wrote:
> >>>>I am hopeful that this patch [1] can be picked up for the 4.8 merge window
> >>>>in the arm64 tree.  As I mentioned in my last patch series that included
> >>>>this patch [2], I'm hopeful that this version addresses the performance
> >>>>issues that were seen with Mark's original patch.  This version tests the
> >>>>TIF flags prior to calling out to the loop in C code.
> >>>I still need to get round to measuring that again. This might also
> >>>conflict with a late fix I just sent for 4.7.
> >>FWIW, I've failed to measure any performance overhead from this change,
> >>so it certainly looks better thsn Mark's original patch from that angle.
> >FWIW, likewise, with perf bench sched atop of v4.7-rc6.
> 
> I don't see this commit pushed up as part of what's in 4.8 so far. Any chance
> it's going to go in later in the merge window?  I'm just hoping to trim down the
> task-isolation series by being able to drop this patch from its next iteration... :-)

I'll queue it up for 4.9, once I've had a closer look.

Will

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2016-08-09 10:21 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-05 17:38 [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf
2016-04-05 17:38 ` [PATCH v12 01/13] vmstat: add quiet_vmstat_sync function Chris Metcalf
2016-04-05 17:38 ` [PATCH v12 02/13] vmstat: add vmstat_idle function Chris Metcalf
2016-04-05 17:38 ` [PATCH v12 03/13] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
2016-04-05 17:38 ` [PATCH v12 04/13] task_isolation: add initial support Chris Metcalf
2016-05-18 13:34   ` Peter Zijlstra
2016-05-18 16:34     ` Chris Metcalf
2016-05-18 17:09       ` Peter Zijlstra
2016-04-05 17:38 ` [PATCH v12 05/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL Chris Metcalf
2016-05-18 13:35   ` Peter Zijlstra
2016-05-18 16:34     ` Chris Metcalf
2016-04-05 17:38 ` [PATCH v12 06/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf
2016-05-18 13:44   ` Peter Zijlstra
2016-05-18 16:34     ` Chris Metcalf
2016-04-05 17:38 ` [PATCH v12 07/13] task_isolation: add debug boot flag Chris Metcalf
2016-05-18 13:56   ` Peter Zijlstra
2016-05-18 16:36     ` Chris Metcalf
     [not found]     ` <684587d7-3653-7570-215f-37d3e9e786bc@mellanox.com>
2016-05-18 17:06       ` Peter Zijlstra
2016-05-19 15:12         ` Chris Metcalf
     [not found]         ` <b22cf841-9555-fff3-435b-646c4fd95bdc@mellanox.com>
2016-05-19 17:54           ` Peter Zijlstra
2016-05-19 18:05             ` Chris Metcalf
2016-04-05 17:38 ` [PATCH v12 08/13] task_isolation: add PR_TASK_ISOLATION_ONE_SHOT flag Chris Metcalf
2016-04-05 17:38 ` [PATCH v12 09/13] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
2016-04-07 16:58   ` Daniel Lezcano
2016-04-05 17:38 ` [PATCH v12 10/13] arch/x86: enable task isolation functionality Chris Metcalf
2016-05-18 16:23   ` Peter Zijlstra
2016-05-18 16:35     ` Chris Metcalf
2016-05-18 17:10       ` Peter Zijlstra
2016-04-05 17:38 ` [PATCH v12 11/13] arch/tile: " Chris Metcalf
2016-04-05 17:38 ` [PATCH v12 12/13] arm64: factor work_pending state machine to C Chris Metcalf
2016-07-08 15:43   ` [PATCH]: " Chris Metcalf
2016-07-08 15:49     ` Will Deacon
2016-07-08 16:11       ` Chris Metcalf
2016-07-11 11:42       ` Will Deacon
2016-07-11 12:19         ` Mark Rutland
2016-07-29 18:16           ` Chris Metcalf
2016-08-09 10:21             ` Will Deacon
2016-04-05 17:38 ` [PATCH v12 13/13] arch/arm64: enable task isolation functionality Chris Metcalf
2016-05-12 18:26 ` [PATCH v12 00/13] support "task_isolation" mode Chris Metcalf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).