All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-08 17:58 ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it.  However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode.  A prctl()
option (PR_SET_DATAPLANE) is added to control whether processes have
requested this stricter semantics, and within that prctl() option we
provide a number of different bits for more precise control.
Additionally, we add a new command-line boot argument to facilitate
debugging where unexpected interrupts are being delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers.  This patch
series represents the first serious attempt to upstream that
functionality.  Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on dataplane cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on my arch/tile master tree for 4.2,
in turn based on 4.1-rc1) is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Chris Metcalf (6):
  nohz_full: add support for "dataplane" mode
  nohz: dataplane: allow tick to be fully disabled for dataplane
  dataplane nohz: run softirqs synchronously on user entry
  nohz: support PR_DATAPLANE_QUIESCE
  nohz: support PR_DATAPLANE_STRICT mode
  nohz: add dataplane_debug boot flag

 Documentation/kernel-parameters.txt |   6 ++
 arch/tile/mm/homecache.c            |   5 +-
 include/linux/sched.h               |   3 +
 include/linux/tick.h                |  12 ++++
 include/uapi/linux/prctl.h          |   8 +++
 kernel/context_tracking.c           |   3 +
 kernel/irq_work.c                   |   4 +-
 kernel/sched/core.c                 |  18 ++++++
 kernel/signal.c                     |   5 ++
 kernel/smp.c                        |   4 ++
 kernel/softirq.c                    |  15 ++++-
 kernel/sys.c                        |   8 +++
 kernel/time/tick-sched.c            | 112 +++++++++++++++++++++++++++++++++++-
 13 files changed, 198 insertions(+), 5 deletions(-)

-- 
2.1.2


^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-08 17:58 ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it.  However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode.  A prctl()
option (PR_SET_DATAPLANE) is added to control whether processes have
requested this stricter semantics, and within that prctl() option we
provide a number of different bits for more precise control.
Additionally, we add a new command-line boot argument to facilitate
debugging where unexpected interrupts are being delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers.  This patch
series represents the first serious attempt to upstream that
functionality.  Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on dataplane cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on my arch/tile master tree for 4.2,
in turn based on 4.1-rc1) is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Chris Metcalf (6):
  nohz_full: add support for "dataplane" mode
  nohz: dataplane: allow tick to be fully disabled for dataplane
  dataplane nohz: run softirqs synchronously on user entry
  nohz: support PR_DATAPLANE_QUIESCE
  nohz: support PR_DATAPLANE_STRICT mode
  nohz: add dataplane_debug boot flag

 Documentation/kernel-parameters.txt |   6 ++
 arch/tile/mm/homecache.c            |   5 +-
 include/linux/sched.h               |   3 +
 include/linux/tick.h                |  12 ++++
 include/uapi/linux/prctl.h          |   8 +++
 kernel/context_tracking.c           |   3 +
 kernel/irq_work.c                   |   4 +-
 kernel/sched/core.c                 |  18 ++++++
 kernel/signal.c                     |   5 ++
 kernel/smp.c                        |   4 ++
 kernel/softirq.c                    |  15 ++++-
 kernel/sys.c                        |   8 +++
 kernel/time/tick-sched.c            | 112 +++++++++++++++++++++++++++++++++++-
 13 files changed, 198 insertions(+), 5 deletions(-)

-- 
2.1.2

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 1/6] nohz_full: add support for "dataplane" mode
  2015-05-08 17:58 ` Chris Metcalf
@ 2015-05-08 17:58   ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode makes tradeoffs to minimize userspace
interruptions while still attempting to avoid overheads in the
kernel entry/exit path, to provide 100% kernel semantics, etc.

However, some applications require a stronger commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications to elect
to have the stronger semantics as needed, specifying
prctl(PR_SET_DATAPLANE, PR_DATAPLANE_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The dataplane state is indicated by setting a new task struct
field, dataplane_flags, to the value passed by prctl().  When the
_ENABLE bit is set for a task, and it is returning to userspace
on a nohz_full core, it calls the new tick_nohz_dataplane_enter()
routine to take additional actions to help the task avoid being
interrupted in the future.

For this first patch, the only action taken is to call
lru_add_drain() to prevent being interrupted by a subsequent
lru_add_drain_all() call on another core.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/sched.h      |  3 +++
 include/linux/tick.h       | 10 ++++++++++
 include/uapi/linux/prctl.h |  5 +++++
 kernel/context_tracking.c  |  3 +++
 kernel/sys.c               |  8 ++++++++
 kernel/time/tick-sched.c   | 13 +++++++++++++
 6 files changed, 42 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8222ae40ecb0..3680aa07c9ea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1732,6 +1732,9 @@ struct task_struct {
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long	task_state_change;
 #endif
+#ifdef CONFIG_NO_HZ_FULL
+	unsigned int	dataplane_flags;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f8492da57ad3..d191cda9b71a 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -10,6 +10,7 @@
 #include <linux/context_tracking_state.h>
 #include <linux/cpumask.h>
 #include <linux/sched.h>
+#include <linux/prctl.h>
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 extern void __init tick_init(void);
@@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu)
 	return cpumask_test_cpu(cpu, tick_nohz_full_mask);
 }
 
+static inline bool tick_nohz_is_dataplane(void)
+{
+	return tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->dataplane_flags & PR_DATAPLANE_ENABLE);
+}
+
 extern void __tick_nohz_full_check(void);
 extern void tick_nohz_full_kick(void);
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
 extern void __tick_nohz_task_switch(struct task_struct *tsk);
+extern void tick_nohz_dataplane_enter(void);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void tick_nohz_full_kick(void) { }
 static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
+static inline bool tick_nohz_is_dataplane(void) { return false; }
+static inline void tick_nohz_dataplane_enter(void) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..1aa8fa8a8b05 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
 # define PR_FP_MODE_FR		(1 << 0)	/* 64b FP registers */
 # define PR_FP_MODE_FRE		(1 << 1)	/* 32b compatibility */
 
+/* Enable/disable or query dataplane mode for NO_HZ_FULL kernels. */
+#define PR_SET_DATAPLANE	47
+#define PR_GET_DATAPLANE	48
+# define PR_DATAPLANE_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 72d59a1a6eb6..dd6bdd6197b6 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/tick.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state)
 			 * on the tick.
 			 */
 			if (state == CONTEXT_USER) {
+				if (tick_nohz_is_dataplane())
+					tick_nohz_dataplane_enter();
 				trace_user_enter(0);
 				vtime_user_enter(current);
 			}
diff --git a/kernel/sys.c b/kernel/sys.c
index a4e372b798a5..930b750aefde 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_NO_HZ_FULL
+	case PR_SET_DATAPLANE:
+		me->dataplane_flags = arg2;
+		break;
+	case PR_GET_DATAPLANE:
+		error = me->dataplane_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 914259128145..31c674719647 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
 #include <linux/posix-timers.h>
 #include <linux/perf_event.h>
 #include <linux/context_tracking.h>
+#include <linux/swap.h>
 
 #include <asm/irq_regs.h>
 
@@ -389,6 +390,18 @@ void __init tick_nohz_init(void)
 	pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
 		cpumask_pr_args(tick_nohz_full_mask));
 }
+
+/*
+ * When returning to userspace on a nohz_full core after doing
+ * prctl(PR_DATAPLANE_SET,1), we come here and try more aggressively
+ * to prevent this core from being interrupted later.
+ */
+void tick_nohz_dataplane_enter(void)
+{
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+}
+
 #endif
 
 /*
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 1/6] nohz_full: add support for "dataplane" mode
@ 2015-05-08 17:58   ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode makes tradeoffs to minimize userspace
interruptions while still attempting to avoid overheads in the
kernel entry/exit path, to provide 100% kernel semantics, etc.

However, some applications require a stronger commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications to elect
to have the stronger semantics as needed, specifying
prctl(PR_SET_DATAPLANE, PR_DATAPLANE_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The dataplane state is indicated by setting a new task struct
field, dataplane_flags, to the value passed by prctl().  When the
_ENABLE bit is set for a task, and it is returning to userspace
on a nohz_full core, it calls the new tick_nohz_dataplane_enter()
routine to take additional actions to help the task avoid being
interrupted in the future.

For this first patch, the only action taken is to call
lru_add_drain() to prevent being interrupted by a subsequent
lru_add_drain_all() call on another core.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/sched.h      |  3 +++
 include/linux/tick.h       | 10 ++++++++++
 include/uapi/linux/prctl.h |  5 +++++
 kernel/context_tracking.c  |  3 +++
 kernel/sys.c               |  8 ++++++++
 kernel/time/tick-sched.c   | 13 +++++++++++++
 6 files changed, 42 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8222ae40ecb0..3680aa07c9ea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1732,6 +1732,9 @@ struct task_struct {
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long	task_state_change;
 #endif
+#ifdef CONFIG_NO_HZ_FULL
+	unsigned int	dataplane_flags;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f8492da57ad3..d191cda9b71a 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -10,6 +10,7 @@
 #include <linux/context_tracking_state.h>
 #include <linux/cpumask.h>
 #include <linux/sched.h>
+#include <linux/prctl.h>
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 extern void __init tick_init(void);
@@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu)
 	return cpumask_test_cpu(cpu, tick_nohz_full_mask);
 }
 
+static inline bool tick_nohz_is_dataplane(void)
+{
+	return tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->dataplane_flags & PR_DATAPLANE_ENABLE);
+}
+
 extern void __tick_nohz_full_check(void);
 extern void tick_nohz_full_kick(void);
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
 extern void __tick_nohz_task_switch(struct task_struct *tsk);
+extern void tick_nohz_dataplane_enter(void);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void tick_nohz_full_kick(void) { }
 static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
+static inline bool tick_nohz_is_dataplane(void) { return false; }
+static inline void tick_nohz_dataplane_enter(void) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..1aa8fa8a8b05 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
 # define PR_FP_MODE_FR		(1 << 0)	/* 64b FP registers */
 # define PR_FP_MODE_FRE		(1 << 1)	/* 32b compatibility */
 
+/* Enable/disable or query dataplane mode for NO_HZ_FULL kernels. */
+#define PR_SET_DATAPLANE	47
+#define PR_GET_DATAPLANE	48
+# define PR_DATAPLANE_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 72d59a1a6eb6..dd6bdd6197b6 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/tick.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state)
 			 * on the tick.
 			 */
 			if (state == CONTEXT_USER) {
+				if (tick_nohz_is_dataplane())
+					tick_nohz_dataplane_enter();
 				trace_user_enter(0);
 				vtime_user_enter(current);
 			}
diff --git a/kernel/sys.c b/kernel/sys.c
index a4e372b798a5..930b750aefde 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_NO_HZ_FULL
+	case PR_SET_DATAPLANE:
+		me->dataplane_flags = arg2;
+		break;
+	case PR_GET_DATAPLANE:
+		error = me->dataplane_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 914259128145..31c674719647 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
 #include <linux/posix-timers.h>
 #include <linux/perf_event.h>
 #include <linux/context_tracking.h>
+#include <linux/swap.h>
 
 #include <asm/irq_regs.h>
 
@@ -389,6 +390,18 @@ void __init tick_nohz_init(void)
 	pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
 		cpumask_pr_args(tick_nohz_full_mask));
 }
+
+/*
+ * When returning to userspace on a nohz_full core after doing
+ * prctl(PR_DATAPLANE_SET,1), we come here and try more aggressively
+ * to prevent this core from being interrupted later.
+ */
+void tick_nohz_dataplane_enter(void)
+{
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+}
+
 #endif
 
 /*
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 2/6] nohz: dataplane: allow tick to be fully disabled for dataplane
  2015-05-08 17:58 ` Chris Metcalf
  (?)
  (?)
@ 2015-05-08 17:58 ` Chris Metcalf
  2015-05-12  9:26   ` Peter Zijlstra
  -1 siblings, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-kernel
  Cc: Chris Metcalf

While the current fallback to 1-second tick is still helpful for
maintaining completely correct kernel semantics, processes using
prctl(PR_SET_DATAPLANE) semantics place a higher priority on running
completely tickless, so don't bound the time_delta for such processes.

This was previously discussed in

https://lkml.org/lkml/2014/10/31/364

and Thomas Gleixner observed that vruntime, load balancing data,
load accounting, and other things might be impacted.  Frederic
Weisbecker similarly observed that allowing the tick to be indefinitely
deferred just meant that no one would ever fix the underlying bugs.
However it's at least true that the mode proposed in this patch can
only be enabled on an isolcpus core, which may limit how important
it is to maintain scheduler data correctly, for example.

It's also worth observing that the tile architecture has been using
similar code for its Zero-Overhead Linux for many years (starting in
2005) and customers are very enthusiastic about the resulting bare-metal
performance on cores that are available to run full Linux semantics
on demand (crash, logging, shutdown, etc).  So this semantics is very
useful if we can convince ourselves that doing this is safe.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 kernel/time/tick-sched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 31c674719647..25fdd6bdd1eb 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -644,7 +644,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
 		}
 
 #ifdef CONFIG_NO_HZ_FULL
-		if (!ts->inidle) {
+		if (!ts->inidle && !tick_nohz_is_dataplane()) {
 			time_delta = min(time_delta,
 					 scheduler_tick_max_deferment());
 		}
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry
  2015-05-08 17:58 ` Chris Metcalf
                   ` (2 preceding siblings ...)
  (?)
@ 2015-05-08 17:58 ` Chris Metcalf
  2015-05-09  7:04   ` Mike Galbraith
  -1 siblings, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat,
	linux-kernel
  Cc: Chris Metcalf

For tasks which have elected dataplane functionality, we run
any pending softirqs for the core before returning to userspace,
rather than ever scheduling ksoftirqd to run.  The problem we
fix is that by allowing another task to run on the core, we
guarantee more interrupts in the future to the dataplane task,
which is exactly what dataplane mode is required to prevent.

This may be an alternate approach to what Mike Galbraith
recently proposed in e.g.:

  https://lkml.org/lkml/2015/3/13/11

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 kernel/softirq.c         | 14 +++++++++++++-
 kernel/time/tick-sched.c | 26 +++++++++++++++++++++++++-
 2 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..bc9406337f82 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -291,6 +291,15 @@ restart:
 		    --max_restart)
 			goto restart;
 
+		/*
+		 * For dataplane tasks, waking ksoftirqd because the
+		 * softirqs are slow is a bad idea; we would rather
+		 * synchronously finish whatever is interrupting us,
+		 * and then be able to cleanly enter dataplane mode.
+		 */
+		if (tick_nohz_is_dataplane())
+			goto restart;
+
 		wakeup_softirqd();
 	}
 
@@ -410,8 +419,11 @@ inline void raise_softirq_irqoff(unsigned int nr)
 	 *
 	 * Otherwise we wake up ksoftirqd to make sure we
 	 * schedule the softirq soon.
+	 *
+	 * For dataplane tasks, we will handle the softirq
+	 * synchronously on return to userspace.
 	 */
-	if (!in_interrupt())
+	if (!in_interrupt() && !tick_nohz_is_dataplane())
 		wakeup_softirqd();
 }
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 25fdd6bdd1eb..fd0e6e5c931c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -398,8 +398,26 @@ void __init tick_nohz_init(void)
  */
 void tick_nohz_dataplane_enter(void)
 {
+	/*
+	 * Check for softirqs as close as possible to our return to
+	 * userspace, and run any that are waiting.  We need to ensure
+	 * that we can safely avoid running softirqd, which will cause
+	 * interrupts for nohz_full tasks.  Note that interrupts may
+	 * be enabled internally by do_softirq().
+	 */
+	do_softirq();
+
 	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
 	lru_add_drain();
+
+	/*
+	 * Disable interrupts again since other code running in this
+	 * function may have enabled them, and the caller expects
+	 * interrupts to be disabled on return.  Enabling them during
+	 * this call is safe since the caller is not assuming any
+	 * state that might have been altered by an interrupt.
+	 */
+	local_irq_disable();
 }
 
 #endif
@@ -771,7 +789,13 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 	if (need_resched())
 		return false;
 
-	if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
+	/*
+	 * If we are running dataplane for this process, don't worry
+	 * about pending softirqs; we will force them to run
+	 * synchronously before returning to userspace.
+	 */
+	if (unlikely(local_softirq_pending() && cpu_online(cpu) &&
+		     !tick_nohz_is_dataplane())) {
 		static int ratelimit;
 
 		if (ratelimit < 10 &&
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE
  2015-05-08 17:58 ` Chris Metcalf
@ 2015-05-08 17:58   ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Frederic Weisbecker, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the
kernel to quiesce any pending timer interrupts prior to returning
to userspace.  When running with this mode set, sys calls (and page
faults, etc.) can be inordinately slow.  However, user applications
that want to guarantee that no unexpected interrupts will occur
(even if they call into the kernel) can set this flag to guarantee
that semantics.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/uapi/linux/prctl.h |  1 +
 kernel/time/tick-sched.c   | 54 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 1aa8fa8a8b05..8b735651304a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
 #define PR_SET_DATAPLANE	47
 #define PR_GET_DATAPLANE	48
 # define PR_DATAPLANE_ENABLE	(1 << 0)
+# define PR_DATAPLANE_QUIESCE	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index fd0e6e5c931c..69d908c6cef8 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -392,6 +392,53 @@ void __init tick_nohz_init(void)
 }
 
 /*
+ * We normally return immediately to userspace.
+ *
+ * The PR_DATAPLANE_QUIESCE flag causes us to wait until no more
+ * interrupts are pending.  Otherwise we nap with interrupts enabled
+ * and wait for the next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two processes on the same core and both
+ * specify PR_DATAPLANE_QUIESCE, neither will ever leave the kernel,
+ * and one will have to be killed manually.  Otherwise in situations
+ * where another process is in the runqueue on this cpu, this task
+ * will just wait for that other task to go idle before returning to
+ * user space.
+ */
+static void dataplane_quiesce(void)
+{
+	struct clock_event_device *dev =
+		__this_cpu_read(tick_cpu_device.evtdev);
+	struct task_struct *task = current;
+	unsigned long start = jiffies;
+	bool warned = false;
+
+	while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+		if (!warned && (jiffies - start) >= (5 * HZ)) {
+			pr_warn("%s/%d: cpu %d: dataplane task blocked for %ld jiffies\n",
+				task->comm, task->pid, smp_processor_id(),
+				(jiffies - start));
+			warned = true;
+		}
+		if (should_resched())
+			schedule();
+		if (test_thread_flag(TIF_SIGPENDING))
+			break;
+
+		/* Idle with interrupts enabled and wait for the tick. */
+		set_current_state(TASK_INTERRUPTIBLE);
+		arch_cpu_idle();
+		set_current_state(TASK_RUNNING);
+	}
+	if (warned) {
+		pr_warn("%s/%d: cpu %d: dataplane task unblocked after %ld jiffies\n",
+			task->comm, task->pid, smp_processor_id(),
+			(jiffies - start));
+		dump_stack();
+	}
+}
+
+/*
  * When returning to userspace on a nohz_full core after doing
  * prctl(PR_DATAPLANE_SET,1), we come here and try more aggressively
  * to prevent this core from being interrupted later.
@@ -411,6 +458,13 @@ void tick_nohz_dataplane_enter(void)
 	lru_add_drain();
 
 	/*
+	 * Quiesce any timer ticks if requested.  On return from this
+	 * function, no timer ticks are pending.
+	 */
+	if ((current->dataplane_flags & PR_DATAPLANE_QUIESCE) != 0)
+		dataplane_quiesce();
+
+	/*
 	 * Disable interrupts again since other code running in this
 	 * function may have enabled them, and the caller expects
 	 * interrupts to be disabled on return.  Enabling them during
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE
@ 2015-05-08 17:58   ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Frederic Weisbecker, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the
kernel to quiesce any pending timer interrupts prior to returning
to userspace.  When running with this mode set, sys calls (and page
faults, etc.) can be inordinately slow.  However, user applications
that want to guarantee that no unexpected interrupts will occur
(even if they call into the kernel) can set this flag to guarantee
that semantics.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/uapi/linux/prctl.h |  1 +
 kernel/time/tick-sched.c   | 54 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 1aa8fa8a8b05..8b735651304a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
 #define PR_SET_DATAPLANE	47
 #define PR_GET_DATAPLANE	48
 # define PR_DATAPLANE_ENABLE	(1 << 0)
+# define PR_DATAPLANE_QUIESCE	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index fd0e6e5c931c..69d908c6cef8 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -392,6 +392,53 @@ void __init tick_nohz_init(void)
 }
 
 /*
+ * We normally return immediately to userspace.
+ *
+ * The PR_DATAPLANE_QUIESCE flag causes us to wait until no more
+ * interrupts are pending.  Otherwise we nap with interrupts enabled
+ * and wait for the next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two processes on the same core and both
+ * specify PR_DATAPLANE_QUIESCE, neither will ever leave the kernel,
+ * and one will have to be killed manually.  Otherwise in situations
+ * where another process is in the runqueue on this cpu, this task
+ * will just wait for that other task to go idle before returning to
+ * user space.
+ */
+static void dataplane_quiesce(void)
+{
+	struct clock_event_device *dev =
+		__this_cpu_read(tick_cpu_device.evtdev);
+	struct task_struct *task = current;
+	unsigned long start = jiffies;
+	bool warned = false;
+
+	while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+		if (!warned && (jiffies - start) >= (5 * HZ)) {
+			pr_warn("%s/%d: cpu %d: dataplane task blocked for %ld jiffies\n",
+				task->comm, task->pid, smp_processor_id(),
+				(jiffies - start));
+			warned = true;
+		}
+		if (should_resched())
+			schedule();
+		if (test_thread_flag(TIF_SIGPENDING))
+			break;
+
+		/* Idle with interrupts enabled and wait for the tick. */
+		set_current_state(TASK_INTERRUPTIBLE);
+		arch_cpu_idle();
+		set_current_state(TASK_RUNNING);
+	}
+	if (warned) {
+		pr_warn("%s/%d: cpu %d: dataplane task unblocked after %ld jiffies\n",
+			task->comm, task->pid, smp_processor_id(),
+			(jiffies - start));
+		dump_stack();
+	}
+}
+
+/*
  * When returning to userspace on a nohz_full core after doing
  * prctl(PR_DATAPLANE_SET,1), we come here and try more aggressively
  * to prevent this core from being interrupted later.
@@ -411,6 +458,13 @@ void tick_nohz_dataplane_enter(void)
 	lru_add_drain();
 
 	/*
+	 * Quiesce any timer ticks if requested.  On return from this
+	 * function, no timer ticks are pending.
+	 */
+	if ((current->dataplane_flags & PR_DATAPLANE_QUIESCE) != 0)
+		dataplane_quiesce();
+
+	/*
 	 * Disable interrupts again since other code running in this
 	 * function may have enabled them, and the caller expects
 	 * interrupts to be disabled on return.  Enabling them during
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
  2015-05-08 17:58 ` Chris Metcalf
@ 2015-05-08 17:58   ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Frederic Weisbecker, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

With QUIESCE mode, the task is in principle guaranteed not to be
interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of
a number of other synchronous traps, it may be unexpectedly
exposed to long latencies.  Add a simple flag that puts the process
into a state where any such kernel entry is fatal.

To allow the state to be entered and exited, we add an internal
bit to current->dataplane_flags that is set when prctl() sets the
flags.  That way, when we are exiting the kernel after calling
prctl() to forbid future kernel exits, we don't get immediately
killed.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/uapi/linux/prctl.h |  2 ++
 kernel/sys.c               |  2 +-
 kernel/time/tick-sched.c   | 17 +++++++++++++++++
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 8b735651304a..9cf79aa1e73f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
 #define PR_GET_DATAPLANE	48
 # define PR_DATAPLANE_ENABLE	(1 << 0)
 # define PR_DATAPLANE_QUIESCE	(1 << 1)
+# define PR_DATAPLANE_STRICT	(1 << 2)
+# define PR_DATAPLANE_PRCTL	(1U << 31)	/* kernel internal */
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 930b750aefde..8102433c9edd 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2245,7 +2245,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		break;
 #ifdef CONFIG_NO_HZ_FULL
 	case PR_SET_DATAPLANE:
-		me->dataplane_flags = arg2;
+		me->dataplane_flags = arg2 | PR_DATAPLANE_PRCTL;
 		break;
 	case PR_GET_DATAPLANE:
 		error = me->dataplane_flags;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 69d908c6cef8..22ed0decb363 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -436,6 +436,20 @@ static void dataplane_quiesce(void)
 			(jiffies - start));
 		dump_stack();
 	}
+
+	/*
+	 * Kill the process if it violates STRICT mode.  Note that this
+	 * code also results in killing the task if a kernel bug causes an
+	 * irq to be delivered to this core.
+	 */
+	if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL))
+	    == PR_DATAPLANE_STRICT) {
+		pr_warn("Dataplane STRICT mode violated; process killed.\n");
+		dump_stack();
+		task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE;
+		local_irq_enable();
+		do_group_exit(SIGKILL);
+	}
 }
 
 /*
@@ -464,6 +478,9 @@ void tick_nohz_dataplane_enter(void)
 	if ((current->dataplane_flags & PR_DATAPLANE_QUIESCE) != 0)
 		dataplane_quiesce();
 
+	/* Clear the bit set by prctl() when it updates the flags. */
+	current->dataplane_flags &= ~PR_DATAPLANE_PRCTL;
+
 	/*
 	 * Disable interrupts again since other code running in this
 	 * function may have enabled them, and the caller expects
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
@ 2015-05-08 17:58   ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Frederic Weisbecker, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

With QUIESCE mode, the task is in principle guaranteed not to be
interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of
a number of other synchronous traps, it may be unexpectedly
exposed to long latencies.  Add a simple flag that puts the process
into a state where any such kernel entry is fatal.

To allow the state to be entered and exited, we add an internal
bit to current->dataplane_flags that is set when prctl() sets the
flags.  That way, when we are exiting the kernel after calling
prctl() to forbid future kernel exits, we don't get immediately
killed.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/uapi/linux/prctl.h |  2 ++
 kernel/sys.c               |  2 +-
 kernel/time/tick-sched.c   | 17 +++++++++++++++++
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 8b735651304a..9cf79aa1e73f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
 #define PR_GET_DATAPLANE	48
 # define PR_DATAPLANE_ENABLE	(1 << 0)
 # define PR_DATAPLANE_QUIESCE	(1 << 1)
+# define PR_DATAPLANE_STRICT	(1 << 2)
+# define PR_DATAPLANE_PRCTL	(1U << 31)	/* kernel internal */
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 930b750aefde..8102433c9edd 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2245,7 +2245,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		break;
 #ifdef CONFIG_NO_HZ_FULL
 	case PR_SET_DATAPLANE:
-		me->dataplane_flags = arg2;
+		me->dataplane_flags = arg2 | PR_DATAPLANE_PRCTL;
 		break;
 	case PR_GET_DATAPLANE:
 		error = me->dataplane_flags;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 69d908c6cef8..22ed0decb363 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -436,6 +436,20 @@ static void dataplane_quiesce(void)
 			(jiffies - start));
 		dump_stack();
 	}
+
+	/*
+	 * Kill the process if it violates STRICT mode.  Note that this
+	 * code also results in killing the task if a kernel bug causes an
+	 * irq to be delivered to this core.
+	 */
+	if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL))
+	    == PR_DATAPLANE_STRICT) {
+		pr_warn("Dataplane STRICT mode violated; process killed.\n");
+		dump_stack();
+		task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE;
+		local_irq_enable();
+		do_group_exit(SIGKILL);
+	}
 }
 
 /*
@@ -464,6 +478,9 @@ void tick_nohz_dataplane_enter(void)
 	if ((current->dataplane_flags & PR_DATAPLANE_QUIESCE) != 0)
 		dataplane_quiesce();
 
+	/* Clear the bit set by prctl() when it updates the flags. */
+	current->dataplane_flags &= ~PR_DATAPLANE_PRCTL;
+
 	/*
 	 * Disable interrupts again since other code running in this
 	 * function may have enabled them, and the caller expects
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 6/6] nohz: add dataplane_debug boot flag
  2015-05-08 17:58 ` Chris Metcalf
                   ` (5 preceding siblings ...)
  (?)
@ 2015-05-08 17:58 ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Frederic Weisbecker, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-kernel
  Cc: Chris Metcalf

This flag simplifies debugging of NO_HZ_FULL kernels when processes
are running in PR_DATAPLANE_QUIESCE mode.  Such processes should
get no interrupts from the kernel, and if they do, when this boot
flag is specified a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a dataplane core
has unexpectedly entered the kernel.  But what this boot flag does
is allow the kernel to provide better diagnostics, e.g. by reporting
in the IPI-generating code what remote core and context is preparing
to deliver an interrupt to a dataplane core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 Documentation/kernel-parameters.txt |  6 ++++++
 arch/tile/mm/homecache.c            |  5 ++++-
 include/linux/tick.h                |  2 ++
 kernel/irq_work.c                   |  4 +++-
 kernel/sched/core.c                 | 18 ++++++++++++++++++
 kernel/signal.c                     |  5 +++++
 kernel/smp.c                        |  4 ++++
 kernel/softirq.c                    |  1 +
 8 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index f6befa9855c1..5c5af5258e17 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -794,6 +794,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 	dasd=		[HW,NET]
 			See header of drivers/s390/block/dasd_devmap.c.
 
+	dataplane_debug	[KNL]
+			In kernels built with CONFIG_NO_HZ_FULL and booted
+			in nohz_full= mode, this setting will generate console
+			backtraces when the kernel is about to interrupt a
+			task that has requested PR_DATAPLANE_QUIESCE.
+
 	db9.dev[2|3]=	[HW,JOY] Multisystem joystick support via parallel port
 			(one device per port)
 			Format: <port#>,<type>
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..dd5ec7eca9a8 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
 #include <linux/smp.h>
 #include <linux/module.h>
 #include <linux/hugetlb.h>
+#include <linux/tick.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
 	 * Don't bother to update atomically; losing a count
 	 * here is not that critical.
 	 */
-	for_each_cpu(cpu, &mask)
+	for_each_cpu(cpu, &mask) {
 		++per_cpu(irq_stat, cpu).irq_hv_flush_count;
+		tick_nohz_dataplane_debug(cpu);
+	}
 }
 
 /*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index d191cda9b71a..4610cdf0f972 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -147,6 +147,7 @@ extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
 extern void __tick_nohz_task_switch(struct task_struct *tsk);
 extern void tick_nohz_dataplane_enter(void);
+extern void tick_nohz_dataplane_debug(int cpu);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -157,6 +158,7 @@ static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
 static inline bool tick_nohz_is_dataplane(void) { return false; }
 static inline void tick_nohz_dataplane_enter(void) { }
+static inline void tick_nohz_dataplane_debug(int cpu) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index cbf9fb899d92..0adc53c4e899 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -75,8 +75,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (!irq_work_claim(work))
 		return false;
 
-	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+		tick_nohz_dataplane_debug(cpu);
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return true;
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f9123a82cbb6..202fab0c41cb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -719,6 +719,24 @@ bool sched_can_stop_tick(void)
 
 	return true;
 }
+
+/* Enable debugging of any interrupts of dataplane cores. */
+static int dataplane_debug;
+static int __init dataplane_debug_func(char *str)
+{
+	dataplane_debug = true;
+	return 1;
+}
+__setup("dataplane_debug", dataplane_debug_func);
+
+void tick_nohz_dataplane_debug(int cpu)
+{
+	if (dataplane_debug && tick_nohz_full_cpu(cpu) &&
+	    (cpu_curr(cpu)->dataplane_flags & PR_DATAPLANE_QUIESCE)) {
+		pr_err("Interrupt detected for dataplane cpu %d\n", cpu);
+		dump_stack();
+	}
+}
 #endif /* CONFIG_NO_HZ_FULL */
 
 void sched_avg_update(struct rq *rq)
diff --git a/kernel/signal.c b/kernel/signal.c
index d51c5ddd855c..ebc552cafff5 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -689,6 +689,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+#ifdef CONFIG_NO_HZ_FULL
+	/* If the task is being killed, don't complain about dataplane. */
+	if (state & TASK_WAKEKILL)
+		t->dataplane_flags = 0;
+#endif
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 07854477c164..9518fc80321b 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
 #include <linux/smp.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include <linux/tick.h>
 
 #include "smpboot.h"
 
@@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
+	tick_nohz_dataplane_debug(cpu);
 	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
 		arch_send_call_function_single_ipi(cpu);
 
@@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	for_each_cpu(cpu, cfd->cpumask)
+		tick_nohz_dataplane_debug(cpu);
 	arch_send_call_function_ipi_mask(cfd->cpumask);
 
 	if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index bc9406337f82..eeacabf08ca6 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -394,6 +394,7 @@ void irq_exit(void)
 	WARN_ON_ONCE(!irqs_disabled());
 #endif
 
+	tick_nohz_dataplane_debug(smp_processor_id());
 	account_irq_exit_time(current);
 	preempt_count_sub(HARDIRQ_OFFSET);
 	if (!in_interrupt() && local_softirq_pending())
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-08 21:18   ` Andrew Morton
  0 siblings, 0 replies; 340+ messages in thread
From: Andrew Morton @ 2015-05-08 21:18 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc,
	linux-api, linux-kernel

On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote:

> A prctl() option (PR_SET_DATAPLANE) is added

Dumb question: what does the term "dataplane" mean in this context?  I
can't see the relationship between those words and what this patch
does.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-08 21:18   ` Andrew Morton
  0 siblings, 0 replies; 340+ messages in thread
From: Andrew Morton @ 2015-05-08 21:18 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:

> A prctl() option (PR_SET_DATAPLANE) is added

Dumb question: what does the term "dataplane" mean in this context?  I
can't see the relationship between those words and what this patch
does.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-08 21:18   ` Andrew Morton
@ 2015-05-08 21:22     ` Steven Rostedt
  -1 siblings, 0 replies; 340+ messages in thread
From: Steven Rostedt @ 2015-05-08 21:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc,
	linux-api, linux-kernel

On Fri, 8 May 2015 14:18:24 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote:
> 
> > A prctl() option (PR_SET_DATAPLANE) is added
> 
> Dumb question: what does the term "dataplane" mean in this context?  I
> can't see the relationship between those words and what this patch
> does.

I was thinking the same thing. I haven't gotten around to searching
DATAPLANE yet.

I would assume we want a name that is more meaningful for what is
happening.

-- Steve

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-08 21:22     ` Steven Rostedt
  0 siblings, 0 replies; 340+ messages in thread
From: Steven Rostedt @ 2015-05-08 21:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc,
	linux-api, linux-kernel

On Fri, 8 May 2015 14:18:24 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote:
> 
> > A prctl() option (PR_SET_DATAPLANE) is added
> 
> Dumb question: what does the term "dataplane" mean in this context?  I
> can't see the relationship between those words and what this patch
> does.

I was thinking the same thing. I haven't gotten around to searching
DATAPLANE yet.

I would assume we want a name that is more meaningful for what is
happening.

-- Steve

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-08 23:11       ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-08 23:11 UTC (permalink / raw)
  To: Steven Rostedt, Andrew Morton
  Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel,
	Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc,
	linux-api, linux-kernel

On 5/8/2015 5:22 PM, Steven Rostedt wrote:
> On Fri, 8 May 2015 14:18:24 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
>> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote:
>>
>>> A prctl() option (PR_SET_DATAPLANE) is added
>> Dumb question: what does the term "dataplane" mean in this context?  I
>> can't see the relationship between those words and what this patch
>> does.
> I was thinking the same thing. I haven't gotten around to searching
> DATAPLANE yet.
>
> I would assume we want a name that is more meaningful for what is
> happening.

The text in the commit message and the 0/6 cover letter do try to explain
the concept.  The terminology comes, I think, from networking line cards,
where the "dataplane" is the part of the application that handles all the
fast path processing of network packets, and the "control plane" is the part
that handles routing updates, etc., generally slow-path stuff.  I've probably
just been using the terms so long they seem normal to me.

That said, what would be clearer?  NO_HZ_STRICT as a superset of
NO_HZ_FULL?  Or move away from the NO_HZ terminology a bit; after all,
we're talking about no interrupts of any kind, and maybe NO_HZ is too
limited in scope?  So, NO_INTERRUPTS?  USERSPACE_ONLY?  Or look
to vendors who ship bare-metal runtimes and call it BARE_METAL?
Borrow the Tilera marketing name and call it ZERO_OVERHEAD?

Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me,
of course :-)

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-08 23:11       ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-08 23:11 UTC (permalink / raw)
  To: Steven Rostedt, Andrew Morton
  Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel,
	Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 5/8/2015 5:22 PM, Steven Rostedt wrote:
> On Fri, 8 May 2015 14:18:24 -0700
> Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
>
>> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
>>
>>> A prctl() option (PR_SET_DATAPLANE) is added
>> Dumb question: what does the term "dataplane" mean in this context?  I
>> can't see the relationship between those words and what this patch
>> does.
> I was thinking the same thing. I haven't gotten around to searching
> DATAPLANE yet.
>
> I would assume we want a name that is more meaningful for what is
> happening.

The text in the commit message and the 0/6 cover letter do try to explain
the concept.  The terminology comes, I think, from networking line cards,
where the "dataplane" is the part of the application that handles all the
fast path processing of network packets, and the "control plane" is the part
that handles routing updates, etc., generally slow-path stuff.  I've probably
just been using the terms so long they seem normal to me.

That said, what would be clearer?  NO_HZ_STRICT as a superset of
NO_HZ_FULL?  Or move away from the NO_HZ terminology a bit; after all,
we're talking about no interrupts of any kind, and maybe NO_HZ is too
limited in scope?  So, NO_INTERRUPTS?  USERSPACE_ONLY?  Or look
to vendors who ship bare-metal runtimes and call it BARE_METAL?
Borrow the Tilera marketing name and call it ZERO_OVERHEAD?

Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me,
of course :-)

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-08 23:11       ` Chris Metcalf
@ 2015-05-08 23:19         ` Andrew Morton
  -1 siblings, 0 replies; 340+ messages in thread
From: Andrew Morton @ 2015-05-08 23:19 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc,
	linux-api, linux-kernel

On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote:

> On 5/8/2015 5:22 PM, Steven Rostedt wrote:
> > On Fri, 8 May 2015 14:18:24 -0700
> > Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote:
> >>
> >>> A prctl() option (PR_SET_DATAPLANE) is added
> >> Dumb question: what does the term "dataplane" mean in this context?  I
> >> can't see the relationship between those words and what this patch
> >> does.
> > I was thinking the same thing. I haven't gotten around to searching
> > DATAPLANE yet.
> >
> > I would assume we want a name that is more meaningful for what is
> > happening.
> 
> The text in the commit message and the 0/6 cover letter do try to explain
> the concept.  The terminology comes, I think, from networking line cards,
> where the "dataplane" is the part of the application that handles all the
> fast path processing of network packets, and the "control plane" is the part
> that handles routing updates, etc., generally slow-path stuff.  I've probably
> just been using the terms so long they seem normal to me.
> 
> That said, what would be clearer?  NO_HZ_STRICT as a superset of
> NO_HZ_FULL?  Or move away from the NO_HZ terminology a bit; after all,
> we're talking about no interrupts of any kind, and maybe NO_HZ is too
> limited in scope?  So, NO_INTERRUPTS?  USERSPACE_ONLY?  Or look
> to vendors who ship bare-metal runtimes and call it BARE_METAL?
> Borrow the Tilera marketing name and call it ZERO_OVERHEAD?
> 
> Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me,
> of course :-)

I like NO_INTERRUPTS.  Simple, direct.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-08 23:19         ` Andrew Morton
  0 siblings, 0 replies; 340+ messages in thread
From: Andrew Morton @ 2015-05-08 23:19 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc,
	linux-api, linux-kernel

On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote:

> On 5/8/2015 5:22 PM, Steven Rostedt wrote:
> > On Fri, 8 May 2015 14:18:24 -0700
> > Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote:
> >>
> >>> A prctl() option (PR_SET_DATAPLANE) is added
> >> Dumb question: what does the term "dataplane" mean in this context?  I
> >> can't see the relationship between those words and what this patch
> >> does.
> > I was thinking the same thing. I haven't gotten around to searching
> > DATAPLANE yet.
> >
> > I would assume we want a name that is more meaningful for what is
> > happening.
> 
> The text in the commit message and the 0/6 cover letter do try to explain
> the concept.  The terminology comes, I think, from networking line cards,
> where the "dataplane" is the part of the application that handles all the
> fast path processing of network packets, and the "control plane" is the part
> that handles routing updates, etc., generally slow-path stuff.  I've probably
> just been using the terms so long they seem normal to me.
> 
> That said, what would be clearer?  NO_HZ_STRICT as a superset of
> NO_HZ_FULL?  Or move away from the NO_HZ terminology a bit; after all,
> we're talking about no interrupts of any kind, and maybe NO_HZ is too
> limited in scope?  So, NO_INTERRUPTS?  USERSPACE_ONLY?  Or look
> to vendors who ship bare-metal runtimes and call it BARE_METAL?
> Borrow the Tilera marketing name and call it ZERO_OVERHEAD?
> 
> Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me,
> of course :-)

I like NO_INTERRUPTS.  Simple, direct.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry
  2015-05-08 17:58 ` [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry Chris Metcalf
@ 2015-05-09  7:04   ` Mike Galbraith
  2015-05-11 20:13     ` Chris Metcalf
  0 siblings, 1 reply; 340+ messages in thread
From: Mike Galbraith @ 2015-05-09  7:04 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat,
	linux-kernel

On Fri, 2015-05-08 at 13:58 -0400, Chris Metcalf wrote:
> For tasks which have elected dataplane functionality, we run
> any pending softirqs for the core before returning to userspace,
> rather than ever scheduling ksoftirqd to run.  The problem we
> fix is that by allowing another task to run on the core, we
> guarantee more interrupts in the future to the dataplane task,
> which is exactly what dataplane mode is required to prevent.

If ksoftirqd were rt class, softirqs would be gone when the soloist gets
the CPU back and heads to userspace.  Being a soloist, it has no use for
a priority, so why can't it just let ksoftirqd run if it raises the
occasional softirq?  Meeting a contended lock while processing it will
wreck the soloist regardless of who does that processing.

	-Mike


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-08 23:19         ` Andrew Morton
  (?)
@ 2015-05-09  7:05         ` Ingo Molnar
  2015-05-09  7:19             ` Andy Lutomirski
                             ` (2 more replies)
  -1 siblings, 3 replies; 340+ messages in thread
From: Ingo Molnar @ 2015-05-09  7:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Metcalf, Steven Rostedt, Gilad Ben Yossef, Ingo Molnar,
	Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote:
> 
> > On 5/8/2015 5:22 PM, Steven Rostedt wrote:
> > > On Fri, 8 May 2015 14:18:24 -0700
> > > Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote:
> > >>
> > >>> A prctl() option (PR_SET_DATAPLANE) is added
> > >> Dumb question: what does the term "dataplane" mean in this context?  I
> > >> can't see the relationship between those words and what this patch
> > >> does.
> > > I was thinking the same thing. I haven't gotten around to searching
> > > DATAPLANE yet.
> > >
> > > I would assume we want a name that is more meaningful for what is
> > > happening.
> > 
> > The text in the commit message and the 0/6 cover letter do try to explain
> > the concept.  The terminology comes, I think, from networking line cards,
> > where the "dataplane" is the part of the application that handles all the
> > fast path processing of network packets, and the "control plane" is the part
> > that handles routing updates, etc., generally slow-path stuff.  I've probably
> > just been using the terms so long they seem normal to me.
> > 
> > That said, what would be clearer?  NO_HZ_STRICT as a superset of
> > NO_HZ_FULL?  Or move away from the NO_HZ terminology a bit; after all,
> > we're talking about no interrupts of any kind, and maybe NO_HZ is too
> > limited in scope?  So, NO_INTERRUPTS?  USERSPACE_ONLY?  Or look
> > to vendors who ship bare-metal runtimes and call it BARE_METAL?
> > Borrow the Tilera marketing name and call it ZERO_OVERHEAD?
> > 
> > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me,
> > of course :-)

'baremetal' has uses in virtualization speak, so I think that would be 
confusing.

> I like NO_INTERRUPTS.  Simple, direct.

NO_HZ_PURE?

That's what it's really about: user-space wants to run exclusively, in 
pure user-mode, without any interrupts.

So I don't like 'NO_HZ_NO_INTERRUPTS' for a couple of reasons:

 - It is similar to a term we use in perf: PERF_PMU_CAP_NO_INTERRUPT.

 - Another reason is that 'NO_INTERRUPTS', in most existing uses in 
   the kernel generally relates to some sort of hardware weakness, 
   limitation, a negative property: that we try to limp along without 
   having a hardware interrupt and have to poll. In other driver code
   that uses variants of NO_INTERRUPT it appears to be similar. So I 
   think there's some confusion potential here.

 - Here the fact that we don't disturb user-space is an absolutely
   positive property, not a limitation, a kernel feature we work hard 
   to achieve. NO_HZ_PURE would convey that while NO_HZ_NO_INTERRUPTS 
   wouldn't.

 - NO_HZ_NO_INTERRUPTS has a double negation, and it's also too long,
   compared to NO_HZ_FULL or NO_HZ_PURE ;-) The term 'no HZ' already 
   expresses that we don't have periodic interruptions. We just 
   duplicate that information with NO_HZ_NO_INTERRUPTS, while 
   NO_HZ_FULL or NO_HZ_PURE qualifies it, makes it a stronger
   property - which is what we want I think.

So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep 
it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be 
such a 'zero overhead' mode of operation, where if user-space runs, it 
won't get interrupted in any way.

There's no need to add yet another Kconfig variant - lets just enhance 
the current stuff and maybe rename it to NO_HZ_PURE to better express 
its intent.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-09  7:19             ` Andy Lutomirski
  0 siblings, 0 replies; 340+ messages in thread
From: Andy Lutomirski @ 2015-05-09  7:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Chris Metcalf, Steven Rostedt, Gilad Ben Yossef,
	Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Srivatsa S. Bhat, linux-doc, Linux API,
	linux-kernel

On Sat, May 9, 2015 at 12:05 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andrew Morton <akpm@linux-foundation.org> wrote:
>
>> On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote:
>>
>> > On 5/8/2015 5:22 PM, Steven Rostedt wrote:
>> > > On Fri, 8 May 2015 14:18:24 -0700
>> > > Andrew Morton <akpm@linux-foundation.org> wrote:
>> > >
>> > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> > >>
>> > >>> A prctl() option (PR_SET_DATAPLANE) is added
>> > >> Dumb question: what does the term "dataplane" mean in this context?  I
>> > >> can't see the relationship between those words and what this patch
>> > >> does.
>> > > I was thinking the same thing. I haven't gotten around to searching
>> > > DATAPLANE yet.
>> > >
>> > > I would assume we want a name that is more meaningful for what is
>> > > happening.
>> >
>> > The text in the commit message and the 0/6 cover letter do try to explain
>> > the concept.  The terminology comes, I think, from networking line cards,
>> > where the "dataplane" is the part of the application that handles all the
>> > fast path processing of network packets, and the "control plane" is the part
>> > that handles routing updates, etc., generally slow-path stuff.  I've probably
>> > just been using the terms so long they seem normal to me.
>> >
>> > That said, what would be clearer?  NO_HZ_STRICT as a superset of
>> > NO_HZ_FULL?  Or move away from the NO_HZ terminology a bit; after all,
>> > we're talking about no interrupts of any kind, and maybe NO_HZ is too
>> > limited in scope?  So, NO_INTERRUPTS?  USERSPACE_ONLY?  Or look
>> > to vendors who ship bare-metal runtimes and call it BARE_METAL?
>> > Borrow the Tilera marketing name and call it ZERO_OVERHEAD?
>> >
>> > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me,
>> > of course :-)
>
> 'baremetal' has uses in virtualization speak, so I think that would be
> confusing.
>
>> I like NO_INTERRUPTS.  Simple, direct.
>
> NO_HZ_PURE?
>

Naming aside, I don't think this should be a per-task flag at all.  We
already have way too much overhead per syscall in nohz mode, and it
would be nice to get the per-syscall overhead as low as possible.  We
should strive, for all tasks, to keep syscall overhead down *and*
avoid as many interrupts as possible.

That being said, I do see a legitimate use for a way to tell the
kernel "I'm going to run in userspace for a long time; stay away".
But shouldn't that be a single operation, not an ongoing flag?  IOW, I
think that we should have a new syscall quiesce() or something rather
than a prctl.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-09  7:19             ` Andy Lutomirski
  0 siblings, 0 replies; 340+ messages in thread
From: Andy Lutomirski @ 2015-05-09  7:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Chris Metcalf, Steven Rostedt, Gilad Ben Yossef,
	Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Srivatsa S. Bhat,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Sat, May 9, 2015 at 12:05 AM, Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> * Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
>
>> On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
>>
>> > On 5/8/2015 5:22 PM, Steven Rostedt wrote:
>> > > On Fri, 8 May 2015 14:18:24 -0700
>> > > Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
>> > >
>> > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
>> > >>
>> > >>> A prctl() option (PR_SET_DATAPLANE) is added
>> > >> Dumb question: what does the term "dataplane" mean in this context?  I
>> > >> can't see the relationship between those words and what this patch
>> > >> does.
>> > > I was thinking the same thing. I haven't gotten around to searching
>> > > DATAPLANE yet.
>> > >
>> > > I would assume we want a name that is more meaningful for what is
>> > > happening.
>> >
>> > The text in the commit message and the 0/6 cover letter do try to explain
>> > the concept.  The terminology comes, I think, from networking line cards,
>> > where the "dataplane" is the part of the application that handles all the
>> > fast path processing of network packets, and the "control plane" is the part
>> > that handles routing updates, etc., generally slow-path stuff.  I've probably
>> > just been using the terms so long they seem normal to me.
>> >
>> > That said, what would be clearer?  NO_HZ_STRICT as a superset of
>> > NO_HZ_FULL?  Or move away from the NO_HZ terminology a bit; after all,
>> > we're talking about no interrupts of any kind, and maybe NO_HZ is too
>> > limited in scope?  So, NO_INTERRUPTS?  USERSPACE_ONLY?  Or look
>> > to vendors who ship bare-metal runtimes and call it BARE_METAL?
>> > Borrow the Tilera marketing name and call it ZERO_OVERHEAD?
>> >
>> > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me,
>> > of course :-)
>
> 'baremetal' has uses in virtualization speak, so I think that would be
> confusing.
>
>> I like NO_INTERRUPTS.  Simple, direct.
>
> NO_HZ_PURE?
>

Naming aside, I don't think this should be a per-task flag at all.  We
already have way too much overhead per syscall in nohz mode, and it
would be nice to get the per-syscall overhead as low as possible.  We
should strive, for all tasks, to keep syscall overhead down *and*
avoid as many interrupts as possible.

That being said, I do see a legitimate use for a way to tell the
kernel "I'm going to run in userspace for a long time; stay away".
But shouldn't that be a single operation, not an ongoing flag?  IOW, I
think that we should have a new syscall quiesce() or something rather
than a prctl.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-09  7:19             ` Mike Galbraith
  0 siblings, 0 replies; 340+ messages in thread
From: Mike Galbraith @ 2015-05-09  7:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Chris Metcalf, Steven Rostedt, Gilad Ben Yossef,
	Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api,
	linux-kernel

On Sat, 2015-05-09 at 09:05 +0200, Ingo Molnar wrote:
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote:
> > 
> > > On 5/8/2015 5:22 PM, Steven Rostedt wrote:
> > > > On Fri, 8 May 2015 14:18:24 -0700
> > > > Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote:
> > > >>
> > > >>> A prctl() option (PR_SET_DATAPLANE) is added
> > > >> Dumb question: what does the term "dataplane" mean in this context?  I
> > > >> can't see the relationship between those words and what this patch
> > > >> does.
> > > > I was thinking the same thing. I haven't gotten around to searching
> > > > DATAPLANE yet.
> > > >
> > > > I would assume we want a name that is more meaningful for what is
> > > > happening.
> > > 
> > > The text in the commit message and the 0/6 cover letter do try to explain
> > > the concept.  The terminology comes, I think, from networking line cards,
> > > where the "dataplane" is the part of the application that handles all the
> > > fast path processing of network packets, and the "control plane" is the part
> > > that handles routing updates, etc., generally slow-path stuff.  I've probably
> > > just been using the terms so long they seem normal to me.
> > > 
> > > That said, what would be clearer?  NO_HZ_STRICT as a superset of
> > > NO_HZ_FULL?  Or move away from the NO_HZ terminology a bit; after all,
> > > we're talking about no interrupts of any kind, and maybe NO_HZ is too
> > > limited in scope?  So, NO_INTERRUPTS?  USERSPACE_ONLY?  Or look
> > > to vendors who ship bare-metal runtimes and call it BARE_METAL?
> > > Borrow the Tilera marketing name and call it ZERO_OVERHEAD?
> > > 
> > > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me,
> > > of course :-)
> 
> 'baremetal' has uses in virtualization speak, so I think that would be 
> confusing.
> 
> > I like NO_INTERRUPTS.  Simple, direct.
> 
> NO_HZ_PURE?

Hm, coke light, coke zero... OS_LIGHT and OS_ZERO?

	-Mike


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-09  7:19             ` Mike Galbraith
  0 siblings, 0 replies; 340+ messages in thread
From: Mike Galbraith @ 2015-05-09  7:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Chris Metcalf, Steven Rostedt, Gilad Ben Yossef,
	Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Srivatsa S. Bhat,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Sat, 2015-05-09 at 09:05 +0200, Ingo Molnar wrote:
> * Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> 
> > On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
> > 
> > > On 5/8/2015 5:22 PM, Steven Rostedt wrote:
> > > > On Fri, 8 May 2015 14:18:24 -0700
> > > > Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> > > >
> > > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
> > > >>
> > > >>> A prctl() option (PR_SET_DATAPLANE) is added
> > > >> Dumb question: what does the term "dataplane" mean in this context?  I
> > > >> can't see the relationship between those words and what this patch
> > > >> does.
> > > > I was thinking the same thing. I haven't gotten around to searching
> > > > DATAPLANE yet.
> > > >
> > > > I would assume we want a name that is more meaningful for what is
> > > > happening.
> > > 
> > > The text in the commit message and the 0/6 cover letter do try to explain
> > > the concept.  The terminology comes, I think, from networking line cards,
> > > where the "dataplane" is the part of the application that handles all the
> > > fast path processing of network packets, and the "control plane" is the part
> > > that handles routing updates, etc., generally slow-path stuff.  I've probably
> > > just been using the terms so long they seem normal to me.
> > > 
> > > That said, what would be clearer?  NO_HZ_STRICT as a superset of
> > > NO_HZ_FULL?  Or move away from the NO_HZ terminology a bit; after all,
> > > we're talking about no interrupts of any kind, and maybe NO_HZ is too
> > > limited in scope?  So, NO_INTERRUPTS?  USERSPACE_ONLY?  Or look
> > > to vendors who ship bare-metal runtimes and call it BARE_METAL?
> > > Borrow the Tilera marketing name and call it ZERO_OVERHEAD?
> > > 
> > > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me,
> > > of course :-)
> 
> 'baremetal' has uses in virtualization speak, so I think that would be 
> confusing.
> 
> > I like NO_INTERRUPTS.  Simple, direct.
> 
> NO_HZ_PURE?

Hm, coke light, coke zero... OS_LIGHT and OS_ZERO?

	-Mike

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
  2015-05-08 17:58   ` Chris Metcalf
  (?)
@ 2015-05-09  7:28   ` Andy Lutomirski
  2015-05-09 10:37       ` Gilad Ben Yossef
  2015-05-11 19:13       ` Chris Metcalf
  -1 siblings, 2 replies; 340+ messages in thread
From: Andy Lutomirski @ 2015-05-09  7:28 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Srivatsa S. Bhat, Paul E. McKenney, Frederic Weisbecker,
	Ingo Molnar, Rik van Riel, linux-doc, Andrew Morton,
	linux-kernel, Thomas Gleixner, Tejun Heo, Peter Zijlstra,
	Steven Rostedt, Christoph Lameter, Gilad Ben Yossef, Linux API

On May 8, 2015 11:44 PM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote:
>
> With QUIESCE mode, the task is in principle guaranteed not to be
> interrupted by the kernel, but only if it behaves.  In particular,
> if it enters the kernel via system call, page fault, or any of
> a number of other synchronous traps, it may be unexpectedly
> exposed to long latencies.  Add a simple flag that puts the process
> into a state where any such kernel entry is fatal.
>
> To allow the state to be entered and exited, we add an internal
> bit to current->dataplane_flags that is set when prctl() sets the
> flags.  That way, when we are exiting the kernel after calling
> prctl() to forbid future kernel exits, we don't get immediately
> killed.

Is there any reason this can't already be addressed in userspace using
/proc/interrupts or perf_events?  ISTM the real goal here is to detect
when we screw up and fail to avoid an interrupt, and killing the task
seems like overkill to me.

Also, can we please stop further torturing the exit paths?  We have a
disaster of assembly code that calls into syscall_trace_leave and
do_notify_resume.  Those functions, in turn, *both* call user_enter
(WTF?), and on very brief inspection user_enter makes it into the nohz
code through multiple levels of indirection, which, with these
patches, has yet another conditionally enabled helper, which does this
new stuff.  It's getting to be impossible to tell what happens when we
exit to user space any more.

Also, I think your code is buggy.  There's no particular guarantee
that user_enter is only called once between sys_prctl and the final
exit to user mode (see the above WTF), so you might spuriously kill
the process.

Also, I think that most users will be quite surprised if "strict
dataplane" code causes any machine check on the system to kill your
dataplane task.  Similarly, a user accidentally running perf record -a
probably should have some reasonable semantics.  /proc/interrupts gets
that right as is.  Sure, MCEs will hurt your RT performance, but Intel
screwed up the way that MCEs work, so we should make do.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* RE: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-09 10:18               ` Gilad Ben Yossef
  0 siblings, 0 replies; 340+ messages in thread
From: Gilad Ben Yossef @ 2015-05-09 10:18 UTC (permalink / raw)
  To: Mike Galbraith, Ingo Molnar
  Cc: Andrew Morton, Chris Metcalf, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3728 bytes --]


> From: Mike Galbraith [mailto:umgwanakikbuti@gmail.com]
> Sent: Saturday, May 09, 2015 10:20 AM
> To: Ingo Molnar
> Cc: Andrew Morton; Chris Metcalf; Steven Rostedt; Gilad Ben Yossef; Ingo
> Molnar; Peter Zijlstra; Rik van Riel; Tejun Heo; Frederic Weisbecker;
> Thomas Gleixner; Paul E. McKenney; Christoph Lameter; Srivatsa S. Bhat;
> linux-doc@vger.kernel.org; linux-api@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full
> 
> On Sat, 2015-05-09 at 09:05 +0200, Ingo Molnar wrote:
> > * Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > > On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf@ezchip.com>
> wrote:
> > >
> > > > On 5/8/2015 5:22 PM, Steven Rostedt wrote:
> > > > > On Fri, 8 May 2015 14:18:24 -0700
> > > > > Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > >
> > > > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf
> <cmetcalf@ezchip.com> wrote:
> > > > >>
> > > > >>> A prctl() option (PR_SET_DATAPLANE) is added
> > > > >> Dumb question: what does the term "dataplane" mean in this
> context?  I
> > > > >> can't see the relationship between those words and what this
> patch
> > > > >> does.
> > > > > I was thinking the same thing. I haven't gotten around to
> searching
> > > > > DATAPLANE yet.
> > > > >
> > > > > I would assume we want a name that is more meaningful for what is
> > > > > happening.
> > > >
> > > > The text in the commit message and the 0/6 cover letter do try to
> explain
> > > > the concept.  The terminology comes, I think, from networking line
> cards,
> > > > where the "dataplane" is the part of the application that handles
> all the
> > > > fast path processing of network packets, and the "control plane" is
> the part
> > > > that handles routing updates, etc., generally slow-path stuff.  I've
> probably
> > > > just been using the terms so long they seem normal to me.
> > > >
> > > > That said, what would be clearer?  NO_HZ_STRICT as a superset of
> > > > NO_HZ_FULL?  Or move away from the NO_HZ terminology a bit; after
> all,
> > > > we're talking about no interrupts of any kind, and maybe NO_HZ is
> too
> > > > limited in scope?  So, NO_INTERRUPTS?  USERSPACE_ONLY?  Or look
> > > > to vendors who ship bare-metal runtimes and call it BARE_METAL?
> > > > Borrow the Tilera marketing name and call it ZERO_OVERHEAD?
> > > >
> > > > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me,
> > > > of course :-)
> >
> > 'baremetal' has uses in virtualization speak, so I think that would be
> > confusing.
> >
> > > I like NO_INTERRUPTS.  Simple, direct.
> >
> > NO_HZ_PURE?
> 
> Hm, coke light, coke zero... OS_LIGHT and OS_ZERO?
LOL... you forgot OS_CLASSIC for backwards compatibility :-)
How about TASK_SOLO?
Yes, you are trying to achieve the least amount of interference but the bigger context is about monopolizing a single CPU for yourself.
Anyway it is worth pointing out that while NO_HZ_FULL is very useful in conjunction with this turning the tick off is useful also if you have multiple tasks runnable (e.g. if you know you only need to context switch in 100 ms, why keep a periodic interrupt running?) even though we don't support it *right now*. It might be a good idea not to entangle these concepts too much.

Gilad
Gilad Ben-Yossef
Chief Software Architect
EZchip Technologies Ltd.
37 Israel Pollak Ave, Kiryat Gat 82025 ,Israel
Tel: +972-4-959-6666 ext. 576, Fax: +972-8-681-1483 
Mobile: +972-52-826-0388, US Mobile: +1-973-826-0388
Email: giladb@ezchip.com, Web: http://www.ezchip.com

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 340+ messages in thread

* RE: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-09 10:18               ` Gilad Ben Yossef
  0 siblings, 0 replies; 340+ messages in thread
From: Gilad Ben Yossef @ 2015-05-09 10:18 UTC (permalink / raw)
  To: Mike Galbraith, Ingo Molnar
  Cc: Andrew Morton, Chris Metcalf, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


> From: Mike Galbraith [mailto:umgwanakikbuti@gmail.com]
> Sent: Saturday, May 09, 2015 10:20 AM
> To: Ingo Molnar
> Cc: Andrew Morton; Chris Metcalf; Steven Rostedt; Gilad Ben Yossef; Ingo
> Molnar; Peter Zijlstra; Rik van Riel; Tejun Heo; Frederic Weisbecker;
> Thomas Gleixner; Paul E. McKenney; Christoph Lameter; Srivatsa S. Bhat;
> linux-doc@vger.kernel.org; linux-api@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full
> 
> On Sat, 2015-05-09 at 09:05 +0200, Ingo Molnar wrote:
> > * Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > > On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf@ezchip.com>
> wrote:
> > >
> > > > On 5/8/2015 5:22 PM, Steven Rostedt wrote:
> > > > > On Fri, 8 May 2015 14:18:24 -0700
> > > > > Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > >
> > > > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf
> <cmetcalf@ezchip.com> wrote:
> > > > >>
> > > > >>> A prctl() option (PR_SET_DATAPLANE) is added
> > > > >> Dumb question: what does the term "dataplane" mean in this
> context?  I
> > > > >> can't see the relationship between those words and what this
> patch
> > > > >> does.
> > > > > I was thinking the same thing. I haven't gotten around to
> searching
> > > > > DATAPLANE yet.
> > > > >
> > > > > I would assume we want a name that is more meaningful for what is
> > > > > happening.
> > > >
> > > > The text in the commit message and the 0/6 cover letter do try to
> explain
> > > > the concept.  The terminology comes, I think, from networking line
> cards,
> > > > where the "dataplane" is the part of the application that handles
> all the
> > > > fast path processing of network packets, and the "control plane" is
> the part
> > > > that handles routing updates, etc., generally slow-path stuff.  I've
> probably
> > > > just been using the terms so long they seem normal to me.
> > > >
> > > > That said, what would be clearer?  NO_HZ_STRICT as a superset of
> > > > NO_HZ_FULL?  Or move away from the NO_HZ terminology a bit; after
> all,
> > > > we're talking about no interrupts of any kind, and maybe NO_HZ is
> too
> > > > limited in scope?  So, NO_INTERRUPTS?  USERSPACE_ONLY?  Or look
> > > > to vendors who ship bare-metal runtimes and call it BARE_METAL?
> > > > Borrow the Tilera marketing name and call it ZERO_OVERHEAD?
> > > >
> > > > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me,
> > > > of course :-)
> >
> > 'baremetal' has uses in virtualization speak, so I think that would be
> > confusing.
> >
> > > I like NO_INTERRUPTS.  Simple, direct.
> >
> > NO_HZ_PURE?
> 
> Hm, coke light, coke zero... OS_LIGHT and OS_ZERO?
LOL... you forgot OS_CLASSIC for backwards compatibility :-)
How about TASK_SOLO?
Yes, you are trying to achieve the least amount of interference but the bigger context is about monopolizing a single CPU for yourself.
Anyway it is worth pointing out that while NO_HZ_FULL is very useful in conjunction with this turning the tick off is useful also if you have multiple tasks runnable (e.g. if you know you only need to context switch in 100 ms, why keep a periodic interrupt running?) even though we don't support it *right now*. It might be a good idea not to entangle these concepts too much.

Gilad
Gilad Ben-Yossef
Chief Software Architect
EZchip Technologies Ltd.
37 Israel Pollak Ave, Kiryat Gat 82025 ,Israel
Tel: +972-4-959-6666 ext. 576, Fax: +972-8-681-1483 
Mobile: +972-52-826-0388, US Mobile: +1-973-826-0388
Email: giladb@ezchip.com, Web: http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* RE: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
  2015-05-09  7:28   ` Andy Lutomirski
@ 2015-05-09 10:37       ` Gilad Ben Yossef
  2015-05-11 19:13       ` Chris Metcalf
  1 sibling, 0 replies; 340+ messages in thread
From: Gilad Ben Yossef @ 2015-05-09 10:37 UTC (permalink / raw)
  To: Andy Lutomirski, Chris Metcalf
  Cc: Srivatsa S. Bhat, Paul E. McKenney, Frederic Weisbecker,
	Ingo Molnar, Rik van Riel, linux-doc, Andrew Morton,
	linux-kernel, Thomas Gleixner, Tejun Heo, Peter Zijlstra,
	Steven Rostedt, Christoph Lameter, Linux API

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2088 bytes --]

> From: Andy Lutomirski [mailto:luto@amacapital.net]
> Sent: Saturday, May 09, 2015 10:29 AM
> To: Chris Metcalf
> Cc: Srivatsa S. Bhat; Paul E. McKenney; Frederic Weisbecker; Ingo Molnar;
> Rik van Riel; linux-doc@vger.kernel.org; Andrew Morton; linux-
> kernel@vger.kernel.org; Thomas Gleixner; Tejun Heo; Peter Zijlstra; Steven
> Rostedt; Christoph Lameter; Gilad Ben Yossef; Linux API
> Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
> 
> On May 8, 2015 11:44 PM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote:
> >
> > With QUIESCE mode, the task is in principle guaranteed not to be
> > interrupted by the kernel, but only if it behaves.  In particular,
> > if it enters the kernel via system call, page fault, or any of
> > a number of other synchronous traps, it may be unexpectedly
> > exposed to long latencies.  Add a simple flag that puts the process
> > into a state where any such kernel entry is fatal.
> >
> > To allow the state to be entered and exited, we add an internal
> > bit to current->dataplane_flags that is set when prctl() sets the
> > flags.  That way, when we are exiting the kernel after calling
> > prctl() to forbid future kernel exits, we don't get immediately
> > killed.
> 
> Is there any reason this can't already be addressed in userspace using
> /proc/interrupts or perf_events?  ISTM the real goal here is to detect
> when we screw up and fail to avoid an interrupt, and killing the task
> seems like overkill to me.
> 
> Also, can we please stop further torturing the exit paths?  
So, I don't know if it is a practical suggestion or not, but would it better/easier to mark a pending signal on kernel entry for this case?
The upsides I see is that the user gets her notification (killing the task or just logging the event in a signal handler) and hopefully since return to userspace with a pending signal is already handled we don't need new code in the exit path?

Gilad
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 340+ messages in thread

* RE: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
@ 2015-05-09 10:37       ` Gilad Ben Yossef
  0 siblings, 0 replies; 340+ messages in thread
From: Gilad Ben Yossef @ 2015-05-09 10:37 UTC (permalink / raw)
  To: Andy Lutomirski, Chris Metcalf
  Cc: Srivatsa S. Bhat, Paul E. McKenney, Frederic Weisbecker,
	Ingo Molnar, Rik van Riel, linux-doc, Andrew Morton,
	linux-kernel, Thomas Gleixner, Tejun Heo, Peter Zijlstra,
	Steven Rostedt, Christoph Lameter, Linux API

> From: Andy Lutomirski [mailto:luto@amacapital.net]
> Sent: Saturday, May 09, 2015 10:29 AM
> To: Chris Metcalf
> Cc: Srivatsa S. Bhat; Paul E. McKenney; Frederic Weisbecker; Ingo Molnar;
> Rik van Riel; linux-doc@vger.kernel.org; Andrew Morton; linux-
> kernel@vger.kernel.org; Thomas Gleixner; Tejun Heo; Peter Zijlstra; Steven
> Rostedt; Christoph Lameter; Gilad Ben Yossef; Linux API
> Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
> 
> On May 8, 2015 11:44 PM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote:
> >
> > With QUIESCE mode, the task is in principle guaranteed not to be
> > interrupted by the kernel, but only if it behaves.  In particular,
> > if it enters the kernel via system call, page fault, or any of
> > a number of other synchronous traps, it may be unexpectedly
> > exposed to long latencies.  Add a simple flag that puts the process
> > into a state where any such kernel entry is fatal.
> >
> > To allow the state to be entered and exited, we add an internal
> > bit to current->dataplane_flags that is set when prctl() sets the
> > flags.  That way, when we are exiting the kernel after calling
> > prctl() to forbid future kernel exits, we don't get immediately
> > killed.
> 
> Is there any reason this can't already be addressed in userspace using
> /proc/interrupts or perf_events?  ISTM the real goal here is to detect
> when we screw up and fail to avoid an interrupt, and killing the task
> seems like overkill to me.
> 
> Also, can we please stop further torturing the exit paths?  
So, I don't know if it is a practical suggestion or not, but would it better/easier to mark a pending signal on kernel entry for this case?
The upsides I see is that the user gets her notification (killing the task or just logging the event in a signal handler) and hopefully since return to userspace with a pending signal is already handled we don't need new code in the exit path?

Gilad

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-11 12:57             ` Steven Rostedt
  0 siblings, 0 replies; 340+ messages in thread
From: Steven Rostedt @ 2015-05-11 12:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar,
	Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel


NO_HZ_LEAVE_ME_THE_FSCK_ALONE!


On Sat, 9 May 2015 09:05:38 +0200
Ingo Molnar <mingo@kernel.org> wrote:
 
> So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep 
> it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be 
> such a 'zero overhead' mode of operation, where if user-space runs, it 
> won't get interrupted in any way.


All kidding aside, I think this is the real answer. We don't need a new
NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly
what it was created to do. That should be fixed.

Please lets get NO_HZ_FULL up to par. That should be the main focus.

-- Steve

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-11 12:57             ` Steven Rostedt
  0 siblings, 0 replies; 340+ messages in thread
From: Steven Rostedt @ 2015-05-11 12:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar,
	Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


NO_HZ_LEAVE_ME_THE_FSCK_ALONE!


On Sat, 9 May 2015 09:05:38 +0200
Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
 
> So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep 
> it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be 
> such a 'zero overhead' mode of operation, where if user-space runs, it 
> won't get interrupted in any way.


All kidding aside, I think this is the real answer. We don't need a new
NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly
what it was created to do. That should be fixed.

Please lets get NO_HZ_FULL up to par. That should be the main focus.

-- Steve

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-11 12:57             ` Steven Rostedt
  (?)
@ 2015-05-11 15:36             ` Frederic Weisbecker
  2015-05-11 19:19               ` Mike Galbraith
  -1 siblings, 1 reply; 340+ messages in thread
From: Frederic Weisbecker @ 2015-05-11 15:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Andrew Morton, Chris Metcalf, Gilad Ben Yossef,
	Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel

On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote:
> 
> NO_HZ_LEAVE_ME_THE_FSCK_ALONE!
> 
> 
> On Sat, 9 May 2015 09:05:38 +0200
> Ingo Molnar <mingo@kernel.org> wrote:
>  
> > So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep 
> > it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be 
> > such a 'zero overhead' mode of operation, where if user-space runs, it 
> > won't get interrupted in any way.
> 
> 
> All kidding aside, I think this is the real answer. We don't need a new
> NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly
> what it was created to do. That should be fixed.
> 
> Please lets get NO_HZ_FULL up to par. That should be the main focus.

Now if we can achieve to make NO_HZ_FULL behave in a specific way
that fits everyone's usecase, I'll be happy.

But some people may expect hard isolation requirement (Real Time, deterministic
latency) and others softer isolation (HPC, only interested in performance, can
live with one rare random tick, so no need to loop before returning to userspace
until we have the no-noise guarantee).

I expect some Real Time users may want this kind of dataplane mode where a syscall
or whatever sleeps until the system is ready to provide the guarantee that no
disturbance is going to happen for a given time. I'm not sure HPC users are interested
in that.

In fact it goes along the fact that NO_HZ_FULL was really only supposed to be about
the tick and now people are introducing more and more kernel default presetting that
assume NO_HZ_FULL implies ISOLATION which is about all kind of noise (tick, tasks, irqs,
...). Which is true but what kind of ISOLATION?

Probably NO_HZ_FULL should really only be about stopping the tick then some sort
of CONFIG_ISOLATION would drive the kind of isolation we are interested in
and hereby the behaviour of NO_HZ_FULL, workqueues, timers, tasks affinity, irqs
affinity, dataplane mode, ...

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-11 12:57             ` Steven Rostedt
  (?)
  (?)
@ 2015-05-11 17:19             ` Paul E. McKenney
  2015-05-11 17:27               ` Andrew Morton
  -1 siblings, 1 reply; 340+ messages in thread
From: Paul E. McKenney @ 2015-05-11 17:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Andrew Morton, Chris Metcalf, Gilad Ben Yossef,
	Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel

On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote:
> 
> NO_HZ_LEAVE_ME_THE_FSCK_ALONE!

NO_HZ_OVERFLOWING?

Kconfig naming controversy aside, I believe this patchset is addressing
a real need.  Might need additional adjustment, but something useful.

							Thanx, Paul

> On Sat, 9 May 2015 09:05:38 +0200
> Ingo Molnar <mingo@kernel.org> wrote:
> 
> > So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep 
> > it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be 
> > such a 'zero overhead' mode of operation, where if user-space runs, it 
> > won't get interrupted in any way.
> 
> 
> All kidding aside, I think this is the real answer. We don't need a new
> NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly
> what it was created to do. That should be fixed.
> 
> Please lets get NO_HZ_FULL up to par. That should be the main focus.
> 
> -- Steve
> 


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-11 17:19             ` Paul E. McKenney
@ 2015-05-11 17:27               ` Andrew Morton
  2015-05-11 17:33                 ` Frederic Weisbecker
  0 siblings, 1 reply; 340+ messages in thread
From: Andrew Morton @ 2015-05-11 17:27 UTC (permalink / raw)
  To: paulmck
  Cc: Steven Rostedt, Ingo Molnar, Chris Metcalf, Gilad Ben Yossef,
	Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel

On Mon, 11 May 2015 10:19:16 -0700 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote:
> > 
> > NO_HZ_LEAVE_ME_THE_FSCK_ALONE!
> 
> NO_HZ_OVERFLOWING?

Actually, "NO_HZ" shouldn't appear in the name at all.  The objective
is to permit userspace to execute without interruption.  NO_HZ is a
part of that, as is NO_INTERRUPTS.  The "NO_HZ" thing is a historical
artifact from an early partial implementation.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-11 17:27               ` Andrew Morton
@ 2015-05-11 17:33                 ` Frederic Weisbecker
  2015-05-11 18:00                   ` Steven Rostedt
  0 siblings, 1 reply; 340+ messages in thread
From: Frederic Weisbecker @ 2015-05-11 17:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: paulmck, Steven Rostedt, Ingo Molnar, Chris Metcalf,
	Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat,
	linux-doc, linux-api, linux-kernel

On Mon, May 11, 2015 at 10:27:44AM -0700, Andrew Morton wrote:
> On Mon, 11 May 2015 10:19:16 -0700 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote:
> > > 
> > > NO_HZ_LEAVE_ME_THE_FSCK_ALONE!
> > 
> > NO_HZ_OVERFLOWING?
> 
> Actually, "NO_HZ" shouldn't appear in the name at all.  The objective
> is to permit userspace to execute without interruption.  NO_HZ is a
> part of that, as is NO_INTERRUPTS.  The "NO_HZ" thing is a historical
> artifact from an early partial implementation.

Agreed! Which is why I'd rather advocate in favour of CONFIG_ISOLATION.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-11 17:33                 ` Frederic Weisbecker
@ 2015-05-11 18:00                   ` Steven Rostedt
  2015-05-11 18:09                       ` Chris Metcalf
  0 siblings, 1 reply; 340+ messages in thread
From: Steven Rostedt @ 2015-05-11 18:00 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, paulmck, Ingo Molnar, Chris Metcalf,
	Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat,
	linux-doc, linux-api, linux-kernel

On Mon, 11 May 2015 19:33:06 +0200
Frederic Weisbecker <fweisbec@gmail.com> wrote:

> On Mon, May 11, 2015 at 10:27:44AM -0700, Andrew Morton wrote:
> > On Mon, 11 May 2015 10:19:16 -0700 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > 
> > > On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote:
> > > > 
> > > > NO_HZ_LEAVE_ME_THE_FSCK_ALONE!
> > > 
> > > NO_HZ_OVERFLOWING?
> > 
> > Actually, "NO_HZ" shouldn't appear in the name at all.  The objective
> > is to permit userspace to execute without interruption.  NO_HZ is a
> > part of that, as is NO_INTERRUPTS.  The "NO_HZ" thing is a historical
> > artifact from an early partial implementation.
> 
> Agreed! Which is why I'd rather advocate in favour of CONFIG_ISOLATION.

Then we should have CONFIG_LEAVE_ME_THE_FSCK_ALONE. Hmm, I guess that's
just an synonym for CONFIG_ISOLATION.

-- Steve

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-11 18:00                   ` Steven Rostedt
@ 2015-05-11 18:09                       ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-11 18:09 UTC (permalink / raw)
  To: Steven Rostedt, Frederic Weisbecker
  Cc: Andrew Morton, paulmck, Ingo Molnar, Gilad Ben Yossef,
	Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api,
	linux-kernel

A bunch of issues have been raised by various folks (thanks!)  and
I'll try to break them down and respond to them in a few different
emails.  This email is just about the issue of naming and whether the
proposed patch series should even have its own "name" or just be part
of NO_HZ_FULL.

First, Ingo and Steven both suggested that this new "dataplane" mode
(or whatever we want to call it; see below) should just be rolled into
the existing NO_HZ_FULL and that we should focus on making that work
better.

Steven writes:
> All kidding aside, I think this is the real answer. We don't need a new
> NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly
> what it was created to do. That should be fixed.

The claim I'm making is that it's worthwhile to differentiate the two
semantics.  Plain NO_HZ_FULL just says "kernel makes a best effort to
avoid periodic interrupts without incurring any serious overhead".  My
patch series allows an app to request "kernel makes an absolute
commitment to avoid all interrupts regardless of cost when leaving
kernel space".  These are different enough ideas, and serve different
enough application needs, that I think they should be kept distinct.

Frederic actually summed this up very nicely in his recent email when
he wrote "some people may expect hard isolation requirement (Real
Time, deterministic latency) and others softer isolation (HPC, only
interested in performance, can live with one rare random tick, so no
need to loop before returning to userspace until we have the no-noise
guarantee)."

So we need a way for apps to ask for the "harder" mode and let
the softer mode be the default.

What about naming?  We may or may not want to have a Kconfig flag
for this, and we may or may not have a separate mode for it, but
we still will need some kind of name to talk about it with.  (In
particular there's the prctl name, if we take that approach, and
potential boot command-line flags to consider naming for.)

I'll quickly cover the suggestions that have been raised:

- DATAPLANE.  My suggestion, seemingly broadly disliked by folks
   who felt it wasn't apparent what it meant.  Probably a fair point.

- NO_INTERRUPTS (Andrew).  Captures some of the sense, but was
   criticized pretty fairly by Ingo as being too negative, confusing
   with perf nomenclature, and too long :-)

- PURE (Ingo).  Proposed as an alternative to NO_HZ_FULL, but we could
   use it as a name for this new mode.  However, I think it's not clear
   enough how FULL and PURE can/should relate to each other from the
   names alone.

- BARE_METAL (me).  Ingo observes it's confusing with respect to
   virtualization.

- TASK_SOLO (Gilad).  Not sure this conveys enough of the semantics.

- OS_LIGHT/OS_ZERO and NO_HZ_LEAVE_ME_THE_FSCK_ALONE.  Excellent
   ideas :-)

- ISOLATION (Frederic).  I like this but it conflicts with other uses
   of "isolation" in the kernel: cgroup isolation, lru page isolation,
   iommu isolation, scheduler isolation (at least it's a superset of
   that one), etc.  Also, we're not exactly isolating a task - often
   a "dataplane" app consists of a bunch of interacting threads in
   userspace, so not exactly isolated.  So perhaps it's too confusing.

- OVERFLOWING (Steven) - not sure I understood this one, honestly.

I suggested earlier a few other candidates that I don't love, but no
one commented on: NO_HZ_STRICT, USERSPACE_ONLY, and ZERO_OVERHEAD.

One thing I'm leaning towards is to remove the intermediate state of
DATAPLANE_ENABLE and say that there is really only one primary state,
DATAPLANE_QUIESCE (or whatever we call it).  The "dataplane but no
quiesce" state probably isn't that useful, since it doesn't offer the
hard guarantee that is the entire point of this patch series.  So that
opens the idea of using the name NO_HZ_QUIESCE or just QUIESCE as the
word that describes the mode; of course this sort of conflicts with
RCU quiesce (though it is a superset of that so maybe that's OK).

One new idea I had is to use NO_HZ_HARD to reflect what Frederic was
suggesting about "soft" and "hard" requirements for NO_HZ.  So
enabling NO_HZ_HARD would enable my suggested QUIESCE mode.

One way to focus this discussion is on the user API naming.  I had
prctl(PR_SET_DATAPLANE), which was attractive in being a "positive"
noun.  A lot of the other suggestions fail this test in various way.
Reasonable candidates seem to be:

   PR_SET_OS_ZERO
   PR_SET_TASK_SOLO
   PR_SET_ISOLATION

Another possibility:

   PR_SET_NONSTOP

Or take Andrew's NO_INTERRUPTS and have:

   PR_SET_UNINTERRUPTED

I slightly favor ISOLATION at this point despite the overlap with
other kernel concepts.

Let the bike-shedding continue! :-)

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-11 18:09                       ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-11 18:09 UTC (permalink / raw)
  To: Steven Rostedt, Frederic Weisbecker
  Cc: Andrew Morton, paulmck, Ingo Molnar, Gilad Ben Yossef,
	Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api,
	linux-kernel

A bunch of issues have been raised by various folks (thanks!)  and
I'll try to break them down and respond to them in a few different
emails.  This email is just about the issue of naming and whether the
proposed patch series should even have its own "name" or just be part
of NO_HZ_FULL.

First, Ingo and Steven both suggested that this new "dataplane" mode
(or whatever we want to call it; see below) should just be rolled into
the existing NO_HZ_FULL and that we should focus on making that work
better.

Steven writes:
> All kidding aside, I think this is the real answer. We don't need a new
> NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly
> what it was created to do. That should be fixed.

The claim I'm making is that it's worthwhile to differentiate the two
semantics.  Plain NO_HZ_FULL just says "kernel makes a best effort to
avoid periodic interrupts without incurring any serious overhead".  My
patch series allows an app to request "kernel makes an absolute
commitment to avoid all interrupts regardless of cost when leaving
kernel space".  These are different enough ideas, and serve different
enough application needs, that I think they should be kept distinct.

Frederic actually summed this up very nicely in his recent email when
he wrote "some people may expect hard isolation requirement (Real
Time, deterministic latency) and others softer isolation (HPC, only
interested in performance, can live with one rare random tick, so no
need to loop before returning to userspace until we have the no-noise
guarantee)."

So we need a way for apps to ask for the "harder" mode and let
the softer mode be the default.

What about naming?  We may or may not want to have a Kconfig flag
for this, and we may or may not have a separate mode for it, but
we still will need some kind of name to talk about it with.  (In
particular there's the prctl name, if we take that approach, and
potential boot command-line flags to consider naming for.)

I'll quickly cover the suggestions that have been raised:

- DATAPLANE.  My suggestion, seemingly broadly disliked by folks
   who felt it wasn't apparent what it meant.  Probably a fair point.

- NO_INTERRUPTS (Andrew).  Captures some of the sense, but was
   criticized pretty fairly by Ingo as being too negative, confusing
   with perf nomenclature, and too long :-)

- PURE (Ingo).  Proposed as an alternative to NO_HZ_FULL, but we could
   use it as a name for this new mode.  However, I think it's not clear
   enough how FULL and PURE can/should relate to each other from the
   names alone.

- BARE_METAL (me).  Ingo observes it's confusing with respect to
   virtualization.

- TASK_SOLO (Gilad).  Not sure this conveys enough of the semantics.

- OS_LIGHT/OS_ZERO and NO_HZ_LEAVE_ME_THE_FSCK_ALONE.  Excellent
   ideas :-)

- ISOLATION (Frederic).  I like this but it conflicts with other uses
   of "isolation" in the kernel: cgroup isolation, lru page isolation,
   iommu isolation, scheduler isolation (at least it's a superset of
   that one), etc.  Also, we're not exactly isolating a task - often
   a "dataplane" app consists of a bunch of interacting threads in
   userspace, so not exactly isolated.  So perhaps it's too confusing.

- OVERFLOWING (Steven) - not sure I understood this one, honestly.

I suggested earlier a few other candidates that I don't love, but no
one commented on: NO_HZ_STRICT, USERSPACE_ONLY, and ZERO_OVERHEAD.

One thing I'm leaning towards is to remove the intermediate state of
DATAPLANE_ENABLE and say that there is really only one primary state,
DATAPLANE_QUIESCE (or whatever we call it).  The "dataplane but no
quiesce" state probably isn't that useful, since it doesn't offer the
hard guarantee that is the entire point of this patch series.  So that
opens the idea of using the name NO_HZ_QUIESCE or just QUIESCE as the
word that describes the mode; of course this sort of conflicts with
RCU quiesce (though it is a superset of that so maybe that's OK).

One new idea I had is to use NO_HZ_HARD to reflect what Frederic was
suggesting about "soft" and "hard" requirements for NO_HZ.  So
enabling NO_HZ_HARD would enable my suggested QUIESCE mode.

One way to focus this discussion is on the user API naming.  I had
prctl(PR_SET_DATAPLANE), which was attractive in being a "positive"
noun.  A lot of the other suggestions fail this test in various way.
Reasonable candidates seem to be:

   PR_SET_OS_ZERO
   PR_SET_TASK_SOLO
   PR_SET_ISOLATION

Another possibility:

   PR_SET_NONSTOP

Or take Andrew's NO_INTERRUPTS and have:

   PR_SET_UNINTERRUPTED

I slightly favor ISOLATION at this point despite the overlap with
other kernel concepts.

Let the bike-shedding continue! :-)

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-11 18:36                         ` Steven Rostedt
  0 siblings, 0 replies; 340+ messages in thread
From: Steven Rostedt @ 2015-05-11 18:36 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Frederic Weisbecker, Andrew Morton, paulmck, Ingo Molnar,
	Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc,
	linux-api, linux-kernel

On Mon, 11 May 2015 14:09:59 -0400
Chris Metcalf <cmetcalf@ezchip.com> wrote:

> Steven writes:
> > All kidding aside, I think this is the real answer. We don't need a new
> > NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly
> > what it was created to do. That should be fixed.
> 
> The claim I'm making is that it's worthwhile to differentiate the two
> semantics.  Plain NO_HZ_FULL just says "kernel makes a best effort to
> avoid periodic interrupts without incurring any serious overhead".  My
> patch series allows an app to request "kernel makes an absolute
> commitment to avoid all interrupts regardless of cost when leaving
> kernel space".  These are different enough ideas, and serve different
> enough application needs, that I think they should be kept distinct.
> 
> Frederic actually summed this up very nicely in his recent email when
> he wrote "some people may expect hard isolation requirement (Real
> Time, deterministic latency) and others softer isolation (HPC, only
> interested in performance, can live with one rare random tick, so no
> need to loop before returning to userspace until we have the no-noise
> guarantee)."
> 
> So we need a way for apps to ask for the "harder" mode and let
> the softer mode be the default.

Fair enough. But I would hope that this would improve on NO_HZ_FULL as
well.

> 
> What about naming?  We may or may not want to have a Kconfig flag
> for this, and we may or may not have a separate mode for it, but
> we still will need some kind of name to talk about it with.  (In
> particular there's the prctl name, if we take that approach, and
> potential boot command-line flags to consider naming for.)
> 
> I'll quickly cover the suggestions that have been raised:
> 
> - DATAPLANE.  My suggestion, seemingly broadly disliked by folks
>    who felt it wasn't apparent what it meant.  Probably a fair point.
> 
> - NO_INTERRUPTS (Andrew).  Captures some of the sense, but was
>    criticized pretty fairly by Ingo as being too negative, confusing
>    with perf nomenclature, and too long :-)

What about NO_INTERRUPTIONS

> 
> - PURE (Ingo).  Proposed as an alternative to NO_HZ_FULL, but we could
>    use it as a name for this new mode.  However, I think it's not clear
>    enough how FULL and PURE can/should relate to each other from the
>    names alone.

I would find the two confusing as well.

> 
> - BARE_METAL (me).  Ingo observes it's confusing with respect to
>    virtualization.

This is also confusing.

> 
> - TASK_SOLO (Gilad).  Not sure this conveys enough of the semantics.

Agreed.

> 
> - OS_LIGHT/OS_ZERO and NO_HZ_LEAVE_ME_THE_FSCK_ALONE.  Excellent
>    ideas :-)

At least the LEAVE_ME_ALONE conveys the semantics ;-)

> 
> - ISOLATION (Frederic).  I like this but it conflicts with other uses
>    of "isolation" in the kernel: cgroup isolation, lru page isolation,
>    iommu isolation, scheduler isolation (at least it's a superset of
>    that one), etc.  Also, we're not exactly isolating a task - often
>    a "dataplane" app consists of a bunch of interacting threads in
>    userspace, so not exactly isolated.  So perhaps it's too confusing.
> 
> - OVERFLOWING (Steven) - not sure I understood this one, honestly.

Actually, that was suggested by Paul McKenney.

> 
> I suggested earlier a few other candidates that I don't love, but no
> one commented on: NO_HZ_STRICT, USERSPACE_ONLY, and ZERO_OVERHEAD.
> 
> One thing I'm leaning towards is to remove the intermediate state of
> DATAPLANE_ENABLE and say that there is really only one primary state,
> DATAPLANE_QUIESCE (or whatever we call it).  The "dataplane but no
> quiesce" state probably isn't that useful, since it doesn't offer the
> hard guarantee that is the entire point of this patch series.  So that
> opens the idea of using the name NO_HZ_QUIESCE or just QUIESCE as the
> word that describes the mode; of course this sort of conflicts with
> RCU quiesce (though it is a superset of that so maybe that's OK).
> 
> One new idea I had is to use NO_HZ_HARD to reflect what Frederic was
> suggesting about "soft" and "hard" requirements for NO_HZ.  So
> enabling NO_HZ_HARD would enable my suggested QUIESCE mode.
> 
> One way to focus this discussion is on the user API naming.  I had
> prctl(PR_SET_DATAPLANE), which was attractive in being a "positive"
> noun.  A lot of the other suggestions fail this test in various way.
> Reasonable candidates seem to be:
> 
>    PR_SET_OS_ZERO
>    PR_SET_TASK_SOLO
>    PR_SET_ISOLATION
> 
> Another possibility:
> 
>    PR_SET_NONSTOP
> 
> Or take Andrew's NO_INTERRUPTS and have:
> 
>    PR_SET_UNINTERRUPTED

For another possible answer, what about 

	SET_TRANQUILITY

A state with no disturbances. 

-- Steve

> 
> I slightly favor ISOLATION at this point despite the overlap with
> other kernel concepts.
> 
> Let the bike-shedding continue! :-)
> 


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-11 18:36                         ` Steven Rostedt
  0 siblings, 0 replies; 340+ messages in thread
From: Steven Rostedt @ 2015-05-11 18:36 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Frederic Weisbecker, Andrew Morton,
	paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Ingo Molnar,
	Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, 11 May 2015 14:09:59 -0400
Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:

> Steven writes:
> > All kidding aside, I think this is the real answer. We don't need a new
> > NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly
> > what it was created to do. That should be fixed.
> 
> The claim I'm making is that it's worthwhile to differentiate the two
> semantics.  Plain NO_HZ_FULL just says "kernel makes a best effort to
> avoid periodic interrupts without incurring any serious overhead".  My
> patch series allows an app to request "kernel makes an absolute
> commitment to avoid all interrupts regardless of cost when leaving
> kernel space".  These are different enough ideas, and serve different
> enough application needs, that I think they should be kept distinct.
> 
> Frederic actually summed this up very nicely in his recent email when
> he wrote "some people may expect hard isolation requirement (Real
> Time, deterministic latency) and others softer isolation (HPC, only
> interested in performance, can live with one rare random tick, so no
> need to loop before returning to userspace until we have the no-noise
> guarantee)."
> 
> So we need a way for apps to ask for the "harder" mode and let
> the softer mode be the default.

Fair enough. But I would hope that this would improve on NO_HZ_FULL as
well.

> 
> What about naming?  We may or may not want to have a Kconfig flag
> for this, and we may or may not have a separate mode for it, but
> we still will need some kind of name to talk about it with.  (In
> particular there's the prctl name, if we take that approach, and
> potential boot command-line flags to consider naming for.)
> 
> I'll quickly cover the suggestions that have been raised:
> 
> - DATAPLANE.  My suggestion, seemingly broadly disliked by folks
>    who felt it wasn't apparent what it meant.  Probably a fair point.
> 
> - NO_INTERRUPTS (Andrew).  Captures some of the sense, but was
>    criticized pretty fairly by Ingo as being too negative, confusing
>    with perf nomenclature, and too long :-)

What about NO_INTERRUPTIONS

> 
> - PURE (Ingo).  Proposed as an alternative to NO_HZ_FULL, but we could
>    use it as a name for this new mode.  However, I think it's not clear
>    enough how FULL and PURE can/should relate to each other from the
>    names alone.

I would find the two confusing as well.

> 
> - BARE_METAL (me).  Ingo observes it's confusing with respect to
>    virtualization.

This is also confusing.

> 
> - TASK_SOLO (Gilad).  Not sure this conveys enough of the semantics.

Agreed.

> 
> - OS_LIGHT/OS_ZERO and NO_HZ_LEAVE_ME_THE_FSCK_ALONE.  Excellent
>    ideas :-)

At least the LEAVE_ME_ALONE conveys the semantics ;-)

> 
> - ISOLATION (Frederic).  I like this but it conflicts with other uses
>    of "isolation" in the kernel: cgroup isolation, lru page isolation,
>    iommu isolation, scheduler isolation (at least it's a superset of
>    that one), etc.  Also, we're not exactly isolating a task - often
>    a "dataplane" app consists of a bunch of interacting threads in
>    userspace, so not exactly isolated.  So perhaps it's too confusing.
> 
> - OVERFLOWING (Steven) - not sure I understood this one, honestly.

Actually, that was suggested by Paul McKenney.

> 
> I suggested earlier a few other candidates that I don't love, but no
> one commented on: NO_HZ_STRICT, USERSPACE_ONLY, and ZERO_OVERHEAD.
> 
> One thing I'm leaning towards is to remove the intermediate state of
> DATAPLANE_ENABLE and say that there is really only one primary state,
> DATAPLANE_QUIESCE (or whatever we call it).  The "dataplane but no
> quiesce" state probably isn't that useful, since it doesn't offer the
> hard guarantee that is the entire point of this patch series.  So that
> opens the idea of using the name NO_HZ_QUIESCE or just QUIESCE as the
> word that describes the mode; of course this sort of conflicts with
> RCU quiesce (though it is a superset of that so maybe that's OK).
> 
> One new idea I had is to use NO_HZ_HARD to reflect what Frederic was
> suggesting about "soft" and "hard" requirements for NO_HZ.  So
> enabling NO_HZ_HARD would enable my suggested QUIESCE mode.
> 
> One way to focus this discussion is on the user API naming.  I had
> prctl(PR_SET_DATAPLANE), which was attractive in being a "positive"
> noun.  A lot of the other suggestions fail this test in various way.
> Reasonable candidates seem to be:
> 
>    PR_SET_OS_ZERO
>    PR_SET_TASK_SOLO
>    PR_SET_ISOLATION
> 
> Another possibility:
> 
>    PR_SET_NONSTOP
> 
> Or take Andrew's NO_INTERRUPTS and have:
> 
>    PR_SET_UNINTERRUPTED

For another possible answer, what about 

	SET_TRANQUILITY

A state with no disturbances. 

-- Steve

> 
> I slightly favor ISOLATION at this point despite the overlap with
> other kernel concepts.
> 
> Let the bike-shedding continue! :-)
> 

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
@ 2015-05-11 19:13       ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-11 19:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paul E. McKenney, Frederic Weisbecker, Ingo Molnar, Rik van Riel,
	linux-doc, Andrew Morton, linux-kernel, Thomas Gleixner,
	Tejun Heo, Peter Zijlstra, Steven Rostedt, Christoph Lameter,
	Gilad Ben Yossef, Linux API

On 05/09/2015 03:28 AM, Andy Lutomirski wrote:
> On May 8, 2015 11:44 PM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote:
>> With QUIESCE mode, the task is in principle guaranteed not to be
>> interrupted by the kernel, but only if it behaves.  In particular,
>> if it enters the kernel via system call, page fault, or any of
>> a number of other synchronous traps, it may be unexpectedly
>> exposed to long latencies.  Add a simple flag that puts the process
>> into a state where any such kernel entry is fatal.
>>
>> To allow the state to be entered and exited, we add an internal
>> bit to current->dataplane_flags that is set when prctl() sets the
>> flags.  That way, when we are exiting the kernel after calling
>> prctl() to forbid future kernel exits, we don't get immediately
>> killed.
> Is there any reason this can't already be addressed in userspace using
> /proc/interrupts or perf_events?  ISTM the real goal here is to detect
> when we screw up and fail to avoid an interrupt, and killing the task
> seems like overkill to me.

Patch 6/6 proposes a mechanism to track down times when the
kernel screws up and delivers an IRQ to a userspace-only task.
Here, we're just trying to identify the times when an application
screws itself up out of cluelessness, and provide a mechanism
that allows the developer to easily figure out why and fix it.

In particular, /proc/interrupts won't show syscalls or page faults,
which are two easy ways applications can screw themselves
when they think they're in userspace-only mode.  Also, they don't
provide sufficient precision to make it clear what part of the
application caused the undesired kernel entry.

In this case, killing the task is appropriate, since that's exactly
the semantics that have been asked for - it's like on architectures
that don't natively support unaligned accesses, but fake it relatively
slowly in the kernel, and in development you just say "give me a
SIGBUS when that happens" and in production you might say
"fix it up and let's try to keep going".

You can argue that this is something that can be done by ftrace,
but certainly you'd want to have a way to programmatically
turn on ftrace at the moment when you're entering userspace-only
mode, so we'd want some API around that anyway.  And honestly,
it's so easy to test a task state bit in a couple of places and
generate the failurel on the spot, vs. the relative complexity
of setting up and understanding ftrace, that I think it merits
inclusion on that basis alone.

> Also, can we please stop further torturing the exit paths?  We have a
> disaster of assembly code that calls into syscall_trace_leave and
> do_notify_resume.  Those functions, in turn, *both* call user_enter
> (WTF?), and on very brief inspection user_enter makes it into the nohz
> code through multiple levels of indirection, which, with these
> patches, has yet another conditionally enabled helper, which does this
> new stuff.  It's getting to be impossible to tell what happens when we
> exit to user space any more.
>
> Also, I think your code is buggy.  There's no particular guarantee
> that user_enter is only called once between sys_prctl and the final
> exit to user mode (see the above WTF), so you might spuriously kill
> the process.

This is a good point; I also find the x86 kernel entry and exit
paths confusing, although I've reviewed them a bunch of times.
The tile architecture paths are a little easier to understand.

That said, I think the answer here is avoid non-idempotent
actions in the dataplane code, such as clearing a syscall bit.

A better implementation, I think, is to put the tests for "you
screwed up and synchronously entered the kernel" in
the syscall_trace_enter() code, which TIF_NOHZ already
gets us into; there, we can test if the dataplane "strict" bit is
set and the syscall is not prctl(), then we generate the error.
(We'd exclude exit and exit_group here too, since we don't
need to shoot down a task that's just trying to kill itself.)
This needs a bit of platform-specific code for each platform,
but that doesn't seem like too big a problem.

Likewise we can test in exception_enter() since that's only
called for all the synchronous user entries like page faults.

> Also, I think that most users will be quite surprised if "strict
> dataplane" code causes any machine check on the system to kill your
> dataplane task.

Fair point, and avoided by testing as described above instead.
(Though presumably in development it's not such a big deal,
and as I said you'd likely turn it off in production.)

> Similarly, a user accidentally running perf record -a
> probably should have some reasonable semantics.

Yes, also avoided by doing this as above, though I'd argue we
could also just say that running perf disables this mode.
But it's not as clean as the above suggestion.

On 05/09/2015 06:37 AM, Gilad Ben Yossef wrote:
> So, I don't know if it is a practical suggestion or not, but would it better/easier to mark a pending signal on kernel entry for this case?
> The upsides I see is that the user gets her notification (killing the task or just logging the event in a signal handler) and hopefully since return to userspace with a pending signal is already handled we don't need new code in the exit path?

We could certainly do this now that I'm planning to do the
test at kernel entry rather than super-late in kernel exit.
Rather than just do_group_exit(SIGKILL), we should raise
a proper SIGKILL signal via send_sig(SIGKILL, current, 1),
and then we could catch it in the debugger; the pc should
help identify if it was a syscall, page fault, or other trap.

I'm not sure there's an argument to be made for the user
process being able to catch the signal itself; presumably in
production you don't turn this mode on anyway, and in
development, assuming a debugger is probably fine.

But if you want to argue for another signal (SIGILL?) please
do; I'm curious to hear if you think it would make more sense.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
@ 2015-05-11 19:13       ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-11 19:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paul E. McKenney, Frederic Weisbecker, Ingo Molnar, Rik van Riel,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Tejun Heo,
	Peter Zijlstra, Steven Rostedt, Christoph Lameter,
	Gilad Ben Yossef, Linux API

On 05/09/2015 03:28 AM, Andy Lutomirski wrote:
> On May 8, 2015 11:44 PM, "Chris Metcalf" <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
>> With QUIESCE mode, the task is in principle guaranteed not to be
>> interrupted by the kernel, but only if it behaves.  In particular,
>> if it enters the kernel via system call, page fault, or any of
>> a number of other synchronous traps, it may be unexpectedly
>> exposed to long latencies.  Add a simple flag that puts the process
>> into a state where any such kernel entry is fatal.
>>
>> To allow the state to be entered and exited, we add an internal
>> bit to current->dataplane_flags that is set when prctl() sets the
>> flags.  That way, when we are exiting the kernel after calling
>> prctl() to forbid future kernel exits, we don't get immediately
>> killed.
> Is there any reason this can't already be addressed in userspace using
> /proc/interrupts or perf_events?  ISTM the real goal here is to detect
> when we screw up and fail to avoid an interrupt, and killing the task
> seems like overkill to me.

Patch 6/6 proposes a mechanism to track down times when the
kernel screws up and delivers an IRQ to a userspace-only task.
Here, we're just trying to identify the times when an application
screws itself up out of cluelessness, and provide a mechanism
that allows the developer to easily figure out why and fix it.

In particular, /proc/interrupts won't show syscalls or page faults,
which are two easy ways applications can screw themselves
when they think they're in userspace-only mode.  Also, they don't
provide sufficient precision to make it clear what part of the
application caused the undesired kernel entry.

In this case, killing the task is appropriate, since that's exactly
the semantics that have been asked for - it's like on architectures
that don't natively support unaligned accesses, but fake it relatively
slowly in the kernel, and in development you just say "give me a
SIGBUS when that happens" and in production you might say
"fix it up and let's try to keep going".

You can argue that this is something that can be done by ftrace,
but certainly you'd want to have a way to programmatically
turn on ftrace at the moment when you're entering userspace-only
mode, so we'd want some API around that anyway.  And honestly,
it's so easy to test a task state bit in a couple of places and
generate the failurel on the spot, vs. the relative complexity
of setting up and understanding ftrace, that I think it merits
inclusion on that basis alone.

> Also, can we please stop further torturing the exit paths?  We have a
> disaster of assembly code that calls into syscall_trace_leave and
> do_notify_resume.  Those functions, in turn, *both* call user_enter
> (WTF?), and on very brief inspection user_enter makes it into the nohz
> code through multiple levels of indirection, which, with these
> patches, has yet another conditionally enabled helper, which does this
> new stuff.  It's getting to be impossible to tell what happens when we
> exit to user space any more.
>
> Also, I think your code is buggy.  There's no particular guarantee
> that user_enter is only called once between sys_prctl and the final
> exit to user mode (see the above WTF), so you might spuriously kill
> the process.

This is a good point; I also find the x86 kernel entry and exit
paths confusing, although I've reviewed them a bunch of times.
The tile architecture paths are a little easier to understand.

That said, I think the answer here is avoid non-idempotent
actions in the dataplane code, such as clearing a syscall bit.

A better implementation, I think, is to put the tests for "you
screwed up and synchronously entered the kernel" in
the syscall_trace_enter() code, which TIF_NOHZ already
gets us into; there, we can test if the dataplane "strict" bit is
set and the syscall is not prctl(), then we generate the error.
(We'd exclude exit and exit_group here too, since we don't
need to shoot down a task that's just trying to kill itself.)
This needs a bit of platform-specific code for each platform,
but that doesn't seem like too big a problem.

Likewise we can test in exception_enter() since that's only
called for all the synchronous user entries like page faults.

> Also, I think that most users will be quite surprised if "strict
> dataplane" code causes any machine check on the system to kill your
> dataplane task.

Fair point, and avoided by testing as described above instead.
(Though presumably in development it's not such a big deal,
and as I said you'd likely turn it off in production.)

> Similarly, a user accidentally running perf record -a
> probably should have some reasonable semantics.

Yes, also avoided by doing this as above, though I'd argue we
could also just say that running perf disables this mode.
But it's not as clean as the above suggestion.

On 05/09/2015 06:37 AM, Gilad Ben Yossef wrote:
> So, I don't know if it is a practical suggestion or not, but would it better/easier to mark a pending signal on kernel entry for this case?
> The upsides I see is that the user gets her notification (killing the task or just logging the event in a signal handler) and hopefully since return to userspace with a pending signal is already handled we don't need new code in the exit path?

We could certainly do this now that I'm planning to do the
test at kernel entry rather than super-late in kernel exit.
Rather than just do_group_exit(SIGKILL), we should raise
a proper SIGKILL signal via send_sig(SIGKILL, current, 1),
and then we could catch it in the debugger; the pc should
help identify if it was a syscall, page fault, or other trap.

I'm not sure there's an argument to be made for the user
process being able to catch the signal itself; presumably in
production you don't turn this mode on anyway, and in
development, assuming a debugger is probably fine.

But if you want to argue for another signal (SIGILL?) please
do; I'm curious to hear if you think it would make more sense.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-11 15:36             ` Frederic Weisbecker
@ 2015-05-11 19:19               ` Mike Galbraith
  2015-05-11 19:25                   ` Chris Metcalf
  0 siblings, 1 reply; 340+ messages in thread
From: Mike Galbraith @ 2015-05-11 19:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Chris Metcalf,
	Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel

On Mon, 2015-05-11 at 17:36 +0200, Frederic Weisbecker wrote:

> I expect some Real Time users may want this kind of dataplane mode where a syscall
> or whatever sleeps until the system is ready to provide the guarantee that no
> disturbance is going to happen for a given time. I'm not sure HPC users are interested
> in that.

I bet they are.  RT is just a different way to spell HPC, and reverse.

> In fact it goes along the fact that NO_HZ_FULL was really only supposed to be about
> the tick and now people are introducing more and more kernel default presetting that
> assume NO_HZ_FULL implies ISOLATION which is about all kind of noise (tick, tasks, irqs,
> ...). Which is true but what kind of ISOLATION?

True, nohz mode and various isolation measures are distinct properties.
NO_HZ_FULL is kinda pointless without isolation measures to go with it,
but you're right.

I really shouldn't have acked nohz_full -> isolcpus.  Beside the fact
that old static isolcpus was _supposed_ to crawl off and die, I know
beyond doubt that having isolated a cpu as well as you can definitely
does NOT imply that said cpu should become tickless.  I routinely run a
load model that wants all the isolation it can get.  It's not single
task compute though, rt executive coordinating rt workers, and of course
wants every cycle it can get, so nohz_full is less than helpful.

	-Mike


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-11 19:19               ` Mike Galbraith
@ 2015-05-11 19:25                   ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-11 19:25 UTC (permalink / raw)
  To: Mike Galbraith, Frederic Weisbecker
  Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Gilad Ben Yossef,
	Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel

On 05/11/2015 03:19 PM, Mike Galbraith wrote:
> I really shouldn't have acked nohz_full -> isolcpus.  Beside the fact
> that old static isolcpus was_supposed_  to crawl off and die, I know
> beyond doubt that having isolated a cpu as well as you can definitely
> does NOT imply that said cpu should become tickless.

True, at a high level, I agree that it would be better to have a
top-level concept like Frederic's proposed ISOLATION that includes
isolcpus and nohz_cpu (and other stuff as needed).

That said, what you wrote above is wrong; even with the patch you
acked, setting isolcpus does not automatically turn on nohz_full for
a given cpu.  The patch made it true the other way around: when
you say nohz_full, you automatically get isolcpus on that cpu too.
That does, at least, make sense for the semantics of nohz_full.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-11 19:25                   ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-11 19:25 UTC (permalink / raw)
  To: Mike Galbraith, Frederic Weisbecker
  Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Gilad Ben Yossef,
	Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel

On 05/11/2015 03:19 PM, Mike Galbraith wrote:
> I really shouldn't have acked nohz_full -> isolcpus.  Beside the fact
> that old static isolcpus was_supposed_  to crawl off and die, I know
> beyond doubt that having isolated a cpu as well as you can definitely
> does NOT imply that said cpu should become tickless.

True, at a high level, I agree that it would be better to have a
top-level concept like Frederic's proposed ISOLATION that includes
isolcpus and nohz_cpu (and other stuff as needed).

That said, what you wrote above is wrong; even with the patch you
acked, setting isolcpus does not automatically turn on nohz_full for
a given cpu.  The patch made it true the other way around: when
you say nohz_full, you automatically get isolcpus on that cpu too.
That does, at least, make sense for the semantics of nohz_full.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-11 19:54               ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-11 19:54 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: Andrew Morton, Steven Rostedt, Gilad Ben Yossef, Peter Zijlstra,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, linux-doc, Linux API,
	linux-kernel

(Oops, resending and forcing html off.)

On 05/09/2015 03:19 AM, Andy Lutomirski wrote:
> Naming aside, I don't think this should be a per-task flag at all.  We
> already have way too much overhead per syscall in nohz mode, and it
> would be nice to get the per-syscall overhead as low as possible.  We
> should strive, for all tasks, to keep syscall overhead down*and*
> avoid as many interrupts as possible.
>
> That being said, I do see a legitimate use for a way to tell the
> kernel "I'm going to run in userspace for a long time; stay away".
> But shouldn't that be a single operation, not an ongoing flag?  IOW, I
> think that we should have a new syscall quiesce() or something rather
> than a prctl.

Yes, if all you are concerned about is quiescing the tick, we could
probably do it as a new syscall.

I do note that you'd want to try to actually do the quiesce as late as
possible - in particular, if you just did it in the usual syscall, you
might miss out on a timer that is set by softirq, or even something
that happened when you called schedule() on the syscall exit path.
Doing it as late as we are doing helps to ensure that that doesn't
happen.  We could still arrange for this semantics by having a new
quiesce() syscall set a temporary task bit that was cleared on
return to userspace, but as you pointed out in a different email,
that gets tricky if you end up doing multiple user_exit() calls on
your way back to userspace.

More to the point, I think it's actually important to know when an
application believes it's in userspace-only mode as an actual state
bit, rather than just during its transitional moment.  If an
application calls the kernel at an unexpected time (third-party code
is the usual culprit for our customers, whether it's syscalls, page
faults, or other things) we would prefer to have the "quiesce"
semantics stay in force and cause the third-party code to be
visibly very slow, rather than cause a totally unexpected and
hard-to-diagnose interrupt show up later as we are still going
around the loop that we thought was safely userspace-only.

And, for debugging the kernel, it's crazy helpful to have that state
bit in place: see patch 6/6 in the series for how we can diagnose
things like "a different core just queued an IPI that will hit a
dataplane core unexpectedly".  Having that state bit makes this sort
of thing a trivial check in the kernel and relatively easy to debug.

Finally, I proposed a "strict" mode in patch 5/6 where we kill the
process if it voluntarily enters the kernel by mistake after saying it
wasn't going to any more.  To do this requires a state bit, so
carrying another state bit for "quiesce on user entry" seems pretty
reasonable.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-11 19:54               ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-11 19:54 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: Andrew Morton, Steven Rostedt, Gilad Ben Yossef, Peter Zijlstra,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

(Oops, resending and forcing html off.)

On 05/09/2015 03:19 AM, Andy Lutomirski wrote:
> Naming aside, I don't think this should be a per-task flag at all.  We
> already have way too much overhead per syscall in nohz mode, and it
> would be nice to get the per-syscall overhead as low as possible.  We
> should strive, for all tasks, to keep syscall overhead down*and*
> avoid as many interrupts as possible.
>
> That being said, I do see a legitimate use for a way to tell the
> kernel "I'm going to run in userspace for a long time; stay away".
> But shouldn't that be a single operation, not an ongoing flag?  IOW, I
> think that we should have a new syscall quiesce() or something rather
> than a prctl.

Yes, if all you are concerned about is quiescing the tick, we could
probably do it as a new syscall.

I do note that you'd want to try to actually do the quiesce as late as
possible - in particular, if you just did it in the usual syscall, you
might miss out on a timer that is set by softirq, or even something
that happened when you called schedule() on the syscall exit path.
Doing it as late as we are doing helps to ensure that that doesn't
happen.  We could still arrange for this semantics by having a new
quiesce() syscall set a temporary task bit that was cleared on
return to userspace, but as you pointed out in a different email,
that gets tricky if you end up doing multiple user_exit() calls on
your way back to userspace.

More to the point, I think it's actually important to know when an
application believes it's in userspace-only mode as an actual state
bit, rather than just during its transitional moment.  If an
application calls the kernel at an unexpected time (third-party code
is the usual culprit for our customers, whether it's syscalls, page
faults, or other things) we would prefer to have the "quiesce"
semantics stay in force and cause the third-party code to be
visibly very slow, rather than cause a totally unexpected and
hard-to-diagnose interrupt show up later as we are still going
around the loop that we thought was safely userspace-only.

And, for debugging the kernel, it's crazy helpful to have that state
bit in place: see patch 6/6 in the series for how we can diagnose
things like "a different core just queued an IPI that will hit a
dataplane core unexpectedly".  Having that state bit makes this sort
of thing a trivial check in the kernel and relatively easy to debug.

Finally, I proposed a "strict" mode in patch 5/6 where we kill the
process if it voluntarily enters the kernel by mistake after saying it
wasn't going to any more.  To do this requires a state bit, so
carrying another state bit for "quiesce on user entry" seems pretty
reasonable.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry
  2015-05-09  7:04   ` Mike Galbraith
@ 2015-05-11 20:13     ` Chris Metcalf
  2015-05-12  2:21       ` Mike Galbraith
                         ` (2 more replies)
  0 siblings, 3 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-11 20:13 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, linux-kernel

On 05/09/2015 03:04 AM, Mike Galbraith wrote:
> On Fri, 2015-05-08 at 13:58 -0400, Chris Metcalf wrote:
>> For tasks which have elected dataplane functionality, we run
>> any pending softirqs for the core before returning to userspace,
>> rather than ever scheduling ksoftirqd to run.  The problem we
>> fix is that by allowing another task to run on the core, we
>> guarantee more interrupts in the future to the dataplane task,
>> which is exactly what dataplane mode is required to prevent.
> If ksoftirqd were rt class

I realize I actually don't know if this is true or not.  Is
ksoftirqd rt class?  If not, it does seem pretty plausible that
it should be...

> softirqs would be gone when the soloist gets
> the CPU back and heads to userspace.  Being a soloist, it has no use for
> a priority, so why can't it just let ksoftirqd run if it raises the
> occasional softirq?  Meeting a contended lock while processing it will
> wreck the soloist regardless of who does that processing.

The thing you want to avoid is having two processes both
runnable at once, since then the "quiesce" mode can't make
forward progress and basically spins in cpu_idle() until ksoftirqd
can come in.  Alas, my recollection of the precise failure mode
is somewhat dimmed; my commit notes from a year ago (for
a variant of the patch I'm upstreaming now):

         - Trying to return to userspace with pending softirqs is not
           currently allowed.  Prior to this patch, when this happened
           we would just wait in cpu_idle.  Instead, what we now do is
           directly run any pending softirqs, then go back and retry the
           path where we return to userspace.
         
         - Raising softirqs (in this case for hrtimer support) could
           cause the ksoftirqd daemon to be woken on a core.  This is
           bad because on a dataplane core, a QUIESCE process will
           then block until the ksoftirqd runs, and the system sometimes
           seems to flag that soft irqs are available but not schedule
           the timer to arrange for a context switch to ksoftirqd.
           To handle this, we avoid bailing out in __do_softirq() when
           we've been working for a while, if we're on a dataplane core,
           and just keep working until done.  Similarly, on a dataplane
           core running a userspace task, we don't wake ksoftirqd when
           we are raising a softirq, even if we're not in an interrupt
           context where it will run promptly, since a non-interrupt
           context will also run promptly.

I'm happy to drop this patch entirely from the series for now, and
if ksoftirqd shows up as a problem going forward, we can address it
as necessary at that time.   What do you think?

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-11 22:15                 ` Andy Lutomirski
  0 siblings, 0 replies; 340+ messages in thread
From: Andy Lutomirski @ 2015-05-11 22:15 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Paul E. McKenney, Frederic Weisbecker, linux-kernel,
	Rik van Riel, Andrew Morton, Linux API, Thomas Gleixner,
	Tejun Heo, Peter Zijlstra, Steven Rostedt, linux-doc,
	Christoph Lameter, Gilad Ben Yossef, Ingo Molnar

On May 12, 2015 4:54 AM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote:
>
> (Oops, resending and forcing html off.)
>
>
> On 05/09/2015 03:19 AM, Andy Lutomirski wrote:
>>
>> Naming aside, I don't think this should be a per-task flag at all.  We
>> already have way too much overhead per syscall in nohz mode, and it
>> would be nice to get the per-syscall overhead as low as possible.  We
>> should strive, for all tasks, to keep syscall overhead down*and*
>> avoid as many interrupts as possible.
>>
>> That being said, I do see a legitimate use for a way to tell the
>> kernel "I'm going to run in userspace for a long time; stay away".
>> But shouldn't that be a single operation, not an ongoing flag?  IOW, I
>> think that we should have a new syscall quiesce() or something rather
>> than a prctl.
>
>
> Yes, if all you are concerned about is quiescing the tick, we could
> probably do it as a new syscall.
>
> I do note that you'd want to try to actually do the quiesce as late as
> possible - in particular, if you just did it in the usual syscall, you
> might miss out on a timer that is set by softirq, or even something
> that happened when you called schedule() on the syscall exit path.
> Doing it as late as we are doing helps to ensure that that doesn't
> happen.  We could still arrange for this semantics by having a new
> quiesce() syscall set a temporary task bit that was cleared on
> return to userspace, but as you pointed out in a different email,
> that gets tricky if you end up doing multiple user_exit() calls on
> your way back to userspace.

We should fix that, then.  A quiesce() syscall can certainly arrange
to clean up on final exit.

>
> More to the point, I think it's actually important to know when an
> application believes it's in userspace-only mode as an actual state
> bit, rather than just during its transitional moment.

We can do that, too, with a new flag that's cleared on the next entry.

>  If an
> application calls the kernel at an unexpected time (third-party code
> is the usual culprit for our customers, whether it's syscalls, page
> faults, or other things) we would prefer to have the "quiesce"
> semantics stay in force and cause the third-party code to be
> visibly very slow, rather than cause a totally unexpected and
> hard-to-diagnose interrupt show up later as we are still going
> around the loop that we thought was safely userspace-only.

I'm not really convinced that we should design this feature around
ease of debugging userspace screwups.  There are already plenty of
ways to do that part.  Userspace getting an interrupt because
userspace accidentally did a syscall is very different from userspace
getting interrupted due to an IPI.

>
> And, for debugging the kernel, it's crazy helpful to have that state
> bit in place: see patch 6/6 in the series for how we can diagnose
> things like "a different core just queued an IPI that will hit a
> dataplane core unexpectedly".  Having that state bit makes this sort
> of thing a trivial check in the kernel and relatively easy to debug.

As above, this can be done with a one-time operation, too.

>
> Finally, I proposed a "strict" mode in patch 5/6 where we kill the
> process if it voluntarily enters the kernel by mistake after saying it
> wasn't going to any more.  To do this requires a state bit, so
> carrying another state bit for "quiesce on user entry" seems pretty
> reasonable.

I still dislike that in the form you chose.  It's too deadly to be
useful for anyone but the hardest RT users.

I think I'd be okay with variants, though: let a suitably privileged
process ask for a signal on inadvertent kernel entry or rig up an fd
to be notified when one of these bad entries happens.  Queueing
something to a pollable fd would work, too.

See that thread for more comments.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-11 22:15                 ` Andy Lutomirski
  0 siblings, 0 replies; 340+ messages in thread
From: Andy Lutomirski @ 2015-05-11 22:15 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Paul E. McKenney, Frederic Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Rik van Riel, Andrew Morton,
	Linux API, Thomas Gleixner, Tejun Heo, Peter Zijlstra,
	Steven Rostedt, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	Christoph Lameter, Gilad Ben Yossef, Ingo Molnar

On May 12, 2015 4:54 AM, "Chris Metcalf" <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
>
> (Oops, resending and forcing html off.)
>
>
> On 05/09/2015 03:19 AM, Andy Lutomirski wrote:
>>
>> Naming aside, I don't think this should be a per-task flag at all.  We
>> already have way too much overhead per syscall in nohz mode, and it
>> would be nice to get the per-syscall overhead as low as possible.  We
>> should strive, for all tasks, to keep syscall overhead down*and*
>> avoid as many interrupts as possible.
>>
>> That being said, I do see a legitimate use for a way to tell the
>> kernel "I'm going to run in userspace for a long time; stay away".
>> But shouldn't that be a single operation, not an ongoing flag?  IOW, I
>> think that we should have a new syscall quiesce() or something rather
>> than a prctl.
>
>
> Yes, if all you are concerned about is quiescing the tick, we could
> probably do it as a new syscall.
>
> I do note that you'd want to try to actually do the quiesce as late as
> possible - in particular, if you just did it in the usual syscall, you
> might miss out on a timer that is set by softirq, or even something
> that happened when you called schedule() on the syscall exit path.
> Doing it as late as we are doing helps to ensure that that doesn't
> happen.  We could still arrange for this semantics by having a new
> quiesce() syscall set a temporary task bit that was cleared on
> return to userspace, but as you pointed out in a different email,
> that gets tricky if you end up doing multiple user_exit() calls on
> your way back to userspace.

We should fix that, then.  A quiesce() syscall can certainly arrange
to clean up on final exit.

>
> More to the point, I think it's actually important to know when an
> application believes it's in userspace-only mode as an actual state
> bit, rather than just during its transitional moment.

We can do that, too, with a new flag that's cleared on the next entry.

>  If an
> application calls the kernel at an unexpected time (third-party code
> is the usual culprit for our customers, whether it's syscalls, page
> faults, or other things) we would prefer to have the "quiesce"
> semantics stay in force and cause the third-party code to be
> visibly very slow, rather than cause a totally unexpected and
> hard-to-diagnose interrupt show up later as we are still going
> around the loop that we thought was safely userspace-only.

I'm not really convinced that we should design this feature around
ease of debugging userspace screwups.  There are already plenty of
ways to do that part.  Userspace getting an interrupt because
userspace accidentally did a syscall is very different from userspace
getting interrupted due to an IPI.

>
> And, for debugging the kernel, it's crazy helpful to have that state
> bit in place: see patch 6/6 in the series for how we can diagnose
> things like "a different core just queued an IPI that will hit a
> dataplane core unexpectedly".  Having that state bit makes this sort
> of thing a trivial check in the kernel and relatively easy to debug.

As above, this can be done with a one-time operation, too.

>
> Finally, I proposed a "strict" mode in patch 5/6 where we kill the
> process if it voluntarily enters the kernel by mistake after saying it
> wasn't going to any more.  To do this requires a state bit, so
> carrying another state bit for "quiesce on user entry" seems pretty
> reasonable.

I still dislike that in the form you chose.  It's too deadly to be
useful for anyone but the hardest RT users.

I think I'd be okay with variants, though: let a suitably privileged
process ask for a signal on inadvertent kernel entry or rig up an fd
to be notified when one of these bad entries happens.  Queueing
something to a pollable fd would work, too.

See that thread for more comments.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
  2015-05-11 19:13       ` Chris Metcalf
  (?)
@ 2015-05-11 22:28       ` Andy Lutomirski
  2015-05-12 21:06         ` Chris Metcalf
  -1 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-05-11 22:28 UTC (permalink / raw)
  To: Chris Metcalf, Peter Zijlstra
  Cc: Paul E. McKenney, Frederic Weisbecker, Ingo Molnar, Rik van Riel,
	linux-doc, Andrew Morton, linux-kernel, Thomas Gleixner,
	Tejun Heo, Steven Rostedt, Christoph Lameter, Gilad Ben Yossef,
	Linux API

[add peterz due to perf stuff]

On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> On 05/09/2015 03:28 AM, Andy Lutomirski wrote:
>>
>> On May 8, 2015 11:44 PM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote:
>>>
>>> With QUIESCE mode, the task is in principle guaranteed not to be
>>> interrupted by the kernel, but only if it behaves.  In particular,
>>> if it enters the kernel via system call, page fault, or any of
>>> a number of other synchronous traps, it may be unexpectedly
>>> exposed to long latencies.  Add a simple flag that puts the process
>>> into a state where any such kernel entry is fatal.
>>>
>>> To allow the state to be entered and exited, we add an internal
>>> bit to current->dataplane_flags that is set when prctl() sets the
>>> flags.  That way, when we are exiting the kernel after calling
>>> prctl() to forbid future kernel exits, we don't get immediately
>>> killed.
>>
>> Is there any reason this can't already be addressed in userspace using
>> /proc/interrupts or perf_events?  ISTM the real goal here is to detect
>> when we screw up and fail to avoid an interrupt, and killing the task
>> seems like overkill to me.
>
>
> Patch 6/6 proposes a mechanism to track down times when the
> kernel screws up and delivers an IRQ to a userspace-only task.
> Here, we're just trying to identify the times when an application
> screws itself up out of cluelessness, and provide a mechanism
> that allows the developer to easily figure out why and fix it.
>
> In particular, /proc/interrupts won't show syscalls or page faults,
> which are two easy ways applications can screw themselves
> when they think they're in userspace-only mode.  Also, they don't
> provide sufficient precision to make it clear what part of the
> application caused the undesired kernel entry.

Perf does, though, complete with context.

>
> In this case, killing the task is appropriate, since that's exactly
> the semantics that have been asked for - it's like on architectures
> that don't natively support unaligned accesses, but fake it relatively
> slowly in the kernel, and in development you just say "give me a
> SIGBUS when that happens" and in production you might say
> "fix it up and let's try to keep going".

I think more control is needed.  I also think that, if we go this
route, we should distinguish syscalls, synchronous non-syscall
entries, and asynchronous non-syscall entries.  They're quite
different.

>
> You can argue that this is something that can be done by ftrace,
> but certainly you'd want to have a way to programmatically
> turn on ftrace at the moment when you're entering userspace-only
> mode, so we'd want some API around that anyway.  And honestly,
> it's so easy to test a task state bit in a couple of places and
> generate the failurel on the spot, vs. the relative complexity
> of setting up and understanding ftrace, that I think it merits
> inclusion on that basis alone.

perf_event, not ftrace.

>
>> Also, can we please stop further torturing the exit paths?  We have a
>> disaster of assembly code that calls into syscall_trace_leave and
>> do_notify_resume.  Those functions, in turn, *both* call user_enter
>> (WTF?), and on very brief inspection user_enter makes it into the nohz
>> code through multiple levels of indirection, which, with these
>> patches, has yet another conditionally enabled helper, which does this
>> new stuff.  It's getting to be impossible to tell what happens when we
>> exit to user space any more.
>>
>> Also, I think your code is buggy.  There's no particular guarantee
>> that user_enter is only called once between sys_prctl and the final
>> exit to user mode (see the above WTF), so you might spuriously kill
>> the process.
>
>
> This is a good point; I also find the x86 kernel entry and exit
> paths confusing, although I've reviewed them a bunch of times.
> The tile architecture paths are a little easier to understand.
>
> That said, I think the answer here is avoid non-idempotent
> actions in the dataplane code, such as clearing a syscall bit.
>
> A better implementation, I think, is to put the tests for "you
> screwed up and synchronously entered the kernel" in
> the syscall_trace_enter() code, which TIF_NOHZ already
> gets us into;

No, not unless you're planning on using that to distinguish syscalls
from other stuff *and* people think that's justified.

It's far to easy to just make a tiny change to the entry code.  Add a
tiny trivial change here, a few lines of asm (that's you, audit!)
there, some weird written-in-asm scheduling code over here, and you
end up with the truly awful mess that we currently have.

If it really makes sense for this stuff to go with context tracking,
then fine, but we should *fix* the context tracking first rather than
kludging around it.  I already have a prototype patch for the relevant
part of that.

> there, we can test if the dataplane "strict" bit is
> set and the syscall is not prctl(), then we generate the error.
> (We'd exclude exit and exit_group here too, since we don't
> need to shoot down a task that's just trying to kill itself.)
> This needs a bit of platform-specific code for each platform,
> but that doesn't seem like too big a problem.

I'd rather avoid that, too.  This feature isn't really arch-specific,
so let's avoid the arch stuff if at all possible.

>
> Likewise we can test in exception_enter() since that's only
> called for all the synchronous user entries like page faults.

Let's try to generalize a bit.  There's also irq_entry and ist_enter,
and some of the exception_enter cases are for synchronous entries
while (IIRC -- could be wrong) others aren't always like that.

>
>> Also, I think that most users will be quite surprised if "strict
>> dataplane" code causes any machine check on the system to kill your
>> dataplane task.
>
>
> Fair point, and avoided by testing as described above instead.
> (Though presumably in development it's not such a big deal,
> and as I said you'd likely turn it off in production.)

Until you forget to turn it off in production because it worked so
nicely in development.

What if we added a mode to perf where delivery of a sample
synchronously (or semi-synchronously by catching it on the next exit
to userspace) freezes the delivering task?  It would be like debugger
support via perf.

peterz, do you think this would be a sensible thing to add to perf?
It would only make sense for some types of events (tracepoints and
hw_breakpoints mostly, I think).

>> So, I don't know if it is a practical suggestion or not, but would it
>> better/easier to mark a pending signal on kernel entry for this case?
>> The upsides I see is that the user gets her notification (killing the task
>> or just logging the event in a signal handler) and hopefully since return to
>> userspace with a pending signal is already handled we don't need new code in
>> the exit path?
>
>
> We could certainly do this now that I'm planning to do the
> test at kernel entry rather than super-late in kernel exit.
> Rather than just do_group_exit(SIGKILL), we should raise
> a proper SIGKILL signal via send_sig(SIGKILL, current, 1),
> and then we could catch it in the debugger; the pc should
> help identify if it was a syscall, page fault, or other trap.
>
> I'm not sure there's an argument to be made for the user
> process being able to catch the signal itself; presumably in
> production you don't turn this mode on anyway, and in
> development, assuming a debugger is probably fine.
>
> But if you want to argue for another signal (SIGILL?) please
> do; I'm curious to hear if you think it would make more sense.

Make it configurable as part of the prctl.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-11 19:25                   ` Chris Metcalf
  (?)
@ 2015-05-12  1:47                   ` Mike Galbraith
  2015-05-12  4:35                     ` Mike Galbraith
  2015-05-15 15:05                     ` Chris Metcalf
  -1 siblings, 2 replies; 340+ messages in thread
From: Mike Galbraith @ 2015-05-12  1:47 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Frederic Weisbecker, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel

On Mon, 2015-05-11 at 15:25 -0400, Chris Metcalf wrote:
> On 05/11/2015 03:19 PM, Mike Galbraith wrote:
> > I really shouldn't have acked nohz_full -> isolcpus.  Beside the fact
> > that old static isolcpus was_supposed_  to crawl off and die, I know
> > beyond doubt that having isolated a cpu as well as you can definitely
> > does NOT imply that said cpu should become tickless.
> 
> True, at a high level, I agree that it would be better to have a
> top-level concept like Frederic's proposed ISOLATION that includes
> isolcpus and nohz_cpu (and other stuff as needed).
> 
> That said, what you wrote above is wrong; even with the patch you
> acked, setting isolcpus does not automatically turn on nohz_full for
> a given cpu.  The patch made it true the other way around: when
> you say nohz_full, you automatically get isolcpus on that cpu too.
> That does, at least, make sense for the semantics of nohz_full.

I didn't write that, I wrote nohz_full implies (spelled '->') isolcpus.
Yes, with nohz_full currently being static, the old allegedly dying but
also static isolcpus scheduler off switch is a convenient thing to wire
the nohz_full CPU SET (<- hint;) property to.

	-Mike



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry
  2015-05-11 20:13     ` Chris Metcalf
@ 2015-05-12  2:21       ` Mike Galbraith
  2015-05-12  9:28       ` Peter Zijlstra
  2015-05-12  9:32       ` Peter Zijlstra
  2 siblings, 0 replies; 340+ messages in thread
From: Mike Galbraith @ 2015-05-12  2:21 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, linux-kernel

On Mon, 2015-05-11 at 16:13 -0400, Chris Metcalf wrote:
> On 05/09/2015 03:04 AM, Mike Galbraith wrote:
> > On Fri, 2015-05-08 at 13:58 -0400, Chris Metcalf wrote:
> >> For tasks which have elected dataplane functionality, we run
> >> any pending softirqs for the core before returning to userspace,
> >> rather than ever scheduling ksoftirqd to run.  The problem we
> >> fix is that by allowing another task to run on the core, we
> >> guarantee more interrupts in the future to the dataplane task,
> >> which is exactly what dataplane mode is required to prevent.
> > If ksoftirqd were rt class
> 
> I realize I actually don't know if this is true or not.  Is
> ksoftirqd rt class?  If not, it does seem pretty plausible that
> it should be...

It is in an rt kernel, not in a stock kernel, it's malleable in both ;-)

> > softirqs would be gone when the soloist gets
> > the CPU back and heads to userspace.  Being a soloist, it has no use for
> > a priority, so why can't it just let ksoftirqd run if it raises the
> > occasional softirq?  Meeting a contended lock while processing it will
> > wreck the soloist regardless of who does that processing.
> 
> The thing you want to avoid is having two processes both
> runnable at once, since then the "quiesce" mode can't make
> forward progress and basically spins in cpu_idle() until ksoftirqd
> can come in.

The only way ksoftirqd can appear is the soloist woke it.  If alleged
soloist is raising enough softirqs to matter, it ain't really an ultra
sensitive solo artist, it's part of a noise inducing (locks) chorus.

>   Alas, my recollection of the precise failure mode
> is somewhat dimmed; my commit notes from a year ago (for
> a variant of the patch I'm upstreaming now):
> 
>          - Trying to return to userspace with pending softirqs is not
>            currently allowed.  Prior to this patch, when this happened
>            we would just wait in cpu_idle.  Instead, what we now do is
>            directly run any pending softirqs, then go back and retry the
>            path where we return to userspace.
>          
>          - Raising softirqs (in this case for hrtimer support) could
>            cause the ksoftirqd daemon to be woken on a core.  This is
>            bad because on a dataplane core, a QUIESCE process will
>            then block until the ksoftirqd runs, and the system sometimes
>            seems to flag that soft irqs are available but not schedule
>            the timer to arrange for a context switch to ksoftirqd.
>            To handle this, we avoid bailing out in __do_softirq() when
>            we've been working for a while, if we're on a dataplane core,
>            and just keep working until done.  Similarly, on a dataplane
>            core running a userspace task, we don't wake ksoftirqd when
>            we are raising a softirq, even if we're not in an interrupt
>            context where it will run promptly, since a non-interrupt
>            context will also run promptly.

Thomas has nuked the hrtimer softirq.

> I'm happy to drop this patch entirely from the series for now, and
> if ksoftirqd shows up as a problem going forward, we can address it
> as necessary at that time.   What do you think?

Inlining softirqs may save a context switch, but adds cycles that we may
consume at higher frequency than the thing we're avoiding.

	-Mike


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-12  1:47                   ` Mike Galbraith
@ 2015-05-12  4:35                     ` Mike Galbraith
  2015-05-15 15:05                     ` Chris Metcalf
  1 sibling, 0 replies; 340+ messages in thread
From: Mike Galbraith @ 2015-05-12  4:35 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Frederic Weisbecker, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel

On Tue, 2015-05-12 at 03:47 +0200, Mike Galbraith wrote:
> On Mon, 2015-05-11 at 15:25 -0400, Chris Metcalf wrote:
> > On 05/11/2015 03:19 PM, Mike Galbraith wrote:
> > > I really shouldn't have acked nohz_full -> isolcpus.  Beside the fact
> > > that old static isolcpus was_supposed_  to crawl off and die, I know
> > > beyond doubt that having isolated a cpu as well as you can definitely
> > > does NOT imply that said cpu should become tickless.
> > 
> > True, at a high level, I agree that it would be better to have a
> > top-level concept like Frederic's proposed ISOLATION that includes
> > isolcpus and nohz_cpu (and other stuff as needed).
> > 
> > That said, what you wrote above is wrong; even with the patch you
> > acked, setting isolcpus does not automatically turn on nohz_full for
> > a given cpu.  The patch made it true the other way around: when
> > you say nohz_full, you automatically get isolcpus on that cpu too.
> > That does, at least, make sense for the semantics of nohz_full.
> 
> I didn't write that, I wrote nohz_full implies (spelled '->') isolcpus.
> Yes, with nohz_full currently being static, the old allegedly dying but
> also static isolcpus scheduler off switch is a convenient thing to wire
> the nohz_full CPU SET (<- hint;) property to.

BTW, another facet of this: Rik wants to make isolcpus immune to
cpusets, which makes some sense, user did say isolcpus=, but that also
makes isolcpus truly static.  If the user now says nohz_full=, they lose
the ability to deactivate CPU isolation, making the set fairly useless
for anything other than HPC.  Currently, the user can flip the isolation
switch as he sees fit.  He takes a size extra large performance hit for
having said nohz_full=, but he doesn't lose generic utility.

	-Mike


^ permalink raw reply	[flat|nested] 340+ messages in thread

* CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full)
  2015-05-11 18:09                       ` Chris Metcalf
  (?)
  (?)
@ 2015-05-12  9:10                       ` Ingo Molnar
  2015-05-12 11:48                           ` Peter Zijlstra
  2015-05-12 21:05                           ` CONFIG_ISOLATION=y Chris Metcalf
  -1 siblings, 2 replies; 340+ messages in thread
From: Ingo Molnar @ 2015-05-12  9:10 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck,
	Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc,
	linux-api, linux-kernel


* Chris Metcalf <cmetcalf@ezchip.com> wrote:

> - ISOLATION (Frederic).  I like this but it conflicts with other uses
>   of "isolation" in the kernel: cgroup isolation, lru page isolation,
>   iommu isolation, scheduler isolation (at least it's a superset of
>   that one), etc.  Also, we're not exactly isolating a task - often
>   a "dataplane" app consists of a bunch of interacting threads in
>   userspace, so not exactly isolated.  So perhaps it's too confusing.

So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this is 
a high level kernel feature, so it won't conflict with isolation 
concepts in lower level subsystems such as IOMMU isolation - and other 
higher level features like scheduler isolation are basically another 
partial implementation we want to merge with all this...

nohz, RCU tricks, watchdog defaults, isolcpus and various other 
measures to keep these CPUs and workloads as isolated as possible
are (or should become) components of this high level concept.

Ideally CONFIG_ISOLATION=y would be a kernel feature that has almost 
zero overhead on normal workloads and on non-isolated CPUs, so that 
Linux distributions can enable it.

Enabling CONFIG_ISOLATION=y should be the only 'kernel config' step 
needed: just like cpusets, the configuration of isolated CPUs should 
be a completely boot option free excercise that can be dynamically 
done and undone by the administrator via an intuitive interface.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 2/6] nohz: dataplane: allow tick to be fully disabled for dataplane
  2015-05-08 17:58 ` [PATCH 2/6] nohz: dataplane: allow tick to be fully disabled for dataplane Chris Metcalf
@ 2015-05-12  9:26   ` Peter Zijlstra
  2015-05-12 13:12     ` Paul E. McKenney
  0 siblings, 1 reply; 340+ messages in thread
From: Peter Zijlstra @ 2015-05-12  9:26 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat,
	linux-kernel

On Fri, May 08, 2015 at 01:58:43PM -0400, Chris Metcalf wrote:
> While the current fallback to 1-second tick is still helpful for
> maintaining completely correct kernel semantics, processes using
> prctl(PR_SET_DATAPLANE) semantics place a higher priority on running
> completely tickless, so don't bound the time_delta for such processes.
> 
> This was previously discussed in
> 
> https://lkml.org/lkml/2014/10/31/364
> 
> and Thomas Gleixner observed that vruntime, load balancing data,
> load accounting, and other things might be impacted.  Frederic
> Weisbecker similarly observed that allowing the tick to be indefinitely
> deferred just meant that no one would ever fix the underlying bugs.
> However it's at least true that the mode proposed in this patch can
> only be enabled on an isolcpus core, which may limit how important
> it is to maintain scheduler data correctly, for example.

So how is making this available going to help people fix the actual
problem?

There is nothing fundamentally impossible about fixing this proper, its
just a lot of hard work.

NAK on this, do it right.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry
  2015-05-11 20:13     ` Chris Metcalf
  2015-05-12  2:21       ` Mike Galbraith
@ 2015-05-12  9:28       ` Peter Zijlstra
  2015-05-12  9:32       ` Peter Zijlstra
  2 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2015-05-12  9:28 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Mike Galbraith, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, linux-kernel

On Mon, May 11, 2015 at 04:13:16PM -0400, Chris Metcalf wrote:
>         - Raising softirqs (in this case for hrtimer support) could

Note that Thomas recently killed all the softirq wreckage in hrtimers.
So that specific case is dealt with.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry
  2015-05-11 20:13     ` Chris Metcalf
  2015-05-12  2:21       ` Mike Galbraith
  2015-05-12  9:28       ` Peter Zijlstra
@ 2015-05-12  9:32       ` Peter Zijlstra
  2015-05-12 13:08         ` Paul E. McKenney
  2 siblings, 1 reply; 340+ messages in thread
From: Peter Zijlstra @ 2015-05-12  9:32 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Mike Galbraith, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, linux-kernel

On Mon, May 11, 2015 at 04:13:16PM -0400, Chris Metcalf wrote:
> The thing you want to avoid is having two processes both
> runnable at once

Right, because as soon as nr_running > 1 we kill the entire nohz_full
thing. RT or not for ksoftirqd doesn't matter.

Then again, like interrupts, you basically want to avoid softirqs in
this mode.

So I think the right solution is to figure out why the softirqs get
raised and cure that.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE
@ 2015-05-12  9:33     ` Peter Zijlstra
  0 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2015-05-12  9:33 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc,
	linux-api, linux-kernel

On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote:
> This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the
> kernel to quiesce any pending timer interrupts prior to returning
> to userspace.  When running with this mode set, sys calls (and page
> faults, etc.) can be inordinately slow.  However, user applications
> that want to guarantee that no unexpected interrupts will occur
> (even if they call into the kernel) can set this flag to guarantee
> that semantics.

Currently people hot-unplug and hot-plug the CPU to do this. Obviously
that's a wee bit horrible :-)

Not sure if a prctl like this is any better though. This is a CPU
properly not a process one.

ISTR people talking about 'quiesce' sysfs file, along side the hotplug
stuff, I can't quite remember.



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE
@ 2015-05-12  9:33     ` Peter Zijlstra
  0 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2015-05-12  9:33 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote:
> This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the
> kernel to quiesce any pending timer interrupts prior to returning
> to userspace.  When running with this mode set, sys calls (and page
> faults, etc.) can be inordinately slow.  However, user applications
> that want to guarantee that no unexpected interrupts will occur
> (even if they call into the kernel) can set this flag to guarantee
> that semantics.

Currently people hot-unplug and hot-plug the CPU to do this. Obviously
that's a wee bit horrible :-)

Not sure if a prctl like this is any better though. This is a CPU
properly not a process one.

ISTR people talking about 'quiesce' sysfs file, along side the hotplug
stuff, I can't quite remember.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
  2015-05-08 17:58   ` Chris Metcalf
  (?)
  (?)
@ 2015-05-12  9:38   ` Peter Zijlstra
  2015-05-12 13:20       ` Paul E. McKenney
  -1 siblings, 1 reply; 340+ messages in thread
From: Peter Zijlstra @ 2015-05-12  9:38 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc,
	linux-api, linux-kernel

On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote:
> +++ b/kernel/time/tick-sched.c
> @@ -436,6 +436,20 @@ static void dataplane_quiesce(void)
>  			(jiffies - start));
>  		dump_stack();
>  	}
> +
> +	/*
> +	 * Kill the process if it violates STRICT mode.  Note that this
> +	 * code also results in killing the task if a kernel bug causes an
> +	 * irq to be delivered to this core.
> +	 */
> +	if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL))
> +	    == PR_DATAPLANE_STRICT) {
> +		pr_warn("Dataplane STRICT mode violated; process killed.\n");
> +		dump_stack();
> +		task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE;
> +		local_irq_enable();
> +		do_group_exit(SIGKILL);
> +	}
>  }

So while I'm all for hard fails like this, can we not provide a wee bit
more information in the siginfo ? And maybe use a slightly less fatal
signal, such that userspace can actually catch it and dump state in
debug modes?

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE
@ 2015-05-12  9:50       ` Ingo Molnar
  0 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2015-05-12  9:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Frederic Weisbecker, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote:
> > This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the
> > kernel to quiesce any pending timer interrupts prior to returning
> > to userspace.  When running with this mode set, sys calls (and page
> > faults, etc.) can be inordinately slow.  However, user applications
> > that want to guarantee that no unexpected interrupts will occur
> > (even if they call into the kernel) can set this flag to guarantee
> > that semantics.
> 
> Currently people hot-unplug and hot-plug the CPU to do this. 
> Obviously that's a wee bit horrible :-)
> 
> Not sure if a prctl like this is any better though. This is a CPU 
> properly not a process one.

So if then a prctl() (or other system call) could be a shortcut to:

 - move the task to an isolated CPU
 - make sure there _is_ such an isolated domain available

I.e. have some programmatic, kernel provided way for an application to 
be sure it's running in the right environment. Relying on random 
administration flags here and there won't cut it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE
@ 2015-05-12  9:50       ` Ingo Molnar
  0 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2015-05-12  9:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Frederic Weisbecker, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


* Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:

> On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote:
> > This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the
> > kernel to quiesce any pending timer interrupts prior to returning
> > to userspace.  When running with this mode set, sys calls (and page
> > faults, etc.) can be inordinately slow.  However, user applications
> > that want to guarantee that no unexpected interrupts will occur
> > (even if they call into the kernel) can set this flag to guarantee
> > that semantics.
> 
> Currently people hot-unplug and hot-plug the CPU to do this. 
> Obviously that's a wee bit horrible :-)
> 
> Not sure if a prctl like this is any better though. This is a CPU 
> properly not a process one.

So if then a prctl() (or other system call) could be a shortcut to:

 - move the task to an isolated CPU
 - make sure there _is_ such an isolated domain available

I.e. have some programmatic, kernel provided way for an application to 
be sure it's running in the right environment. Relying on random 
administration flags here and there won't cut it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE
@ 2015-05-12 10:38         ` Peter Zijlstra
  0 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2015-05-12 10:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Frederic Weisbecker, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel

On Tue, May 12, 2015 at 11:50:30AM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote:
> > > This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the
> > > kernel to quiesce any pending timer interrupts prior to returning
> > > to userspace.  When running with this mode set, sys calls (and page
> > > faults, etc.) can be inordinately slow.  However, user applications
> > > that want to guarantee that no unexpected interrupts will occur
> > > (even if they call into the kernel) can set this flag to guarantee
> > > that semantics.
> > 
> > Currently people hot-unplug and hot-plug the CPU to do this. 
> > Obviously that's a wee bit horrible :-)
> > 
> > Not sure if a prctl like this is any better though. This is a CPU 
> > properly not a process one.
> 
> So if then a prctl() (or other system call) could be a shortcut to:
> 
>  - move the task to an isolated CPU
>  - make sure there _is_ such an isolated domain available
> 
> I.e. have some programmatic, kernel provided way for an application to 
> be sure it's running in the right environment. Relying on random 
> administration flags here and there won't cut it.

No, we already have sched_setaffinity() and we should not duplicate its
ability to move tasks about.

What this is about is 'clearing' CPU state, its nothing to do with
tasks.

Ideally we'd never have to clear the state because it should be
impossible to get into this predicament in the first place.

The typical example here is a periodic timer that found its way onto the
cpu and stays there. We're actually working on allowing such self arming
timers to migrate, so once we have that sorted this could be fixed
proper I think.

Not sure if there's more pollution that people worry about.

The hotplug hack worked because unplug force migrates the timers away.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE
@ 2015-05-12 10:38         ` Peter Zijlstra
  0 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2015-05-12 10:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Frederic Weisbecker, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, May 12, 2015 at 11:50:30AM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> 
> > On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote:
> > > This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the
> > > kernel to quiesce any pending timer interrupts prior to returning
> > > to userspace.  When running with this mode set, sys calls (and page
> > > faults, etc.) can be inordinately slow.  However, user applications
> > > that want to guarantee that no unexpected interrupts will occur
> > > (even if they call into the kernel) can set this flag to guarantee
> > > that semantics.
> > 
> > Currently people hot-unplug and hot-plug the CPU to do this. 
> > Obviously that's a wee bit horrible :-)
> > 
> > Not sure if a prctl like this is any better though. This is a CPU 
> > properly not a process one.
> 
> So if then a prctl() (or other system call) could be a shortcut to:
> 
>  - move the task to an isolated CPU
>  - make sure there _is_ such an isolated domain available
> 
> I.e. have some programmatic, kernel provided way for an application to 
> be sure it's running in the right environment. Relying on random 
> administration flags here and there won't cut it.

No, we already have sched_setaffinity() and we should not duplicate its
ability to move tasks about.

What this is about is 'clearing' CPU state, its nothing to do with
tasks.

Ideally we'd never have to clear the state because it should be
impossible to get into this predicament in the first place.

The typical example here is a periodic timer that found its way onto the
cpu and stays there. We're actually working on allowing such self arming
timers to migrate, so once we have that sorted this could be fixed
proper I think.

Not sure if there's more pollution that people worry about.

The hotplug hack worked because unplug force migrates the timers away.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-12 10:46               ` Peter Zijlstra
  0 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2015-05-12 10:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Andrew Morton, Chris Metcalf, Gilad Ben Yossef,
	Ingo Molnar, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel

On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote:
> 
> Please lets get NO_HZ_FULL up to par. That should be the main focus.
> 

ACK, much of this dataplane stuff is (useful) hacks working around the
fact that nohz_full just isn't complete.



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-12 10:46               ` Peter Zijlstra
  0 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2015-05-12 10:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Andrew Morton, Chris Metcalf, Gilad Ben Yossef,
	Ingo Molnar, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote:
> 
> Please lets get NO_HZ_FULL up to par. That should be the main focus.
> 

ACK, much of this dataplane stuff is (useful) hacks working around the
fact that nohz_full just isn't complete.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full)
@ 2015-05-12 11:48                           ` Peter Zijlstra
  0 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2015-05-12 11:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker,
	Andrew Morton, paulmck, Gilad Ben Yossef, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat,
	linux-doc, linux-api, linux-kernel

On Tue, May 12, 2015 at 11:10:32AM +0200, Ingo Molnar wrote:
> 
> So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this is 
> a high level kernel feature, so it won't conflict with isolation 
> concepts in lower level subsystems such as IOMMU isolation - and other 
> higher level features like scheduler isolation are basically another 
> partial implementation we want to merge with all this...
> 

But why do we need a CONFIG flag for something that has no content?

That is, I do not see anything much; except the 'I want to stay in
userspace and kill me otherwise' flag, and I'm not sure that warrants a
CONFIG flag like this.

Other than that, its all a combination of NOHZ_FULL and cpusets/isolcpus
and whatnot.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full)
@ 2015-05-12 11:48                           ` Peter Zijlstra
  0 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2015-05-12 11:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker,
	Andrew Morton, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Gilad Ben Yossef, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Christoph Lameter, Srivatsa S. Bhat,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, May 12, 2015 at 11:10:32AM +0200, Ingo Molnar wrote:
> 
> So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this is 
> a high level kernel feature, so it won't conflict with isolation 
> concepts in lower level subsystems such as IOMMU isolation - and other 
> higher level features like scheduler isolation are basically another 
> partial implementation we want to merge with all this...
> 

But why do we need a CONFIG flag for something that has no content?

That is, I do not see anything much; except the 'I want to stay in
userspace and kill me otherwise' flag, and I'm not sure that warrants a
CONFIG flag like this.

Other than that, its all a combination of NOHZ_FULL and cpusets/isolcpus
and whatnot.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full)
  2015-05-12 11:48                           ` Peter Zijlstra
  (?)
@ 2015-05-12 12:34                           ` Ingo Molnar
  2015-05-12 12:39                               ` Peter Zijlstra
  2015-05-12 15:36                             ` Frederic Weisbecker
  -1 siblings, 2 replies; 340+ messages in thread
From: Ingo Molnar @ 2015-05-12 12:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker,
	Andrew Morton, paulmck, Gilad Ben Yossef, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat,
	linux-doc, linux-api, linux-kernel


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, May 12, 2015 at 11:10:32AM +0200, Ingo Molnar wrote:
> > 
> > So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this 
> > is a high level kernel feature, so it won't conflict with 
> > isolation concepts in lower level subsystems such as IOMMU 
> > isolation - and other higher level features like scheduler 
> > isolation are basically another partial implementation we want to 
> > merge with all this...
> 
> But why do we need a CONFIG flag for something that has no content?
> 
> That is, I do not see anything much; except the 'I want to stay in 
> userspace and kill me otherwise' flag, and I'm not sure that 
> warrants a CONFIG flag like this.
> 
> Other than that, its all a combination of NOHZ_FULL and 
> cpusets/isolcpus and whatnot.

Yes, that's what I meant: CONFIG_ISOLATION would trigger what is 
NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL as 
an individual Kconfig option?

CONFIG_ISOLATION=y would express the guarantee from the kernel that 
it's possible for user-space to configure itself to run undisturbed - 
instead of the current inconsistent set of options and facilities.

A bit like CONFIG_PREEMPT_RT is more than just preemptable spinlocks, 
it also tries to offer various facilities and tune the defaults to 
turn the kernel hard-rt.

Does that make sense to you?

Thanks,

	Ingo


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full)
@ 2015-05-12 12:39                               ` Peter Zijlstra
  0 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2015-05-12 12:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker,
	Andrew Morton, paulmck, Gilad Ben Yossef, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat,
	linux-doc, linux-api, linux-kernel

On Tue, May 12, 2015 at 02:34:40PM +0200, Ingo Molnar wrote:
> Yes, that's what I meant: CONFIG_ISOLATION would trigger what is 
> NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL as 
> an individual Kconfig option?

Ah, as a rename of nohz_full, sure that might work.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full)
@ 2015-05-12 12:39                               ` Peter Zijlstra
  0 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2015-05-12 12:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker,
	Andrew Morton, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Gilad Ben Yossef, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Christoph Lameter, Srivatsa S. Bhat,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, May 12, 2015 at 02:34:40PM +0200, Ingo Molnar wrote:
> Yes, that's what I meant: CONFIG_ISOLATION would trigger what is 
> NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL as 
> an individual Kconfig option?

Ah, as a rename of nohz_full, sure that might work.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full)
@ 2015-05-12 12:43                                 ` Ingo Molnar
  0 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2015-05-12 12:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker,
	Andrew Morton, paulmck, Gilad Ben Yossef, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat,
	linux-doc, linux-api, linux-kernel


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, May 12, 2015 at 02:34:40PM +0200, Ingo Molnar wrote:
>
> > Yes, that's what I meant: CONFIG_ISOLATION would trigger what is 
> > NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL 
> > as an individual Kconfig option?
> 
> Ah, as a rename of nohz_full, sure that might work.

It could also be named CONFIG_CPU_ISOLATION=y, to make it more 
explicit what it's about.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full)
@ 2015-05-12 12:43                                 ` Ingo Molnar
  0 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2015-05-12 12:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker,
	Andrew Morton, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Gilad Ben Yossef, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Christoph Lameter, Srivatsa S. Bhat,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


* Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:

> On Tue, May 12, 2015 at 02:34:40PM +0200, Ingo Molnar wrote:
>
> > Yes, that's what I meant: CONFIG_ISOLATION would trigger what is 
> > NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL 
> > as an individual Kconfig option?
> 
> Ah, as a rename of nohz_full, sure that might work.

It could also be named CONFIG_CPU_ISOLATION=y, to make it more 
explicit what it's about.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE
  2015-05-12 10:38         ` Peter Zijlstra
  (?)
@ 2015-05-12 12:52         ` Ingo Molnar
  2015-05-13  4:35           ` Andy Lutomirski
  -1 siblings, 1 reply; 340+ messages in thread
From: Ingo Molnar @ 2015-05-12 12:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Frederic Weisbecker, Paul E. McKenney, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel


* Peter Zijlstra <peterz@infradead.org> wrote:

> > So if then a prctl() (or other system call) could be a shortcut 
> > to:
> > 
> >  - move the task to an isolated CPU
> >  - make sure there _is_ such an isolated domain available
> > 
> > I.e. have some programmatic, kernel provided way for an 
> > application to be sure it's running in the right environment. 
> > Relying on random administration flags here and there won't cut 
> > it.
> 
> No, we already have sched_setaffinity() and we should not duplicate 
> its ability to move tasks about.

But sched_setaffinity() does not guarantee isolation - it's just a 
syscall to move a task to a set of CPUs, which might be isolated or 
not.

What I suggested is that it might make sense to offer a system call, 
for example a sched_setparam() variant, that makes such guarantees.

Say if user-space does:

	ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params);

... then we would get the task moved to an isolated domain and get a 0 
return code if the kernel is able to do all that and if the current 
uid/namespace/etc. has the required permissions and such.

( BIND_ISOLATED will not replace the current p->policy value, so it's
  still possible to use the regular policies as well on top of this. )

I.e. make it programatic instead of relying on a fragile, kernel 
version dependent combination of sysctl, sysfs, kernel config and boot 
parameter details to get us this result.

I.e. provide a central hub to offer this feature in a more structured, 
easier to use fashion.

We might still require the admin (or distro) to separately set up the 
domain of isolated CPUs, and it would still be possible to simply 
'move' tasks there using existing syscalls - but I say that it's not a 
bad idea at all to offer a single central syscall interface for apps 
to request such treatment.

> What this is about is 'clearing' CPU state, its nothing to do with 
> tasks.
> 
> Ideally we'd never have to clear the state because it should be 
> impossible to get into this predicament in the first place.

That I absolutely agree about, that bit is nonsense.

We might offer debugging facilities to debug such bugs, but we won't 
work or hack it around.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry
  2015-05-12  9:32       ` Peter Zijlstra
@ 2015-05-12 13:08         ` Paul E. McKenney
  0 siblings, 0 replies; 340+ messages in thread
From: Paul E. McKenney @ 2015-05-12 13:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Mike Galbraith, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Christoph Lameter, linux-kernel

On Tue, May 12, 2015 at 11:32:02AM +0200, Peter Zijlstra wrote:
> On Mon, May 11, 2015 at 04:13:16PM -0400, Chris Metcalf wrote:
> > The thing you want to avoid is having two processes both
> > runnable at once
> 
> Right, because as soon as nr_running > 1 we kill the entire nohz_full
> thing. RT or not for ksoftirqd doesn't matter.
> 
> Then again, like interrupts, you basically want to avoid softirqs in
> this mode.
> 
> So I think the right solution is to figure out why the softirqs get
> raised and cure that.

Makes sense, but it also makes sense to have something that detects
when that cure fails and clean up.  And, in a test/debug environment,
also issuing some sort of diagnostic in that case.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 2/6] nohz: dataplane: allow tick to be fully disabled for dataplane
  2015-05-12  9:26   ` Peter Zijlstra
@ 2015-05-12 13:12     ` Paul E. McKenney
  2015-05-14 20:55       ` Chris Metcalf
  0 siblings, 1 reply; 340+ messages in thread
From: Paul E. McKenney @ 2015-05-12 13:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat,
	linux-kernel

On Tue, May 12, 2015 at 11:26:07AM +0200, Peter Zijlstra wrote:
> On Fri, May 08, 2015 at 01:58:43PM -0400, Chris Metcalf wrote:
> > While the current fallback to 1-second tick is still helpful for
> > maintaining completely correct kernel semantics, processes using
> > prctl(PR_SET_DATAPLANE) semantics place a higher priority on running
> > completely tickless, so don't bound the time_delta for such processes.
> > 
> > This was previously discussed in
> > 
> > https://lkml.org/lkml/2014/10/31/364
> > 
> > and Thomas Gleixner observed that vruntime, load balancing data,
> > load accounting, and other things might be impacted.  Frederic
> > Weisbecker similarly observed that allowing the tick to be indefinitely
> > deferred just meant that no one would ever fix the underlying bugs.
> > However it's at least true that the mode proposed in this patch can
> > only be enabled on an isolcpus core, which may limit how important
> > it is to maintain scheduler data correctly, for example.
> 
> So how is making this available going to help people fix the actual
> problem?

It will at least provide an environment where adding more of this
problem might get punished.  This would be an improvement over what
we have today, namely that the 1HZ fallback timer silently forgives
adding more problems of this sort.

							Thanx, Paul

> There is nothing fundamentally impossible about fixing this proper, its
> just a lot of hard work.
> 
> NAK on this, do it right.
> 


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
       [not found]             ` <55510885.9070101@ezchip.com>
@ 2015-05-12 13:18               ` Paul E. McKenney
  0 siblings, 0 replies; 340+ messages in thread
From: Paul E. McKenney @ 2015-05-12 13:18 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Andy Lutomirski, Ingo Molnar, Andrew Morton, Steven Rostedt,
	Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Christoph Lameter,
	Srivatsa S. Bhat, linux-doc, Linux API, linux-kernel

On Mon, May 11, 2015 at 03:52:37PM -0400, Chris Metcalf wrote:
> On 05/09/2015 03:19 AM, Andy Lutomirski wrote:
> >Naming aside, I don't think this should be a per-task flag at all.  We
> >already have way too much overhead per syscall in nohz mode, and it
> >would be nice to get the per-syscall overhead as low as possible.  We
> >should strive, for all tasks, to keep syscall overhead down*and*
> >avoid as many interrupts as possible.
> >
> >That being said, I do see a legitimate use for a way to tell the
> >kernel "I'm going to run in userspace for a long time; stay away".
> >But shouldn't that be a single operation, not an ongoing flag?  IOW, I
> >think that we should have a new syscall quiesce() or something rather
> >than a prctl.
> 
> Yes, if all you are concerned about is quiescing the tick, we could
> probably do it as a new syscall.
> 
> I do note that you'd want to try to actually do the quiesce as late as
> possible - in particular, if you just did it in the usual syscall, you
> might miss out on a timer that is set by softirq, or even something
> that happened when you called schedule() on the syscall exit path.
> Doing it as late as we are doing helps to ensure that that doesn't
> happen.  We could still arrange for this semantics by having a new
> quiesce() syscall set a temporary task bit that was cleared on
> return to userspace, but as you pointed out in a different email,
> that gets tricky if you end up doing multiple user_exit() calls on
> your way back to userspace.
> 
> More to the point, I think it's actually important to know when an
> application believes it's in userspace-only mode as an actual state
> bit, rather than just during its transitional moment.  If an
> application calls the kernel at an unexpected time (third-party code
> is the usual culprit for our customers, whether it's syscalls, page
> faults, or other things) we would prefer to have the "quiesce"
> semantics stay in force and cause the third-party code to be
> visibly very slow, rather than cause a totally unexpected and
> hard-to-diagnose interrupt show up later as we are still going
> around the loop that we thought was safely userspace-only.
> 
> And, for debugging the kernel, it's crazy helpful to have that state
> bit in place: see patch 6/6 in the series for how we can diagnose
> things like "a different core just queued an IPI that will hit a
> dataplane core unexpectedly".  Having that state bit makes this sort
> of thing a trivial check in the kernel and relatively easy to debug.

I agree with this!  It is currently a bit painful to debug problems
that might result in multiple tasks runnable on a given CPU.  If you
suspect a problem, you enable tracing and re-run.  Not paricularly
friendly for chasing down intermittent problems, so some sort of
improvement would be a very good thing.

							Thanx, Paul

> Finally, I proposed a "strict" mode in patch 5/6 where we kill the
> process if it voluntarily enters the kernel by mistake after saying it
> wasn't going to any more.  To do this requires a state bit, so
> carrying another state bit for "quiesce on user entry" seems pretty
> reasonable.
> 
> -- 
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
> 


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
@ 2015-05-12 13:20       ` Paul E. McKenney
  0 siblings, 0 replies; 340+ messages in thread
From: Paul E. McKenney @ 2015-05-12 13:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Frederic Weisbecker, Christoph Lameter, Srivatsa S. Bhat,
	linux-doc, linux-api, linux-kernel

On Tue, May 12, 2015 at 11:38:58AM +0200, Peter Zijlstra wrote:
> On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote:
> > +++ b/kernel/time/tick-sched.c
> > @@ -436,6 +436,20 @@ static void dataplane_quiesce(void)
> >  			(jiffies - start));
> >  		dump_stack();
> >  	}
> > +
> > +	/*
> > +	 * Kill the process if it violates STRICT mode.  Note that this
> > +	 * code also results in killing the task if a kernel bug causes an
> > +	 * irq to be delivered to this core.
> > +	 */
> > +	if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL))
> > +	    == PR_DATAPLANE_STRICT) {
> > +		pr_warn("Dataplane STRICT mode violated; process killed.\n");
> > +		dump_stack();
> > +		task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE;
> > +		local_irq_enable();
> > +		do_group_exit(SIGKILL);
> > +	}
> >  }
> 
> So while I'm all for hard fails like this, can we not provide a wee bit
> more information in the siginfo ? And maybe use a slightly less fatal
> signal, such that userspace can actually catch it and dump state in
> debug modes?

Agreed, a bit more debug state would be helpful.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
@ 2015-05-12 13:20       ` Paul E. McKenney
  0 siblings, 0 replies; 340+ messages in thread
From: Paul E. McKenney @ 2015-05-12 13:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Frederic Weisbecker, Christoph Lameter, Srivatsa S. Bhat,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, May 12, 2015 at 11:38:58AM +0200, Peter Zijlstra wrote:
> On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote:
> > +++ b/kernel/time/tick-sched.c
> > @@ -436,6 +436,20 @@ static void dataplane_quiesce(void)
> >  			(jiffies - start));
> >  		dump_stack();
> >  	}
> > +
> > +	/*
> > +	 * Kill the process if it violates STRICT mode.  Note that this
> > +	 * code also results in killing the task if a kernel bug causes an
> > +	 * irq to be delivered to this core.
> > +	 */
> > +	if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL))
> > +	    == PR_DATAPLANE_STRICT) {
> > +		pr_warn("Dataplane STRICT mode violated; process killed.\n");
> > +		dump_stack();
> > +		task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE;
> > +		local_irq_enable();
> > +		do_group_exit(SIGKILL);
> > +	}
> >  }
> 
> So while I'm all for hard fails like this, can we not provide a wee bit
> more information in the siginfo ? And maybe use a slightly less fatal
> signal, such that userspace can actually catch it and dump state in
> debug modes?

Agreed, a bit more debug state would be helpful.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full)
  2015-05-12 12:34                           ` Ingo Molnar
  2015-05-12 12:39                               ` Peter Zijlstra
@ 2015-05-12 15:36                             ` Frederic Weisbecker
  1 sibling, 0 replies; 340+ messages in thread
From: Frederic Weisbecker @ 2015-05-12 15:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Chris Metcalf, Steven Rostedt, Andrew Morton,
	paulmck, Gilad Ben Yossef, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc,
	linux-api, linux-kernel

On Tue, May 12, 2015 at 02:34:40PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Tue, May 12, 2015 at 11:10:32AM +0200, Ingo Molnar wrote:
> > > 
> > > So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this 
> > > is a high level kernel feature, so it won't conflict with 
> > > isolation concepts in lower level subsystems such as IOMMU 
> > > isolation - and other higher level features like scheduler 
> > > isolation are basically another partial implementation we want to 
> > > merge with all this...
> > 
> > But why do we need a CONFIG flag for something that has no content?
> > 
> > That is, I do not see anything much; except the 'I want to stay in 
> > userspace and kill me otherwise' flag, and I'm not sure that 
> > warrants a CONFIG flag like this.
> > 
> > Other than that, its all a combination of NOHZ_FULL and 
> > cpusets/isolcpus and whatnot.
> 
> Yes, that's what I meant: CONFIG_ISOLATION would trigger what is 
> NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL as 
> an individual Kconfig option?

Right, we could return to what we had previously: CONFIG_NO_HZ. A config
that enables dynticks-idle by default and allows full dynticks if nohz_full=
boot option is passed (or something driven by higher level isolation interface).

Because eventually, distros enable NO_HZ_FULL so that their 0.0001% users
can use it. Well at least Red Hat does.

> 
> CONFIG_ISOLATION=y would express the guarantee from the kernel that 
> it's possible for user-space to configure itself to run undisturbed - 
> instead of the current inconsistent set of options and facilities.
> 
> A bit like CONFIG_PREEMPT_RT is more than just preemptable spinlocks, 
> it also tries to offer various facilities and tune the defaults to 
> turn the kernel hard-rt.
> 
> Does that make sense to you?

Right although distros tend to want features to be enabled dynamically
so that they have a single kernel to maintain. Things like PREEMPT_RT
really need to be a different kernel because fundamental primitives like
spinlocks must be implemented statically.

But isolation can be a boot-enabled, or even runtime-enabled, as it's only
about timer,irq,task affinity. Full Nohz is more complicated but it can
be runtime toggled in the future.

So we can bring CONFIG_CPU_ISOLATION, at least for distros that are really
not interested in that so they can disable it. CONFIG_CPU_ISOLATION=y would
bring an ability which is default-disabled and driven dynamically through whatever
interface.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: CONFIG_ISOLATION=y
  2015-05-12  9:10                       ` CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) Ingo Molnar
@ 2015-05-12 21:05                           ` Chris Metcalf
  2015-05-12 21:05                           ` CONFIG_ISOLATION=y Chris Metcalf
  1 sibling, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-12 21:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck,
	Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc,
	linux-api, linux-kernel

On 05/12/2015 05:10 AM, Ingo Molnar wrote:
> * Chris Metcalf <cmetcalf@ezchip.com> wrote:
>
>> - ISOLATION (Frederic).  I like this but it conflicts with other uses
>>    of "isolation" in the kernel: cgroup isolation, lru page isolation,
>>    iommu isolation, scheduler isolation (at least it's a superset of
>>    that one), etc.  Also, we're not exactly isolating a task - often
>>    a "dataplane" app consists of a bunch of interacting threads in
>>    userspace, so not exactly isolated.  So perhaps it's too confusing.
> So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this is
> a high level kernel feature, so it won't conflict with isolation
> concepts in lower level subsystems such as IOMMU isolation - and other
> higher level features like scheduler isolation are basically another
> partial implementation we want to merge with all this...
>
> nohz, RCU tricks, watchdog defaults, isolcpus and various other
> measures to keep these CPUs and workloads as isolated as possible
> are (or should become) components of this high level concept.
>
> Ideally CONFIG_ISOLATION=y would be a kernel feature that has almost
> zero overhead on normal workloads and on non-isolated CPUs, so that
> Linux distributions can enable it.

Using CONFIG_CPU_ISOLATION to capture all this stuff instead of
making CONFIG_NO_HZ_FULL do it seems plausible for naming.
However, this feels like just bombing the current naming to this
new name, right?  I'd like to argue that this is orthogonal to adding
new isolation functionality into no_hz_full, as my patch series has
been doing.  Perhaps we can defer this to a follow-up patch series?
I'm happy to do the work but I'm not sure we want to bundle all
that churn into the current patch series under consideration.
I can use cpu_isolation_xxx for naming in the current patch series
so we don't have to come back and bomb that later.

> Enabling CONFIG_ISOLATION=y should be the only 'kernel config' step
> needed: just like cpusets, the configuration of isolated CPUs should
> be a completely boot option free excercise that can be dynamically
> done and undone by the administrator via an intuitive interface.

Eventually isolation can be runtime-enabled, but for now I think
it makes sense to be boot-enabled.  As Frederic suggested, we
can arrange full nohz to be runtime toggled in the future.
I agree that it should be reasonable to compile it in by default.

On 05/12/2015 07:48 AM, Peter Zijlstra wrote:
> But why do we need a CONFIG flag for something that has no content?
>
> That is, I do not see anything much; except the 'I want to stay in
> userspace and kill me otherwise' flag, and I'm not sure that warrants a
> CONFIG flag like this.
>
> Other than that, its all a combination of NOHZ_FULL and cpusets/isolcpus
> and whatnot.

There are three major pieces here - one is the STRICT piece
that you allude to, but there is also the piece where we quiesce
tasks in the kernel until no timer interrupts are pending, and the
piece that allows easy debugging of stray IRQs etc to isolated cpus.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: CONFIG_ISOLATION=y
@ 2015-05-12 21:05                           ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-12 21:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck,
	Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc,
	linux-api, linux-kernel

On 05/12/2015 05:10 AM, Ingo Molnar wrote:
> * Chris Metcalf <cmetcalf@ezchip.com> wrote:
>
>> - ISOLATION (Frederic).  I like this but it conflicts with other uses
>>    of "isolation" in the kernel: cgroup isolation, lru page isolation,
>>    iommu isolation, scheduler isolation (at least it's a superset of
>>    that one), etc.  Also, we're not exactly isolating a task - often
>>    a "dataplane" app consists of a bunch of interacting threads in
>>    userspace, so not exactly isolated.  So perhaps it's too confusing.
> So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this is
> a high level kernel feature, so it won't conflict with isolation
> concepts in lower level subsystems such as IOMMU isolation - and other
> higher level features like scheduler isolation are basically another
> partial implementation we want to merge with all this...
>
> nohz, RCU tricks, watchdog defaults, isolcpus and various other
> measures to keep these CPUs and workloads as isolated as possible
> are (or should become) components of this high level concept.
>
> Ideally CONFIG_ISOLATION=y would be a kernel feature that has almost
> zero overhead on normal workloads and on non-isolated CPUs, so that
> Linux distributions can enable it.

Using CONFIG_CPU_ISOLATION to capture all this stuff instead of
making CONFIG_NO_HZ_FULL do it seems plausible for naming.
However, this feels like just bombing the current naming to this
new name, right?  I'd like to argue that this is orthogonal to adding
new isolation functionality into no_hz_full, as my patch series has
been doing.  Perhaps we can defer this to a follow-up patch series?
I'm happy to do the work but I'm not sure we want to bundle all
that churn into the current patch series under consideration.
I can use cpu_isolation_xxx for naming in the current patch series
so we don't have to come back and bomb that later.

> Enabling CONFIG_ISOLATION=y should be the only 'kernel config' step
> needed: just like cpusets, the configuration of isolated CPUs should
> be a completely boot option free excercise that can be dynamically
> done and undone by the administrator via an intuitive interface.

Eventually isolation can be runtime-enabled, but for now I think
it makes sense to be boot-enabled.  As Frederic suggested, we
can arrange full nohz to be runtime toggled in the future.
I agree that it should be reasonable to compile it in by default.

On 05/12/2015 07:48 AM, Peter Zijlstra wrote:
> But why do we need a CONFIG flag for something that has no content?
>
> That is, I do not see anything much; except the 'I want to stay in
> userspace and kill me otherwise' flag, and I'm not sure that warrants a
> CONFIG flag like this.
>
> Other than that, its all a combination of NOHZ_FULL and cpusets/isolcpus
> and whatnot.

There are three major pieces here - one is the STRICT piece
that you allude to, but there is also the piece where we quiesce
tasks in the kernel until no timer interrupts are pending, and the
piece that allows easy debugging of stray IRQs etc to isolated cpus.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
  2015-05-11 22:28       ` Andy Lutomirski
@ 2015-05-12 21:06         ` Chris Metcalf
  2015-05-12 22:23           ` Andy Lutomirski
  0 siblings, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-05-12 21:06 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra
  Cc: Paul E. McKenney, Frederic Weisbecker, Ingo Molnar, Rik van Riel,
	linux-doc, Andrew Morton, linux-kernel, Thomas Gleixner,
	Tejun Heo, Steven Rostedt, Christoph Lameter, Gilad Ben Yossef,
	Linux API

On 05/11/2015 06:28 PM, Andy Lutomirski wrote:
> [add peterz due to perf stuff]
>
> On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> Patch 6/6 proposes a mechanism to track down times when the
>> kernel screws up and delivers an IRQ to a userspace-only task.
>> Here, we're just trying to identify the times when an application
>> screws itself up out of cluelessness, and provide a mechanism
>> that allows the developer to easily figure out why and fix it.
>>
>> In particular, /proc/interrupts won't show syscalls or page faults,
>> which are two easy ways applications can screw themselves
>> when they think they're in userspace-only mode.  Also, they don't
>> provide sufficient precision to make it clear what part of the
>> application caused the undesired kernel entry.
> Perf does, though, complete with context.

The perf_event suggestions are interesting, but I think it's plausible
for this to be an alternate way to debug the issues that STRICT
addresses.

>> In this case, killing the task is appropriate, since that's exactly
>> the semantics that have been asked for - it's like on architectures
>> that don't natively support unaligned accesses, but fake it relatively
>> slowly in the kernel, and in development you just say "give me a
>> SIGBUS when that happens" and in production you might say
>> "fix it up and let's try to keep going".
> I think more control is needed.  I also think that, if we go this
> route, we should distinguish syscalls, synchronous non-syscall
> entries, and asynchronous non-syscall entries.  They're quite
> different.

I don't think it's necessary to distinguish the types.  As long as we
have a PC pointing to the instruction that triggered the problem,
we can see if it's a system call instruction, a memory write that
caused a page fault, a trap instruction, etc.  We certainly could
add infrastructure to capture syscall numbers, fault/signal numbers,
etc etc, but I think it's overkill if it adds kernel overhead on
entry/exit.

>> A better implementation, I think, is to put the tests for "you
>> screwed up and synchronously entered the kernel" in
>> the syscall_trace_enter() code, which TIF_NOHZ already
>> gets us into;
> No, not unless you're planning on using that to distinguish syscalls
> from other stuff *and* people think that's justified.

So, the question is how we separate synchronous entries
from IRQs?  At a high level, IRQs are kernel bugs (for cpu-isolated
tasks), and synchronous entries are application bugs.  We'd
like to deliver a signal for the latter, and do some kind of
kernel diagnostics for the former.  So we can't just add the
test in the context tracking code, which doesn't actually know
why we're entering or exiting.

That's why I was thinking that the syscall_trace_entry and
exception_enter paths were the best choices.  I'm fairly sure
that exception_enter is only done for synchronous traps,
page faults, etc.

Certainly on the tile architecture we include the trap number
in the pt_regs, so it's possible to just examine the pt_regs and
know why you entered or are exiting the kernel, but I don't
think we can rely on that for all architectures.

> It's far to easy to just make a tiny change to the entry code.  Add a
> tiny trivial change here, a few lines of asm (that's you, audit!)
> there, some weird written-in-asm scheduling code over here, and you
> end up with the truly awful mess that we currently have.
>
> If it really makes sense for this stuff to go with context tracking,
> then fine, but we should *fix* the context tracking first rather than
> kludging around it.  I already have a prototype patch for the relevant
> part of that.
>
>> there, we can test if the dataplane "strict" bit is
>> set and the syscall is not prctl(), then we generate the error.
>> (We'd exclude exit and exit_group here too, since we don't
>> need to shoot down a task that's just trying to kill itself.)
>> This needs a bit of platform-specific code for each platform,
>> but that doesn't seem like too big a problem.
> I'd rather avoid that, too.  This feature isn't really arch-specific,
> so let's avoid the arch stuff if at all possible.

I'll put out a v2 of my patch that does both the things you
advise against :-) just so we can have a strawman to think
about how to do it better - unless you have a suggestion
offhand as to how we can better differentiate sync and async
entries into the kernel in a platform-independent way.

I could imagine modifying user_exit() and exception_enter()
to pass an identifier into the context system saying why they
were changing contexts, so we could have syscalls, trap
numbers, fault numbers, etc., and some way to query as
to whether they were synchronous or asynchronous, and
build this scheme on top of that, but I'm not sure the extra
infrastructure is worthwhile.

>> Likewise we can test in exception_enter() since that's only
>> called for all the synchronous user entries like page faults.
> Let's try to generalize a bit.  There's also irq_entry and ist_enter,
> and some of the exception_enter cases are for synchronous entries
> while (IIRC -- could be wrong) others aren't always like that.

I don't think we need to generalize this piece.  irq_entry()
shouldn't be reported by the STRICT mechanism but by
kernel bug reporting.  For ist_enter(), it looks like if you're
coming from userspace it's just handled with exception_enter().
I'm more familiar with the tile architecture mechanisms than
with x86, though, to be honest.

>>> Also, I think that most users will be quite surprised if "strict
>>> dataplane" code causes any machine check on the system to kill your
>>> dataplane task.
>>
>> Fair point, and avoided by testing as described above instead.
>> (Though presumably in development it's not such a big deal,
>> and as I said you'd likely turn it off in production.)
> Until you forget to turn it off in production because it worked so
> nicely in development.

I guess that's an argument for using a non-fatal signal with a
handler from the get-go, since then even in production you'll
just end up with a slightly heavier-weight kernel overhead
(whatever stupid thing your application did, plus the time
spent in the signal handler), but then after that you can get
back to processing packets or whatever the app is doing.

You had mentioned some alternatives to a catchable signal
(a signal to some other process, or queuing to an fd); I think
it still seems reasonable to just deliver a signal to the process,
configurably by the prctl, and not do anything more complex.
Does this seem reasonable to you at this point?

> What if we added a mode to perf where delivery of a sample
> synchronously (or semi-synchronously by catching it on the next exit
> to userspace) freezes the delivering task?  It would be like debugger
> support via perf.
>
> peterz, do you think this would be a sensible thing to add to perf?
> It would only make sense for some types of events (tracepoints and
> hw_breakpoints mostly, I think).

I suspect it's reasonable to consider this orthogonal, particularly
if there is some skid between the actual violation by the
application, and the freeze happening.

You pushed back somewhat on prctl() in favor of a quiesce()
syscall in your email, but it seemed like at the end of your
email you were adopting the prctl() perspective.  Is that true?
I admit the prctl() still seems cleaner from my perspective.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
  2015-05-12 21:06         ` Chris Metcalf
@ 2015-05-12 22:23           ` Andy Lutomirski
  2015-05-15 21:25             ` Chris Metcalf
  0 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-05-12 22:23 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Paul E. McKenney, Frederic Weisbecker, linux-kernel,
	Rik van Riel, Andrew Morton, Linux API, Thomas Gleixner,
	Tejun Heo, Peter Zijlstra, Steven Rostedt, linux-doc,
	Christoph Lameter, Gilad Ben Yossef, Ingo Molnar

On May 13, 2015 6:06 AM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote:
>
> On 05/11/2015 06:28 PM, Andy Lutomirski wrote:
>>
>> [add peterz due to perf stuff]
>>
>> On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>>>
>>> Patch 6/6 proposes a mechanism to track down times when the
>>> kernel screws up and delivers an IRQ to a userspace-only task.
>>> Here, we're just trying to identify the times when an application
>>> screws itself up out of cluelessness, and provide a mechanism
>>> that allows the developer to easily figure out why and fix it.
>>>
>>> In particular, /proc/interrupts won't show syscalls or page faults,
>>> which are two easy ways applications can screw themselves
>>> when they think they're in userspace-only mode.  Also, they don't
>>> provide sufficient precision to make it clear what part of the
>>> application caused the undesired kernel entry.
>>
>> Perf does, though, complete with context.
>
>
> The perf_event suggestions are interesting, but I think it's plausible
> for this to be an alternate way to debug the issues that STRICT
> addresses.
>
>
>>> In this case, killing the task is appropriate, since that's exactly
>>> the semantics that have been asked for - it's like on architectures
>>> that don't natively support unaligned accesses, but fake it relatively
>>> slowly in the kernel, and in development you just say "give me a
>>> SIGBUS when that happens" and in production you might say
>>> "fix it up and let's try to keep going".
>>
>> I think more control is needed.  I also think that, if we go this
>> route, we should distinguish syscalls, synchronous non-syscall
>> entries, and asynchronous non-syscall entries.  They're quite
>> different.
>
>
> I don't think it's necessary to distinguish the types.  As long as we
> have a PC pointing to the instruction that triggered the problem,
> we can see if it's a system call instruction, a memory write that
> caused a page fault, a trap instruction, etc.

Not true.  PC right after a syscall insn could be any type of kernel
entry, and you can't even reliably tell whether the syscall insn was
executed or, on x86, whether it was a syscall at all.  (x86 insns
can't be reliably decided backwards.)

PC pointing at a load could be a page fault or an IPI.

> We certainly could
> add infrastructure to capture syscall numbers, fault/signal numbers,
> etc etc, but I think it's overkill if it adds kernel overhead on
> entry/exit.
>

None of these should add overhead.

>
>>> A better implementation, I think, is to put the tests for "you
>>> screwed up and synchronously entered the kernel" in
>>> the syscall_trace_enter() code, which TIF_NOHZ already
>>> gets us into;
>>
>> No, not unless you're planning on using that to distinguish syscalls
>> from other stuff *and* people think that's justified.
>
>
> So, the question is how we separate synchronous entries
> from IRQs?  At a high level, IRQs are kernel bugs (for cpu-isolated
> tasks), and synchronous entries are application bugs.  We'd
> like to deliver a signal for the latter, and do some kind of
> kernel diagnostics for the former.  So we can't just add the
> test in the context tracking code, which doesn't actually know
> why we're entering or exiting.

Synchronous entries could be VM bugs, too.

>
> That's why I was thinking that the syscall_trace_entry and
> exception_enter paths were the best choices.  I'm fairly sure
> that exception_enter is only done for synchronous traps,
> page faults, etc.

Maybe.  Doing it through the actual entry/exit slow paths would be
overhead-free, although I'm not sure that IRQs have real slow paths
for entry.

>
> Certainly on the tile architecture we include the trap number
> in the pt_regs, so it's possible to just examine the pt_regs and
> know why you entered or are exiting the kernel, but I don't
> think we can rely on that for all architectures.

x86 can't do this.

> I'll put out a v2 of my patch that does both the things you
> advise against :-) just so we can have a strawman to think
> about how to do it better - unless you have a suggestion
> offhand as to how we can better differentiate sync and async
> entries into the kernel in a platform-independent way.
>
> I could imagine modifying user_exit() and exception_enter()
> to pass an identifier into the context system saying why they
> were changing contexts, so we could have syscalls, trap
> numbers, fault numbers, etc., and some way to query as
> to whether they were synchronous or asynchronous, and
> build this scheme on top of that, but I'm not sure the extra
> infrastructure is worthwhile.
>

I'll take a look.

Again, though, I think we really do need to distinguish at least MCE
and NMI (on x86) from the others.

>
>> What if we added a mode to perf where delivery of a sample
>> synchronously (or semi-synchronously by catching it on the next exit
>> to userspace) freezes the delivering task?  It would be like debugger
>> support via perf.
>>
>> peterz, do you think this would be a sensible thing to add to perf?
>> It would only make sense for some types of events (tracepoints and
>> hw_breakpoints mostly, I think).
>
>
> I suspect it's reasonable to consider this orthogonal, particularly
> if there is some skid between the actual violation by the
> application, and the freeze happening.
>

I think it could be done without skid, except for async entries, but
for asynx entries we don't care about exact user state anyway.

> You pushed back somewhat on prctl() in favor of a quiesce()
> syscall in your email, but it seemed like at the end of your
> email you were adopting the prctl() perspective.  Is that true?
> I admit the prctl() still seems cleaner from my perspective.
>

Prctl for the strict thing seems much more reasonable to me than prctl
for quiescing.  Also, the scheduler people seem to thing that
quiescing should be automatic.

Anyway, I'll happily look at code and maybe even write more coherent
emails when I'm back in town in a week.  Since you're thinking that
async entries should give kernel diagnostics instead of signals, maybe
the right thing to do is to separate them out completely and try to
address the individual entry types separately and as needed.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE
  2015-05-12 12:52         ` Ingo Molnar
@ 2015-05-13  4:35           ` Andy Lutomirski
  2015-05-13 17:51             ` Paul E. McKenney
  0 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-05-13  4:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney,
	Christoph Lameter, Srivatsa S. Bhat, linux-doc, Linux API,
	linux-kernel

On Tue, May 12, 2015 at 5:52 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Peter Zijlstra <peterz@infradead.org> wrote:
>
>> > So if then a prctl() (or other system call) could be a shortcut
>> > to:
>> >
>> >  - move the task to an isolated CPU
>> >  - make sure there _is_ such an isolated domain available
>> >
>> > I.e. have some programmatic, kernel provided way for an
>> > application to be sure it's running in the right environment.
>> > Relying on random administration flags here and there won't cut
>> > it.
>>
>> No, we already have sched_setaffinity() and we should not duplicate
>> its ability to move tasks about.
>
> But sched_setaffinity() does not guarantee isolation - it's just a
> syscall to move a task to a set of CPUs, which might be isolated or
> not.
>
> What I suggested is that it might make sense to offer a system call,
> for example a sched_setparam() variant, that makes such guarantees.
>
> Say if user-space does:
>
>         ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params);
>
> ... then we would get the task moved to an isolated domain and get a 0
> return code if the kernel is able to do all that and if the current
> uid/namespace/etc. has the required permissions and such.
>
> ( BIND_ISOLATED will not replace the current p->policy value, so it's
>   still possible to use the regular policies as well on top of this. )

I think we shouldn't have magic selection of an isolated domain.
Anyone using this has already configured some isolated CPUs and
probably wants to choose the CPU and, especially, NUMA node
themselves.  Also, maybe it should be a special type of realtime
class/priority -- doing this should require RT permission IMO.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE
  2015-05-13  4:35           ` Andy Lutomirski
@ 2015-05-13 17:51             ` Paul E. McKenney
  2015-05-14 20:55                 ` Chris Metcalf
  0 siblings, 1 reply; 340+ messages in thread
From: Paul E. McKenney @ 2015-05-13 17:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Peter Zijlstra, Chris Metcalf, Gilad Ben Yossef,
	Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Frederic Weisbecker,
	Christoph Lameter, Srivatsa S. Bhat, linux-doc, Linux API,
	linux-kernel

On Tue, May 12, 2015 at 09:35:25PM -0700, Andy Lutomirski wrote:
> On Tue, May 12, 2015 at 5:52 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >> > So if then a prctl() (or other system call) could be a shortcut
> >> > to:
> >> >
> >> >  - move the task to an isolated CPU
> >> >  - make sure there _is_ such an isolated domain available
> >> >
> >> > I.e. have some programmatic, kernel provided way for an
> >> > application to be sure it's running in the right environment.
> >> > Relying on random administration flags here and there won't cut
> >> > it.
> >>
> >> No, we already have sched_setaffinity() and we should not duplicate
> >> its ability to move tasks about.
> >
> > But sched_setaffinity() does not guarantee isolation - it's just a
> > syscall to move a task to a set of CPUs, which might be isolated or
> > not.
> >
> > What I suggested is that it might make sense to offer a system call,
> > for example a sched_setparam() variant, that makes such guarantees.
> >
> > Say if user-space does:
> >
> >         ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params);
> >
> > ... then we would get the task moved to an isolated domain and get a 0
> > return code if the kernel is able to do all that and if the current
> > uid/namespace/etc. has the required permissions and such.
> >
> > ( BIND_ISOLATED will not replace the current p->policy value, so it's
> >   still possible to use the regular policies as well on top of this. )
> 
> I think we shouldn't have magic selection of an isolated domain.
> Anyone using this has already configured some isolated CPUs and
> probably wants to choose the CPU and, especially, NUMA node
> themselves.  Also, maybe it should be a special type of realtime
> class/priority -- doing this should require RT permission IMO.

I have no real argument against special permissions, but this feature
is totally orthogonal to realtime classes/priorities.  It is perfectly
legitimate for a given CPU's single runnable task to be SCHED_OTHER,
for example.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE
  2015-05-12  9:33     ` Peter Zijlstra
@ 2015-05-14 20:54       ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-14 20:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc,
	linux-api, linux-kernel

On 05/12/2015 05:33 AM, Peter Zijlstra wrote:
> On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote:
>> This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the
>> kernel to quiesce any pending timer interrupts prior to returning
>> to userspace.  When running with this mode set, sys calls (and page
>> faults, etc.) can be inordinately slow.  However, user applications
>> that want to guarantee that no unexpected interrupts will occur
>> (even if they call into the kernel) can set this flag to guarantee
>> that semantics.
> Currently people hot-unplug and hot-plug the CPU to do this. Obviously
> that's a wee bit horrible :-)
>
> Not sure if a prctl like this is any better though. This is a CPU
> properly not a process one.

The CPU property aspects, I think, should be largely handled by
fixing kernel bugs that let work end up running on nohz_full cores
without having been explicitly requested to run there.

As you said in a follow-up email:

On 05/12/2015 06:38 AM, Peter Zijlstra wrote:
> Ideally we'd never have to clear the state because it should be
> impossible to get into this predicament in the first place.

What my prctl() proposal does is quiesce things that end up
happening specifically because the user process called on purpose
into the kernel.  For example, perhaps RCU was invoked in the
kernel, and the core has to wait a timer tick to quiesce RCU.
Whatever causes it, the intent is that you're not allowed back into
userspace until everything has settled down from your call into
the kernel; the presumption is that it's all due to the kernel entry
that was just made, and not from other stray work.

In that sense, it's very appropriate for it to be a process property.

> ISTR people talking about 'quiesce' sysfs file, along side the hotplug
> stuff, I can't quite remember.

It seems somewhat similar (adding Viresh to the cc's) but does
seem like it might have been more intended to address the
CPU properties rather than process properties:

   https://lkml.org/lkml/2014/4/4/99

One thing the original Tilera dataplane code did was to require
setting dataplane flags to succeed only on dataplane cores,
and only when the task had been affinitized to that single core.
This did not protect the task from later being re-affinitized in
a way that broke those assumptions, but I suppose you could
also imagine make sched_setaffinity() fail for such a process.
Somewhat unrelated, but it occurred to me in the context of this
reply, so what do you think?  I can certainly add this to the
patch series if it seems like it makes setting the prctl() flags
more conservative.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE
@ 2015-05-14 20:54       ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-14 20:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc,
	linux-api, linux-kernel

On 05/12/2015 05:33 AM, Peter Zijlstra wrote:
> On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote:
>> This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the
>> kernel to quiesce any pending timer interrupts prior to returning
>> to userspace.  When running with this mode set, sys calls (and page
>> faults, etc.) can be inordinately slow.  However, user applications
>> that want to guarantee that no unexpected interrupts will occur
>> (even if they call into the kernel) can set this flag to guarantee
>> that semantics.
> Currently people hot-unplug and hot-plug the CPU to do this. Obviously
> that's a wee bit horrible :-)
>
> Not sure if a prctl like this is any better though. This is a CPU
> properly not a process one.

The CPU property aspects, I think, should be largely handled by
fixing kernel bugs that let work end up running on nohz_full cores
without having been explicitly requested to run there.

As you said in a follow-up email:

On 05/12/2015 06:38 AM, Peter Zijlstra wrote:
> Ideally we'd never have to clear the state because it should be
> impossible to get into this predicament in the first place.

What my prctl() proposal does is quiesce things that end up
happening specifically because the user process called on purpose
into the kernel.  For example, perhaps RCU was invoked in the
kernel, and the core has to wait a timer tick to quiesce RCU.
Whatever causes it, the intent is that you're not allowed back into
userspace until everything has settled down from your call into
the kernel; the presumption is that it's all due to the kernel entry
that was just made, and not from other stray work.

In that sense, it's very appropriate for it to be a process property.

> ISTR people talking about 'quiesce' sysfs file, along side the hotplug
> stuff, I can't quite remember.

It seems somewhat similar (adding Viresh to the cc's) but does
seem like it might have been more intended to address the
CPU properties rather than process properties:

   https://lkml.org/lkml/2014/4/4/99

One thing the original Tilera dataplane code did was to require
setting dataplane flags to succeed only on dataplane cores,
and only when the task had been affinitized to that single core.
This did not protect the task from later being re-affinitized in
a way that broke those assumptions, but I suppose you could
also imagine make sched_setaffinity() fail for such a process.
Somewhat unrelated, but it occurred to me in the context of this
reply, so what do you think?  I can certainly add this to the
patch series if it seems like it makes setting the prctl() flags
more conservative.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE
@ 2015-05-14 20:55                 ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-14 20:55 UTC (permalink / raw)
  To: paulmck, Andy Lutomirski
  Cc: Ingo Molnar, Peter Zijlstra, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Frederic Weisbecker, Christoph Lameter,
	linux-doc, Linux API, linux-kernel

On 05/12/2015 08:52 AM, Ingo Molnar wrote:
> What I suggested is that it might make sense to offer a system call,
> for example a sched_setparam() variant, that makes such guarantees.
>
> Say if user-space does:
>
> 	ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params);
>
> ... then we would get the task moved to an isolated domain and get a 0
> return code if the kernel is able to do all that and if the current
> uid/namespace/etc. has the required permissions and such.

Unfortunately I don't know nearly as much about the scheduler
and scheduler policies as I might, since I mostly focused on
make the scheduler stay out of the way.  :-)  This does seem like
another way to set a policy bit on a process.  I assume you
could only validly issue this call on a nohz_full core, and that
you're not assuming it migrates the cpu to such a core?

You suggested that BIND_ISOLATED would not replace the usual
scheduler policies, but perhaps SCHED_ISOLATED as a full
replacement would make sense - it would make it an error
to have any other schedulable task on that core.  I guess that
brings it around to whether the "cpu_isolated" task just loses when
another task is scheduled on the core with it (the current
approach I'm proposing) or if it ends up truly owning the core
and other processes can be denied the right to run there:
which in that case clearly does get us into the area of requiring
privileges to set up, as Andy pointed out later.

This would leave the notion of "strict" as proposed elsewhere
as a separate thing, but presumably it could still be a prctl()
as originally proposed.

I admit I don't know enough to say whether this sounds like
a better approach than just using a prctl() to set the
cpu_isolated state.  My instinct is that it's cleanest to avoid
requiring permissions to do this, and to simply enable the
quiescing semantics the process requested when it happens
to be alone on a core.  If so, it's somewhat orthogonal to the
actual scheduler policy in force, so best not to conflate it with
the notion of scheduler code at all via sched_setscheduler()?

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE
@ 2015-05-14 20:55                 ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-14 20:55 UTC (permalink / raw)
  To: paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Andy Lutomirski
  Cc: Ingo Molnar, Peter Zijlstra, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Frederic Weisbecker, Christoph Lameter,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 05/12/2015 08:52 AM, Ingo Molnar wrote:
> What I suggested is that it might make sense to offer a system call,
> for example a sched_setparam() variant, that makes such guarantees.
>
> Say if user-space does:
>
> 	ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params);
>
> ... then we would get the task moved to an isolated domain and get a 0
> return code if the kernel is able to do all that and if the current
> uid/namespace/etc. has the required permissions and such.

Unfortunately I don't know nearly as much about the scheduler
and scheduler policies as I might, since I mostly focused on
make the scheduler stay out of the way.  :-)  This does seem like
another way to set a policy bit on a process.  I assume you
could only validly issue this call on a nohz_full core, and that
you're not assuming it migrates the cpu to such a core?

You suggested that BIND_ISOLATED would not replace the usual
scheduler policies, but perhaps SCHED_ISOLATED as a full
replacement would make sense - it would make it an error
to have any other schedulable task on that core.  I guess that
brings it around to whether the "cpu_isolated" task just loses when
another task is scheduled on the core with it (the current
approach I'm proposing) or if it ends up truly owning the core
and other processes can be denied the right to run there:
which in that case clearly does get us into the area of requiring
privileges to set up, as Andy pointed out later.

This would leave the notion of "strict" as proposed elsewhere
as a separate thing, but presumably it could still be a prctl()
as originally proposed.

I admit I don't know enough to say whether this sounds like
a better approach than just using a prctl() to set the
cpu_isolated state.  My instinct is that it's cleanest to avoid
requiring permissions to do this, and to simply enable the
quiescing semantics the process requested when it happens
to be alone on a core.  If so, it's somewhat orthogonal to the
actual scheduler policy in force, so best not to conflate it with
the notion of scheduler code at all via sched_setscheduler()?

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 2/6] nohz: dataplane: allow tick to be fully disabled for dataplane
  2015-05-12 13:12     ` Paul E. McKenney
@ 2015-05-14 20:55       ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-14 20:55 UTC (permalink / raw)
  To: paulmck, Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Christoph Lameter, linux-kernel

On 05/12/2015 09:12 AM, Paul E. McKenney wrote:
> On Tue, May 12, 2015 at 11:26:07AM +0200, Peter Zijlstra wrote:
>> On Fri, May 08, 2015 at 01:58:43PM -0400, Chris Metcalf wrote:
>>> While the current fallback to 1-second tick is still helpful for
>>> maintaining completely correct kernel semantics, processes using
>>> prctl(PR_SET_DATAPLANE) semantics place a higher priority on running
>>> completely tickless, so don't bound the time_delta for such processes.
>>>
>>> This was previously discussed in
>>>
>>> https://lkml.org/lkml/2014/10/31/364
>>>
>>> and Thomas Gleixner observed that vruntime, load balancing data,
>>> load accounting, and other things might be impacted.  Frederic
>>> Weisbecker similarly observed that allowing the tick to be indefinitely
>>> deferred just meant that no one would ever fix the underlying bugs.
>>> However it's at least true that the mode proposed in this patch can
>>> only be enabled on an isolcpus core, which may limit how important
>>> it is to maintain scheduler data correctly, for example.
>> So how is making this available going to help people fix the actual
>> problem?
> It will at least provide an environment where adding more of this
> problem might get punished.  This would be an improvement over what
> we have today, namely that the 1HZ fallback timer silently forgives
> adding more problems of this sort.

So I guess the obvious question to ask is whether there is a mode
that can be dynamically enabled (/proc/sys/kernel/nohz_experimental
or whatever) where we allow turning off this tick - perhaps to make
it more likely tick-dependent code isn't added to the kernel as Paul
suggests, or perhaps to enable applications that want to avoid the
tick conservativeness and are willing to do sufficient QA that they
are comfortable exploring possible issues with the 1Hz tick being
disabled?

Paul, PeterZ, any thoughts on something along these lines?
Or another suggestion?

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-12  1:47                   ` Mike Galbraith
  2015-05-12  4:35                     ` Mike Galbraith
@ 2015-05-15 15:05                     ` Chris Metcalf
  2015-05-15 18:44                       ` Mike Galbraith
  1 sibling, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-05-15 15:05 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Frederic Weisbecker, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	linux-kernel

On 05/11/2015 09:47 PM, Mike Galbraith wrote:
> On Mon, 2015-05-11 at 15:25 -0400, Chris Metcalf wrote:
>> On 05/11/2015 03:19 PM, Mike Galbraith wrote:
>>> I really shouldn't have acked nohz_full -> isolcpus.  Beside the fact
>>> that old static isolcpus was_supposed_  to crawl off and die, I know
>>> beyond doubt that having isolated a cpu as well as you can definitely
>>> does NOT imply that said cpu should become tickless.
>> True, at a high level, I agree that it would be better to have a
>> top-level concept like Frederic's proposed ISOLATION that includes
>> isolcpus and nohz_cpu (and other stuff as needed).
>>
>> That said, what you wrote above is wrong; even with the patch you
>> acked, setting isolcpus does not automatically turn on nohz_full for
>> a given cpu.  The patch made it true the other way around: when
>> you say nohz_full, you automatically get isolcpus on that cpu too.
>> That does, at least, make sense for the semantics of nohz_full.
> I didn't write that, I wrote nohz_full implies (spelled '->') isolcpus.
> Yes, with nohz_full currently being static, the old allegedly dying but
> also static isolcpus scheduler off switch is a convenient thing to wire
> the nohz_full CPU SET (<- hint;) property to.

Yes, I was responding to the bit where you said "having isolated a
cpu as well as you can does NOT imply it should become tickless",
but indeed, the "nohz_full -> isolcpus" patch didn't make that true.
In any case sounds like we were just talking past each other.

> BTW, another facet of this: Rik wants to make isolcpus immune to
> cpusets, which makes some sense, user did say isolcpus=, but that also
> makes isolcpus truly static.  If the user now says nohz_full=, they lose
> the ability to deactivate CPU isolation, making the set fairly useless
> for anything other than HPC.  Currently, the user can flip the isolation
> switch as he sees fit.  He takes a size extra large performance hit for
> having said nohz_full=, but he doesn't lose generic utility.

I don't I follow this completely.  If the user says nohz_full=, he
probably doesn't care about deactivating isolcpus later, since that
defeats the entire purpose of the nohz_full= in the first place,
as far as I can tell.  And when you say "anything other than HPC",
I'm not sure what you mean; as far as I know high-performance
computing only cares because it wants that extra 0.5% of the
cpu or whatever interrupts eat up, but just as a nice-to-have.
The real use case is high-performance userspace drivers where
the nohz_full cores are responding to real-time things like packet
arrivals with almost no latency to spare.

What is the generic utility you're envisioning for nohz_full cores
that have turned off scheduler isolation?  I assume it's some
workload where you'd prefer not to have too many interrupts
but still are running multiple tasks, but in that case does it really
make much difference in practice?

> Thomas has nuked the hrtimer softirq.

Yes, this I didn't know.  So I will drop my "no ksoftirqd" patch and
we will see if ksoftirqs emerge as an issue for my "cpu isolation"
stuff in the future; it may be that that was the only issue.

> Inlining softirqs may save a context switch, but adds cycles that we may
> consume at higher frequency than the thing we're avoiding.

Yes but consuming cycles is not nearly as much of a concern
as avoiding interrupts or scheduling, certainly for the case of
userspace drivers that I described above.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-12 10:46               ` Peter Zijlstra
@ 2015-05-15 15:10                 ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-15 15:10 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: Ingo Molnar, Andrew Morton, Gilad Ben Yossef, Ingo Molnar,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc,
	linux-api, linux-kernel

On 05/12/2015 06:46 AM, Peter Zijlstra wrote:
> On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote:
>> Please lets get NO_HZ_FULL up to par. That should be the main focus.
>>
> ACK, much of this dataplane stuff is (useful) hacks working around the
> fact that nohz_full just isn't complete.

There are enough disjoint threads on this topic that I want
to just touch base here and see if you have been convinced
on other threads that there is stuff beyond the hacks here:
in particular

1. The basic "dataplane" mode to arrange to do extra work on
return to kernel space that normally isn't warranted, to avoid
future IPIs, and additionally to wait in the kernel until any timer
interrupts required by the kernel invocation itself are done; and

2. The "strict" mode to allow a task to tell the kernel it isn't
planning on making any more such calls, and have the kernel
help diagnose any resulting application bugs.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
@ 2015-05-15 15:10                 ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-15 15:10 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: Ingo Molnar, Andrew Morton, Gilad Ben Yossef, Ingo Molnar,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc,
	linux-api, linux-kernel

On 05/12/2015 06:46 AM, Peter Zijlstra wrote:
> On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote:
>> Please lets get NO_HZ_FULL up to par. That should be the main focus.
>>
> ACK, much of this dataplane stuff is (useful) hacks working around the
> fact that nohz_full just isn't complete.

There are enough disjoint threads on this topic that I want
to just touch base here and see if you have been convinced
on other threads that there is stuff beyond the hacks here:
in particular

1. The basic "dataplane" mode to arrange to do extra work on
return to kernel space that normally isn't warranted, to avoid
future IPIs, and additionally to wait in the kernel until any timer
interrupts required by the kernel invocation itself are done; and

2. The "strict" mode to allow a task to tell the kernel it isn't
planning on making any more such calls, and have the kernel
help diagnose any resulting application bugs.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-15 15:05                     ` Chris Metcalf
@ 2015-05-15 18:44                       ` Mike Galbraith
  2015-05-26 19:51                         ` Chris Metcalf
  0 siblings, 1 reply; 340+ messages in thread
From: Mike Galbraith @ 2015-05-15 18:44 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Frederic Weisbecker, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	linux-kernel

On Fri, 2015-05-15 at 11:05 -0400, Chris Metcalf wrote:
> On 05/11/2015 09:47 PM, Mike Galbraith wrote:
> > On Mon, 2015-05-11 at 15:25 -0400, Chris Metcalf wrote:
> >> On 05/11/2015 03:19 PM, Mike Galbraith wrote:
> >>> I really shouldn't have acked nohz_full -> isolcpus.  Beside the fact
> >>> that old static isolcpus was_supposed_  to crawl off and die, I know
> >>> beyond doubt that having isolated a cpu as well as you can definitely
> >>> does NOT imply that said cpu should become tickless.
> >> True, at a high level, I agree that it would be better to have a
> >> top-level concept like Frederic's proposed ISOLATION that includes
> >> isolcpus and nohz_cpu (and other stuff as needed).
> >>
> >> That said, what you wrote above is wrong; even with the patch you
> >> acked, setting isolcpus does not automatically turn on nohz_full for
> >> a given cpu.  The patch made it true the other way around: when
> >> you say nohz_full, you automatically get isolcpus on that cpu too.
> >> That does, at least, make sense for the semantics of nohz_full.
> > I didn't write that, I wrote nohz_full implies (spelled '->') isolcpus.
> > Yes, with nohz_full currently being static, the old allegedly dying but
> > also static isolcpus scheduler off switch is a convenient thing to wire
> > the nohz_full CPU SET (<- hint;) property to.
> 
> Yes, I was responding to the bit where you said "having isolated a
> cpu as well as you can does NOT imply it should become tickless",
> but indeed, the "nohz_full -> isolcpus" patch didn't make that true.
> In any case sounds like we were just talking past each other.

Yup.

> > BTW, another facet of this: Rik wants to make isolcpus immune to
> > cpusets, which makes some sense, user did say isolcpus=, but that also
> > makes isolcpus truly static.  If the user now says nohz_full=, they lose
> > the ability to deactivate CPU isolation, making the set fairly useless
> > for anything other than HPC.  Currently, the user can flip the isolation
> > switch as he sees fit.  He takes a size extra large performance hit for
> > having said nohz_full=, but he doesn't lose generic utility.
> 
> I don't I follow this completely.  If the user says nohz_full=, he
> probably doesn't care about deactivating isolcpus later, since that
> defeats the entire purpose of the nohz_full= in the first place,
> as far as I can tell.  And when you say "anything other than HPC",
> I'm not sure what you mean; as far as I know high-performance
> computing only cares because it wants that extra 0.5% of the
> cpu or whatever interrupts eat up, but just as a nice-to-have.
> The real use case is high-performance userspace drivers where
> the nohz_full cores are responding to real-time things like packet
> arrivals with almost no latency to spare.

Ok, verbosity on.

Currently, nohz_full is static, meaning in a dynamic environment, where
the user may not have a constant need for it, if you make it imply
isolcpus, then make isolcpus immutable, you have just needlessly taken
an option from the user.  Those CPUS are no longer part of his generic
resource pool, and he has nothing to say about it.

> What is the generic utility you're envisioning for nohz_full cores
> that have turned off scheduler isolation?  I assume it's some
> workload where you'd prefer not to have too many interrupts
> but still are running multiple tasks, but in that case does it really
> make much difference in practice?

Again, I think we're talking past one another.

I'm saying there is no need to mandate, nothing more.  For your needs,
my needs whatever, that immutable may sound good, but in fact, it
removes flexibility, and for no good reason.

This shows immediately in simple testing.  Do I need nohz_full?  Hell
no, only for testing.  If I want to test, I obviously need it for a
while, and yes, I can reboot... but what's the difference between me the
silly tester who needs it only to see if it works at all, and how well,
and some guy who does something critical once in a while, or a company
with a pool of big boxen that they reconfigure on the fly to meet
whatever dynamic needs?

Just because the nohz_full feature itself is currently static is no
reason to put users thereof in a straight jacket by mandating that any
set they define irrevocably disappears from the generic resource pool .
Those CPUS are useful until the moment someone cripples them, which
making nohz_full imply isolcpus does if isolcpus then also becomes
immutable, which Rik's patch does.  Making nohz_full imply isolcpus
sounds perfectly fine until someone comes along and makes isolcpus
immutable (Rik's patch), at which point the user loses a choice due to
two people making it imply things that _alone_ sound perfectly fine.

See what I'm saying now?

> > Thomas has nuked the hrtimer softirq.
> 
> Yes, this I didn't know.  So I will drop my "no ksoftirqd" patch and
> we will see if ksoftirqs emerge as an issue for my "cpu isolation"
> stuff in the future; it may be that that was the only issue.
> 
> > Inlining softirqs may save a context switch, but adds cycles that we may
> > consume at higher frequency than the thing we're avoiding.
> 
> Yes but consuming cycles is not nearly as much of a concern
> as avoiding interrupts or scheduling, certainly for the case of
> userspace drivers that I described above.

If you're raising softirqs in an SMP kernel, you're also doing something
that puts you at very serious risk of meeting the jitter monster, locks,
and worse, sleeping locks, no?

	-Mike


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
  2015-05-12 22:23           ` Andy Lutomirski
@ 2015-05-15 21:25             ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-15 21:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paul E. McKenney, Frederic Weisbecker, linux-kernel,
	Rik van Riel, Andrew Morton, Linux API, Thomas Gleixner,
	Tejun Heo, Peter Zijlstra, Steven Rostedt, linux-doc,
	Christoph Lameter, Gilad Ben Yossef, Ingo Molnar

On 05/12/2015 06:23 PM, Andy Lutomirski wrote:
> On May 13, 2015 6:06 AM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote:
>> On 05/11/2015 06:28 PM, Andy Lutomirski wrote:
>>> On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>>>> In this case, killing the task is appropriate, since that's exactly
>>>> the semantics that have been asked for - it's like on architectures
>>>> that don't natively support unaligned accesses, but fake it relatively
>>>> slowly in the kernel, and in development you just say "give me a
>>>> SIGBUS when that happens" and in production you might say
>>>> "fix it up and let's try to keep going".
>>> I think more control is needed.  I also think that, if we go this
>>> route, we should distinguish syscalls, synchronous non-syscall
>>> entries, and asynchronous non-syscall entries.  They're quite
>>> different.
>>
>> I don't think it's necessary to distinguish the types.  As long as we
>> have a PC pointing to the instruction that triggered the problem,
>> we can see if it's a system call instruction, a memory write that
>> caused a page fault, a trap instruction, etc.
> Not true.  PC right after a syscall insn could be any type of kernel
> entry, and you can't even reliably tell whether the syscall insn was
> executed or, on x86, whether it was a syscall at all.  (x86 insns
> can't be reliably decided backwards.)
>
> PC pointing at a load could be a page fault or an IPI.

All that we are trying to do with this API, though, is distinguish
synchronous faults.  So IPIs, etc., should not be happening
(they would be bugs), and hopefully we are mostly just
distinguishing different types of synchronous program entries.
That said, I did a si_info flag to differentiate syscalls from other
synchronous entries, and I'm open to looking at more such if
it seems useful.

> Again, though, I think we really do need to distinguish at least MCE 
> and NMI (on x86) from the others. 

Yes, those are both interesting cases, and I'm not entirely
sure what the right way to handle them is - for example,
likely disable STRICT if you are running with perf enabled.

I look forward to hearing more when you're back next week!

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v2 0/5] support "cpu_isolated" mode for nohz_full
@ 2015-05-15 21:26   ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-15 21:26 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it.  However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode.  A prctl()
option (PR_SET_CPU_ISOLATED) is added to control whether processes have
requested this stricter semantics, and within that prctl() option we
provide a number of different bits for more precise control.
Additionally, we add a new command-line boot argument to facilitate
debugging where unexpected interrupts are being delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers.  This patch
series represents the first serious attempt to upstream that
functionality.  Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on cpu_isolated cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on my arch/tile master tree for 4.2,
in turn based on 4.1-rc1) is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

v2:
  rename "dataplane" to "cpu_isolated"
  drop ksoftirqd suppression changes (believed no longer needed)
  merge previous "QUIESCE" functionality into baseline functionality
  explicitly track syscalls and exceptions for "STRICT" functionality
  allow configuring a signal to be delivered for STRICT mode failures
  move debug tracking to irq_enter(), not irq_exit()

Note: I have not yet removed the hack to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555).

Chris Metcalf (5):
  nohz_full: add support for "cpu_isolated" mode
  nohz: support PR_CPU_ISOLATED_STRICT mode
  nohz: cpu_isolated strict mode configurable signal
  nohz: add cpu_isolated_debug boot flag
  nohz: cpu_isolated: allow tick to be fully disabled

 Documentation/kernel-parameters.txt |  6 +++
 arch/tile/kernel/ptrace.c           |  6 ++-
 arch/tile/mm/homecache.c            |  5 +-
 arch/x86/kernel/ptrace.c            |  2 +
 include/linux/context_tracking.h    | 11 +++--
 include/linux/sched.h               |  3 ++
 include/linux/tick.h                | 28 +++++++++++
 include/uapi/linux/prctl.h          |  8 +++
 kernel/context_tracking.c           | 12 +++--
 kernel/irq_work.c                   |  4 +-
 kernel/sched/core.c                 | 18 +++++++
 kernel/signal.c                     |  5 ++
 kernel/smp.c                        |  4 ++
 kernel/softirq.c                    |  6 +++
 kernel/sys.c                        |  8 +++
 kernel/time/tick-sched.c            | 98 ++++++++++++++++++++++++++++++++++++-
 16 files changed, 214 insertions(+), 10 deletions(-)

-- 
2.1.2


^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v2 0/5] support "cpu_isolated" mode for nohz_full
@ 2015-05-15 21:26   ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-15 21:26 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Chris Metcalf

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it.  However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode.  A prctl()
option (PR_SET_CPU_ISOLATED) is added to control whether processes have
requested this stricter semantics, and within that prctl() option we
provide a number of different bits for more precise control.
Additionally, we add a new command-line boot argument to facilitate
debugging where unexpected interrupts are being delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers.  This patch
series represents the first serious attempt to upstream that
functionality.  Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on cpu_isolated cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on my arch/tile master tree for 4.2,
in turn based on 4.1-rc1) is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

v2:
  rename "dataplane" to "cpu_isolated"
  drop ksoftirqd suppression changes (believed no longer needed)
  merge previous "QUIESCE" functionality into baseline functionality
  explicitly track syscalls and exceptions for "STRICT" functionality
  allow configuring a signal to be delivered for STRICT mode failures
  move debug tracking to irq_enter(), not irq_exit()

Note: I have not yet removed the hack to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555).

Chris Metcalf (5):
  nohz_full: add support for "cpu_isolated" mode
  nohz: support PR_CPU_ISOLATED_STRICT mode
  nohz: cpu_isolated strict mode configurable signal
  nohz: add cpu_isolated_debug boot flag
  nohz: cpu_isolated: allow tick to be fully disabled

 Documentation/kernel-parameters.txt |  6 +++
 arch/tile/kernel/ptrace.c           |  6 ++-
 arch/tile/mm/homecache.c            |  5 +-
 arch/x86/kernel/ptrace.c            |  2 +
 include/linux/context_tracking.h    | 11 +++--
 include/linux/sched.h               |  3 ++
 include/linux/tick.h                | 28 +++++++++++
 include/uapi/linux/prctl.h          |  8 +++
 kernel/context_tracking.c           | 12 +++--
 kernel/irq_work.c                   |  4 +-
 kernel/sched/core.c                 | 18 +++++++
 kernel/signal.c                     |  5 ++
 kernel/smp.c                        |  4 ++
 kernel/softirq.c                    |  6 +++
 kernel/sys.c                        |  8 +++
 kernel/time/tick-sched.c            | 98 ++++++++++++++++++++++++++++++++++++-
 16 files changed, 214 insertions(+), 10 deletions(-)

-- 
2.1.2

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode
  2015-05-15 21:26   ` Chris Metcalf
@ 2015-05-15 21:27     ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode makes tradeoffs to minimize userspace
interruptions while still attempting to avoid overheads in the
kernel entry/exit path, to provide 100% kernel semantics, etc.

However, some applications require a stronger commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications to elect
to have the stronger semantics as needed, specifying
prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The "cpu_isolated" state is indicated by setting a new task struct
field, cpu_isolated_flags, to the value passed by prctl().  When the
_ENABLE bit is set for a task, and it is returning to userspace
on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter()
routine to take additional actions to help the task avoid being
interrupted in the future.

Initially, there are only two actions taken.  First, the task
calls lru_add_drain() to prevent being interrupted by a subsequent
lru_add_drain_all() call on another core.  Then, the code checks for
pending timer interrupts and quiesces until they are no longer pending.
As a result, sys calls (and page faults, etc.) can be inordinately slow.
However, this quiescing guarantees that no unexpected interrupts will
occur, even if the application intentionally calls into the kernel.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/sched.h      |  3 +++
 include/linux/tick.h       | 10 +++++++++
 include/uapi/linux/prctl.h |  5 +++++
 kernel/context_tracking.c  |  3 +++
 kernel/sys.c               |  8 ++++++++
 kernel/time/tick-sched.c   | 51 ++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 80 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8222ae40ecb0..fb4ba400d7e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1732,6 +1732,9 @@ struct task_struct {
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long	task_state_change;
 #endif
+#ifdef CONFIG_NO_HZ_FULL
+	unsigned int	cpu_isolated_flags;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f8492da57ad3..ec1953474a65 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -10,6 +10,7 @@
 #include <linux/context_tracking_state.h>
 #include <linux/cpumask.h>
 #include <linux/sched.h>
+#include <linux/prctl.h>
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 extern void __init tick_init(void);
@@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu)
 	return cpumask_test_cpu(cpu, tick_nohz_full_mask);
 }
 
+static inline bool tick_nohz_is_cpu_isolated(void)
+{
+	return tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE);
+}
+
 extern void __tick_nohz_full_check(void);
 extern void tick_nohz_full_kick(void);
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
 extern void __tick_nohz_task_switch(struct task_struct *tsk);
+extern void tick_nohz_cpu_isolated_enter(void);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void tick_nohz_full_kick(void) { }
 static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
+static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
+static inline void tick_nohz_cpu_isolated_enter(void) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..edb40b6b84db 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
 # define PR_FP_MODE_FR		(1 << 0)	/* 64b FP registers */
 # define PR_FP_MODE_FRE		(1 << 1)	/* 32b compatibility */
 
+/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */
+#define PR_SET_CPU_ISOLATED	47
+#define PR_GET_CPU_ISOLATED	48
+# define PR_CPU_ISOLATED_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 72d59a1a6eb6..66739d7c1350 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/tick.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state)
 			 * on the tick.
 			 */
 			if (state == CONTEXT_USER) {
+				if (tick_nohz_is_cpu_isolated())
+					tick_nohz_cpu_isolated_enter();
 				trace_user_enter(0);
 				vtime_user_enter(current);
 			}
diff --git a/kernel/sys.c b/kernel/sys.c
index a4e372b798a5..3fd9e47f8fc8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_NO_HZ_FULL
+	case PR_SET_CPU_ISOLATED:
+		me->cpu_isolated_flags = arg2;
+		break;
+	case PR_GET_CPU_ISOLATED:
+		error = me->cpu_isolated_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 914259128145..f1551c946c45 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
 #include <linux/posix-timers.h>
 #include <linux/perf_event.h>
 #include <linux/context_tracking.h>
+#include <linux/swap.h>
 
 #include <asm/irq_regs.h>
 
@@ -389,6 +390,56 @@ void __init tick_nohz_init(void)
 	pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
 		cpumask_pr_args(tick_nohz_full_mask));
 }
+
+/*
+ * We normally return immediately to userspace.
+ *
+ * In "cpu_isolated" mode we wait until no more interrupts are
+ * pending.  Otherwise we nap with interrupts enabled and wait for the
+ * next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two "cpu_isolated" processes on the same
+ * core, neither will ever leave the kernel, and one will have to be
+ * killed manually.  Otherwise in situations where another process is
+ * in the runqueue on this cpu, this task will just wait for that
+ * other task to go idle before returning to user space.
+ */
+void tick_nohz_cpu_isolated_enter(void)
+{
+	struct clock_event_device *dev =
+		__this_cpu_read(tick_cpu_device.evtdev);
+	struct task_struct *task = current;
+	unsigned long start = jiffies;
+	bool warned = false;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+		if (!warned && (jiffies - start) >= (5 * HZ)) {
+			pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n",
+				task->comm, task->pid, smp_processor_id(),
+				(jiffies - start));
+			warned = true;
+		}
+		if (should_resched())
+			schedule();
+		if (test_thread_flag(TIF_SIGPENDING))
+			break;
+
+		/* Idle with interrupts enabled and wait for the tick. */
+		set_current_state(TASK_INTERRUPTIBLE);
+		arch_cpu_idle();
+		set_current_state(TASK_RUNNING);
+	}
+	if (warned) {
+		pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n",
+			task->comm, task->pid, smp_processor_id(),
+			(jiffies - start));
+		dump_stack();
+	}
+}
+
 #endif
 
 /*
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode
@ 2015-05-15 21:27     ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode makes tradeoffs to minimize userspace
interruptions while still attempting to avoid overheads in the
kernel entry/exit path, to provide 100% kernel semantics, etc.

However, some applications require a stronger commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications to elect
to have the stronger semantics as needed, specifying
prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The "cpu_isolated" state is indicated by setting a new task struct
field, cpu_isolated_flags, to the value passed by prctl().  When the
_ENABLE bit is set for a task, and it is returning to userspace
on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter()
routine to take additional actions to help the task avoid being
interrupted in the future.

Initially, there are only two actions taken.  First, the task
calls lru_add_drain() to prevent being interrupted by a subsequent
lru_add_drain_all() call on another core.  Then, the code checks for
pending timer interrupts and quiesces until they are no longer pending.
As a result, sys calls (and page faults, etc.) can be inordinately slow.
However, this quiescing guarantees that no unexpected interrupts will
occur, even if the application intentionally calls into the kernel.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/sched.h      |  3 +++
 include/linux/tick.h       | 10 +++++++++
 include/uapi/linux/prctl.h |  5 +++++
 kernel/context_tracking.c  |  3 +++
 kernel/sys.c               |  8 ++++++++
 kernel/time/tick-sched.c   | 51 ++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 80 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8222ae40ecb0..fb4ba400d7e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1732,6 +1732,9 @@ struct task_struct {
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long	task_state_change;
 #endif
+#ifdef CONFIG_NO_HZ_FULL
+	unsigned int	cpu_isolated_flags;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f8492da57ad3..ec1953474a65 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -10,6 +10,7 @@
 #include <linux/context_tracking_state.h>
 #include <linux/cpumask.h>
 #include <linux/sched.h>
+#include <linux/prctl.h>
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 extern void __init tick_init(void);
@@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu)
 	return cpumask_test_cpu(cpu, tick_nohz_full_mask);
 }
 
+static inline bool tick_nohz_is_cpu_isolated(void)
+{
+	return tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE);
+}
+
 extern void __tick_nohz_full_check(void);
 extern void tick_nohz_full_kick(void);
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
 extern void __tick_nohz_task_switch(struct task_struct *tsk);
+extern void tick_nohz_cpu_isolated_enter(void);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void tick_nohz_full_kick(void) { }
 static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
+static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
+static inline void tick_nohz_cpu_isolated_enter(void) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..edb40b6b84db 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
 # define PR_FP_MODE_FR		(1 << 0)	/* 64b FP registers */
 # define PR_FP_MODE_FRE		(1 << 1)	/* 32b compatibility */
 
+/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */
+#define PR_SET_CPU_ISOLATED	47
+#define PR_GET_CPU_ISOLATED	48
+# define PR_CPU_ISOLATED_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 72d59a1a6eb6..66739d7c1350 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/tick.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state)
 			 * on the tick.
 			 */
 			if (state == CONTEXT_USER) {
+				if (tick_nohz_is_cpu_isolated())
+					tick_nohz_cpu_isolated_enter();
 				trace_user_enter(0);
 				vtime_user_enter(current);
 			}
diff --git a/kernel/sys.c b/kernel/sys.c
index a4e372b798a5..3fd9e47f8fc8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_NO_HZ_FULL
+	case PR_SET_CPU_ISOLATED:
+		me->cpu_isolated_flags = arg2;
+		break;
+	case PR_GET_CPU_ISOLATED:
+		error = me->cpu_isolated_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 914259128145..f1551c946c45 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
 #include <linux/posix-timers.h>
 #include <linux/perf_event.h>
 #include <linux/context_tracking.h>
+#include <linux/swap.h>
 
 #include <asm/irq_regs.h>
 
@@ -389,6 +390,56 @@ void __init tick_nohz_init(void)
 	pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
 		cpumask_pr_args(tick_nohz_full_mask));
 }
+
+/*
+ * We normally return immediately to userspace.
+ *
+ * In "cpu_isolated" mode we wait until no more interrupts are
+ * pending.  Otherwise we nap with interrupts enabled and wait for the
+ * next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two "cpu_isolated" processes on the same
+ * core, neither will ever leave the kernel, and one will have to be
+ * killed manually.  Otherwise in situations where another process is
+ * in the runqueue on this cpu, this task will just wait for that
+ * other task to go idle before returning to user space.
+ */
+void tick_nohz_cpu_isolated_enter(void)
+{
+	struct clock_event_device *dev =
+		__this_cpu_read(tick_cpu_device.evtdev);
+	struct task_struct *task = current;
+	unsigned long start = jiffies;
+	bool warned = false;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+		if (!warned && (jiffies - start) >= (5 * HZ)) {
+			pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n",
+				task->comm, task->pid, smp_processor_id(),
+				(jiffies - start));
+			warned = true;
+		}
+		if (should_resched())
+			schedule();
+		if (test_thread_flag(TIF_SIGPENDING))
+			break;
+
+		/* Idle with interrupts enabled and wait for the tick. */
+		set_current_state(TASK_INTERRUPTIBLE);
+		arch_cpu_idle();
+		set_current_state(TASK_RUNNING);
+	}
+	if (warned) {
+		pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n",
+			task->comm, task->pid, smp_processor_id(),
+			(jiffies - start));
+		dump_stack();
+	}
+}
+
 #endif
 
 /*
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v2 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode
  2015-05-15 21:27     ` Chris Metcalf
@ 2015-05-15 21:27       ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

With cpu_isolated mode, the task is in principle guaranteed not to be
interrupted by the kernel, but only if it behaves.  In particular, if it
enters the kernel via system call, page fault, or any of a number of other
synchronous traps, it may be unexpectedly exposed to long latencies.
Add a simple flag that puts the process into a state where any such
kernel entry is fatal.

To allow the state to be entered and exited, we add an internal bit to
current->cpu_isolated_flags that is set when prctl() sets the flags.
We check the bit on syscall entry as well as on any exception_enter().
The prctl() syscall is ignored to allow clearing the bit again later,
and exit/exit_group are ignored to allow exiting the task without
a pointless signal killing you as you try to do so.

This change adds the syscall-detection hooks only for x86 and tile;
I am happy to try to add more for additional platforms in the final
version.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/kernel/ptrace.c        |  6 +++++-
 arch/x86/kernel/ptrace.c         |  2 ++
 include/linux/context_tracking.h | 11 ++++++++---
 include/linux/tick.h             | 16 ++++++++++++++++
 include/uapi/linux/prctl.h       |  1 +
 kernel/context_tracking.c        |  9 ++++++---
 kernel/time/tick-sched.c         | 38 ++++++++++++++++++++++++++++++++++++++
 7 files changed, 76 insertions(+), 7 deletions(-)

diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..d4e43a13bab1 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (work & _TIF_NOHZ)
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		if (tick_nohz_cpu_isolated_strict())
+			tick_nohz_cpu_isolated_syscall(
+				regs->regs[TREG_SYSCALL_NR]);
+	}
 
 	if (work & _TIF_SYSCALL_TRACE) {
 		if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index a7bc79480719..7f784054ddea 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	if (work & _TIF_NOHZ) {
 		user_exit();
 		work &= ~_TIF_NOHZ;
+		if (tick_nohz_cpu_isolated_strict())
+			tick_nohz_cpu_isolated_syscall(regs->orig_ax);
 	}
 
 #ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 2821838256b4..d042f4cda39d 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include <linux/sched.h>
 #include <linux/vtime.h>
+#include <linux/tick.h>
 #include <linux/context_tracking_state.h>
 #include <asm/ptrace.h>
 
@@ -11,7 +12,7 @@
 extern void context_tracking_cpu_set(int cpu);
 
 extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 extern void __context_tracking_task_switch(struct task_struct *prev,
@@ -37,8 +38,12 @@ static inline enum ctx_state exception_enter(void)
 		return 0;
 
 	prev_ctx = this_cpu_read(context_tracking.state);
-	if (prev_ctx != CONTEXT_KERNEL)
-		context_tracking_exit(prev_ctx);
+	if (prev_ctx != CONTEXT_KERNEL) {
+		if (context_tracking_exit(prev_ctx)) {
+			if (tick_nohz_cpu_isolated_strict())
+				tick_nohz_cpu_isolated_exception();
+		}
+	}
 
 	return prev_ctx;
 }
diff --git a/include/linux/tick.h b/include/linux/tick.h
index ec1953474a65..b7ffb10337ba 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -147,6 +147,8 @@ extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
 extern void __tick_nohz_task_switch(struct task_struct *tsk);
 extern void tick_nohz_cpu_isolated_enter(void);
+extern void tick_nohz_cpu_isolated_syscall(int nr);
+extern void tick_nohz_cpu_isolated_exception(void);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -157,6 +159,8 @@ static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
 static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
 static inline void tick_nohz_cpu_isolated_enter(void) { }
+static inline void tick_nohz_cpu_isolated_syscall(int nr) { }
+static inline void tick_nohz_cpu_isolated_exception(void) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
@@ -189,4 +193,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk)
 		__tick_nohz_task_switch(tsk);
 }
 
+static inline bool tick_nohz_cpu_isolated_strict(void)
+{
+#ifdef CONFIG_NO_HZ_FULL
+	if (tick_nohz_full_cpu(smp_processor_id()) &&
+	    (current->cpu_isolated_flags &
+	     (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) ==
+	    (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT))
+		return true;
+#endif
+	return false;
+}
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index edb40b6b84db..0c11238a84fb 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
 #define PR_SET_CPU_ISOLATED	47
 #define PR_GET_CPU_ISOLATED	48
 # define PR_CPU_ISOLATED_ENABLE	(1 << 0)
+# define PR_CPU_ISOLATED_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 66739d7c1350..c82509caa42e 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -131,15 +131,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
 {
 	unsigned long flags;
+	bool from_user = false;
 
 	if (!context_tracking_is_enabled())
-		return;
+		return false;
 
 	if (in_interrupt())
-		return;
+		return false;
 
 	local_irq_save(flags);
 	if (__this_cpu_read(context_tracking.state) == state) {
@@ -150,6 +151,7 @@ void context_tracking_exit(enum ctx_state state)
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				from_user = true;
 				vtime_user_exit(current);
 				trace_user_exit(0);
 			}
@@ -157,6 +159,7 @@ void context_tracking_exit(enum ctx_state state)
 		__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
 	}
 	local_irq_restore(flags);
+	return from_user;
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
 EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f1551c946c45..273820cd484a 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -27,6 +27,7 @@
 #include <linux/swap.h>
 
 #include <asm/irq_regs.h>
+#include <asm/unistd.h>
 
 #include "tick-internal.h"
 
@@ -440,6 +441,43 @@ void tick_nohz_cpu_isolated_enter(void)
 	}
 }
 
+static void kill_cpu_isolated_strict_task(void)
+{
+	dump_stack();
+	current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
+	send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void tick_nohz_cpu_isolated_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return;
+	}
+
+	pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
+		current->comm, current->pid, syscall);
+	kill_cpu_isolated_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void tick_nohz_cpu_isolated_exception(void)
+{
+	pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
+		current->comm, current->pid);
+	kill_cpu_isolated_strict_task();
+}
+
 #endif
 
 /*
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v2 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode
@ 2015-05-15 21:27       ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

With cpu_isolated mode, the task is in principle guaranteed not to be
interrupted by the kernel, but only if it behaves.  In particular, if it
enters the kernel via system call, page fault, or any of a number of other
synchronous traps, it may be unexpectedly exposed to long latencies.
Add a simple flag that puts the process into a state where any such
kernel entry is fatal.

To allow the state to be entered and exited, we add an internal bit to
current->cpu_isolated_flags that is set when prctl() sets the flags.
We check the bit on syscall entry as well as on any exception_enter().
The prctl() syscall is ignored to allow clearing the bit again later,
and exit/exit_group are ignored to allow exiting the task without
a pointless signal killing you as you try to do so.

This change adds the syscall-detection hooks only for x86 and tile;
I am happy to try to add more for additional platforms in the final
version.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/kernel/ptrace.c        |  6 +++++-
 arch/x86/kernel/ptrace.c         |  2 ++
 include/linux/context_tracking.h | 11 ++++++++---
 include/linux/tick.h             | 16 ++++++++++++++++
 include/uapi/linux/prctl.h       |  1 +
 kernel/context_tracking.c        |  9 ++++++---
 kernel/time/tick-sched.c         | 38 ++++++++++++++++++++++++++++++++++++++
 7 files changed, 76 insertions(+), 7 deletions(-)

diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..d4e43a13bab1 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (work & _TIF_NOHZ)
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		if (tick_nohz_cpu_isolated_strict())
+			tick_nohz_cpu_isolated_syscall(
+				regs->regs[TREG_SYSCALL_NR]);
+	}
 
 	if (work & _TIF_SYSCALL_TRACE) {
 		if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index a7bc79480719..7f784054ddea 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	if (work & _TIF_NOHZ) {
 		user_exit();
 		work &= ~_TIF_NOHZ;
+		if (tick_nohz_cpu_isolated_strict())
+			tick_nohz_cpu_isolated_syscall(regs->orig_ax);
 	}
 
 #ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 2821838256b4..d042f4cda39d 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include <linux/sched.h>
 #include <linux/vtime.h>
+#include <linux/tick.h>
 #include <linux/context_tracking_state.h>
 #include <asm/ptrace.h>
 
@@ -11,7 +12,7 @@
 extern void context_tracking_cpu_set(int cpu);
 
 extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 extern void __context_tracking_task_switch(struct task_struct *prev,
@@ -37,8 +38,12 @@ static inline enum ctx_state exception_enter(void)
 		return 0;
 
 	prev_ctx = this_cpu_read(context_tracking.state);
-	if (prev_ctx != CONTEXT_KERNEL)
-		context_tracking_exit(prev_ctx);
+	if (prev_ctx != CONTEXT_KERNEL) {
+		if (context_tracking_exit(prev_ctx)) {
+			if (tick_nohz_cpu_isolated_strict())
+				tick_nohz_cpu_isolated_exception();
+		}
+	}
 
 	return prev_ctx;
 }
diff --git a/include/linux/tick.h b/include/linux/tick.h
index ec1953474a65..b7ffb10337ba 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -147,6 +147,8 @@ extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
 extern void __tick_nohz_task_switch(struct task_struct *tsk);
 extern void tick_nohz_cpu_isolated_enter(void);
+extern void tick_nohz_cpu_isolated_syscall(int nr);
+extern void tick_nohz_cpu_isolated_exception(void);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -157,6 +159,8 @@ static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
 static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
 static inline void tick_nohz_cpu_isolated_enter(void) { }
+static inline void tick_nohz_cpu_isolated_syscall(int nr) { }
+static inline void tick_nohz_cpu_isolated_exception(void) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
@@ -189,4 +193,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk)
 		__tick_nohz_task_switch(tsk);
 }
 
+static inline bool tick_nohz_cpu_isolated_strict(void)
+{
+#ifdef CONFIG_NO_HZ_FULL
+	if (tick_nohz_full_cpu(smp_processor_id()) &&
+	    (current->cpu_isolated_flags &
+	     (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) ==
+	    (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT))
+		return true;
+#endif
+	return false;
+}
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index edb40b6b84db..0c11238a84fb 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
 #define PR_SET_CPU_ISOLATED	47
 #define PR_GET_CPU_ISOLATED	48
 # define PR_CPU_ISOLATED_ENABLE	(1 << 0)
+# define PR_CPU_ISOLATED_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 66739d7c1350..c82509caa42e 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -131,15 +131,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
 {
 	unsigned long flags;
+	bool from_user = false;
 
 	if (!context_tracking_is_enabled())
-		return;
+		return false;
 
 	if (in_interrupt())
-		return;
+		return false;
 
 	local_irq_save(flags);
 	if (__this_cpu_read(context_tracking.state) == state) {
@@ -150,6 +151,7 @@ void context_tracking_exit(enum ctx_state state)
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				from_user = true;
 				vtime_user_exit(current);
 				trace_user_exit(0);
 			}
@@ -157,6 +159,7 @@ void context_tracking_exit(enum ctx_state state)
 		__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
 	}
 	local_irq_restore(flags);
+	return from_user;
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
 EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f1551c946c45..273820cd484a 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -27,6 +27,7 @@
 #include <linux/swap.h>
 
 #include <asm/irq_regs.h>
+#include <asm/unistd.h>
 
 #include "tick-internal.h"
 
@@ -440,6 +441,43 @@ void tick_nohz_cpu_isolated_enter(void)
 	}
 }
 
+static void kill_cpu_isolated_strict_task(void)
+{
+	dump_stack();
+	current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
+	send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void tick_nohz_cpu_isolated_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return;
+	}
+
+	pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
+		current->comm, current->pid, syscall);
+	kill_cpu_isolated_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void tick_nohz_cpu_isolated_exception(void)
+{
+	pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
+		current->comm, current->pid);
+	kill_cpu_isolated_strict_task();
+}
+
 #endif
 
 /*
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v2 3/5] nohz: cpu_isolated strict mode configurable signal
@ 2015-05-15 21:27       ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

Allow userspace to override the default SIGKILL delivered
when a cpu_isolated process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

In addition to being able to set the signal, we now also
pass whether or not the interruption was from a syscall in
the si_code field of the siginfo.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/uapi/linux/prctl.h |  2 ++
 kernel/time/tick-sched.c   | 15 +++++++++++----
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 0c11238a84fb..ab45bd3d5799 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
 #define PR_GET_CPU_ISOLATED	48
 # define PR_CPU_ISOLATED_ENABLE	(1 << 0)
 # define PR_CPU_ISOLATED_STRICT	(1 << 1)
+# define PR_CPU_ISOLATED_SET_SIG(sig)  (((sig) & 0x7f) << 8)
+# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 273820cd484a..772be78f926c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -441,11 +441,18 @@ void tick_nohz_cpu_isolated_enter(void)
 	}
 }
 
-static void kill_cpu_isolated_strict_task(void)
+static void kill_cpu_isolated_strict_task(int is_syscall)
 {
+	siginfo_t info = {};
+	int sig;
+
 	dump_stack();
 	current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
-	send_sig(SIGKILL, current, 1);
+
+	sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL;
+	info.si_signo = sig;
+	info.si_code = is_syscall;
+	send_sig_info(sig, &info, current);
 }
 
 /*
@@ -464,7 +471,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall)
 
 	pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
 		current->comm, current->pid, syscall);
-	kill_cpu_isolated_strict_task();
+	kill_cpu_isolated_strict_task(1);
 }
 
 /*
@@ -475,7 +482,7 @@ void tick_nohz_cpu_isolated_exception(void)
 {
 	pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
 		current->comm, current->pid);
-	kill_cpu_isolated_strict_task();
+	kill_cpu_isolated_strict_task(0);
 }
 
 #endif
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v2 3/5] nohz: cpu_isolated strict mode configurable signal
@ 2015-05-15 21:27       ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Chris Metcalf

Allow userspace to override the default SIGKILL delivered
when a cpu_isolated process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

In addition to being able to set the signal, we now also
pass whether or not the interruption was from a syscall in
the si_code field of the siginfo.

Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
---
 include/uapi/linux/prctl.h |  2 ++
 kernel/time/tick-sched.c   | 15 +++++++++++----
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 0c11238a84fb..ab45bd3d5799 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
 #define PR_GET_CPU_ISOLATED	48
 # define PR_CPU_ISOLATED_ENABLE	(1 << 0)
 # define PR_CPU_ISOLATED_STRICT	(1 << 1)
+# define PR_CPU_ISOLATED_SET_SIG(sig)  (((sig) & 0x7f) << 8)
+# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 273820cd484a..772be78f926c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -441,11 +441,18 @@ void tick_nohz_cpu_isolated_enter(void)
 	}
 }
 
-static void kill_cpu_isolated_strict_task(void)
+static void kill_cpu_isolated_strict_task(int is_syscall)
 {
+	siginfo_t info = {};
+	int sig;
+
 	dump_stack();
 	current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
-	send_sig(SIGKILL, current, 1);
+
+	sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL;
+	info.si_signo = sig;
+	info.si_code = is_syscall;
+	send_sig_info(sig, &info, current);
 }
 
 /*
@@ -464,7 +471,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall)
 
 	pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
 		current->comm, current->pid, syscall);
-	kill_cpu_isolated_strict_task();
+	kill_cpu_isolated_strict_task(1);
 }
 
 /*
@@ -475,7 +482,7 @@ void tick_nohz_cpu_isolated_exception(void)
 {
 	pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
 		current->comm, current->pid);
-	kill_cpu_isolated_strict_task();
+	kill_cpu_isolated_strict_task(0);
 }
 
 #endif
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v2 4/5] nohz: add cpu_isolated_debug boot flag
  2015-05-15 21:27     ` Chris Metcalf
                       ` (2 preceding siblings ...)
  (?)
@ 2015-05-15 21:27     ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-kernel
  Cc: Chris Metcalf

This flag simplifies debugging of NO_HZ_FULL kernels when processes
are running in PR_CPU_ISOLATED_ENABLE mode.  Such processes should
get no interrupts from the kernel, and if they do, when this boot
flag is specified a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a cpu_isolated core
has unexpectedly entered the kernel.  But what this boot flag does
is allow the kernel to provide better diagnostics, e.g. by reporting
in the IPI-generating code what remote core and context is preparing
to deliver an interrupt to a cpu_isolated core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 Documentation/kernel-parameters.txt |  6 ++++++
 arch/tile/mm/homecache.c            |  5 ++++-
 include/linux/tick.h                |  2 ++
 kernel/irq_work.c                   |  4 +++-
 kernel/sched/core.c                 | 18 ++++++++++++++++++
 kernel/signal.c                     |  5 +++++
 kernel/smp.c                        |  4 ++++
 kernel/softirq.c                    |  6 ++++++
 8 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index f6befa9855c1..2b4c89225d25 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -743,6 +743,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			/proc/<pid>/coredump_filter.
 			See also Documentation/filesystems/proc.txt.
 
+	cpu_isolated_debug	[KNL]
+			In kernels built with CONFIG_NO_HZ_FULL and booted
+			in nohz_full= mode, this setting will generate console
+			backtraces when the kernel is about to interrupt a
+			task that has requested PR_CPU_ISOLATED_ENABLE.
+
 	cpuidle.off=1	[CPU_IDLE]
 			disable the cpuidle sub-system
 
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..f336880e1b01 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
 #include <linux/smp.h>
 #include <linux/module.h>
 #include <linux/hugetlb.h>
+#include <linux/tick.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
 	 * Don't bother to update atomically; losing a count
 	 * here is not that critical.
 	 */
-	for_each_cpu(cpu, &mask)
+	for_each_cpu(cpu, &mask) {
 		++per_cpu(irq_stat, cpu).irq_hv_flush_count;
+		tick_nohz_cpu_isolated_debug(cpu);
+	}
 }
 
 /*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index b7ffb10337ba..0b0d76106b8c 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -149,6 +149,7 @@ extern void __tick_nohz_task_switch(struct task_struct *tsk);
 extern void tick_nohz_cpu_isolated_enter(void);
 extern void tick_nohz_cpu_isolated_syscall(int nr);
 extern void tick_nohz_cpu_isolated_exception(void);
+extern void tick_nohz_cpu_isolated_debug(int cpu);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -161,6 +162,7 @@ static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
 static inline void tick_nohz_cpu_isolated_enter(void) { }
 static inline void tick_nohz_cpu_isolated_syscall(int nr) { }
 static inline void tick_nohz_cpu_isolated_exception(void) { }
+static inline void tick_nohz_cpu_isolated_debug(int cpu) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index cbf9fb899d92..7f35c90346de 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -75,8 +75,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (!irq_work_claim(work))
 		return false;
 
-	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+		tick_nohz_cpu_isolated_debug(cpu);
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return true;
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f9123a82cbb6..7315e7272e94 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -719,6 +719,24 @@ bool sched_can_stop_tick(void)
 
 	return true;
 }
+
+/* Enable debugging of any interrupts of cpu_isolated cores. */
+static int cpu_isolated_debug;
+static int __init cpu_isolated_debug_func(char *str)
+{
+	cpu_isolated_debug = true;
+	return 1;
+}
+__setup("cpu_isolated_debug", cpu_isolated_debug_func);
+
+void tick_nohz_cpu_isolated_debug(int cpu)
+{
+	if (cpu_isolated_debug && tick_nohz_full_cpu(cpu) &&
+	    (cpu_curr(cpu)->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE)) {
+		pr_err("Interrupt detected for cpu_isolated cpu %d\n", cpu);
+		dump_stack();
+	}
+}
 #endif /* CONFIG_NO_HZ_FULL */
 
 void sched_avg_update(struct rq *rq)
diff --git a/kernel/signal.c b/kernel/signal.c
index d51c5ddd855c..1a810ac2656e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -689,6 +689,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+#ifdef CONFIG_NO_HZ_FULL
+	/* If the task is being killed, don't complain about cpu_isolated. */
+	if (state & TASK_WAKEKILL)
+		t->cpu_isolated_flags = 0;
+#endif
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 07854477c164..6b7d8e2c8af4 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
 #include <linux/smp.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include <linux/tick.h>
 
 #include "smpboot.h"
 
@@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
+	tick_nohz_cpu_isolated_debug(cpu);
 	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
 		arch_send_call_function_single_ipi(cpu);
 
@@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	for_each_cpu(cpu, cfd->cpumask)
+		tick_nohz_cpu_isolated_debug(cpu);
 	arch_send_call_function_ipi_mask(cfd->cpumask);
 
 	if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..333872925ff6 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -24,6 +24,7 @@
 #include <linux/ftrace.h>
 #include <linux/smp.h>
 #include <linux/smpboot.h>
+#include <linux/context_tracking.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
 
@@ -335,6 +336,11 @@ void irq_enter(void)
 		_local_bh_enable();
 	}
 
+	if (context_tracking_cpu_is_enabled() &&
+	    context_tracking_in_user() &&
+	    !in_interrupt())
+		tick_nohz_cpu_isolated_debug(smp_processor_id());
+
 	__irq_enter();
 }
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v2 5/5] nohz: cpu_isolated: allow tick to be fully disabled
  2015-05-15 21:27     ` Chris Metcalf
                       ` (3 preceding siblings ...)
  (?)
@ 2015-05-15 21:27     ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-kernel
  Cc: Chris Metcalf

While the current fallback to 1-second tick is still helpful for
maintaining completely correct kernel semantics, processes using
prctl(PR_SET_CPU_ISOLATED) semantics place a higher priority on running
completely tickless, so don't bound the time_delta for such processes.

This was previously discussed in

https://lkml.org/lkml/2014/10/31/364

and Thomas Gleixner observed that vruntime, load balancing data,
load accounting, and other things might be impacted.  Frederic
Weisbecker similarly observed that allowing the tick to be indefinitely
deferred just meant that no one would ever fix the underlying bugs.
However it's at least true that the mode proposed in this patch can
only be enabled on an isolcpus core, which may limit how important
it is to maintain scheduler data correctly, for example.

It's also worth observing that the tile architecture has been using
similar code for its Zero-Overhead Linux for many years (starting in
2005) and customers are very enthusiastic about the resulting bare-metal
performance on cores that are available to run full Linux semantics
on demand (crash, logging, shutdown, etc).  So this semantics is very
useful if we can convince ourselves that doing this is safe.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
Note: I have kept this in the series despite PeterZ's nack, since it
didn't seem resolved in the original thread from v1 of the patch
(https://lkml.org/lkml/2015/5/8/555).

 kernel/time/tick-sched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 772be78f926c..be4db5d81ada 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -727,7 +727,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
 		}
 
 #ifdef CONFIG_NO_HZ_FULL
-		if (!ts->inidle) {
+		if (!ts->inidle && !tick_nohz_is_cpu_isolated()) {
 			time_delta = min(time_delta,
 					 scheduler_tick_max_deferment());
 		}
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* Re: [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode
@ 2015-05-15 22:17       ` Thomas Gleixner
  0 siblings, 0 replies; 340+ messages in thread
From: Thomas Gleixner @ 2015-05-15 22:17 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc,
	linux-api, linux-kernel

On Fri, 15 May 2015, Chris Metcalf wrote:
> +/*
> + * We normally return immediately to userspace.
> + *
> + * In "cpu_isolated" mode we wait until no more interrupts are
> + * pending.  Otherwise we nap with interrupts enabled and wait for the
> + * next interrupt to fire, then loop back and retry.
> + *
> + * Note that if you schedule two "cpu_isolated" processes on the same
> + * core, neither will ever leave the kernel, and one will have to be
> + * killed manually.

And why are we not preventing that situation in the first place? The
scheduler should be able to figure that out easily..

> +  Otherwise in situations where another process is
> + * in the runqueue on this cpu, this task will just wait for that
> + * other task to go idle before returning to user space.
> + */
> +void tick_nohz_cpu_isolated_enter(void)
> +{
> +	struct clock_event_device *dev =
> +		__this_cpu_read(tick_cpu_device.evtdev);
> +	struct task_struct *task = current;
> +	unsigned long start = jiffies;
> +	bool warned = false;
> +
> +	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
> +	lru_add_drain();
> +
> +	while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) {

What's the ACCESS_ONCE for?

> +		if (!warned && (jiffies - start) >= (5 * HZ)) {
> +			pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n",
> +				task->comm, task->pid, smp_processor_id(),
> +				(jiffies - start));

What additional value has the jiffies delta over a plain human
readable '5sec' ?

> +			warned = true;
> +		}
> +		if (should_resched())
> +			schedule();
> +		if (test_thread_flag(TIF_SIGPENDING))
> +			break;
> +
> +		/* Idle with interrupts enabled and wait for the tick. */
> +		set_current_state(TASK_INTERRUPTIBLE);
> +		arch_cpu_idle();

Oh NO! Not another variant of fake idle task. The idle implementations
can call into code which rightfully expects that the CPU is actually
IDLE.

I wasted enough time already debugging the resulting wreckage. Feel
free to use it for experimental purposes, but this is not going
anywhere near to a mainline kernel.

I completely understand WHY you want to do that, but we need proper
mechanisms for that and not some duct tape engineering band aids which
will create hard to debug side effects.

Hint: It's a scheduler job to make sure that the machine has quiesced
      _BEFORE_ letting the magic task off to user land.

> +		set_current_state(TASK_RUNNING);
> +	}
> +	if (warned) {
> +		pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n",
> +			task->comm, task->pid, smp_processor_id(),
> +			(jiffies - start));
> +		dump_stack();

And that dump_stack() tells us which important information?

    	 tick_nohz_cpu_isolated_enter
	 context_tracking_enter
	 context_tracking_user_enter
	 arch_return_to_user_code

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode
@ 2015-05-15 22:17       ` Thomas Gleixner
  0 siblings, 0 replies; 340+ messages in thread
From: Thomas Gleixner @ 2015-05-15 22:17 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, 15 May 2015, Chris Metcalf wrote:
> +/*
> + * We normally return immediately to userspace.
> + *
> + * In "cpu_isolated" mode we wait until no more interrupts are
> + * pending.  Otherwise we nap with interrupts enabled and wait for the
> + * next interrupt to fire, then loop back and retry.
> + *
> + * Note that if you schedule two "cpu_isolated" processes on the same
> + * core, neither will ever leave the kernel, and one will have to be
> + * killed manually.

And why are we not preventing that situation in the first place? The
scheduler should be able to figure that out easily..

> +  Otherwise in situations where another process is
> + * in the runqueue on this cpu, this task will just wait for that
> + * other task to go idle before returning to user space.
> + */
> +void tick_nohz_cpu_isolated_enter(void)
> +{
> +	struct clock_event_device *dev =
> +		__this_cpu_read(tick_cpu_device.evtdev);
> +	struct task_struct *task = current;
> +	unsigned long start = jiffies;
> +	bool warned = false;
> +
> +	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
> +	lru_add_drain();
> +
> +	while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) {

What's the ACCESS_ONCE for?

> +		if (!warned && (jiffies - start) >= (5 * HZ)) {
> +			pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n",
> +				task->comm, task->pid, smp_processor_id(),
> +				(jiffies - start));

What additional value has the jiffies delta over a plain human
readable '5sec' ?

> +			warned = true;
> +		}
> +		if (should_resched())
> +			schedule();
> +		if (test_thread_flag(TIF_SIGPENDING))
> +			break;
> +
> +		/* Idle with interrupts enabled and wait for the tick. */
> +		set_current_state(TASK_INTERRUPTIBLE);
> +		arch_cpu_idle();

Oh NO! Not another variant of fake idle task. The idle implementations
can call into code which rightfully expects that the CPU is actually
IDLE.

I wasted enough time already debugging the resulting wreckage. Feel
free to use it for experimental purposes, but this is not going
anywhere near to a mainline kernel.

I completely understand WHY you want to do that, but we need proper
mechanisms for that and not some duct tape engineering band aids which
will create hard to debug side effects.

Hint: It's a scheduler job to make sure that the machine has quiesced
      _BEFORE_ letting the magic task off to user land.

> +		set_current_state(TASK_RUNNING);
> +	}
> +	if (warned) {
> +		pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n",
> +			task->comm, task->pid, smp_processor_id(),
> +			(jiffies - start));
> +		dump_stack();

And that dump_stack() tells us which important information?

    	 tick_nohz_cpu_isolated_enter
	 context_tracking_enter
	 context_tracking_user_enter
	 arch_return_to_user_code

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-15 18:44                       ` Mike Galbraith
@ 2015-05-26 19:51                         ` Chris Metcalf
  2015-05-27  3:28                           ` Mike Galbraith
  0 siblings, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-05-26 19:51 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Frederic Weisbecker, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	linux-kernel

Thanks for the clarification, and sorry for the slow reply; I had a busy
week of meetings last week, and then the long weekend in the U.S.

On 05/15/2015 02:44 PM, Mike Galbraith wrote:
> Just because the nohz_full feature itself is currently static is no
> reason to put users thereof in a straight jacket by mandating that any
> set they define irrevocably disappears from the generic resource pool .
> Those CPUS are useful until the moment someone cripples them, which
> making nohz_full imply isolcpus does if isolcpus then also becomes
> immutable, which Rik's patch does.  Making nohz_full imply isolcpus
> sounds perfectly fine until someone comes along and makes isolcpus
> immutable (Rik's patch), at which point the user loses a choice due to
> two people making it imply things that _alone_ sound perfectly fine.
>
> See what I'm saying now?

That does make sense; my argument was that 99% of the time when
someone specifies nohz_full they also need isolcpus.  You're right
that someone playing with nohz_full would be unpleasantly surprised.
And of course having more flexibility always feels like a plus.
On balance I suspect it's still better to make command line arguments
handle the common cases most succinctly.

Hopefully we'll get a to a point where all of this is dynamic and how
we play with the boot arguments no longer matters.  If not, perhaps
we revisit this and make a cpu_isolation=1-15 type command line
argument that enables isolcpus and nohz_full both.

>>> Thomas has nuked the hrtimer softirq.
>> Yes, this I didn't know.  So I will drop my "no ksoftirqd" patch and
>> we will see if ksoftirqs emerge as an issue for my "cpu isolation"
>> stuff in the future; it may be that that was the only issue.
>>
>>> Inlining softirqs may save a context switch, but adds cycles that we may
>>> consume at higher frequency than the thing we're avoiding.
>> Yes but consuming cycles is not nearly as much of a concern
>> as avoiding interrupts or scheduling, certainly for the case of
>> userspace drivers that I described above.
> If you're raising softirqs in an SMP kernel, you're also doing something
> that puts you at very serious risk of meeting the jitter monster, locks,
> and worse, sleeping locks, no?

The softirqs were being raised by third parties for hrtimer, not by
the application code itself, if I remember correctly.  In any case
this appears not to be an issue for nohz_full any more now.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6] support "dataplane" mode for nohz_full
  2015-05-26 19:51                         ` Chris Metcalf
@ 2015-05-27  3:28                           ` Mike Galbraith
  0 siblings, 0 replies; 340+ messages in thread
From: Mike Galbraith @ 2015-05-27  3:28 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Frederic Weisbecker, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	linux-kernel

On Tue, 2015-05-26 at 15:51 -0400, Chris Metcalf wrote:

> On balance I suspect it's still better to make command line arguments
> handle the common cases most succinctly.

I prefer user specifies precisely, but yeah, that entails more typing.  

Idle curiosity: can SGI monster from hell boot a NO_HZ_FULL_ALL kernel,
w/wo it implying isolcpus?  Readers having same and a reactor to power
it in their basement, please test.

	-Mike


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode
  2015-05-15 22:17       ` Thomas Gleixner
@ 2015-05-28 20:38         ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-28 20:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc,
	linux-api, linux-kernel

Thomas, thanks for the feedback.  My reply was delayed by being in meetings
all last week and then catching up this week - sorry about that.

On 05/15/2015 06:17 PM, Thomas Gleixner wrote:
> On Fri, 15 May 2015, Chris Metcalf wrote:
>> +/*
>> + * We normally return immediately to userspace.
>> + *
>> + * In "cpu_isolated" mode we wait until no more interrupts are
>> + * pending.  Otherwise we nap with interrupts enabled and wait for the
>> + * next interrupt to fire, then loop back and retry.
>> + *
>> + * Note that if you schedule two "cpu_isolated" processes on the same
>> + * core, neither will ever leave the kernel, and one will have to be
>> + * killed manually.
> And why are we not preventing that situation in the first place? The
> scheduler should be able to figure that out easily..

This is an interesting observation.  My instinct is that adding tests in the
scheduler costs time on a hot path for all processes, and I'm trying to
avoid adding cost where we don't need it.  It's pretty much a straight-up
application bug if two threads or processes explicitly request the
cpu_isolated semantics, and then explicitly schedule themselves onto
the same core, so my preference was to let the application writer
identify and fix the problem if it comes up.

However, I'm certainly open to thinking about checking for this failure
mode in the scheduler, though I don't know enough about the
scheduler to immediately identify where such a change might go.
Would it be appropriate to think about this as a follow-on patch, if it's
determined that the cost of testing for this condition is worth it?

>> +  Otherwise in situations where another process is
>> + * in the runqueue on this cpu, this task will just wait for that
>> + * other task to go idle before returning to user space.
>> + */
>> +void tick_nohz_cpu_isolated_enter(void)
>> +{
>> +	struct clock_event_device *dev =
>> +		__this_cpu_read(tick_cpu_device.evtdev);
>> +	struct task_struct *task = current;
>> +	unsigned long start = jiffies;
>> +	bool warned = false;
>> +
>> +	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
>> +	lru_add_drain();
>> +
>> +	while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) {
> What's the ACCESS_ONCE for?

We are technically in a loop here where we are waiting for an
interrupt handler to update dev->next_event.tv64, so I felt it was
appropriate to flag it as such.  If we didn't have function calls inside
the loop, the compiler would eliminate the loop.

But it's just a style thing, and we can certainly drop it if it seems
confusing.  In any case I've changed it to READ_ONCE() since
that's preferred now anyway; this code was originally written
a while ago.

>> +		if (!warned && (jiffies - start) >= (5 * HZ)) {
>> +			pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n",
>> +				task->comm, task->pid, smp_processor_id(),
>> +				(jiffies - start));
> What additional value has the jiffies delta over a plain human
> readable '5sec' ?

Good point.  I've changed it to emit a value in seconds.

>> +			warned = true;
>> +		}
>> +		if (should_resched())
>> +			schedule();
>> +		if (test_thread_flag(TIF_SIGPENDING))
>> +			break;
>> +
>> +		/* Idle with interrupts enabled and wait for the tick. */
>> +		set_current_state(TASK_INTERRUPTIBLE);
>> +		arch_cpu_idle();
> Oh NO! Not another variant of fake idle task. The idle implementations
> can call into code which rightfully expects that the CPU is actually
> IDLE.
>
> I wasted enough time already debugging the resulting wreckage. Feel
> free to use it for experimental purposes, but this is not going
> anywhere near to a mainline kernel.
>
> I completely understand WHY you want to do that, but we need proper
> mechanisms for that and not some duct tape engineering band aids which
> will create hard to debug side effects.

Yes, I worried about that a little when I put it in.  In particular it's
certainly true that arch_cpu_idle() isn't necessarily designed to
behave properly in this context, even if it may do the right thing
somewhat by accident.

In fact, we don't need the cpu-idling semantics in this loop;
the loop can spin quite happily waiting for next_event in the
tick_cpu_device to stop being defined (or a signal or scheduling
request to occur).  I've changed the code to make it opt-in, so
that a weak no-op function that just calls cpu_relax() can be
replaced by an architecture-defined function that safely waits
until an interrupt is delivered, reducing the number of times we
spin around in the outer loop.

> Hint: It's a scheduler job to make sure that the machine has quiesced
>        _BEFORE_ letting the magic task off to user land.

This is not so clear to me.  There may, for example, be RCU events
that occur after the scheduler is done with its part, that still require
another timer tick on the cpu to finish quiescing RCU.  I think we
need to check for the timer-quiesced state as late as possible to
handle things like this.

Arguably the scheduler could also try to do the right thing with
a cpu_isolated task, but again, this feels like time spent in the
scheduler hot path that affects the non-cpu_isolated tasks.  For
cpu_isolated tasks they should be the only thing that's runnable
on the core 99.999% of the time, or you've done something quite
wrong anyway.

>> +		set_current_state(TASK_RUNNING);
>> +	}
>> +	if (warned) {
>> +		pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n",
>> +			task->comm, task->pid, smp_processor_id(),
>> +			(jiffies - start));
>> +		dump_stack();
> And that dump_stack() tells us which important information?
>
>      	 tick_nohz_cpu_isolated_enter
> 	 context_tracking_enter
> 	 context_tracking_user_enter
> 	 arch_return_to_user_code

For tile, the dump_stack() includes the register state, which includes
the interrupt type that took us into the kernel, which might be helpful.
That said, I'm certainly willing to remove it, or make it call a weak
no-op function where architectures can add more info if they have it.

Thanks again!  I'll put out v3 of the patch series shortly, with changes
from your comments incorporated.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode
@ 2015-05-28 20:38         ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-05-28 20:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Thomas, thanks for the feedback.  My reply was delayed by being in meetings
all last week and then catching up this week - sorry about that.

On 05/15/2015 06:17 PM, Thomas Gleixner wrote:
> On Fri, 15 May 2015, Chris Metcalf wrote:
>> +/*
>> + * We normally return immediately to userspace.
>> + *
>> + * In "cpu_isolated" mode we wait until no more interrupts are
>> + * pending.  Otherwise we nap with interrupts enabled and wait for the
>> + * next interrupt to fire, then loop back and retry.
>> + *
>> + * Note that if you schedule two "cpu_isolated" processes on the same
>> + * core, neither will ever leave the kernel, and one will have to be
>> + * killed manually.
> And why are we not preventing that situation in the first place? The
> scheduler should be able to figure that out easily..

This is an interesting observation.  My instinct is that adding tests in the
scheduler costs time on a hot path for all processes, and I'm trying to
avoid adding cost where we don't need it.  It's pretty much a straight-up
application bug if two threads or processes explicitly request the
cpu_isolated semantics, and then explicitly schedule themselves onto
the same core, so my preference was to let the application writer
identify and fix the problem if it comes up.

However, I'm certainly open to thinking about checking for this failure
mode in the scheduler, though I don't know enough about the
scheduler to immediately identify where such a change might go.
Would it be appropriate to think about this as a follow-on patch, if it's
determined that the cost of testing for this condition is worth it?

>> +  Otherwise in situations where another process is
>> + * in the runqueue on this cpu, this task will just wait for that
>> + * other task to go idle before returning to user space.
>> + */
>> +void tick_nohz_cpu_isolated_enter(void)
>> +{
>> +	struct clock_event_device *dev =
>> +		__this_cpu_read(tick_cpu_device.evtdev);
>> +	struct task_struct *task = current;
>> +	unsigned long start = jiffies;
>> +	bool warned = false;
>> +
>> +	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
>> +	lru_add_drain();
>> +
>> +	while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) {
> What's the ACCESS_ONCE for?

We are technically in a loop here where we are waiting for an
interrupt handler to update dev->next_event.tv64, so I felt it was
appropriate to flag it as such.  If we didn't have function calls inside
the loop, the compiler would eliminate the loop.

But it's just a style thing, and we can certainly drop it if it seems
confusing.  In any case I've changed it to READ_ONCE() since
that's preferred now anyway; this code was originally written
a while ago.

>> +		if (!warned && (jiffies - start) >= (5 * HZ)) {
>> +			pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n",
>> +				task->comm, task->pid, smp_processor_id(),
>> +				(jiffies - start));
> What additional value has the jiffies delta over a plain human
> readable '5sec' ?

Good point.  I've changed it to emit a value in seconds.

>> +			warned = true;
>> +		}
>> +		if (should_resched())
>> +			schedule();
>> +		if (test_thread_flag(TIF_SIGPENDING))
>> +			break;
>> +
>> +		/* Idle with interrupts enabled and wait for the tick. */
>> +		set_current_state(TASK_INTERRUPTIBLE);
>> +		arch_cpu_idle();
> Oh NO! Not another variant of fake idle task. The idle implementations
> can call into code which rightfully expects that the CPU is actually
> IDLE.
>
> I wasted enough time already debugging the resulting wreckage. Feel
> free to use it for experimental purposes, but this is not going
> anywhere near to a mainline kernel.
>
> I completely understand WHY you want to do that, but we need proper
> mechanisms for that and not some duct tape engineering band aids which
> will create hard to debug side effects.

Yes, I worried about that a little when I put it in.  In particular it's
certainly true that arch_cpu_idle() isn't necessarily designed to
behave properly in this context, even if it may do the right thing
somewhat by accident.

In fact, we don't need the cpu-idling semantics in this loop;
the loop can spin quite happily waiting for next_event in the
tick_cpu_device to stop being defined (or a signal or scheduling
request to occur).  I've changed the code to make it opt-in, so
that a weak no-op function that just calls cpu_relax() can be
replaced by an architecture-defined function that safely waits
until an interrupt is delivered, reducing the number of times we
spin around in the outer loop.

> Hint: It's a scheduler job to make sure that the machine has quiesced
>        _BEFORE_ letting the magic task off to user land.

This is not so clear to me.  There may, for example, be RCU events
that occur after the scheduler is done with its part, that still require
another timer tick on the cpu to finish quiescing RCU.  I think we
need to check for the timer-quiesced state as late as possible to
handle things like this.

Arguably the scheduler could also try to do the right thing with
a cpu_isolated task, but again, this feels like time spent in the
scheduler hot path that affects the non-cpu_isolated tasks.  For
cpu_isolated tasks they should be the only thing that's runnable
on the core 99.999% of the time, or you've done something quite
wrong anyway.

>> +		set_current_state(TASK_RUNNING);
>> +	}
>> +	if (warned) {
>> +		pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n",
>> +			task->comm, task->pid, smp_processor_id(),
>> +			(jiffies - start));
>> +		dump_stack();
> And that dump_stack() tells us which important information?
>
>      	 tick_nohz_cpu_isolated_enter
> 	 context_tracking_enter
> 	 context_tracking_user_enter
> 	 arch_return_to_user_code

For tile, the dump_stack() includes the register state, which includes
the interrupt type that took us into the kernel, which might be helpful.
That said, I'm certainly willing to remove it, or make it call a weak
no-op function where architectures can add more info if they have it.

Thanks again!  I'll put out v3 of the patch series shortly, with changes
from your comments incorporated.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v3 0/5] support "cpu_isolated" mode for nohz_full
@ 2015-06-03 15:29     ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it.  However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode.  A prctl()
option (PR_SET_CPU_ISOLATED) is added to control whether processes have
requested this stricter semantics, and within that prctl() option we
provide a number of different bits for more precise control.
Additionally, we add a new command-line boot argument to facilitate
debugging where unexpected interrupts are being delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers.  This patch
series represents the first serious attempt to upstream that
functionality.  Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on cpu_isolated cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on my arch/tile master tree for 4.2,
in turn based on 4.1-rc1) is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

v3:
  remove dependency on cpu_idle subsystem (Thomas Gleixner)
  use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter
  use seconds for console messages instead of jiffies (Thomas Gleixner)
  updated commit description for patch 5/5

v2:
  rename "dataplane" to "cpu_isolated"
  drop ksoftirqd suppression changes (believed no longer needed)
  merge previous "QUIESCE" functionality into baseline functionality
  explicitly track syscalls and exceptions for "STRICT" functionality
  allow configuring a signal to be delivered for STRICT mode failures
  move debug tracking to irq_enter(), not irq_exit()

Note: I have not removed the commit to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if
we remove the 1Hz tick, cpu_isolated threads will never re-enter
userspace since a tick will always be pending.

Chris Metcalf (5):
  nohz_full: add support for "cpu_isolated" mode
  nohz: support PR_CPU_ISOLATED_STRICT mode
  nohz: cpu_isolated strict mode configurable signal
  nohz: add cpu_isolated_debug boot flag
  nohz: cpu_isolated: allow tick to be fully disabled

 Documentation/kernel-parameters.txt |   6 +++
 arch/tile/kernel/process.c          |   9 ++++
 arch/tile/kernel/ptrace.c           |   6 ++-
 arch/tile/mm/homecache.c            |   5 +-
 arch/x86/kernel/ptrace.c            |   2 +
 include/linux/context_tracking.h    |  11 ++--
 include/linux/sched.h               |   3 ++
 include/linux/tick.h                |  28 ++++++++++
 include/uapi/linux/prctl.h          |   8 +++
 kernel/context_tracking.c           |  12 +++--
 kernel/irq_work.c                   |   4 +-
 kernel/sched/core.c                 |  18 +++++++
 kernel/signal.c                     |   5 ++
 kernel/smp.c                        |   4 ++
 kernel/softirq.c                    |   6 +++
 kernel/sys.c                        |   8 +++
 kernel/time/tick-sched.c            | 104 +++++++++++++++++++++++++++++++++++-
 17 files changed, 229 insertions(+), 10 deletions(-)

-- 
2.1.2


^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v3 0/5] support "cpu_isolated" mode for nohz_full
@ 2015-06-03 15:29     ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Chris Metcalf

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it.  However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode.  A prctl()
option (PR_SET_CPU_ISOLATED) is added to control whether processes have
requested this stricter semantics, and within that prctl() option we
provide a number of different bits for more precise control.
Additionally, we add a new command-line boot argument to facilitate
debugging where unexpected interrupts are being delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers.  This patch
series represents the first serious attempt to upstream that
functionality.  Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on cpu_isolated cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on my arch/tile master tree for 4.2,
in turn based on 4.1-rc1) is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

v3:
  remove dependency on cpu_idle subsystem (Thomas Gleixner)
  use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter
  use seconds for console messages instead of jiffies (Thomas Gleixner)
  updated commit description for patch 5/5

v2:
  rename "dataplane" to "cpu_isolated"
  drop ksoftirqd suppression changes (believed no longer needed)
  merge previous "QUIESCE" functionality into baseline functionality
  explicitly track syscalls and exceptions for "STRICT" functionality
  allow configuring a signal to be delivered for STRICT mode failures
  move debug tracking to irq_enter(), not irq_exit()

Note: I have not removed the commit to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if
we remove the 1Hz tick, cpu_isolated threads will never re-enter
userspace since a tick will always be pending.

Chris Metcalf (5):
  nohz_full: add support for "cpu_isolated" mode
  nohz: support PR_CPU_ISOLATED_STRICT mode
  nohz: cpu_isolated strict mode configurable signal
  nohz: add cpu_isolated_debug boot flag
  nohz: cpu_isolated: allow tick to be fully disabled

 Documentation/kernel-parameters.txt |   6 +++
 arch/tile/kernel/process.c          |   9 ++++
 arch/tile/kernel/ptrace.c           |   6 ++-
 arch/tile/mm/homecache.c            |   5 +-
 arch/x86/kernel/ptrace.c            |   2 +
 include/linux/context_tracking.h    |  11 ++--
 include/linux/sched.h               |   3 ++
 include/linux/tick.h                |  28 ++++++++++
 include/uapi/linux/prctl.h          |   8 +++
 kernel/context_tracking.c           |  12 +++--
 kernel/irq_work.c                   |   4 +-
 kernel/sched/core.c                 |  18 +++++++
 kernel/signal.c                     |   5 ++
 kernel/smp.c                        |   4 ++
 kernel/softirq.c                    |   6 +++
 kernel/sys.c                        |   8 +++
 kernel/time/tick-sched.c            | 104 +++++++++++++++++++++++++++++++++++-
 17 files changed, 229 insertions(+), 10 deletions(-)

-- 
2.1.2

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v3 1/5] nohz_full: add support for "cpu_isolated" mode
@ 2015-06-03 15:29       ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode makes tradeoffs to minimize userspace
interruptions while still attempting to avoid overheads in the
kernel entry/exit path, to provide 100% kernel semantics, etc.

However, some applications require a stronger commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications to elect
to have the stronger semantics as needed, specifying
prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The "cpu_isolated" state is indicated by setting a new task struct
field, cpu_isolated_flags, to the value passed by prctl().  When the
_ENABLE bit is set for a task, and it is returning to userspace
on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter()
routine to take additional actions to help the task avoid being
interrupted in the future.

Initially, there are only two actions taken.  First, the task
calls lru_add_drain() to prevent being interrupted by a subsequent
lru_add_drain_all() call on another core.  Then, the code checks for
pending timer interrupts and quiesces until they are no longer pending.
As a result, sys calls (and page faults, etc.) can be inordinately slow.
However, this quiescing guarantees that no unexpected interrupts will
occur, even if the application intentionally calls into the kernel.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/kernel/process.c |  9 ++++++++
 include/linux/sched.h      |  3 +++
 include/linux/tick.h       | 10 ++++++++
 include/uapi/linux/prctl.h |  5 ++++
 kernel/context_tracking.c  |  3 +++
 kernel/sys.c               |  8 +++++++
 kernel/time/tick-sched.c   | 57 ++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 95 insertions(+)

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index e036c0aa9792..e20c3f4a6a82 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -70,6 +70,15 @@ void arch_cpu_idle(void)
 	_cpu_idle();
 }
 
+#ifdef CONFIG_NO_HZ_FULL
+void tick_nohz_cpu_isolated_wait()
+{
+	set_current_state(TASK_INTERRUPTIBLE);
+	_cpu_idle();
+	set_current_state(TASK_RUNNING);
+}
+#endif
+
 /*
  * Release a thread_info structure
  */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8222ae40ecb0..fb4ba400d7e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1732,6 +1732,9 @@ struct task_struct {
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long	task_state_change;
 #endif
+#ifdef CONFIG_NO_HZ_FULL
+	unsigned int	cpu_isolated_flags;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f8492da57ad3..ec1953474a65 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -10,6 +10,7 @@
 #include <linux/context_tracking_state.h>
 #include <linux/cpumask.h>
 #include <linux/sched.h>
+#include <linux/prctl.h>
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 extern void __init tick_init(void);
@@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu)
 	return cpumask_test_cpu(cpu, tick_nohz_full_mask);
 }
 
+static inline bool tick_nohz_is_cpu_isolated(void)
+{
+	return tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE);
+}
+
 extern void __tick_nohz_full_check(void);
 extern void tick_nohz_full_kick(void);
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
 extern void __tick_nohz_task_switch(struct task_struct *tsk);
+extern void tick_nohz_cpu_isolated_enter(void);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void tick_nohz_full_kick(void) { }
 static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
+static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
+static inline void tick_nohz_cpu_isolated_enter(void) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..edb40b6b84db 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
 # define PR_FP_MODE_FR		(1 << 0)	/* 64b FP registers */
 # define PR_FP_MODE_FRE		(1 << 1)	/* 32b compatibility */
 
+/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */
+#define PR_SET_CPU_ISOLATED	47
+#define PR_GET_CPU_ISOLATED	48
+# define PR_CPU_ISOLATED_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 72d59a1a6eb6..66739d7c1350 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/tick.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state)
 			 * on the tick.
 			 */
 			if (state == CONTEXT_USER) {
+				if (tick_nohz_is_cpu_isolated())
+					tick_nohz_cpu_isolated_enter();
 				trace_user_enter(0);
 				vtime_user_enter(current);
 			}
diff --git a/kernel/sys.c b/kernel/sys.c
index a4e372b798a5..3fd9e47f8fc8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_NO_HZ_FULL
+	case PR_SET_CPU_ISOLATED:
+		me->cpu_isolated_flags = arg2;
+		break;
+	case PR_GET_CPU_ISOLATED:
+		error = me->cpu_isolated_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 914259128145..f6236b66788f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
 #include <linux/posix-timers.h>
 #include <linux/perf_event.h>
 #include <linux/context_tracking.h>
+#include <linux/swap.h>
 
 #include <asm/irq_regs.h>
 
@@ -389,6 +390,62 @@ void __init tick_nohz_init(void)
 	pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
 		cpumask_pr_args(tick_nohz_full_mask));
 }
+
+/*
+ * Rather than continuously polling for the next_event in the
+ * tick_cpu_device, architectures can provide a method to save power
+ * by sleeping until an interrupt arrives.
+ */
+void __weak tick_nohz_cpu_isolated_wait()
+{
+	cpu_relax();
+}
+
+/*
+ * We normally return immediately to userspace.
+ *
+ * In "cpu_isolated" mode we wait until no more interrupts are
+ * pending.  Otherwise we nap with interrupts enabled and wait for the
+ * next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two "cpu_isolated" processes on the same
+ * core, neither will ever leave the kernel, and one will have to be
+ * killed manually.  Otherwise in situations where another process is
+ * in the runqueue on this cpu, this task will just wait for that
+ * other task to go idle before returning to user space.
+ */
+void tick_nohz_cpu_isolated_enter(void)
+{
+	struct clock_event_device *dev =
+		__this_cpu_read(tick_cpu_device.evtdev);
+	struct task_struct *task = current;
+	unsigned long start = jiffies;
+	bool warned = false;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+		if (!warned && (jiffies - start) >= (5 * HZ)) {
+			pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n",
+				task->comm, task->pid, smp_processor_id(),
+				(jiffies - start) / HZ);
+			warned = true;
+		}
+		if (should_resched())
+			schedule();
+		if (test_thread_flag(TIF_SIGPENDING))
+			break;
+		tick_nohz_cpu_isolated_wait();
+	}
+	if (warned) {
+		pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n",
+			task->comm, task->pid, smp_processor_id(),
+			(jiffies - start) / HZ);
+		dump_stack();
+	}
+}
+
 #endif
 
 /*
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v3 1/5] nohz_full: add support for "cpu_isolated" mode
@ 2015-06-03 15:29       ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Chris Metcalf

The existing nohz_full mode makes tradeoffs to minimize userspace
interruptions while still attempting to avoid overheads in the
kernel entry/exit path, to provide 100% kernel semantics, etc.

However, some applications require a stronger commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications to elect
to have the stronger semantics as needed, specifying
prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The "cpu_isolated" state is indicated by setting a new task struct
field, cpu_isolated_flags, to the value passed by prctl().  When the
_ENABLE bit is set for a task, and it is returning to userspace
on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter()
routine to take additional actions to help the task avoid being
interrupted in the future.

Initially, there are only two actions taken.  First, the task
calls lru_add_drain() to prevent being interrupted by a subsequent
lru_add_drain_all() call on another core.  Then, the code checks for
pending timer interrupts and quiesces until they are no longer pending.
As a result, sys calls (and page faults, etc.) can be inordinately slow.
However, this quiescing guarantees that no unexpected interrupts will
occur, even if the application intentionally calls into the kernel.

Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
---
 arch/tile/kernel/process.c |  9 ++++++++
 include/linux/sched.h      |  3 +++
 include/linux/tick.h       | 10 ++++++++
 include/uapi/linux/prctl.h |  5 ++++
 kernel/context_tracking.c  |  3 +++
 kernel/sys.c               |  8 +++++++
 kernel/time/tick-sched.c   | 57 ++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 95 insertions(+)

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index e036c0aa9792..e20c3f4a6a82 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -70,6 +70,15 @@ void arch_cpu_idle(void)
 	_cpu_idle();
 }
 
+#ifdef CONFIG_NO_HZ_FULL
+void tick_nohz_cpu_isolated_wait()
+{
+	set_current_state(TASK_INTERRUPTIBLE);
+	_cpu_idle();
+	set_current_state(TASK_RUNNING);
+}
+#endif
+
 /*
  * Release a thread_info structure
  */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8222ae40ecb0..fb4ba400d7e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1732,6 +1732,9 @@ struct task_struct {
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long	task_state_change;
 #endif
+#ifdef CONFIG_NO_HZ_FULL
+	unsigned int	cpu_isolated_flags;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f8492da57ad3..ec1953474a65 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -10,6 +10,7 @@
 #include <linux/context_tracking_state.h>
 #include <linux/cpumask.h>
 #include <linux/sched.h>
+#include <linux/prctl.h>
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 extern void __init tick_init(void);
@@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu)
 	return cpumask_test_cpu(cpu, tick_nohz_full_mask);
 }
 
+static inline bool tick_nohz_is_cpu_isolated(void)
+{
+	return tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE);
+}
+
 extern void __tick_nohz_full_check(void);
 extern void tick_nohz_full_kick(void);
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
 extern void __tick_nohz_task_switch(struct task_struct *tsk);
+extern void tick_nohz_cpu_isolated_enter(void);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void tick_nohz_full_kick(void) { }
 static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
+static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
+static inline void tick_nohz_cpu_isolated_enter(void) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..edb40b6b84db 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
 # define PR_FP_MODE_FR		(1 << 0)	/* 64b FP registers */
 # define PR_FP_MODE_FRE		(1 << 1)	/* 32b compatibility */
 
+/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */
+#define PR_SET_CPU_ISOLATED	47
+#define PR_GET_CPU_ISOLATED	48
+# define PR_CPU_ISOLATED_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 72d59a1a6eb6..66739d7c1350 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/tick.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state)
 			 * on the tick.
 			 */
 			if (state == CONTEXT_USER) {
+				if (tick_nohz_is_cpu_isolated())
+					tick_nohz_cpu_isolated_enter();
 				trace_user_enter(0);
 				vtime_user_enter(current);
 			}
diff --git a/kernel/sys.c b/kernel/sys.c
index a4e372b798a5..3fd9e47f8fc8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_NO_HZ_FULL
+	case PR_SET_CPU_ISOLATED:
+		me->cpu_isolated_flags = arg2;
+		break;
+	case PR_GET_CPU_ISOLATED:
+		error = me->cpu_isolated_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 914259128145..f6236b66788f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
 #include <linux/posix-timers.h>
 #include <linux/perf_event.h>
 #include <linux/context_tracking.h>
+#include <linux/swap.h>
 
 #include <asm/irq_regs.h>
 
@@ -389,6 +390,62 @@ void __init tick_nohz_init(void)
 	pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
 		cpumask_pr_args(tick_nohz_full_mask));
 }
+
+/*
+ * Rather than continuously polling for the next_event in the
+ * tick_cpu_device, architectures can provide a method to save power
+ * by sleeping until an interrupt arrives.
+ */
+void __weak tick_nohz_cpu_isolated_wait()
+{
+	cpu_relax();
+}
+
+/*
+ * We normally return immediately to userspace.
+ *
+ * In "cpu_isolated" mode we wait until no more interrupts are
+ * pending.  Otherwise we nap with interrupts enabled and wait for the
+ * next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two "cpu_isolated" processes on the same
+ * core, neither will ever leave the kernel, and one will have to be
+ * killed manually.  Otherwise in situations where another process is
+ * in the runqueue on this cpu, this task will just wait for that
+ * other task to go idle before returning to user space.
+ */
+void tick_nohz_cpu_isolated_enter(void)
+{
+	struct clock_event_device *dev =
+		__this_cpu_read(tick_cpu_device.evtdev);
+	struct task_struct *task = current;
+	unsigned long start = jiffies;
+	bool warned = false;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+		if (!warned && (jiffies - start) >= (5 * HZ)) {
+			pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n",
+				task->comm, task->pid, smp_processor_id(),
+				(jiffies - start) / HZ);
+			warned = true;
+		}
+		if (should_resched())
+			schedule();
+		if (test_thread_flag(TIF_SIGPENDING))
+			break;
+		tick_nohz_cpu_isolated_wait();
+	}
+	if (warned) {
+		pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n",
+			task->comm, task->pid, smp_processor_id(),
+			(jiffies - start) / HZ);
+		dump_stack();
+	}
+}
+
 #endif
 
 /*
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v3 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode
  2015-06-03 15:29     ` Chris Metcalf
@ 2015-06-03 15:29       ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

With cpu_isolated mode, the task is in principle guaranteed not to be
interrupted by the kernel, but only if it behaves.  In particular, if it
enters the kernel via system call, page fault, or any of a number of other
synchronous traps, it may be unexpectedly exposed to long latencies.
Add a simple flag that puts the process into a state where any such
kernel entry is fatal.

To allow the state to be entered and exited, we add an internal bit to
current->cpu_isolated_flags that is set when prctl() sets the flags.
We check the bit on syscall entry as well as on any exception_enter().
The prctl() syscall is ignored to allow clearing the bit again later,
and exit/exit_group are ignored to allow exiting the task without
a pointless signal killing you as you try to do so.

This change adds the syscall-detection hooks only for x86 and tile;
I am happy to try to add more for additional platforms in the final
version.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/kernel/ptrace.c        |  6 +++++-
 arch/x86/kernel/ptrace.c         |  2 ++
 include/linux/context_tracking.h | 11 ++++++++---
 include/linux/tick.h             | 16 ++++++++++++++++
 include/uapi/linux/prctl.h       |  1 +
 kernel/context_tracking.c        |  9 ++++++---
 kernel/time/tick-sched.c         | 38 ++++++++++++++++++++++++++++++++++++++
 7 files changed, 76 insertions(+), 7 deletions(-)

diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..d4e43a13bab1 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (work & _TIF_NOHZ)
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		if (tick_nohz_cpu_isolated_strict())
+			tick_nohz_cpu_isolated_syscall(
+				regs->regs[TREG_SYSCALL_NR]);
+	}
 
 	if (work & _TIF_SYSCALL_TRACE) {
 		if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index a7bc79480719..7f784054ddea 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	if (work & _TIF_NOHZ) {
 		user_exit();
 		work &= ~_TIF_NOHZ;
+		if (tick_nohz_cpu_isolated_strict())
+			tick_nohz_cpu_isolated_syscall(regs->orig_ax);
 	}
 
 #ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 2821838256b4..d042f4cda39d 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include <linux/sched.h>
 #include <linux/vtime.h>
+#include <linux/tick.h>
 #include <linux/context_tracking_state.h>
 #include <asm/ptrace.h>
 
@@ -11,7 +12,7 @@
 extern void context_tracking_cpu_set(int cpu);
 
 extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 extern void __context_tracking_task_switch(struct task_struct *prev,
@@ -37,8 +38,12 @@ static inline enum ctx_state exception_enter(void)
 		return 0;
 
 	prev_ctx = this_cpu_read(context_tracking.state);
-	if (prev_ctx != CONTEXT_KERNEL)
-		context_tracking_exit(prev_ctx);
+	if (prev_ctx != CONTEXT_KERNEL) {
+		if (context_tracking_exit(prev_ctx)) {
+			if (tick_nohz_cpu_isolated_strict())
+				tick_nohz_cpu_isolated_exception();
+		}
+	}
 
 	return prev_ctx;
 }
diff --git a/include/linux/tick.h b/include/linux/tick.h
index ec1953474a65..b7ffb10337ba 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -147,6 +147,8 @@ extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
 extern void __tick_nohz_task_switch(struct task_struct *tsk);
 extern void tick_nohz_cpu_isolated_enter(void);
+extern void tick_nohz_cpu_isolated_syscall(int nr);
+extern void tick_nohz_cpu_isolated_exception(void);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -157,6 +159,8 @@ static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
 static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
 static inline void tick_nohz_cpu_isolated_enter(void) { }
+static inline void tick_nohz_cpu_isolated_syscall(int nr) { }
+static inline void tick_nohz_cpu_isolated_exception(void) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
@@ -189,4 +193,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk)
 		__tick_nohz_task_switch(tsk);
 }
 
+static inline bool tick_nohz_cpu_isolated_strict(void)
+{
+#ifdef CONFIG_NO_HZ_FULL
+	if (tick_nohz_full_cpu(smp_processor_id()) &&
+	    (current->cpu_isolated_flags &
+	     (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) ==
+	    (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT))
+		return true;
+#endif
+	return false;
+}
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index edb40b6b84db..0c11238a84fb 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
 #define PR_SET_CPU_ISOLATED	47
 #define PR_GET_CPU_ISOLATED	48
 # define PR_CPU_ISOLATED_ENABLE	(1 << 0)
+# define PR_CPU_ISOLATED_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 66739d7c1350..c82509caa42e 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -131,15 +131,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
 {
 	unsigned long flags;
+	bool from_user = false;
 
 	if (!context_tracking_is_enabled())
-		return;
+		return false;
 
 	if (in_interrupt())
-		return;
+		return false;
 
 	local_irq_save(flags);
 	if (__this_cpu_read(context_tracking.state) == state) {
@@ -150,6 +151,7 @@ void context_tracking_exit(enum ctx_state state)
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				from_user = true;
 				vtime_user_exit(current);
 				trace_user_exit(0);
 			}
@@ -157,6 +159,7 @@ void context_tracking_exit(enum ctx_state state)
 		__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
 	}
 	local_irq_restore(flags);
+	return from_user;
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
 EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f6236b66788f..ce3bcf29a0f6 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -27,6 +27,7 @@
 #include <linux/swap.h>
 
 #include <asm/irq_regs.h>
+#include <asm/unistd.h>
 
 #include "tick-internal.h"
 
@@ -446,6 +447,43 @@ void tick_nohz_cpu_isolated_enter(void)
 	}
 }
 
+static void kill_cpu_isolated_strict_task(void)
+{
+	dump_stack();
+	current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
+	send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void tick_nohz_cpu_isolated_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return;
+	}
+
+	pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
+		current->comm, current->pid, syscall);
+	kill_cpu_isolated_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void tick_nohz_cpu_isolated_exception(void)
+{
+	pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
+		current->comm, current->pid);
+	kill_cpu_isolated_strict_task();
+}
+
 #endif
 
 /*
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v3 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode
@ 2015-06-03 15:29       ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

With cpu_isolated mode, the task is in principle guaranteed not to be
interrupted by the kernel, but only if it behaves.  In particular, if it
enters the kernel via system call, page fault, or any of a number of other
synchronous traps, it may be unexpectedly exposed to long latencies.
Add a simple flag that puts the process into a state where any such
kernel entry is fatal.

To allow the state to be entered and exited, we add an internal bit to
current->cpu_isolated_flags that is set when prctl() sets the flags.
We check the bit on syscall entry as well as on any exception_enter().
The prctl() syscall is ignored to allow clearing the bit again later,
and exit/exit_group are ignored to allow exiting the task without
a pointless signal killing you as you try to do so.

This change adds the syscall-detection hooks only for x86 and tile;
I am happy to try to add more for additional platforms in the final
version.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/kernel/ptrace.c        |  6 +++++-
 arch/x86/kernel/ptrace.c         |  2 ++
 include/linux/context_tracking.h | 11 ++++++++---
 include/linux/tick.h             | 16 ++++++++++++++++
 include/uapi/linux/prctl.h       |  1 +
 kernel/context_tracking.c        |  9 ++++++---
 kernel/time/tick-sched.c         | 38 ++++++++++++++++++++++++++++++++++++++
 7 files changed, 76 insertions(+), 7 deletions(-)

diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..d4e43a13bab1 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (work & _TIF_NOHZ)
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		if (tick_nohz_cpu_isolated_strict())
+			tick_nohz_cpu_isolated_syscall(
+				regs->regs[TREG_SYSCALL_NR]);
+	}
 
 	if (work & _TIF_SYSCALL_TRACE) {
 		if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index a7bc79480719..7f784054ddea 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	if (work & _TIF_NOHZ) {
 		user_exit();
 		work &= ~_TIF_NOHZ;
+		if (tick_nohz_cpu_isolated_strict())
+			tick_nohz_cpu_isolated_syscall(regs->orig_ax);
 	}
 
 #ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 2821838256b4..d042f4cda39d 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include <linux/sched.h>
 #include <linux/vtime.h>
+#include <linux/tick.h>
 #include <linux/context_tracking_state.h>
 #include <asm/ptrace.h>
 
@@ -11,7 +12,7 @@
 extern void context_tracking_cpu_set(int cpu);
 
 extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 extern void __context_tracking_task_switch(struct task_struct *prev,
@@ -37,8 +38,12 @@ static inline enum ctx_state exception_enter(void)
 		return 0;
 
 	prev_ctx = this_cpu_read(context_tracking.state);
-	if (prev_ctx != CONTEXT_KERNEL)
-		context_tracking_exit(prev_ctx);
+	if (prev_ctx != CONTEXT_KERNEL) {
+		if (context_tracking_exit(prev_ctx)) {
+			if (tick_nohz_cpu_isolated_strict())
+				tick_nohz_cpu_isolated_exception();
+		}
+	}
 
 	return prev_ctx;
 }
diff --git a/include/linux/tick.h b/include/linux/tick.h
index ec1953474a65..b7ffb10337ba 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -147,6 +147,8 @@ extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
 extern void __tick_nohz_task_switch(struct task_struct *tsk);
 extern void tick_nohz_cpu_isolated_enter(void);
+extern void tick_nohz_cpu_isolated_syscall(int nr);
+extern void tick_nohz_cpu_isolated_exception(void);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -157,6 +159,8 @@ static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
 static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
 static inline void tick_nohz_cpu_isolated_enter(void) { }
+static inline void tick_nohz_cpu_isolated_syscall(int nr) { }
+static inline void tick_nohz_cpu_isolated_exception(void) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
@@ -189,4 +193,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk)
 		__tick_nohz_task_switch(tsk);
 }
 
+static inline bool tick_nohz_cpu_isolated_strict(void)
+{
+#ifdef CONFIG_NO_HZ_FULL
+	if (tick_nohz_full_cpu(smp_processor_id()) &&
+	    (current->cpu_isolated_flags &
+	     (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) ==
+	    (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT))
+		return true;
+#endif
+	return false;
+}
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index edb40b6b84db..0c11238a84fb 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
 #define PR_SET_CPU_ISOLATED	47
 #define PR_GET_CPU_ISOLATED	48
 # define PR_CPU_ISOLATED_ENABLE	(1 << 0)
+# define PR_CPU_ISOLATED_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 66739d7c1350..c82509caa42e 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -131,15 +131,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
 {
 	unsigned long flags;
+	bool from_user = false;
 
 	if (!context_tracking_is_enabled())
-		return;
+		return false;
 
 	if (in_interrupt())
-		return;
+		return false;
 
 	local_irq_save(flags);
 	if (__this_cpu_read(context_tracking.state) == state) {
@@ -150,6 +151,7 @@ void context_tracking_exit(enum ctx_state state)
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				from_user = true;
 				vtime_user_exit(current);
 				trace_user_exit(0);
 			}
@@ -157,6 +159,7 @@ void context_tracking_exit(enum ctx_state state)
 		__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
 	}
 	local_irq_restore(flags);
+	return from_user;
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
 EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f6236b66788f..ce3bcf29a0f6 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -27,6 +27,7 @@
 #include <linux/swap.h>
 
 #include <asm/irq_regs.h>
+#include <asm/unistd.h>
 
 #include "tick-internal.h"
 
@@ -446,6 +447,43 @@ void tick_nohz_cpu_isolated_enter(void)
 	}
 }
 
+static void kill_cpu_isolated_strict_task(void)
+{
+	dump_stack();
+	current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
+	send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void tick_nohz_cpu_isolated_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return;
+	}
+
+	pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
+		current->comm, current->pid, syscall);
+	kill_cpu_isolated_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void tick_nohz_cpu_isolated_exception(void)
+{
+	pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
+		current->comm, current->pid);
+	kill_cpu_isolated_strict_task();
+}
+
 #endif
 
 /*
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v3 3/5] nohz: cpu_isolated strict mode configurable signal
@ 2015-06-03 15:29       ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

Allow userspace to override the default SIGKILL delivered
when a cpu_isolated process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

In addition to being able to set the signal, we now also
pass whether or not the interruption was from a syscall in
the si_code field of the siginfo.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/uapi/linux/prctl.h |  2 ++
 kernel/time/tick-sched.c   | 15 +++++++++++----
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 0c11238a84fb..ab45bd3d5799 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
 #define PR_GET_CPU_ISOLATED	48
 # define PR_CPU_ISOLATED_ENABLE	(1 << 0)
 # define PR_CPU_ISOLATED_STRICT	(1 << 1)
+# define PR_CPU_ISOLATED_SET_SIG(sig)  (((sig) & 0x7f) << 8)
+# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index ce3bcf29a0f6..f09c003da22f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -447,11 +447,18 @@ void tick_nohz_cpu_isolated_enter(void)
 	}
 }
 
-static void kill_cpu_isolated_strict_task(void)
+static void kill_cpu_isolated_strict_task(int is_syscall)
 {
+	siginfo_t info = {};
+	int sig;
+
 	dump_stack();
 	current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
-	send_sig(SIGKILL, current, 1);
+
+	sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL;
+	info.si_signo = sig;
+	info.si_code = is_syscall;
+	send_sig_info(sig, &info, current);
 }
 
 /*
@@ -470,7 +477,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall)
 
 	pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
 		current->comm, current->pid, syscall);
-	kill_cpu_isolated_strict_task();
+	kill_cpu_isolated_strict_task(1);
 }
 
 /*
@@ -481,7 +488,7 @@ void tick_nohz_cpu_isolated_exception(void)
 {
 	pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
 		current->comm, current->pid);
-	kill_cpu_isolated_strict_task();
+	kill_cpu_isolated_strict_task(0);
 }
 
 #endif
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v3 3/5] nohz: cpu_isolated strict mode configurable signal
@ 2015-06-03 15:29       ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Chris Metcalf

Allow userspace to override the default SIGKILL delivered
when a cpu_isolated process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

In addition to being able to set the signal, we now also
pass whether or not the interruption was from a syscall in
the si_code field of the siginfo.

Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
---
 include/uapi/linux/prctl.h |  2 ++
 kernel/time/tick-sched.c   | 15 +++++++++++----
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 0c11238a84fb..ab45bd3d5799 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
 #define PR_GET_CPU_ISOLATED	48
 # define PR_CPU_ISOLATED_ENABLE	(1 << 0)
 # define PR_CPU_ISOLATED_STRICT	(1 << 1)
+# define PR_CPU_ISOLATED_SET_SIG(sig)  (((sig) & 0x7f) << 8)
+# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index ce3bcf29a0f6..f09c003da22f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -447,11 +447,18 @@ void tick_nohz_cpu_isolated_enter(void)
 	}
 }
 
-static void kill_cpu_isolated_strict_task(void)
+static void kill_cpu_isolated_strict_task(int is_syscall)
 {
+	siginfo_t info = {};
+	int sig;
+
 	dump_stack();
 	current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
-	send_sig(SIGKILL, current, 1);
+
+	sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL;
+	info.si_signo = sig;
+	info.si_code = is_syscall;
+	send_sig_info(sig, &info, current);
 }
 
 /*
@@ -470,7 +477,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall)
 
 	pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
 		current->comm, current->pid, syscall);
-	kill_cpu_isolated_strict_task();
+	kill_cpu_isolated_strict_task(1);
 }
 
 /*
@@ -481,7 +488,7 @@ void tick_nohz_cpu_isolated_exception(void)
 {
 	pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
 		current->comm, current->pid);
-	kill_cpu_isolated_strict_task();
+	kill_cpu_isolated_strict_task(0);
 }
 
 #endif
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v3 4/5] nohz: add cpu_isolated_debug boot flag
  2015-06-03 15:29     ` Chris Metcalf
                       ` (3 preceding siblings ...)
  (?)
@ 2015-06-03 15:29     ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-kernel
  Cc: Chris Metcalf

This flag simplifies debugging of NO_HZ_FULL kernels when processes
are running in PR_CPU_ISOLATED_ENABLE mode.  Such processes should
get no interrupts from the kernel, and if they do, when this boot
flag is specified a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a cpu_isolated core
has unexpectedly entered the kernel.  But what this boot flag does
is allow the kernel to provide better diagnostics, e.g. by reporting
in the IPI-generating code what remote core and context is preparing
to deliver an interrupt to a cpu_isolated core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 Documentation/kernel-parameters.txt |  6 ++++++
 arch/tile/mm/homecache.c            |  5 ++++-
 include/linux/tick.h                |  2 ++
 kernel/irq_work.c                   |  4 +++-
 kernel/sched/core.c                 | 18 ++++++++++++++++++
 kernel/signal.c                     |  5 +++++
 kernel/smp.c                        |  4 ++++
 kernel/softirq.c                    |  6 ++++++
 8 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index f6befa9855c1..2b4c89225d25 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -743,6 +743,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			/proc/<pid>/coredump_filter.
 			See also Documentation/filesystems/proc.txt.
 
+	cpu_isolated_debug	[KNL]
+			In kernels built with CONFIG_NO_HZ_FULL and booted
+			in nohz_full= mode, this setting will generate console
+			backtraces when the kernel is about to interrupt a
+			task that has requested PR_CPU_ISOLATED_ENABLE.
+
 	cpuidle.off=1	[CPU_IDLE]
 			disable the cpuidle sub-system
 
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..f336880e1b01 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
 #include <linux/smp.h>
 #include <linux/module.h>
 #include <linux/hugetlb.h>
+#include <linux/tick.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
 	 * Don't bother to update atomically; losing a count
 	 * here is not that critical.
 	 */
-	for_each_cpu(cpu, &mask)
+	for_each_cpu(cpu, &mask) {
 		++per_cpu(irq_stat, cpu).irq_hv_flush_count;
+		tick_nohz_cpu_isolated_debug(cpu);
+	}
 }
 
 /*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index b7ffb10337ba..0b0d76106b8c 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -149,6 +149,7 @@ extern void __tick_nohz_task_switch(struct task_struct *tsk);
 extern void tick_nohz_cpu_isolated_enter(void);
 extern void tick_nohz_cpu_isolated_syscall(int nr);
 extern void tick_nohz_cpu_isolated_exception(void);
+extern void tick_nohz_cpu_isolated_debug(int cpu);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -161,6 +162,7 @@ static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
 static inline void tick_nohz_cpu_isolated_enter(void) { }
 static inline void tick_nohz_cpu_isolated_syscall(int nr) { }
 static inline void tick_nohz_cpu_isolated_exception(void) { }
+static inline void tick_nohz_cpu_isolated_debug(int cpu) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index cbf9fb899d92..7f35c90346de 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -75,8 +75,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (!irq_work_claim(work))
 		return false;
 
-	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+		tick_nohz_cpu_isolated_debug(cpu);
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return true;
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f9123a82cbb6..7315e7272e94 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -719,6 +719,24 @@ bool sched_can_stop_tick(void)
 
 	return true;
 }
+
+/* Enable debugging of any interrupts of cpu_isolated cores. */
+static int cpu_isolated_debug;
+static int __init cpu_isolated_debug_func(char *str)
+{
+	cpu_isolated_debug = true;
+	return 1;
+}
+__setup("cpu_isolated_debug", cpu_isolated_debug_func);
+
+void tick_nohz_cpu_isolated_debug(int cpu)
+{
+	if (cpu_isolated_debug && tick_nohz_full_cpu(cpu) &&
+	    (cpu_curr(cpu)->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE)) {
+		pr_err("Interrupt detected for cpu_isolated cpu %d\n", cpu);
+		dump_stack();
+	}
+}
 #endif /* CONFIG_NO_HZ_FULL */
 
 void sched_avg_update(struct rq *rq)
diff --git a/kernel/signal.c b/kernel/signal.c
index d51c5ddd855c..1a810ac2656e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -689,6 +689,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+#ifdef CONFIG_NO_HZ_FULL
+	/* If the task is being killed, don't complain about cpu_isolated. */
+	if (state & TASK_WAKEKILL)
+		t->cpu_isolated_flags = 0;
+#endif
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 07854477c164..6b7d8e2c8af4 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
 #include <linux/smp.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include <linux/tick.h>
 
 #include "smpboot.h"
 
@@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
+	tick_nohz_cpu_isolated_debug(cpu);
 	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
 		arch_send_call_function_single_ipi(cpu);
 
@@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	for_each_cpu(cpu, cfd->cpumask)
+		tick_nohz_cpu_isolated_debug(cpu);
 	arch_send_call_function_ipi_mask(cfd->cpumask);
 
 	if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..333872925ff6 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -24,6 +24,7 @@
 #include <linux/ftrace.h>
 #include <linux/smp.h>
 #include <linux/smpboot.h>
+#include <linux/context_tracking.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
 
@@ -335,6 +336,11 @@ void irq_enter(void)
 		_local_bh_enable();
 	}
 
+	if (context_tracking_cpu_is_enabled() &&
+	    context_tracking_in_user() &&
+	    !in_interrupt())
+		tick_nohz_cpu_isolated_debug(smp_processor_id());
+
 	__irq_enter();
 }
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v3 5/5] nohz: cpu_isolated: allow tick to be fully disabled
  2015-06-03 15:29     ` Chris Metcalf
                       ` (4 preceding siblings ...)
  (?)
@ 2015-06-03 15:29     ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-kernel
  Cc: Chris Metcalf

While the current fallback to 1-second tick is still helpful for
maintaining completely correct kernel semantics, processes using
prctl(PR_SET_CPU_ISOLATED) semantics place a higher priority on running
completely tickless, so don't bound the time_delta for such processes.
In addition, due to the way such processes quiesce by waiting for
the timer tick to stop prior to returning to userspace, without this
commit it won't be possible to use the cpu_isolated mode at all.

Removing the 1-second cap was previously discussed (see link below)
and Thomas Gleixner observed that vruntime, load balancing data, load
accounting, and other things might be impacted.  Frederic Weisbecker
similarly observed that allowing the tick to be indefinitely deferred just
meant that no one would ever fix the underlying bugs.  However it's at
least true that the mode proposed in this patch can only be enabled on an
isolcpus core by a process requesting cpu_isolated mode, which may limit
how important it is to maintain scheduler data correctly, for example.

Paul McKenney observed that if provide a mode where the 1Hz fallback timer
is removed, this will provide an environment where new code that relies
on that tick will get punished, and we won't forgive such assumptions
silently, so it may also be worth it from that perspective.

Finally, it's worth observing that the tile architecture has been using
similar code for its Zero-Overhead Linux for many years (starting in
2008) and customers are very enthusiastic about the resulting bare-metal
performance on cores that are available to run full Linux semantics
on demand (crash, logging, shutdown, etc).  So this semantics is very
useful if we can convince ourselves that doing this is safe.

Link: https://lkml.kernel.org/r/alpine.DEB.2.11.1410311058500.32582@gentwo.org
Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 kernel/time/tick-sched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f09c003da22f..ec36ed00af9d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -733,7 +733,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
 		}
 
 #ifdef CONFIG_NO_HZ_FULL
-		if (!ts->inidle) {
+		if (!ts->inidle && !tick_nohz_is_cpu_isolated()) {
 			time_delta = min(time_delta,
 					 scheduler_tick_max_deferment());
 		}
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v4 0/5] support "cpu_isolated" mode for nohz_full
  2015-06-03 15:29     ` Chris Metcalf
@ 2015-07-13 19:57       ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

This posting of the series is basically a "ping" since there were
no comments to the v3 version.  I have rebased it to 4.2-rc1, added
support for arm64 syscall tracking for "strict" mode, and retested it;
are there any remaining concerns?  Thomas, I haven't heard from you
whether my removal of the cpu_idle calls sufficiently addresses your
concerns about that aspect.  Are there other concerns with this patch
series at this point?

Original patch series cover letter follows:

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it.  However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode.  A prctl()
option (PR_SET_CPU_ISOLATED) is added to control whether processes have
requested this stricter semantics, and within that prctl() option we
provide a number of different bits for more precise control.
Additionally, we add a new command-line boot argument to facilitate
debugging where unexpected interrupts are being delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers.  This patch
series represents the first serious attempt to upstream that
functionality.  Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on cpu_isolated cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on v4.2-rc1) is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

v4:
  rebased on kernel v4.2-rc1
  added support for detecting CPU_ISOLATED_STRICT syscalls on arm64

v3:
  remove dependency on cpu_idle subsystem (Thomas Gleixner)
  use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter
  use seconds for console messages instead of jiffies (Thomas Gleixner)
  updated commit description for patch 5/5

v2:
  rename "dataplane" to "cpu_isolated"
  drop ksoftirqd suppression changes (believed no longer needed)
  merge previous "QUIESCE" functionality into baseline functionality
  explicitly track syscalls and exceptions for "STRICT" functionality
  allow configuring a signal to be delivered for STRICT mode failures
  move debug tracking to irq_enter(), not irq_exit()

Note: I have not removed the commit to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if
we remove the 1Hz tick, cpu_isolated threads will never re-enter
userspace since a tick will always be pending.

Chris Metcalf (5):
  nohz_full: add support for "cpu_isolated" mode
  nohz: support PR_CPU_ISOLATED_STRICT mode
  nohz: cpu_isolated strict mode configurable signal
  nohz: add cpu_isolated_debug boot flag
  nohz: cpu_isolated: allow tick to be fully disabled

 Documentation/kernel-parameters.txt |   6 +++
 arch/tile/kernel/process.c          |   9 ++++
 arch/tile/kernel/ptrace.c           |   6 ++-
 arch/tile/mm/homecache.c            |   5 +-
 arch/x86/kernel/ptrace.c            |   2 +
 include/linux/context_tracking.h    |  11 ++--
 include/linux/sched.h               |   3 ++
 include/linux/tick.h                |  28 ++++++++++
 include/uapi/linux/prctl.h          |   8 +++
 kernel/context_tracking.c           |  12 +++--
 kernel/irq_work.c                   |   4 +-
 kernel/sched/core.c                 |  18 +++++++
 kernel/signal.c                     |   5 ++
 kernel/smp.c                        |   4 ++
 kernel/softirq.c                    |   6 +++
 kernel/sys.c                        |   8 +++
 kernel/time/tick-sched.c            | 104 +++++++++++++++++++++++++++++++++++-
 17 files changed, 229 insertions(+), 10 deletions(-)

-- 
2.1.2


^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v4 0/5] support "cpu_isolated" mode for nohz_full
@ 2015-07-13 19:57       ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

This posting of the series is basically a "ping" since there were
no comments to the v3 version.  I have rebased it to 4.2-rc1, added
support for arm64 syscall tracking for "strict" mode, and retested it;
are there any remaining concerns?  Thomas, I haven't heard from you
whether my removal of the cpu_idle calls sufficiently addresses your
concerns about that aspect.  Are there other concerns with this patch
series at this point?

Original patch series cover letter follows:

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it.  However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode.  A prctl()
option (PR_SET_CPU_ISOLATED) is added to control whether processes have
requested this stricter semantics, and within that prctl() option we
provide a number of different bits for more precise control.
Additionally, we add a new command-line boot argument to facilitate
debugging where unexpected interrupts are being delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers.  This patch
series represents the first serious attempt to upstream that
functionality.  Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on cpu_isolated cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on v4.2-rc1) is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

v4:
  rebased on kernel v4.2-rc1
  added support for detecting CPU_ISOLATED_STRICT syscalls on arm64

v3:
  remove dependency on cpu_idle subsystem (Thomas Gleixner)
  use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter
  use seconds for console messages instead of jiffies (Thomas Gleixner)
  updated commit description for patch 5/5

v2:
  rename "dataplane" to "cpu_isolated"
  drop ksoftirqd suppression changes (believed no longer needed)
  merge previous "QUIESCE" functionality into baseline functionality
  explicitly track syscalls and exceptions for "STRICT" functionality
  allow configuring a signal to be delivered for STRICT mode failures
  move debug tracking to irq_enter(), not irq_exit()

Note: I have not removed the commit to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if
we remove the 1Hz tick, cpu_isolated threads will never re-enter
userspace since a tick will always be pending.

Chris Metcalf (5):
  nohz_full: add support for "cpu_isolated" mode
  nohz: support PR_CPU_ISOLATED_STRICT mode
  nohz: cpu_isolated strict mode configurable signal
  nohz: add cpu_isolated_debug boot flag
  nohz: cpu_isolated: allow tick to be fully disabled

 Documentation/kernel-parameters.txt |   6 +++
 arch/tile/kernel/process.c          |   9 ++++
 arch/tile/kernel/ptrace.c           |   6 ++-
 arch/tile/mm/homecache.c            |   5 +-
 arch/x86/kernel/ptrace.c            |   2 +
 include/linux/context_tracking.h    |  11 ++--
 include/linux/sched.h               |   3 ++
 include/linux/tick.h                |  28 ++++++++++
 include/uapi/linux/prctl.h          |   8 +++
 kernel/context_tracking.c           |  12 +++--
 kernel/irq_work.c                   |   4 +-
 kernel/sched/core.c                 |  18 +++++++
 kernel/signal.c                     |   5 ++
 kernel/smp.c                        |   4 ++
 kernel/softirq.c                    |   6 +++
 kernel/sys.c                        |   8 +++
 kernel/time/tick-sched.c            | 104 +++++++++++++++++++++++++++++++++++-
 17 files changed, 229 insertions(+), 10 deletions(-)

-- 
2.1.2

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
  2015-07-13 19:57       ` Chris Metcalf
@ 2015-07-13 19:57         ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode makes tradeoffs to minimize userspace
interruptions while still attempting to avoid overheads in the
kernel entry/exit path, to provide 100% kernel semantics, etc.

However, some applications require a stronger commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications to elect
to have the stronger semantics as needed, specifying
prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The "cpu_isolated" state is indicated by setting a new task struct
field, cpu_isolated_flags, to the value passed by prctl().  When the
_ENABLE bit is set for a task, and it is returning to userspace
on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter()
routine to take additional actions to help the task avoid being
interrupted in the future.

Initially, there are only two actions taken.  First, the task
calls lru_add_drain() to prevent being interrupted by a subsequent
lru_add_drain_all() call on another core.  Then, the code checks for
pending timer interrupts and quiesces until they are no longer pending.
As a result, sys calls (and page faults, etc.) can be inordinately slow.
However, this quiescing guarantees that no unexpected interrupts will
occur, even if the application intentionally calls into the kernel.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/kernel/process.c |  9 ++++++++
 include/linux/sched.h      |  3 +++
 include/linux/tick.h       | 10 ++++++++
 include/uapi/linux/prctl.h |  5 ++++
 kernel/context_tracking.c  |  3 +++
 kernel/sys.c               |  8 +++++++
 kernel/time/tick-sched.c   | 57 ++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 95 insertions(+)

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index e036c0aa9792..3625e839ad62 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -70,6 +70,15 @@ void arch_cpu_idle(void)
 	_cpu_idle();
 }
 
+#ifdef CONFIG_NO_HZ_FULL
+void tick_nohz_cpu_isolated_wait(void)
+{
+	set_current_state(TASK_INTERRUPTIBLE);
+	_cpu_idle();
+	set_current_state(TASK_RUNNING);
+}
+#endif
+
 /*
  * Release a thread_info structure
  */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae21f1591615..f350b0c20bbc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1778,6 +1778,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_NO_HZ_FULL
+	unsigned int	cpu_isolated_flags;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 3741ba1a652c..cb5569181359 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -10,6 +10,7 @@
 #include <linux/context_tracking_state.h>
 #include <linux/cpumask.h>
 #include <linux/sched.h>
+#include <linux/prctl.h>
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 extern void __init tick_init(void);
@@ -144,11 +145,18 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask)
 		cpumask_or(mask, mask, tick_nohz_full_mask);
 }
 
+static inline bool tick_nohz_is_cpu_isolated(void)
+{
+	return tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE);
+}
+
 extern void __tick_nohz_full_check(void);
 extern void tick_nohz_full_kick(void);
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
 extern void __tick_nohz_task_switch(struct task_struct *tsk);
+extern void tick_nohz_cpu_isolated_enter(void);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -158,6 +166,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void tick_nohz_full_kick(void) { }
 static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
+static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
+static inline void tick_nohz_cpu_isolated_enter(void) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..edb40b6b84db 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
 # define PR_FP_MODE_FR		(1 << 0)	/* 64b FP registers */
 # define PR_FP_MODE_FRE		(1 << 1)	/* 32b compatibility */
 
+/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */
+#define PR_SET_CPU_ISOLATED	47
+#define PR_GET_CPU_ISOLATED	48
+# define PR_CPU_ISOLATED_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 0a495ab35bc7..f9de3ee12723 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/tick.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state)
 			 * on the tick.
 			 */
 			if (state == CONTEXT_USER) {
+				if (tick_nohz_is_cpu_isolated())
+					tick_nohz_cpu_isolated_enter();
 				trace_user_enter(0);
 				vtime_user_enter(current);
 			}
diff --git a/kernel/sys.c b/kernel/sys.c
index 259fda25eb6b..36eb9a839f1f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_NO_HZ_FULL
+	case PR_SET_CPU_ISOLATED:
+		me->cpu_isolated_flags = arg2;
+		break;
+	case PR_GET_CPU_ISOLATED:
+		error = me->cpu_isolated_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c792429e98c6..4cf093c012d1 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
 #include <linux/posix-timers.h>
 #include <linux/perf_event.h>
 #include <linux/context_tracking.h>
+#include <linux/swap.h>
 
 #include <asm/irq_regs.h>
 
@@ -389,6 +390,62 @@ void __init tick_nohz_init(void)
 	pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
 		cpumask_pr_args(tick_nohz_full_mask));
 }
+
+/*
+ * Rather than continuously polling for the next_event in the
+ * tick_cpu_device, architectures can provide a method to save power
+ * by sleeping until an interrupt arrives.
+ */
+void __weak tick_nohz_cpu_isolated_wait(void)
+{
+	cpu_relax();
+}
+
+/*
+ * We normally return immediately to userspace.
+ *
+ * In "cpu_isolated" mode we wait until no more interrupts are
+ * pending.  Otherwise we nap with interrupts enabled and wait for the
+ * next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two "cpu_isolated" processes on the same
+ * core, neither will ever leave the kernel, and one will have to be
+ * killed manually.  Otherwise in situations where another process is
+ * in the runqueue on this cpu, this task will just wait for that
+ * other task to go idle before returning to user space.
+ */
+void tick_nohz_cpu_isolated_enter(void)
+{
+	struct clock_event_device *dev =
+		__this_cpu_read(tick_cpu_device.evtdev);
+	struct task_struct *task = current;
+	unsigned long start = jiffies;
+	bool warned = false;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+		if (!warned && (jiffies - start) >= (5 * HZ)) {
+			pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n",
+				task->comm, task->pid, smp_processor_id(),
+				(jiffies - start) / HZ);
+			warned = true;
+		}
+		if (should_resched())
+			schedule();
+		if (test_thread_flag(TIF_SIGPENDING))
+			break;
+		tick_nohz_cpu_isolated_wait();
+	}
+	if (warned) {
+		pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n",
+			task->comm, task->pid, smp_processor_id(),
+			(jiffies - start) / HZ);
+		dump_stack();
+	}
+}
+
 #endif
 
 /*
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
@ 2015-07-13 19:57         ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode makes tradeoffs to minimize userspace
interruptions while still attempting to avoid overheads in the
kernel entry/exit path, to provide 100% kernel semantics, etc.

However, some applications require a stronger commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications to elect
to have the stronger semantics as needed, specifying
prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The "cpu_isolated" state is indicated by setting a new task struct
field, cpu_isolated_flags, to the value passed by prctl().  When the
_ENABLE bit is set for a task, and it is returning to userspace
on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter()
routine to take additional actions to help the task avoid being
interrupted in the future.

Initially, there are only two actions taken.  First, the task
calls lru_add_drain() to prevent being interrupted by a subsequent
lru_add_drain_all() call on another core.  Then, the code checks for
pending timer interrupts and quiesces until they are no longer pending.
As a result, sys calls (and page faults, etc.) can be inordinately slow.
However, this quiescing guarantees that no unexpected interrupts will
occur, even if the application intentionally calls into the kernel.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/kernel/process.c |  9 ++++++++
 include/linux/sched.h      |  3 +++
 include/linux/tick.h       | 10 ++++++++
 include/uapi/linux/prctl.h |  5 ++++
 kernel/context_tracking.c  |  3 +++
 kernel/sys.c               |  8 +++++++
 kernel/time/tick-sched.c   | 57 ++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 95 insertions(+)

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index e036c0aa9792..3625e839ad62 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -70,6 +70,15 @@ void arch_cpu_idle(void)
 	_cpu_idle();
 }
 
+#ifdef CONFIG_NO_HZ_FULL
+void tick_nohz_cpu_isolated_wait(void)
+{
+	set_current_state(TASK_INTERRUPTIBLE);
+	_cpu_idle();
+	set_current_state(TASK_RUNNING);
+}
+#endif
+
 /*
  * Release a thread_info structure
  */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae21f1591615..f350b0c20bbc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1778,6 +1778,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_NO_HZ_FULL
+	unsigned int	cpu_isolated_flags;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 3741ba1a652c..cb5569181359 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -10,6 +10,7 @@
 #include <linux/context_tracking_state.h>
 #include <linux/cpumask.h>
 #include <linux/sched.h>
+#include <linux/prctl.h>
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 extern void __init tick_init(void);
@@ -144,11 +145,18 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask)
 		cpumask_or(mask, mask, tick_nohz_full_mask);
 }
 
+static inline bool tick_nohz_is_cpu_isolated(void)
+{
+	return tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE);
+}
+
 extern void __tick_nohz_full_check(void);
 extern void tick_nohz_full_kick(void);
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
 extern void __tick_nohz_task_switch(struct task_struct *tsk);
+extern void tick_nohz_cpu_isolated_enter(void);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -158,6 +166,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void tick_nohz_full_kick(void) { }
 static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
+static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
+static inline void tick_nohz_cpu_isolated_enter(void) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..edb40b6b84db 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
 # define PR_FP_MODE_FR		(1 << 0)	/* 64b FP registers */
 # define PR_FP_MODE_FRE		(1 << 1)	/* 32b compatibility */
 
+/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */
+#define PR_SET_CPU_ISOLATED	47
+#define PR_GET_CPU_ISOLATED	48
+# define PR_CPU_ISOLATED_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 0a495ab35bc7..f9de3ee12723 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/tick.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state)
 			 * on the tick.
 			 */
 			if (state == CONTEXT_USER) {
+				if (tick_nohz_is_cpu_isolated())
+					tick_nohz_cpu_isolated_enter();
 				trace_user_enter(0);
 				vtime_user_enter(current);
 			}
diff --git a/kernel/sys.c b/kernel/sys.c
index 259fda25eb6b..36eb9a839f1f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_NO_HZ_FULL
+	case PR_SET_CPU_ISOLATED:
+		me->cpu_isolated_flags = arg2;
+		break;
+	case PR_GET_CPU_ISOLATED:
+		error = me->cpu_isolated_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c792429e98c6..4cf093c012d1 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
 #include <linux/posix-timers.h>
 #include <linux/perf_event.h>
 #include <linux/context_tracking.h>
+#include <linux/swap.h>
 
 #include <asm/irq_regs.h>
 
@@ -389,6 +390,62 @@ void __init tick_nohz_init(void)
 	pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
 		cpumask_pr_args(tick_nohz_full_mask));
 }
+
+/*
+ * Rather than continuously polling for the next_event in the
+ * tick_cpu_device, architectures can provide a method to save power
+ * by sleeping until an interrupt arrives.
+ */
+void __weak tick_nohz_cpu_isolated_wait(void)
+{
+	cpu_relax();
+}
+
+/*
+ * We normally return immediately to userspace.
+ *
+ * In "cpu_isolated" mode we wait until no more interrupts are
+ * pending.  Otherwise we nap with interrupts enabled and wait for the
+ * next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two "cpu_isolated" processes on the same
+ * core, neither will ever leave the kernel, and one will have to be
+ * killed manually.  Otherwise in situations where another process is
+ * in the runqueue on this cpu, this task will just wait for that
+ * other task to go idle before returning to user space.
+ */
+void tick_nohz_cpu_isolated_enter(void)
+{
+	struct clock_event_device *dev =
+		__this_cpu_read(tick_cpu_device.evtdev);
+	struct task_struct *task = current;
+	unsigned long start = jiffies;
+	bool warned = false;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+		if (!warned && (jiffies - start) >= (5 * HZ)) {
+			pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n",
+				task->comm, task->pid, smp_processor_id(),
+				(jiffies - start) / HZ);
+			warned = true;
+		}
+		if (should_resched())
+			schedule();
+		if (test_thread_flag(TIF_SIGPENDING))
+			break;
+		tick_nohz_cpu_isolated_wait();
+	}
+	if (warned) {
+		pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n",
+			task->comm, task->pid, smp_processor_id(),
+			(jiffies - start) / HZ);
+		dump_stack();
+	}
+}
+
 #endif
 
 /*
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode
  2015-07-13 19:57       ` Chris Metcalf
@ 2015-07-13 19:57         ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

With cpu_isolated mode, the task is in principle guaranteed not to be
interrupted by the kernel, but only if it behaves.  In particular, if it
enters the kernel via system call, page fault, or any of a number of other
synchronous traps, it may be unexpectedly exposed to long latencies.
Add a simple flag that puts the process into a state where any such
kernel entry is fatal.

To allow the state to be entered and exited, we add an internal bit to
current->cpu_isolated_flags that is set when prctl() sets the flags.
We check the bit on syscall entry as well as on any exception_enter().
The prctl() syscall is ignored to allow clearing the bit again later,
and exit/exit_group are ignored to allow exiting the task without
a pointless signal killing you as you try to do so.

This change adds the syscall-detection hooks only for x86, arm64,
and tile.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/arm64/kernel/ptrace.c       |  4 ++++
 arch/tile/kernel/ptrace.c        |  6 +++++-
 arch/x86/kernel/ptrace.c         |  2 ++
 include/linux/context_tracking.h | 11 ++++++++---
 include/linux/tick.h             | 16 ++++++++++++++++
 include/uapi/linux/prctl.h       |  1 +
 kernel/context_tracking.c        |  9 ++++++---
 kernel/time/tick-sched.c         | 38 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index d882b833dbdb..7315b1579cbd 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
+	/* Ensure we report cpu_isolated violations in all circumstances. */
+	if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict())
+		tick_nohz_cpu_isolated_syscall(regs->syscallno);
+
 	/* Do the secure computing check first; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..d4e43a13bab1 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (work & _TIF_NOHZ)
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		if (tick_nohz_cpu_isolated_strict())
+			tick_nohz_cpu_isolated_syscall(
+				regs->regs[TREG_SYSCALL_NR]);
+	}
 
 	if (work & _TIF_SYSCALL_TRACE) {
 		if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9be72bc3613f..860f346977e2 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	if (work & _TIF_NOHZ) {
 		user_exit();
 		work &= ~_TIF_NOHZ;
+		if (tick_nohz_cpu_isolated_strict())
+			tick_nohz_cpu_isolated_syscall(regs->orig_ax);
 	}
 
 #ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd299966f..8b994e2a0330 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include <linux/sched.h>
 #include <linux/vtime.h>
+#include <linux/tick.h>
 #include <linux/context_tracking_state.h>
 #include <asm/ptrace.h>
 
@@ -11,7 +12,7 @@
 extern void context_tracking_cpu_set(int cpu);
 
 extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 
@@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
 		return 0;
 
 	prev_ctx = this_cpu_read(context_tracking.state);
-	if (prev_ctx != CONTEXT_KERNEL)
-		context_tracking_exit(prev_ctx);
+	if (prev_ctx != CONTEXT_KERNEL) {
+		if (context_tracking_exit(prev_ctx)) {
+			if (tick_nohz_cpu_isolated_strict())
+				tick_nohz_cpu_isolated_exception();
+		}
+	}
 
 	return prev_ctx;
 }
diff --git a/include/linux/tick.h b/include/linux/tick.h
index cb5569181359..f79f6945f762 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -157,6 +157,8 @@ extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
 extern void __tick_nohz_task_switch(struct task_struct *tsk);
 extern void tick_nohz_cpu_isolated_enter(void);
+extern void tick_nohz_cpu_isolated_syscall(int nr);
+extern void tick_nohz_cpu_isolated_exception(void);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -168,6 +170,8 @@ static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
 static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
 static inline void tick_nohz_cpu_isolated_enter(void) { }
+static inline void tick_nohz_cpu_isolated_syscall(int nr) { }
+static inline void tick_nohz_cpu_isolated_exception(void) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
@@ -200,4 +204,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk)
 		__tick_nohz_task_switch(tsk);
 }
 
+static inline bool tick_nohz_cpu_isolated_strict(void)
+{
+#ifdef CONFIG_NO_HZ_FULL
+	if (tick_nohz_full_cpu(smp_processor_id()) &&
+	    (current->cpu_isolated_flags &
+	     (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) ==
+	    (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT))
+		return true;
+#endif
+	return false;
+}
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index edb40b6b84db..0c11238a84fb 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
 #define PR_SET_CPU_ISOLATED	47
 #define PR_GET_CPU_ISOLATED	48
 # define PR_CPU_ISOLATED_ENABLE	(1 << 0)
+# define PR_CPU_ISOLATED_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index f9de3ee12723..fd051ea290ee 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
 {
 	unsigned long flags;
+	bool from_user = false;
 
 	if (!context_tracking_is_enabled())
-		return;
+		return false;
 
 	if (in_interrupt())
-		return;
+		return false;
 
 	local_irq_save(flags);
 	if (!context_tracking_recursion_enter())
@@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state)
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				from_user = true;
 				vtime_user_exit(current);
 				trace_user_exit(0);
 			}
@@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state)
 	context_tracking_recursion_exit();
 out_irq_restore:
 	local_irq_restore(flags);
+	return from_user;
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
 EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 4cf093c012d1..9f495c7c7dc2 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -27,6 +27,7 @@
 #include <linux/swap.h>
 
 #include <asm/irq_regs.h>
+#include <asm/unistd.h>
 
 #include "tick-internal.h"
 
@@ -446,6 +447,43 @@ void tick_nohz_cpu_isolated_enter(void)
 	}
 }
 
+static void kill_cpu_isolated_strict_task(void)
+{
+	dump_stack();
+	current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
+	send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void tick_nohz_cpu_isolated_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return;
+	}
+
+	pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
+		current->comm, current->pid, syscall);
+	kill_cpu_isolated_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void tick_nohz_cpu_isolated_exception(void)
+{
+	pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
+		current->comm, current->pid);
+	kill_cpu_isolated_strict_task();
+}
+
 #endif
 
 /*
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode
@ 2015-07-13 19:57         ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

With cpu_isolated mode, the task is in principle guaranteed not to be
interrupted by the kernel, but only if it behaves.  In particular, if it
enters the kernel via system call, page fault, or any of a number of other
synchronous traps, it may be unexpectedly exposed to long latencies.
Add a simple flag that puts the process into a state where any such
kernel entry is fatal.

To allow the state to be entered and exited, we add an internal bit to
current->cpu_isolated_flags that is set when prctl() sets the flags.
We check the bit on syscall entry as well as on any exception_enter().
The prctl() syscall is ignored to allow clearing the bit again later,
and exit/exit_group are ignored to allow exiting the task without
a pointless signal killing you as you try to do so.

This change adds the syscall-detection hooks only for x86, arm64,
and tile.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/arm64/kernel/ptrace.c       |  4 ++++
 arch/tile/kernel/ptrace.c        |  6 +++++-
 arch/x86/kernel/ptrace.c         |  2 ++
 include/linux/context_tracking.h | 11 ++++++++---
 include/linux/tick.h             | 16 ++++++++++++++++
 include/uapi/linux/prctl.h       |  1 +
 kernel/context_tracking.c        |  9 ++++++---
 kernel/time/tick-sched.c         | 38 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index d882b833dbdb..7315b1579cbd 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
+	/* Ensure we report cpu_isolated violations in all circumstances. */
+	if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict())
+		tick_nohz_cpu_isolated_syscall(regs->syscallno);
+
 	/* Do the secure computing check first; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..d4e43a13bab1 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (work & _TIF_NOHZ)
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		if (tick_nohz_cpu_isolated_strict())
+			tick_nohz_cpu_isolated_syscall(
+				regs->regs[TREG_SYSCALL_NR]);
+	}
 
 	if (work & _TIF_SYSCALL_TRACE) {
 		if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9be72bc3613f..860f346977e2 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	if (work & _TIF_NOHZ) {
 		user_exit();
 		work &= ~_TIF_NOHZ;
+		if (tick_nohz_cpu_isolated_strict())
+			tick_nohz_cpu_isolated_syscall(regs->orig_ax);
 	}
 
 #ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd299966f..8b994e2a0330 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include <linux/sched.h>
 #include <linux/vtime.h>
+#include <linux/tick.h>
 #include <linux/context_tracking_state.h>
 #include <asm/ptrace.h>
 
@@ -11,7 +12,7 @@
 extern void context_tracking_cpu_set(int cpu);
 
 extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 
@@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
 		return 0;
 
 	prev_ctx = this_cpu_read(context_tracking.state);
-	if (prev_ctx != CONTEXT_KERNEL)
-		context_tracking_exit(prev_ctx);
+	if (prev_ctx != CONTEXT_KERNEL) {
+		if (context_tracking_exit(prev_ctx)) {
+			if (tick_nohz_cpu_isolated_strict())
+				tick_nohz_cpu_isolated_exception();
+		}
+	}
 
 	return prev_ctx;
 }
diff --git a/include/linux/tick.h b/include/linux/tick.h
index cb5569181359..f79f6945f762 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -157,6 +157,8 @@ extern void tick_nohz_full_kick_cpu(int cpu);
 extern void tick_nohz_full_kick_all(void);
 extern void __tick_nohz_task_switch(struct task_struct *tsk);
 extern void tick_nohz_cpu_isolated_enter(void);
+extern void tick_nohz_cpu_isolated_syscall(int nr);
+extern void tick_nohz_cpu_isolated_exception(void);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -168,6 +170,8 @@ static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
 static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
 static inline void tick_nohz_cpu_isolated_enter(void) { }
+static inline void tick_nohz_cpu_isolated_syscall(int nr) { }
+static inline void tick_nohz_cpu_isolated_exception(void) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
@@ -200,4 +204,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk)
 		__tick_nohz_task_switch(tsk);
 }
 
+static inline bool tick_nohz_cpu_isolated_strict(void)
+{
+#ifdef CONFIG_NO_HZ_FULL
+	if (tick_nohz_full_cpu(smp_processor_id()) &&
+	    (current->cpu_isolated_flags &
+	     (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) ==
+	    (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT))
+		return true;
+#endif
+	return false;
+}
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index edb40b6b84db..0c11238a84fb 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
 #define PR_SET_CPU_ISOLATED	47
 #define PR_GET_CPU_ISOLATED	48
 # define PR_CPU_ISOLATED_ENABLE	(1 << 0)
+# define PR_CPU_ISOLATED_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index f9de3ee12723..fd051ea290ee 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
 {
 	unsigned long flags;
+	bool from_user = false;
 
 	if (!context_tracking_is_enabled())
-		return;
+		return false;
 
 	if (in_interrupt())
-		return;
+		return false;
 
 	local_irq_save(flags);
 	if (!context_tracking_recursion_enter())
@@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state)
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				from_user = true;
 				vtime_user_exit(current);
 				trace_user_exit(0);
 			}
@@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state)
 	context_tracking_recursion_exit();
 out_irq_restore:
 	local_irq_restore(flags);
+	return from_user;
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
 EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 4cf093c012d1..9f495c7c7dc2 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -27,6 +27,7 @@
 #include <linux/swap.h>
 
 #include <asm/irq_regs.h>
+#include <asm/unistd.h>
 
 #include "tick-internal.h"
 
@@ -446,6 +447,43 @@ void tick_nohz_cpu_isolated_enter(void)
 	}
 }
 
+static void kill_cpu_isolated_strict_task(void)
+{
+	dump_stack();
+	current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
+	send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void tick_nohz_cpu_isolated_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return;
+	}
+
+	pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
+		current->comm, current->pid, syscall);
+	kill_cpu_isolated_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void tick_nohz_cpu_isolated_exception(void)
+{
+	pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
+		current->comm, current->pid);
+	kill_cpu_isolated_strict_task();
+}
+
 #endif
 
 /*
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v4 3/5] nohz: cpu_isolated strict mode configurable signal
  2015-07-13 19:57       ` Chris Metcalf
@ 2015-07-13 19:57         ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

Allow userspace to override the default SIGKILL delivered
when a cpu_isolated process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

In addition to being able to set the signal, we now also
pass whether or not the interruption was from a syscall in
the si_code field of the siginfo.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/uapi/linux/prctl.h |  2 ++
 kernel/time/tick-sched.c   | 15 +++++++++++----
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 0c11238a84fb..ab45bd3d5799 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
 #define PR_GET_CPU_ISOLATED	48
 # define PR_CPU_ISOLATED_ENABLE	(1 << 0)
 # define PR_CPU_ISOLATED_STRICT	(1 << 1)
+# define PR_CPU_ISOLATED_SET_SIG(sig)  (((sig) & 0x7f) << 8)
+# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9f495c7c7dc2..c5eca9c99fad 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -447,11 +447,18 @@ void tick_nohz_cpu_isolated_enter(void)
 	}
 }
 
-static void kill_cpu_isolated_strict_task(void)
+static void kill_cpu_isolated_strict_task(int is_syscall)
 {
+	siginfo_t info = {};
+	int sig;
+
 	dump_stack();
 	current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
-	send_sig(SIGKILL, current, 1);
+
+	sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL;
+	info.si_signo = sig;
+	info.si_code = is_syscall;
+	send_sig_info(sig, &info, current);
 }
 
 /*
@@ -470,7 +477,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall)
 
 	pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
 		current->comm, current->pid, syscall);
-	kill_cpu_isolated_strict_task();
+	kill_cpu_isolated_strict_task(1);
 }
 
 /*
@@ -481,7 +488,7 @@ void tick_nohz_cpu_isolated_exception(void)
 {
 	pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
 		current->comm, current->pid);
-	kill_cpu_isolated_strict_task();
+	kill_cpu_isolated_strict_task(0);
 }
 
 #endif
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v4 3/5] nohz: cpu_isolated strict mode configurable signal
@ 2015-07-13 19:57         ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

Allow userspace to override the default SIGKILL delivered
when a cpu_isolated process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

In addition to being able to set the signal, we now also
pass whether or not the interruption was from a syscall in
the si_code field of the siginfo.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/uapi/linux/prctl.h |  2 ++
 kernel/time/tick-sched.c   | 15 +++++++++++----
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 0c11238a84fb..ab45bd3d5799 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
 #define PR_GET_CPU_ISOLATED	48
 # define PR_CPU_ISOLATED_ENABLE	(1 << 0)
 # define PR_CPU_ISOLATED_STRICT	(1 << 1)
+# define PR_CPU_ISOLATED_SET_SIG(sig)  (((sig) & 0x7f) << 8)
+# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9f495c7c7dc2..c5eca9c99fad 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -447,11 +447,18 @@ void tick_nohz_cpu_isolated_enter(void)
 	}
 }
 
-static void kill_cpu_isolated_strict_task(void)
+static void kill_cpu_isolated_strict_task(int is_syscall)
 {
+	siginfo_t info = {};
+	int sig;
+
 	dump_stack();
 	current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
-	send_sig(SIGKILL, current, 1);
+
+	sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL;
+	info.si_signo = sig;
+	info.si_code = is_syscall;
+	send_sig_info(sig, &info, current);
 }
 
 /*
@@ -470,7 +477,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall)
 
 	pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
 		current->comm, current->pid, syscall);
-	kill_cpu_isolated_strict_task();
+	kill_cpu_isolated_strict_task(1);
 }
 
 /*
@@ -481,7 +488,7 @@ void tick_nohz_cpu_isolated_exception(void)
 {
 	pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
 		current->comm, current->pid);
-	kill_cpu_isolated_strict_task();
+	kill_cpu_isolated_strict_task(0);
 }
 
 #endif
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v4 4/5] nohz: add cpu_isolated_debug boot flag
  2015-07-13 19:57       ` Chris Metcalf
                         ` (3 preceding siblings ...)
  (?)
@ 2015-07-13 19:58       ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-13 19:58 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, linux-kernel
  Cc: Chris Metcalf

This flag simplifies debugging of NO_HZ_FULL kernels when processes
are running in PR_CPU_ISOLATED_ENABLE mode.  Such processes should
get no interrupts from the kernel, and if they do, when this boot
flag is specified a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a cpu_isolated core
has unexpectedly entered the kernel.  But what this boot flag does
is allow the kernel to provide better diagnostics, e.g. by reporting
in the IPI-generating code what remote core and context is preparing
to deliver an interrupt to a cpu_isolated core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 Documentation/kernel-parameters.txt |  6 ++++++
 arch/tile/mm/homecache.c            |  5 ++++-
 include/linux/tick.h                |  2 ++
 kernel/irq_work.c                   |  4 +++-
 kernel/sched/core.c                 | 18 ++++++++++++++++++
 kernel/signal.c                     |  5 +++++
 kernel/smp.c                        |  4 ++++
 kernel/softirq.c                    |  6 ++++++
 8 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 1d6f0459cd7b..76e8e2ff4a0a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -749,6 +749,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			/proc/<pid>/coredump_filter.
 			See also Documentation/filesystems/proc.txt.
 
+	cpu_isolated_debug	[KNL]
+			In kernels built with CONFIG_NO_HZ_FULL and booted
+			in nohz_full= mode, this setting will generate console
+			backtraces when the kernel is about to interrupt a
+			task that has requested PR_CPU_ISOLATED_ENABLE.
+
 	cpuidle.off=1	[CPU_IDLE]
 			disable the cpuidle sub-system
 
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..f336880e1b01 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
 #include <linux/smp.h>
 #include <linux/module.h>
 #include <linux/hugetlb.h>
+#include <linux/tick.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
 	 * Don't bother to update atomically; losing a count
 	 * here is not that critical.
 	 */
-	for_each_cpu(cpu, &mask)
+	for_each_cpu(cpu, &mask) {
 		++per_cpu(irq_stat, cpu).irq_hv_flush_count;
+		tick_nohz_cpu_isolated_debug(cpu);
+	}
 }
 
 /*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f79f6945f762..ed65551e2315 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -159,6 +159,7 @@ extern void __tick_nohz_task_switch(struct task_struct *tsk);
 extern void tick_nohz_cpu_isolated_enter(void);
 extern void tick_nohz_cpu_isolated_syscall(int nr);
 extern void tick_nohz_cpu_isolated_exception(void);
+extern void tick_nohz_cpu_isolated_debug(int cpu);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -172,6 +173,7 @@ static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
 static inline void tick_nohz_cpu_isolated_enter(void) { }
 static inline void tick_nohz_cpu_isolated_syscall(int nr) { }
 static inline void tick_nohz_cpu_isolated_exception(void) { }
+static inline void tick_nohz_cpu_isolated_debug(int cpu) { }
 #endif
 
 static inline bool is_housekeeping_cpu(int cpu)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index cbf9fb899d92..7f35c90346de 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -75,8 +75,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (!irq_work_claim(work))
 		return false;
 
-	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+		tick_nohz_cpu_isolated_debug(cpu);
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return true;
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78b4bad10081..c8388f9206b2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -743,6 +743,24 @@ bool sched_can_stop_tick(void)
 
 	return true;
 }
+
+/* Enable debugging of any interrupts of cpu_isolated cores. */
+static int cpu_isolated_debug;
+static int __init cpu_isolated_debug_func(char *str)
+{
+	cpu_isolated_debug = true;
+	return 1;
+}
+__setup("cpu_isolated_debug", cpu_isolated_debug_func);
+
+void tick_nohz_cpu_isolated_debug(int cpu)
+{
+	if (cpu_isolated_debug && tick_nohz_full_cpu(cpu) &&
+	    (cpu_curr(cpu)->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE)) {
+		pr_err("Interrupt detected for cpu_isolated cpu %d\n", cpu);
+		dump_stack();
+	}
+}
 #endif /* CONFIG_NO_HZ_FULL */
 
 void sched_avg_update(struct rq *rq)
diff --git a/kernel/signal.c b/kernel/signal.c
index 836df8dac6cc..90ee460c2586 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -684,6 +684,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+#ifdef CONFIG_NO_HZ_FULL
+	/* If the task is being killed, don't complain about cpu_isolated. */
+	if (state & TASK_WAKEKILL)
+		t->cpu_isolated_flags = 0;
+#endif
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 07854477c164..6b7d8e2c8af4 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
 #include <linux/smp.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include <linux/tick.h>
 
 #include "smpboot.h"
 
@@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
+	tick_nohz_cpu_isolated_debug(cpu);
 	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
 		arch_send_call_function_single_ipi(cpu);
 
@@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	for_each_cpu(cpu, cfd->cpumask)
+		tick_nohz_cpu_isolated_debug(cpu);
 	arch_send_call_function_ipi_mask(cfd->cpumask);
 
 	if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..333872925ff6 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -24,6 +24,7 @@
 #include <linux/ftrace.h>
 #include <linux/smp.h>
 #include <linux/smpboot.h>
+#include <linux/context_tracking.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
 
@@ -335,6 +336,11 @@ void irq_enter(void)
 		_local_bh_enable();
 	}
 
+	if (context_tracking_cpu_is_enabled() &&
+	    context_tracking_in_user() &&
+	    !in_interrupt())
+		tick_nohz_cpu_isolated_debug(smp_processor_id());
+
 	__irq_enter();
 }
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v4 5/5] nohz: cpu_isolated: allow tick to be fully disabled
  2015-07-13 19:57       ` Chris Metcalf
                         ` (4 preceding siblings ...)
  (?)
@ 2015-07-13 19:58       ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-13 19:58 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-kernel
  Cc: Chris Metcalf

While the current fallback to 1-second tick is still helpful for
maintaining completely correct kernel semantics, processes using
prctl(PR_SET_CPU_ISOLATED) semantics place a higher priority on running
completely tickless, so don't bound the time_delta for such processes.
In addition, due to the way such processes quiesce by waiting for
the timer tick to stop prior to returning to userspace, without this
commit it won't be possible to use the cpu_isolated mode at all.

Removing the 1-second cap was previously discussed (see link below)
and Thomas Gleixner observed that vruntime, load balancing data, load
accounting, and other things might be impacted.  Frederic Weisbecker
similarly observed that allowing the tick to be indefinitely deferred just
meant that no one would ever fix the underlying bugs.  However it's at
least true that the mode proposed in this patch can only be enabled on an
isolcpus core by a process requesting cpu_isolated mode, which may limit
how important it is to maintain scheduler data correctly, for example.

Paul McKenney observed that if provide a mode where the 1Hz fallback timer
is removed, this will provide an environment where new code that relies
on that tick will get punished, and we won't forgive such assumptions
silently, so it may also be worth it from that perspective.

Finally, it's worth observing that the tile architecture has been using
similar code for its Zero-Overhead Linux for many years (starting in
2008) and customers are very enthusiastic about the resulting bare-metal
performance on cores that are available to run full Linux semantics
on demand (crash, logging, shutdown, etc).  So this semantics is very
useful if we can convince ourselves that doing this is safe.

Link: https://lkml.kernel.org/r/alpine.DEB.2.11.1410311058500.32582@gentwo.org
Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 kernel/time/tick-sched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c5eca9c99fad..8187b4b4c91c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -754,7 +754,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
 
 #ifdef CONFIG_NO_HZ_FULL
 	/* Limit the tick delta to the maximum scheduler deferment */
-	if (!ts->inidle)
+	if (!ts->inidle && !tick_nohz_is_cpu_isolated())
 		delta = min(delta, scheduler_tick_max_deferment());
 #endif
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
  2015-07-13 19:57         ` Chris Metcalf
  (?)
@ 2015-07-13 20:40         ` Andy Lutomirski
  2015-07-13 21:01           ` Chris Metcalf
  -1 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-07-13 20:40 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, Linux API, linux-kernel

On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> The existing nohz_full mode makes tradeoffs to minimize userspace
> interruptions while still attempting to avoid overheads in the
> kernel entry/exit path, to provide 100% kernel semantics, etc.
>
> However, some applications require a stronger commitment from the
> kernel to avoid interruptions, in particular userspace device
> driver style applications, such as high-speed networking code.
>
> This change introduces a framework to allow applications to elect
> to have the stronger semantics as needed, specifying
> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
> Subsequent commits will add additional flags and additional
> semantics.

I thought the general consensus was that this should be the default
behavior and that any associated bugs should be fixed.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
  2015-07-13 20:40         ` Andy Lutomirski
@ 2015-07-13 21:01           ` Chris Metcalf
  2015-07-13 21:45             ` Andy Lutomirski
  0 siblings, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-07-13 21:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, Linux API, linux-kernel

On 07/13/2015 04:40 PM, Andy Lutomirski wrote:
> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> The existing nohz_full mode makes tradeoffs to minimize userspace
>> interruptions while still attempting to avoid overheads in the
>> kernel entry/exit path, to provide 100% kernel semantics, etc.
>>
>> However, some applications require a stronger commitment from the
>> kernel to avoid interruptions, in particular userspace device
>> driver style applications, such as high-speed networking code.
>>
>> This change introduces a framework to allow applications to elect
>> to have the stronger semantics as needed, specifying
>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
>> Subsequent commits will add additional flags and additional
>> semantics.
> I thought the general consensus was that this should be the default
> behavior and that any associated bugs should be fixed.

I think it comes down to dividing the set of use cases in two:

- "Regular" nohz_full, as used to improve performance and limit
   interruptions, possibly for power benefits, etc.  But, stray
   interrupts are not particularly bad, and you don't want to take
   extreme measures to avoid them.

- What I'm calling "cpu_isolated" mode where when you return to
   userspace, you expect that by God, the kernel doesn't interrupt you
   again, and if it does, it's a flat-out bug.

There are a few things that cpu_isolated mode currently does to
accomplish its goals that are pretty heavy-weight:

Processes are held in kernel space until ticks are quiesced; this is
not necessarily what every nohz_full task wants.  If a task makes a
kernel call, there may well be arbitrary timer fallout, and having a
way to select whether or not you are willing to take a timer tick after
return to userspace is pretty important.

Likewise, there are things that you may want to do on return to
userspace that are designed to prevent further interruptions in
cpu_isolated mode, even at a possible future performance cost if and
when you return to the kernel, such as flushing the per-cpu free page
list so that you won't be interrupted by an IPI to flush it later.

If you're arguing that the cpu_isolated semantic is really the only
one that makes sense for nohz_full, my sense is that it might be
surprising to many of the folks who do nohz_full work.  But, I'm happy
to be wrong on this point, and maybe all the nohz_full community is
interested in making the same tradeoffs for nohz_full generally that
I've proposed in this patch series just for cpu_isolated?

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
  2015-07-13 21:01           ` Chris Metcalf
@ 2015-07-13 21:45             ` Andy Lutomirski
  2015-07-21 19:10               ` Chris Metcalf
  0 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-07-13 21:45 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, Linux API, linux-kernel

On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> On 07/13/2015 04:40 PM, Andy Lutomirski wrote:
>>
>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com>
>> wrote:
>>>
>>> The existing nohz_full mode makes tradeoffs to minimize userspace
>>> interruptions while still attempting to avoid overheads in the
>>> kernel entry/exit path, to provide 100% kernel semantics, etc.
>>>
>>> However, some applications require a stronger commitment from the
>>> kernel to avoid interruptions, in particular userspace device
>>> driver style applications, such as high-speed networking code.
>>>
>>> This change introduces a framework to allow applications to elect
>>> to have the stronger semantics as needed, specifying
>>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
>>> Subsequent commits will add additional flags and additional
>>> semantics.
>>
>> I thought the general consensus was that this should be the default
>> behavior and that any associated bugs should be fixed.
>
>
> I think it comes down to dividing the set of use cases in two:
>
> - "Regular" nohz_full, as used to improve performance and limit
>   interruptions, possibly for power benefits, etc.  But, stray
>   interrupts are not particularly bad, and you don't want to take
>   extreme measures to avoid them.
>
> - What I'm calling "cpu_isolated" mode where when you return to
>   userspace, you expect that by God, the kernel doesn't interrupt you
>   again, and if it does, it's a flat-out bug.
>
> There are a few things that cpu_isolated mode currently does to
> accomplish its goals that are pretty heavy-weight:
>
> Processes are held in kernel space until ticks are quiesced; this is
> not necessarily what every nohz_full task wants.  If a task makes a
> kernel call, there may well be arbitrary timer fallout, and having a
> way to select whether or not you are willing to take a timer tick after
> return to userspace is pretty important.

Then shouldn't deferred work be done immediately in nohz_full mode
regardless?  What is this delayed work that's being done?

>
> Likewise, there are things that you may want to do on return to
> userspace that are designed to prevent further interruptions in
> cpu_isolated mode, even at a possible future performance cost if and
> when you return to the kernel, such as flushing the per-cpu free page
> list so that you won't be interrupted by an IPI to flush it later.
>

Why not just kick the per-cpu free page over to whatever cpu is
monitoring your RCU state, etc?  That should be very quick.

> If you're arguing that the cpu_isolated semantic is really the only
> one that makes sense for nohz_full, my sense is that it might be
> surprising to many of the folks who do nohz_full work.  But, I'm happy
> to be wrong on this point, and maybe all the nohz_full community is
> interested in making the same tradeoffs for nohz_full generally that
> I've proposed in this patch series just for cpu_isolated?

nohz_full is currently dog slow for no particularly good reasons.  I
suspect that the interrupts you're seeing are also there for no
particularly good reasons as well.

Let's fix them instead of adding new ABIs to work around them.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode
@ 2015-07-13 21:47           ` Andy Lutomirski
  0 siblings, 0 replies; 340+ messages in thread
From: Andy Lutomirski @ 2015-07-13 21:47 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> With cpu_isolated mode, the task is in principle guaranteed not to be
> interrupted by the kernel, but only if it behaves.  In particular, if it
> enters the kernel via system call, page fault, or any of a number of other
> synchronous traps, it may be unexpectedly exposed to long latencies.
> Add a simple flag that puts the process into a state where any such
> kernel entry is fatal.
>

To me, this seems like the wrong design.  If nothing else, it seems
too much like an abusable anti-debugging mechanism.  I can imagine
some per-task flag "I think I shouldn't be interrupted now" and a
tracepoint that fires if the task is interrupted with that flag set.
But the strong cpu isolation stuff requires systemwide configuration,
and I think that monitoring that it works should work similarly.

More comments below.

> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
> ---
>  arch/arm64/kernel/ptrace.c       |  4 ++++
>  arch/tile/kernel/ptrace.c        |  6 +++++-
>  arch/x86/kernel/ptrace.c         |  2 ++
>  include/linux/context_tracking.h | 11 ++++++++---
>  include/linux/tick.h             | 16 ++++++++++++++++
>  include/uapi/linux/prctl.h       |  1 +
>  kernel/context_tracking.c        |  9 ++++++---
>  kernel/time/tick-sched.c         | 38 ++++++++++++++++++++++++++++++++++++++
>  8 files changed, 80 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> index d882b833dbdb..7315b1579cbd 100644
> --- a/arch/arm64/kernel/ptrace.c
> +++ b/arch/arm64/kernel/ptrace.c
> @@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
>
>  asmlinkage int syscall_trace_enter(struct pt_regs *regs)
>  {
> +       /* Ensure we report cpu_isolated violations in all circumstances. */
> +       if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict())
> +               tick_nohz_cpu_isolated_syscall(regs->syscallno);

IMO this is pointless.  If a user wants a syscall to kill them, use
seccomp.  The kernel isn't at fault if the user does a syscall when it
didn't want to enter the kernel.


> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
>                 return 0;
>
>         prev_ctx = this_cpu_read(context_tracking.state);
> -       if (prev_ctx != CONTEXT_KERNEL)
> -               context_tracking_exit(prev_ctx);
> +       if (prev_ctx != CONTEXT_KERNEL) {
> +               if (context_tracking_exit(prev_ctx)) {
> +                       if (tick_nohz_cpu_isolated_strict())
> +                               tick_nohz_cpu_isolated_exception();
> +               }
> +       }

NACK.  I'm cautiously optimistic that an x86 kernel 4.3 or newer will
simply never call exception_enter.  It certainly won't call it
frequently unless something goes wrong with the patches that are
already in -tip.

> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
>   * This call supports re-entrancy. This way it can be called from any exception
>   * handler without needing to know if we came from userspace or not.
>   */
> -void context_tracking_exit(enum ctx_state state)
> +bool context_tracking_exit(enum ctx_state state)
>  {
>         unsigned long flags;
> +       bool from_user = false;
>

IMO the internal context tracking API (e.g. context_tracking_exit) are
mostly of the form "hey context tracking: I don't really know what
you're doing or what I'm doing, but let me call you and make both of
us feel better."  You're making it somewhat worse: now it's all of the
above plus "I don't even know whether I just entered the kernel --
maybe you have a better idea".

Starting with 4.3, x86 kernels will know *exactly* when they enter the
kernel.  All of this context tracking what-was-my-previous-state stuff
will remain until someone kills it, but when it goes away we'll get a
nice performance boost.

So, no, let's implement this for real if we're going to implement it.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode
@ 2015-07-13 21:47           ` Andy Lutomirski
  0 siblings, 0 replies; 340+ messages in thread
From: Andy Lutomirski @ 2015-07-13 21:47 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
> With cpu_isolated mode, the task is in principle guaranteed not to be
> interrupted by the kernel, but only if it behaves.  In particular, if it
> enters the kernel via system call, page fault, or any of a number of other
> synchronous traps, it may be unexpectedly exposed to long latencies.
> Add a simple flag that puts the process into a state where any such
> kernel entry is fatal.
>

To me, this seems like the wrong design.  If nothing else, it seems
too much like an abusable anti-debugging mechanism.  I can imagine
some per-task flag "I think I shouldn't be interrupted now" and a
tracepoint that fires if the task is interrupted with that flag set.
But the strong cpu isolation stuff requires systemwide configuration,
and I think that monitoring that it works should work similarly.

More comments below.

> Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
> ---
>  arch/arm64/kernel/ptrace.c       |  4 ++++
>  arch/tile/kernel/ptrace.c        |  6 +++++-
>  arch/x86/kernel/ptrace.c         |  2 ++
>  include/linux/context_tracking.h | 11 ++++++++---
>  include/linux/tick.h             | 16 ++++++++++++++++
>  include/uapi/linux/prctl.h       |  1 +
>  kernel/context_tracking.c        |  9 ++++++---
>  kernel/time/tick-sched.c         | 38 ++++++++++++++++++++++++++++++++++++++
>  8 files changed, 80 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> index d882b833dbdb..7315b1579cbd 100644
> --- a/arch/arm64/kernel/ptrace.c
> +++ b/arch/arm64/kernel/ptrace.c
> @@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
>
>  asmlinkage int syscall_trace_enter(struct pt_regs *regs)
>  {
> +       /* Ensure we report cpu_isolated violations in all circumstances. */
> +       if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict())
> +               tick_nohz_cpu_isolated_syscall(regs->syscallno);

IMO this is pointless.  If a user wants a syscall to kill them, use
seccomp.  The kernel isn't at fault if the user does a syscall when it
didn't want to enter the kernel.


> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
>                 return 0;
>
>         prev_ctx = this_cpu_read(context_tracking.state);
> -       if (prev_ctx != CONTEXT_KERNEL)
> -               context_tracking_exit(prev_ctx);
> +       if (prev_ctx != CONTEXT_KERNEL) {
> +               if (context_tracking_exit(prev_ctx)) {
> +                       if (tick_nohz_cpu_isolated_strict())
> +                               tick_nohz_cpu_isolated_exception();
> +               }
> +       }

NACK.  I'm cautiously optimistic that an x86 kernel 4.3 or newer will
simply never call exception_enter.  It certainly won't call it
frequently unless something goes wrong with the patches that are
already in -tip.

> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
>   * This call supports re-entrancy. This way it can be called from any exception
>   * handler without needing to know if we came from userspace or not.
>   */
> -void context_tracking_exit(enum ctx_state state)
> +bool context_tracking_exit(enum ctx_state state)
>  {
>         unsigned long flags;
> +       bool from_user = false;
>

IMO the internal context tracking API (e.g. context_tracking_exit) are
mostly of the form "hey context tracking: I don't really know what
you're doing or what I'm doing, but let me call you and make both of
us feel better."  You're making it somewhat worse: now it's all of the
above plus "I don't even know whether I just entered the kernel --
maybe you have a better idea".

Starting with 4.3, x86 kernels will know *exactly* when they enter the
kernel.  All of this context tracking what-was-my-previous-state stuff
will remain until someone kills it, but when it goes away we'll get a
nice performance boost.

So, no, let's implement this for real if we're going to implement it.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
  2015-07-13 21:45             ` Andy Lutomirski
@ 2015-07-21 19:10               ` Chris Metcalf
  2015-07-21 19:26                 ` Andy Lutomirski
  2015-07-24 14:03                 ` Frederic Weisbecker
  0 siblings, 2 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-21 19:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, Linux API, linux-kernel

Sorry for the delay in responding; some other priorities came up internally.

On 07/13/2015 05:45 PM, Andy Lutomirski wrote:
> On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> On 07/13/2015 04:40 PM, Andy Lutomirski wrote:
>>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com>
>>> wrote:
>>>> The existing nohz_full mode makes tradeoffs to minimize userspace
>>>> interruptions while still attempting to avoid overheads in the
>>>> kernel entry/exit path, to provide 100% kernel semantics, etc.
>>>>
>>>> However, some applications require a stronger commitment from the
>>>> kernel to avoid interruptions, in particular userspace device
>>>> driver style applications, such as high-speed networking code.
>>>>
>>>> This change introduces a framework to allow applications to elect
>>>> to have the stronger semantics as needed, specifying
>>>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
>>>> Subsequent commits will add additional flags and additional
>>>> semantics.
>>> I thought the general consensus was that this should be the default
>>> behavior and that any associated bugs should be fixed.
>>
>> I think it comes down to dividing the set of use cases in two:
>>
>> - "Regular" nohz_full, as used to improve performance and limit
>>    interruptions, possibly for power benefits, etc.  But, stray
>>    interrupts are not particularly bad, and you don't want to take
>>    extreme measures to avoid them.
>>
>> - What I'm calling "cpu_isolated" mode where when you return to
>>    userspace, you expect that by God, the kernel doesn't interrupt you
>>    again, and if it does, it's a flat-out bug.
>>
>> There are a few things that cpu_isolated mode currently does to
>> accomplish its goals that are pretty heavy-weight:
>>
>> Processes are held in kernel space until ticks are quiesced; this is
>> not necessarily what every nohz_full task wants.  If a task makes a
>> kernel call, there may well be arbitrary timer fallout, and having a
>> way to select whether or not you are willing to take a timer tick after
>> return to userspace is pretty important.
> Then shouldn't deferred work be done immediately in nohz_full mode
> regardless?  What is this delayed work that's being done?

I'm thinking of things like needing to wait for an RCU quiesce
period to complete.

In the current version, there's also the vmstat_update() that
may schedule delayed work and interrupt the core again
shortly before realizing that there are no more counter updates
happening, at which point it quiesces.  Currently we handle
this in cpu_isolated mode simply by spinning and waiting for
the timer interrupts to complete.

>> Likewise, there are things that you may want to do on return to
>> userspace that are designed to prevent further interruptions in
>> cpu_isolated mode, even at a possible future performance cost if and
>> when you return to the kernel, such as flushing the per-cpu free page
>> list so that you won't be interrupted by an IPI to flush it later.
> Why not just kick the per-cpu free page over to whatever cpu is
> monitoring your RCU state, etc?  That should be very quick.

So just for the sake of precision, the thing I'm talking about
is the lru_add_drain() call on kernel exit.  Are you proposing
that we call that for every nohz_full core on kernel exit?
I'm not opposed to this, but I don't know if other nohz
developers feel like this is the right tradeoff.

Similarly, addressing the vmstat_update() issue above, in
cpu_isolated mode we might want to have a follow-on
patch that forces the vmstat system into quiesced state
on return to userspace.  We would need to do this
unconditionally on all nohz_full cores if we tried to combine
the current nohz_full with my proposed cpu_isolated
functionality.  Again, I'm not necessarily opposed, but
I suspect other nohz developers might not want this.

(I didn't want to introduce such a patch as part of this
series since it pulls in even more interested parties, and
it gets harder and harder to get to consensus.)

>> If you're arguing that the cpu_isolated semantic is really the only
>> one that makes sense for nohz_full, my sense is that it might be
>> surprising to many of the folks who do nohz_full work.  But, I'm happy
>> to be wrong on this point, and maybe all the nohz_full community is
>> interested in making the same tradeoffs for nohz_full generally that
>> I've proposed in this patch series just for cpu_isolated?
> nohz_full is currently dog slow for no particularly good reasons.  I
> suspect that the interrupts you're seeing are also there for no
> particularly good reasons as well.
>
> Let's fix them instead of adding new ABIs to work around them.

Well, in principle if we accepted my proposed patch series
and then over time came to decide that it was reasonable
for nohz_full to have these complete cpu isolation
semantics, the one proposed ABI simply becomes a no-op.
So it's not as problematic an ABI as some.

My issue is this: I'm totally happy with submitting a revised
patch series that does all the stuff for pure nohz_full that
I'm currently proposing for cpu_isolated.  But, is it what
the community wants?  Should I propose it and see?

Frederic, do you have any insight here?  Thanks!

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
  2015-07-21 19:10               ` Chris Metcalf
@ 2015-07-21 19:26                 ` Andy Lutomirski
  2015-07-21 20:36                     ` Paul E. McKenney
  2015-07-24 20:22                   ` Chris Metcalf
  2015-07-24 14:03                 ` Frederic Weisbecker
  1 sibling, 2 replies; 340+ messages in thread
From: Andy Lutomirski @ 2015-07-21 19:26 UTC (permalink / raw)
  To: Chris Metcalf, Paul McKenney
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Christoph Lameter, Viresh Kumar, linux-doc,
	Linux API, linux-kernel

On Tue, Jul 21, 2015 at 12:10 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> Sorry for the delay in responding; some other priorities came up internally.
>
> On 07/13/2015 05:45 PM, Andy Lutomirski wrote:
>>
>> On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <cmetcalf@ezchip.com>
>> wrote:
>>>
>>> On 07/13/2015 04:40 PM, Andy Lutomirski wrote:
>>>>
>>>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com>
>>>>
>>>> wrote:
>>>>>
>>>>> The existing nohz_full mode makes tradeoffs to minimize userspace
>>>>> interruptions while still attempting to avoid overheads in the
>>>>> kernel entry/exit path, to provide 100% kernel semantics, etc.
>>>>>
>>>>> However, some applications require a stronger commitment from the
>>>>> kernel to avoid interruptions, in particular userspace device
>>>>> driver style applications, such as high-speed networking code.
>>>>>
>>>>> This change introduces a framework to allow applications to elect
>>>>> to have the stronger semantics as needed, specifying
>>>>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
>>>>> Subsequent commits will add additional flags and additional
>>>>> semantics.
>>>>
>>>> I thought the general consensus was that this should be the default
>>>> behavior and that any associated bugs should be fixed.
>>>
>>>
>>> I think it comes down to dividing the set of use cases in two:
>>>
>>> - "Regular" nohz_full, as used to improve performance and limit
>>>    interruptions, possibly for power benefits, etc.  But, stray
>>>    interrupts are not particularly bad, and you don't want to take
>>>    extreme measures to avoid them.
>>>
>>> - What I'm calling "cpu_isolated" mode where when you return to
>>>    userspace, you expect that by God, the kernel doesn't interrupt you
>>>    again, and if it does, it's a flat-out bug.
>>>
>>> There are a few things that cpu_isolated mode currently does to
>>> accomplish its goals that are pretty heavy-weight:
>>>
>>> Processes are held in kernel space until ticks are quiesced; this is
>>> not necessarily what every nohz_full task wants.  If a task makes a
>>> kernel call, there may well be arbitrary timer fallout, and having a
>>> way to select whether or not you are willing to take a timer tick after
>>> return to userspace is pretty important.
>>
>> Then shouldn't deferred work be done immediately in nohz_full mode
>> regardless?  What is this delayed work that's being done?
>
>
> I'm thinking of things like needing to wait for an RCU quiesce
> period to complete.

rcu_nocbs does this, right?

>
> In the current version, there's also the vmstat_update() that
> may schedule delayed work and interrupt the core again
> shortly before realizing that there are no more counter updates
> happening, at which point it quiesces.  Currently we handle
> this in cpu_isolated mode simply by spinning and waiting for
> the timer interrupts to complete.

Perhaps we should fix that?

>
>>> Likewise, there are things that you may want to do on return to
>>> userspace that are designed to prevent further interruptions in
>>> cpu_isolated mode, even at a possible future performance cost if and
>>> when you return to the kernel, such as flushing the per-cpu free page
>>> list so that you won't be interrupted by an IPI to flush it later.
>>
>> Why not just kick the per-cpu free page over to whatever cpu is
>> monitoring your RCU state, etc?  That should be very quick.
>
>
> So just for the sake of precision, the thing I'm talking about
> is the lru_add_drain() call on kernel exit.  Are you proposing
> that we call that for every nohz_full core on kernel exit?
> I'm not opposed to this, but I don't know if other nohz
> developers feel like this is the right tradeoff.

I'm proposing either that we do that or that we arrange for other cpus
to be able to steal our LRU list while we're in RCU user/idle.

>> Let's fix them instead of adding new ABIs to work around them.
>
>
> Well, in principle if we accepted my proposed patch series
> and then over time came to decide that it was reasonable
> for nohz_full to have these complete cpu isolation
> semantics, the one proposed ABI simply becomes a no-op.
> So it's not as problematic an ABI as some.

What if we made it a debugfs thing instead of a prctl?  Have a mode
where the system tries really hard to quiesce itself even at the cost
of performance.

>
> My issue is this: I'm totally happy with submitting a revised
> patch series that does all the stuff for pure nohz_full that
> I'm currently proposing for cpu_isolated.  But, is it what
> the community wants?  Should I propose it and see?
>
> Frederic, do you have any insight here?  Thanks!
>
> --
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode
@ 2015-07-21 19:34             ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-21 19:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On 07/13/2015 05:47 PM, Andy Lutomirski wrote:
> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> With cpu_isolated mode, the task is in principle guaranteed not to be
>> interrupted by the kernel, but only if it behaves.  In particular, if it
>> enters the kernel via system call, page fault, or any of a number of other
>> synchronous traps, it may be unexpectedly exposed to long latencies.
>> Add a simple flag that puts the process into a state where any such
>> kernel entry is fatal.
>>
> To me, this seems like the wrong design.  If nothing else, it seems
> too much like an abusable anti-debugging mechanism.  I can imagine
> some per-task flag "I think I shouldn't be interrupted now" and a
> tracepoint that fires if the task is interrupted with that flag set.
> But the strong cpu isolation stuff requires systemwide configuration,
> and I think that monitoring that it works should work similarly.

First, you mention a per-task flag, but not specifically whether the
proposed prctl() mechanism is a reasonable way to set that flag.
Just wanted to clarify that this wasn't an issue in and of itself for you.

Second, you suggest a tracepoint.  I'm OK with creating a tracepoint
dedicated to cpu_isolated strict failures and making that the only
way this mechanism works.  But, earlier community feedback seemed to
suggest that the signal mechanism was OK; one piece of feedback
just requested being able to set which signal was delivered.  Do you
think the signal idea is a bad one?  Are you proposing potentially
having a signal and/or a tracepoint?

Last, you mention systemwide configuration for monitoring.  Can you
expand on what you mean by that?  We already support the monitoring
only on the nohz_full cores, so to that extent it's already systemwide.
And the per-task flag has to be set by the running process when it's
ready for this state, so that can't really be systemwide configuration.
I don't understand your suggestion on this point.

>> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
>> index d882b833dbdb..7315b1579cbd 100644
>> --- a/arch/arm64/kernel/ptrace.c
>> +++ b/arch/arm64/kernel/ptrace.c
>> @@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
>>
>>   asmlinkage int syscall_trace_enter(struct pt_regs *regs)
>>   {
>> +       /* Ensure we report cpu_isolated violations in all circumstances. */
>> +       if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict())
>> +               tick_nohz_cpu_isolated_syscall(regs->syscallno);
> IMO this is pointless.  If a user wants a syscall to kill them, use
> seccomp.  The kernel isn't at fault if the user does a syscall when it
> didn't want to enter the kernel.

Interesting!  I didn't realize how close SECCOMP_SET_MODE_STRICT
was to what I wanted here.  One concern is that there doesn't seem
to be a way to "escape" from seccomp strict mode, i.e. you can't
call seccomp() again to turn it off - which makes sense for seccomp
since it's a security issue, but not so much sense with cpu_isolated.

So, do you think there's a good role for the seccomp() API to play
in achieving this goal?  It's certainly not a question of "the kernel at
fault" but rather "asking the kernel to help catch user mistakes"
(typically third-party libraries in our customers' experience).  You
could imagine a SECCOMP_SET_MODE_ISOLATED or something.

Alternatively, we could stick with the API proposed in my patch
series, or something similar, and just try to piggy-back on the seccomp
internals to make it happen.  It would require Kconfig to ensure
that SECCOMP was enabled though, which obviously isn't currently
required to do cpu isolation.

>> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
>>                  return 0;
>>
>>          prev_ctx = this_cpu_read(context_tracking.state);
>> -       if (prev_ctx != CONTEXT_KERNEL)
>> -               context_tracking_exit(prev_ctx);
>> +       if (prev_ctx != CONTEXT_KERNEL) {
>> +               if (context_tracking_exit(prev_ctx)) {
>> +                       if (tick_nohz_cpu_isolated_strict())
>> +                               tick_nohz_cpu_isolated_exception();
>> +               }
>> +       }
> NACK.  I'm cautiously optimistic that an x86 kernel 4.3 or newer will
> simply never call exception_enter.  It certainly won't call it
> frequently unless something goes wrong with the patches that are
> already in -tip.

This is intended to catch user exceptions like page faults, GPV or
(on platforms where this would happen) unaligned data traps.
The kernel still has a role to play here and cpu_isolated mode
needs to let the user know they have accidentally entered
the kernel in this case.

>> --- a/kernel/context_tracking.c
>> +++ b/kernel/context_tracking.c
>> @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
>>    * This call supports re-entrancy. This way it can be called from any exception
>>    * handler without needing to know if we came from userspace or not.
>>    */
>> -void context_tracking_exit(enum ctx_state state)
>> +bool context_tracking_exit(enum ctx_state state)
>>   {
>>          unsigned long flags;
>> +       bool from_user = false;
>>
> IMO the internal context tracking API (e.g. context_tracking_exit) are
> mostly of the form "hey context tracking: I don't really know what
> you're doing or what I'm doing, but let me call you and make both of
> us feel better."  You're making it somewhat worse: now it's all of the
> above plus "I don't even know whether I just entered the kernel --
> maybe you have a better idea".
>
> Starting with 4.3, x86 kernels will know *exactly* when they enter the
> kernel.  All of this context tracking what-was-my-previous-state stuff
> will remain until someone kills it, but when it goes away we'll get a
> nice performance boost.
>
> So, no, let's implement this for real if we're going to implement it.

I'm certainly OK with rebasing on top of 4.3 after the context
tracking stuff is better.  That said, I think it makes sense to continue
to debate the intent of the patch series even if we pull this one
patch out and defer it until after 4.3, or having it end up pulled
into some other repo that includes the improvements and
is being pulled for 4.3.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode
@ 2015-07-21 19:34             ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-21 19:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 07/13/2015 05:47 PM, Andy Lutomirski wrote:
> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
>> With cpu_isolated mode, the task is in principle guaranteed not to be
>> interrupted by the kernel, but only if it behaves.  In particular, if it
>> enters the kernel via system call, page fault, or any of a number of other
>> synchronous traps, it may be unexpectedly exposed to long latencies.
>> Add a simple flag that puts the process into a state where any such
>> kernel entry is fatal.
>>
> To me, this seems like the wrong design.  If nothing else, it seems
> too much like an abusable anti-debugging mechanism.  I can imagine
> some per-task flag "I think I shouldn't be interrupted now" and a
> tracepoint that fires if the task is interrupted with that flag set.
> But the strong cpu isolation stuff requires systemwide configuration,
> and I think that monitoring that it works should work similarly.

First, you mention a per-task flag, but not specifically whether the
proposed prctl() mechanism is a reasonable way to set that flag.
Just wanted to clarify that this wasn't an issue in and of itself for you.

Second, you suggest a tracepoint.  I'm OK with creating a tracepoint
dedicated to cpu_isolated strict failures and making that the only
way this mechanism works.  But, earlier community feedback seemed to
suggest that the signal mechanism was OK; one piece of feedback
just requested being able to set which signal was delivered.  Do you
think the signal idea is a bad one?  Are you proposing potentially
having a signal and/or a tracepoint?

Last, you mention systemwide configuration for monitoring.  Can you
expand on what you mean by that?  We already support the monitoring
only on the nohz_full cores, so to that extent it's already systemwide.
And the per-task flag has to be set by the running process when it's
ready for this state, so that can't really be systemwide configuration.
I don't understand your suggestion on this point.

>> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
>> index d882b833dbdb..7315b1579cbd 100644
>> --- a/arch/arm64/kernel/ptrace.c
>> +++ b/arch/arm64/kernel/ptrace.c
>> @@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
>>
>>   asmlinkage int syscall_trace_enter(struct pt_regs *regs)
>>   {
>> +       /* Ensure we report cpu_isolated violations in all circumstances. */
>> +       if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict())
>> +               tick_nohz_cpu_isolated_syscall(regs->syscallno);
> IMO this is pointless.  If a user wants a syscall to kill them, use
> seccomp.  The kernel isn't at fault if the user does a syscall when it
> didn't want to enter the kernel.

Interesting!  I didn't realize how close SECCOMP_SET_MODE_STRICT
was to what I wanted here.  One concern is that there doesn't seem
to be a way to "escape" from seccomp strict mode, i.e. you can't
call seccomp() again to turn it off - which makes sense for seccomp
since it's a security issue, but not so much sense with cpu_isolated.

So, do you think there's a good role for the seccomp() API to play
in achieving this goal?  It's certainly not a question of "the kernel at
fault" but rather "asking the kernel to help catch user mistakes"
(typically third-party libraries in our customers' experience).  You
could imagine a SECCOMP_SET_MODE_ISOLATED or something.

Alternatively, we could stick with the API proposed in my patch
series, or something similar, and just try to piggy-back on the seccomp
internals to make it happen.  It would require Kconfig to ensure
that SECCOMP was enabled though, which obviously isn't currently
required to do cpu isolation.

>> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
>>                  return 0;
>>
>>          prev_ctx = this_cpu_read(context_tracking.state);
>> -       if (prev_ctx != CONTEXT_KERNEL)
>> -               context_tracking_exit(prev_ctx);
>> +       if (prev_ctx != CONTEXT_KERNEL) {
>> +               if (context_tracking_exit(prev_ctx)) {
>> +                       if (tick_nohz_cpu_isolated_strict())
>> +                               tick_nohz_cpu_isolated_exception();
>> +               }
>> +       }
> NACK.  I'm cautiously optimistic that an x86 kernel 4.3 or newer will
> simply never call exception_enter.  It certainly won't call it
> frequently unless something goes wrong with the patches that are
> already in -tip.

This is intended to catch user exceptions like page faults, GPV or
(on platforms where this would happen) unaligned data traps.
The kernel still has a role to play here and cpu_isolated mode
needs to let the user know they have accidentally entered
the kernel in this case.

>> --- a/kernel/context_tracking.c
>> +++ b/kernel/context_tracking.c
>> @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
>>    * This call supports re-entrancy. This way it can be called from any exception
>>    * handler without needing to know if we came from userspace or not.
>>    */
>> -void context_tracking_exit(enum ctx_state state)
>> +bool context_tracking_exit(enum ctx_state state)
>>   {
>>          unsigned long flags;
>> +       bool from_user = false;
>>
> IMO the internal context tracking API (e.g. context_tracking_exit) are
> mostly of the form "hey context tracking: I don't really know what
> you're doing or what I'm doing, but let me call you and make both of
> us feel better."  You're making it somewhat worse: now it's all of the
> above plus "I don't even know whether I just entered the kernel --
> maybe you have a better idea".
>
> Starting with 4.3, x86 kernels will know *exactly* when they enter the
> kernel.  All of this context tracking what-was-my-previous-state stuff
> will remain until someone kills it, but when it goes away we'll get a
> nice performance boost.
>
> So, no, let's implement this for real if we're going to implement it.

I'm certainly OK with rebasing on top of 4.3 after the context
tracking stuff is better.  That said, I think it makes sense to continue
to debate the intent of the patch series even if we pull this one
patch out and defer it until after 4.3, or having it end up pulled
into some other repo that includes the improvements and
is being pulled for 4.3.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode
@ 2015-07-21 19:42               ` Andy Lutomirski
  0 siblings, 0 replies; 340+ messages in thread
From: Andy Lutomirski @ 2015-07-21 19:42 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On Tue, Jul 21, 2015 at 12:34 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> On 07/13/2015 05:47 PM, Andy Lutomirski wrote:
>>
>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com>
>> wrote:
>>>
>>> With cpu_isolated mode, the task is in principle guaranteed not to be
>>> interrupted by the kernel, but only if it behaves.  In particular, if it
>>> enters the kernel via system call, page fault, or any of a number of
>>> other
>>> synchronous traps, it may be unexpectedly exposed to long latencies.
>>> Add a simple flag that puts the process into a state where any such
>>> kernel entry is fatal.
>>>
>> To me, this seems like the wrong design.  If nothing else, it seems
>> too much like an abusable anti-debugging mechanism.  I can imagine
>> some per-task flag "I think I shouldn't be interrupted now" and a
>> tracepoint that fires if the task is interrupted with that flag set.
>> But the strong cpu isolation stuff requires systemwide configuration,
>> and I think that monitoring that it works should work similarly.
>
>
> First, you mention a per-task flag, but not specifically whether the
> proposed prctl() mechanism is a reasonable way to set that flag.
> Just wanted to clarify that this wasn't an issue in and of itself for you.

I think I'm okay with a per-task flag for this and, if you add one,
then prctl() is presumably the way to go.  Unless people think that
nohz should be 100% reliable always, in which case might as well make
the flag per-cpu.

>
> Second, you suggest a tracepoint.  I'm OK with creating a tracepoint
> dedicated to cpu_isolated strict failures and making that the only
> way this mechanism works.  But, earlier community feedback seemed to
> suggest that the signal mechanism was OK; one piece of feedback
> just requested being able to set which signal was delivered.  Do you
> think the signal idea is a bad one?  Are you proposing potentially
> having a signal and/or a tracepoint?

I prefer the tracepoint.  It's friendlier to debuggers, and it's
really about diagnosing a kernel problem, not a userspace problem.
Also, I really doubt that people should deploy a signal thing in
production.  What if an NMI fires and kills their realtime program?

>
> Last, you mention systemwide configuration for monitoring.  Can you
> expand on what you mean by that?  We already support the monitoring
> only on the nohz_full cores, so to that extent it's already systemwide.
> And the per-task flag has to be set by the running process when it's
> ready for this state, so that can't really be systemwide configuration.
> I don't understand your suggestion on this point.

I'm really thinking about systemwide configuration for isolation.  I
think we'll always (at least in the nearish term) need the admin's
help to set up isolated CPUs.  If the admin makes a whole CPU be
isolated, then monitoring just that CPU and monitoring it all the time
seems sensible.  If we really do think that isolating a CPU should
require a syscall of some sort because it's too expensive otherwise,
then we can do it that way, too.  And if full isolation requires some
user help (e.g. don't do certain things that break isolation), then
having a per-task monitoring flag seems reasonable.

We may always need the user's help to avoid IPIs.  For example, if one
thread calls munmap, the other thread is going to get an IPI.  There's
nothing we can do about that.

> I'm certainly OK with rebasing on top of 4.3 after the context
> tracking stuff is better.  That said, I think it makes sense to continue
> to debate the intent of the patch series even if we pull this one
> patch out and defer it until after 4.3, or having it end up pulled
> into some other repo that includes the improvements and
> is being pulled for 4.3.

Sure, no problem.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode
@ 2015-07-21 19:42               ` Andy Lutomirski
  0 siblings, 0 replies; 340+ messages in thread
From: Andy Lutomirski @ 2015-07-21 19:42 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Jul 21, 2015 at 12:34 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
> On 07/13/2015 05:47 PM, Andy Lutomirski wrote:
>>
>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
>> wrote:
>>>
>>> With cpu_isolated mode, the task is in principle guaranteed not to be
>>> interrupted by the kernel, but only if it behaves.  In particular, if it
>>> enters the kernel via system call, page fault, or any of a number of
>>> other
>>> synchronous traps, it may be unexpectedly exposed to long latencies.
>>> Add a simple flag that puts the process into a state where any such
>>> kernel entry is fatal.
>>>
>> To me, this seems like the wrong design.  If nothing else, it seems
>> too much like an abusable anti-debugging mechanism.  I can imagine
>> some per-task flag "I think I shouldn't be interrupted now" and a
>> tracepoint that fires if the task is interrupted with that flag set.
>> But the strong cpu isolation stuff requires systemwide configuration,
>> and I think that monitoring that it works should work similarly.
>
>
> First, you mention a per-task flag, but not specifically whether the
> proposed prctl() mechanism is a reasonable way to set that flag.
> Just wanted to clarify that this wasn't an issue in and of itself for you.

I think I'm okay with a per-task flag for this and, if you add one,
then prctl() is presumably the way to go.  Unless people think that
nohz should be 100% reliable always, in which case might as well make
the flag per-cpu.

>
> Second, you suggest a tracepoint.  I'm OK with creating a tracepoint
> dedicated to cpu_isolated strict failures and making that the only
> way this mechanism works.  But, earlier community feedback seemed to
> suggest that the signal mechanism was OK; one piece of feedback
> just requested being able to set which signal was delivered.  Do you
> think the signal idea is a bad one?  Are you proposing potentially
> having a signal and/or a tracepoint?

I prefer the tracepoint.  It's friendlier to debuggers, and it's
really about diagnosing a kernel problem, not a userspace problem.
Also, I really doubt that people should deploy a signal thing in
production.  What if an NMI fires and kills their realtime program?

>
> Last, you mention systemwide configuration for monitoring.  Can you
> expand on what you mean by that?  We already support the monitoring
> only on the nohz_full cores, so to that extent it's already systemwide.
> And the per-task flag has to be set by the running process when it's
> ready for this state, so that can't really be systemwide configuration.
> I don't understand your suggestion on this point.

I'm really thinking about systemwide configuration for isolation.  I
think we'll always (at least in the nearish term) need the admin's
help to set up isolated CPUs.  If the admin makes a whole CPU be
isolated, then monitoring just that CPU and monitoring it all the time
seems sensible.  If we really do think that isolating a CPU should
require a syscall of some sort because it's too expensive otherwise,
then we can do it that way, too.  And if full isolation requires some
user help (e.g. don't do certain things that break isolation), then
having a per-task monitoring flag seems reasonable.

We may always need the user's help to avoid IPIs.  For example, if one
thread calls munmap, the other thread is going to get an IPI.  There's
nothing we can do about that.

> I'm certainly OK with rebasing on top of 4.3 after the context
> tracking stuff is better.  That said, I think it makes sense to continue
> to debate the intent of the patch series even if we pull this one
> patch out and defer it until after 4.3, or having it end up pulled
> into some other repo that includes the improvements and
> is being pulled for 4.3.

Sure, no problem.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
@ 2015-07-21 20:36                     ` Paul E. McKenney
  0 siblings, 0 replies; 340+ messages in thread
From: Paul E. McKenney @ 2015-07-21 20:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Christoph Lameter,
	Viresh Kumar, linux-doc, Linux API, linux-kernel

On Tue, Jul 21, 2015 at 12:26:17PM -0700, Andy Lutomirski wrote:
> On Tue, Jul 21, 2015 at 12:10 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> > Sorry for the delay in responding; some other priorities came up internally.
> >
> > On 07/13/2015 05:45 PM, Andy Lutomirski wrote:
> >>
> >> On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <cmetcalf@ezchip.com>
> >> wrote:
> >>>
> >>> On 07/13/2015 04:40 PM, Andy Lutomirski wrote:
> >>>>
> >>>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com>
> >>>>
> >>>> wrote:
> >>>>>
> >>>>> The existing nohz_full mode makes tradeoffs to minimize userspace
> >>>>> interruptions while still attempting to avoid overheads in the
> >>>>> kernel entry/exit path, to provide 100% kernel semantics, etc.
> >>>>>
> >>>>> However, some applications require a stronger commitment from the
> >>>>> kernel to avoid interruptions, in particular userspace device
> >>>>> driver style applications, such as high-speed networking code.
> >>>>>
> >>>>> This change introduces a framework to allow applications to elect
> >>>>> to have the stronger semantics as needed, specifying
> >>>>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
> >>>>> Subsequent commits will add additional flags and additional
> >>>>> semantics.
> >>>>
> >>>> I thought the general consensus was that this should be the default
> >>>> behavior and that any associated bugs should be fixed.
> >>>
> >>>
> >>> I think it comes down to dividing the set of use cases in two:
> >>>
> >>> - "Regular" nohz_full, as used to improve performance and limit
> >>>    interruptions, possibly for power benefits, etc.  But, stray
> >>>    interrupts are not particularly bad, and you don't want to take
> >>>    extreme measures to avoid them.
> >>>
> >>> - What I'm calling "cpu_isolated" mode where when you return to
> >>>    userspace, you expect that by God, the kernel doesn't interrupt you
> >>>    again, and if it does, it's a flat-out bug.
> >>>
> >>> There are a few things that cpu_isolated mode currently does to
> >>> accomplish its goals that are pretty heavy-weight:
> >>>
> >>> Processes are held in kernel space until ticks are quiesced; this is
> >>> not necessarily what every nohz_full task wants.  If a task makes a
> >>> kernel call, there may well be arbitrary timer fallout, and having a
> >>> way to select whether or not you are willing to take a timer tick after
> >>> return to userspace is pretty important.
> >>
> >> Then shouldn't deferred work be done immediately in nohz_full mode
> >> regardless?  What is this delayed work that's being done?
> >
> > I'm thinking of things like needing to wait for an RCU quiesce
> > period to complete.
> 
> rcu_nocbs does this, right?

CONFIG_RCU_NOCB_CPUS offloads the RCU callbacks to a kthread, which
allows the nohz CPU to turn off its scheduling-clock tick more frequently.
Chris might have some other reason to wait for an RCU grace period, given
that waiting for an RCU grace period would not guarantee no callbacks.
Some more might have arrived in the meantime, and there can be some delay
between the end of the grace period and the invocation of the callbacks.

> > In the current version, there's also the vmstat_update() that
> > may schedule delayed work and interrupt the core again
> > shortly before realizing that there are no more counter updates
> > happening, at which point it quiesces.  Currently we handle
> > this in cpu_isolated mode simply by spinning and waiting for
> > the timer interrupts to complete.
> 
> Perhaps we should fix that?

Didn't Christoph Lameter fix this?  Or is this an additional problem?

							Thanx, Paul

> >>> Likewise, there are things that you may want to do on return to
> >>> userspace that are designed to prevent further interruptions in
> >>> cpu_isolated mode, even at a possible future performance cost if and
> >>> when you return to the kernel, such as flushing the per-cpu free page
> >>> list so that you won't be interrupted by an IPI to flush it later.
> >>
> >> Why not just kick the per-cpu free page over to whatever cpu is
> >> monitoring your RCU state, etc?  That should be very quick.
> >
> >
> > So just for the sake of precision, the thing I'm talking about
> > is the lru_add_drain() call on kernel exit.  Are you proposing
> > that we call that for every nohz_full core on kernel exit?
> > I'm not opposed to this, but I don't know if other nohz
> > developers feel like this is the right tradeoff.
> 
> I'm proposing either that we do that or that we arrange for other cpus
> to be able to steal our LRU list while we're in RCU user/idle.
> 
> >> Let's fix them instead of adding new ABIs to work around them.
> >
> >
> > Well, in principle if we accepted my proposed patch series
> > and then over time came to decide that it was reasonable
> > for nohz_full to have these complete cpu isolation
> > semantics, the one proposed ABI simply becomes a no-op.
> > So it's not as problematic an ABI as some.
> 
> What if we made it a debugfs thing instead of a prctl?  Have a mode
> where the system tries really hard to quiesce itself even at the cost
> of performance.
> 
> >
> > My issue is this: I'm totally happy with submitting a revised
> > patch series that does all the stuff for pure nohz_full that
> > I'm currently proposing for cpu_isolated.  But, is it what
> > the community wants?  Should I propose it and see?
> >
> > Frederic, do you have any insight here?  Thanks!
> >
> > --
> > Chris Metcalf, EZChip Semiconductor
> > http://www.ezchip.com
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-api" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC
> 


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
@ 2015-07-21 20:36                     ` Paul E. McKenney
  0 siblings, 0 replies; 340+ messages in thread
From: Paul E. McKenney @ 2015-07-21 20:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Christoph Lameter,
	Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Jul 21, 2015 at 12:26:17PM -0700, Andy Lutomirski wrote:
> On Tue, Jul 21, 2015 at 12:10 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
> > Sorry for the delay in responding; some other priorities came up internally.
> >
> > On 07/13/2015 05:45 PM, Andy Lutomirski wrote:
> >>
> >> On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
> >> wrote:
> >>>
> >>> On 07/13/2015 04:40 PM, Andy Lutomirski wrote:
> >>>>
> >>>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
> >>>>
> >>>> wrote:
> >>>>>
> >>>>> The existing nohz_full mode makes tradeoffs to minimize userspace
> >>>>> interruptions while still attempting to avoid overheads in the
> >>>>> kernel entry/exit path, to provide 100% kernel semantics, etc.
> >>>>>
> >>>>> However, some applications require a stronger commitment from the
> >>>>> kernel to avoid interruptions, in particular userspace device
> >>>>> driver style applications, such as high-speed networking code.
> >>>>>
> >>>>> This change introduces a framework to allow applications to elect
> >>>>> to have the stronger semantics as needed, specifying
> >>>>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
> >>>>> Subsequent commits will add additional flags and additional
> >>>>> semantics.
> >>>>
> >>>> I thought the general consensus was that this should be the default
> >>>> behavior and that any associated bugs should be fixed.
> >>>
> >>>
> >>> I think it comes down to dividing the set of use cases in two:
> >>>
> >>> - "Regular" nohz_full, as used to improve performance and limit
> >>>    interruptions, possibly for power benefits, etc.  But, stray
> >>>    interrupts are not particularly bad, and you don't want to take
> >>>    extreme measures to avoid them.
> >>>
> >>> - What I'm calling "cpu_isolated" mode where when you return to
> >>>    userspace, you expect that by God, the kernel doesn't interrupt you
> >>>    again, and if it does, it's a flat-out bug.
> >>>
> >>> There are a few things that cpu_isolated mode currently does to
> >>> accomplish its goals that are pretty heavy-weight:
> >>>
> >>> Processes are held in kernel space until ticks are quiesced; this is
> >>> not necessarily what every nohz_full task wants.  If a task makes a
> >>> kernel call, there may well be arbitrary timer fallout, and having a
> >>> way to select whether or not you are willing to take a timer tick after
> >>> return to userspace is pretty important.
> >>
> >> Then shouldn't deferred work be done immediately in nohz_full mode
> >> regardless?  What is this delayed work that's being done?
> >
> > I'm thinking of things like needing to wait for an RCU quiesce
> > period to complete.
> 
> rcu_nocbs does this, right?

CONFIG_RCU_NOCB_CPUS offloads the RCU callbacks to a kthread, which
allows the nohz CPU to turn off its scheduling-clock tick more frequently.
Chris might have some other reason to wait for an RCU grace period, given
that waiting for an RCU grace period would not guarantee no callbacks.
Some more might have arrived in the meantime, and there can be some delay
between the end of the grace period and the invocation of the callbacks.

> > In the current version, there's also the vmstat_update() that
> > may schedule delayed work and interrupt the core again
> > shortly before realizing that there are no more counter updates
> > happening, at which point it quiesces.  Currently we handle
> > this in cpu_isolated mode simply by spinning and waiting for
> > the timer interrupts to complete.
> 
> Perhaps we should fix that?

Didn't Christoph Lameter fix this?  Or is this an additional problem?

							Thanx, Paul

> >>> Likewise, there are things that you may want to do on return to
> >>> userspace that are designed to prevent further interruptions in
> >>> cpu_isolated mode, even at a possible future performance cost if and
> >>> when you return to the kernel, such as flushing the per-cpu free page
> >>> list so that you won't be interrupted by an IPI to flush it later.
> >>
> >> Why not just kick the per-cpu free page over to whatever cpu is
> >> monitoring your RCU state, etc?  That should be very quick.
> >
> >
> > So just for the sake of precision, the thing I'm talking about
> > is the lru_add_drain() call on kernel exit.  Are you proposing
> > that we call that for every nohz_full core on kernel exit?
> > I'm not opposed to this, but I don't know if other nohz
> > developers feel like this is the right tradeoff.
> 
> I'm proposing either that we do that or that we arrange for other cpus
> to be able to steal our LRU list while we're in RCU user/idle.
> 
> >> Let's fix them instead of adding new ABIs to work around them.
> >
> >
> > Well, in principle if we accepted my proposed patch series
> > and then over time came to decide that it was reasonable
> > for nohz_full to have these complete cpu isolation
> > semantics, the one proposed ABI simply becomes a no-op.
> > So it's not as problematic an ABI as some.
> 
> What if we made it a debugfs thing instead of a prctl?  Have a mode
> where the system tries really hard to quiesce itself even at the cost
> of performance.
> 
> >
> > My issue is this: I'm totally happy with submitting a revised
> > patch series that does all the stuff for pure nohz_full that
> > I'm currently proposing for cpu_isolated.  But, is it what
> > the community wants?  Should I propose it and see?
> >
> > Frederic, do you have any insight here?  Thanks!
> >
> > --
> > Chris Metcalf, EZChip Semiconductor
> > http://www.ezchip.com
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-api" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC
> 

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
  2015-07-21 20:36                     ` Paul E. McKenney
  (?)
@ 2015-07-22 13:57                     ` Christoph Lameter
  2015-07-22 19:28                         ` Paul E. McKenney
  -1 siblings, 1 reply; 340+ messages in thread
From: Christoph Lameter @ 2015-07-22 13:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Viresh Kumar,
	linux-doc, Linux API, linux-kernel

On Tue, 21 Jul 2015, Paul E. McKenney wrote:

> > > In the current version, there's also the vmstat_update() that
> > > may schedule delayed work and interrupt the core again
> > > shortly before realizing that there are no more counter updates
> > > happening, at which point it quiesces.  Currently we handle
> > > this in cpu_isolated mode simply by spinning and waiting for
> > > the timer interrupts to complete.
> >
> > Perhaps we should fix that?
>
> Didn't Christoph Lameter fix this?  Or is this an additional problem?

Well the vmstat update must realize first that there are no outstanding
updates before switching itself off. So typically there is one extra tick.
But we could add another function that will simply fold the differential
immediately and turn the kworker task in the expectation that the
processor will stay quiet.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
@ 2015-07-22 19:28                         ` Paul E. McKenney
  0 siblings, 0 replies; 340+ messages in thread
From: Paul E. McKenney @ 2015-07-22 19:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Viresh Kumar,
	linux-doc, Linux API, linux-kernel

On Wed, Jul 22, 2015 at 08:57:45AM -0500, Christoph Lameter wrote:
> On Tue, 21 Jul 2015, Paul E. McKenney wrote:
> 
> > > > In the current version, there's also the vmstat_update() that
> > > > may schedule delayed work and interrupt the core again
> > > > shortly before realizing that there are no more counter updates
> > > > happening, at which point it quiesces.  Currently we handle
> > > > this in cpu_isolated mode simply by spinning and waiting for
> > > > the timer interrupts to complete.
> > >
> > > Perhaps we should fix that?
> >
> > Didn't Christoph Lameter fix this?  Or is this an additional problem?
> 
> Well the vmstat update must realize first that there are no outstanding
> updates before switching itself off. So typically there is one extra tick.
> But we could add another function that will simply fold the differential
> immediately and turn the kworker task in the expectation that the
> processor will stay quiet.

Got it, thank you!

								Thanx, Paul


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
@ 2015-07-22 19:28                         ` Paul E. McKenney
  0 siblings, 0 replies; 340+ messages in thread
From: Paul E. McKenney @ 2015-07-22 19:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Viresh Kumar,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Jul 22, 2015 at 08:57:45AM -0500, Christoph Lameter wrote:
> On Tue, 21 Jul 2015, Paul E. McKenney wrote:
> 
> > > > In the current version, there's also the vmstat_update() that
> > > > may schedule delayed work and interrupt the core again
> > > > shortly before realizing that there are no more counter updates
> > > > happening, at which point it quiesces.  Currently we handle
> > > > this in cpu_isolated mode simply by spinning and waiting for
> > > > the timer interrupts to complete.
> > >
> > > Perhaps we should fix that?
> >
> > Didn't Christoph Lameter fix this?  Or is this an additional problem?
> 
> Well the vmstat update must realize first that there are no outstanding
> updates before switching itself off. So typically there is one extra tick.
> But we could add another function that will simply fold the differential
> immediately and turn the kworker task in the expectation that the
> processor will stay quiet.

Got it, thank you!

								Thanx, Paul

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
  2015-07-22 19:28                         ` Paul E. McKenney
  (?)
@ 2015-07-22 20:02                         ` Christoph Lameter
  2015-07-24 20:21                           ` Chris Metcalf
  -1 siblings, 1 reply; 340+ messages in thread
From: Christoph Lameter @ 2015-07-22 20:02 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Viresh Kumar,
	linux-doc, Linux API, linux-kernel

On Wed, 22 Jul 2015, Paul E. McKenney wrote:

> > > Didn't Christoph Lameter fix this?  Or is this an additional problem?
> >
> > Well the vmstat update must realize first that there are no outstanding
> > updates before switching itself off. So typically there is one extra tick.
> > But we could add another function that will simply fold the differential
> > immediately and turn the kworker task in the expectation that the
> > processor will stay quiet.
>
> Got it, thank you!
>
> 								Thanx, Paul

Ok here is a function that quiets down the vmstat kworkers.


Subject: vmstat: provide a function to quiet down the diff processing

quiet_vmstat() can be called in anticipation of a OS "quiet" period
where no tick processing should be triggered. quiet_vmstat() will fold
all pending differentials into the global counters and disable the
vmstat_worker processing.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1394,6 +1394,20 @@ static void vmstat_update(struct work_st
 }

 /*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+	do {
+		if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+			cancel_delayed_work(this_cpu_ptr(&vmstat_work));
+
+	} while (refresh_cpu_vm_stats());
+}
+
+/*
  * Check if the diffs for a certain cpu indicate that
  * an update is needed.
  */
Index: linux/include/linux/vmstat.h
===================================================================
--- linux.orig/include/linux/vmstat.h
+++ linux/include/linux/vmstat.h
@@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);

+void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);

@@ -272,6 +273,7 @@ static inline void __dec_zone_page_state
 static inline void refresh_cpu_vm_stats(int cpu) { }
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }

 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
  2015-07-13 19:57         ` Chris Metcalf
  (?)
  (?)
@ 2015-07-24 13:27         ` Frederic Weisbecker
  2015-07-24 20:21             ` Chris Metcalf
  -1 siblings, 1 reply; 340+ messages in thread
From: Frederic Weisbecker @ 2015-07-24 13:27 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc,
	linux-api, linux-kernel

On Mon, Jul 13, 2015 at 03:57:57PM -0400, Chris Metcalf wrote:
> The existing nohz_full mode makes tradeoffs to minimize userspace
> interruptions while still attempting to avoid overheads in the
> kernel entry/exit path, to provide 100% kernel semantics, etc.
> 
> However, some applications require a stronger commitment from the
> kernel to avoid interruptions, in particular userspace device
> driver style applications, such as high-speed networking code.
> 
> This change introduces a framework to allow applications to elect
> to have the stronger semantics as needed, specifying
> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
> Subsequent commits will add additional flags and additional
> semantics.
> 
> The "cpu_isolated" state is indicated by setting a new task struct
> field, cpu_isolated_flags, to the value passed by prctl().  When the
> _ENABLE bit is set for a task, and it is returning to userspace
> on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter()
> routine to take additional actions to help the task avoid being
> interrupted in the future.
> 
> Initially, there are only two actions taken.  First, the task
> calls lru_add_drain() to prevent being interrupted by a subsequent
> lru_add_drain_all() call on another core.  Then, the code checks for
> pending timer interrupts and quiesces until they are no longer pending.
> As a result, sys calls (and page faults, etc.) can be inordinately slow.
> However, this quiescing guarantees that no unexpected interrupts will
> occur, even if the application intentionally calls into the kernel.
> 
> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
> ---
>  arch/tile/kernel/process.c |  9 ++++++++
>  include/linux/sched.h      |  3 +++
>  include/linux/tick.h       | 10 ++++++++
>  include/uapi/linux/prctl.h |  5 ++++
>  kernel/context_tracking.c  |  3 +++
>  kernel/sys.c               |  8 +++++++
>  kernel/time/tick-sched.c   | 57 ++++++++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 95 insertions(+)
> 
> diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
> index e036c0aa9792..3625e839ad62 100644
> --- a/arch/tile/kernel/process.c
> +++ b/arch/tile/kernel/process.c
> @@ -70,6 +70,15 @@ void arch_cpu_idle(void)
>  	_cpu_idle();
>  }
>  
> +#ifdef CONFIG_NO_HZ_FULL

I think this goes way beyond nohz itself. We don't only want the tick to shutdown,
we want also the pending timers, workqueues, etc...

It's time to create the CONFIG_ISOLATION_foo stuffs.

> +void tick_nohz_cpu_isolated_wait(void)
> +{
> +	set_current_state(TASK_INTERRUPTIBLE);
> +	_cpu_idle();
> +	set_current_state(TASK_RUNNING);
> +}
> +#endif
> +
>  /*
>   * Release a thread_info structure
>   */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ae21f1591615..f350b0c20bbc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1778,6 +1778,9 @@ struct task_struct {
>  	unsigned long	task_state_change;
>  #endif
>  	int pagefault_disabled;
> +#ifdef CONFIG_NO_HZ_FULL
> +	unsigned int	cpu_isolated_flags;
> +#endif
>  };
>  
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index 3741ba1a652c..cb5569181359 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -10,6 +10,7 @@
>  #include <linux/context_tracking_state.h>
>  #include <linux/cpumask.h>
>  #include <linux/sched.h>
> +#include <linux/prctl.h>
>  
>  #ifdef CONFIG_GENERIC_CLOCKEVENTS
>  extern void __init tick_init(void);
> @@ -144,11 +145,18 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask)
>  		cpumask_or(mask, mask, tick_nohz_full_mask);
>  }
>  
> +static inline bool tick_nohz_is_cpu_isolated(void)
> +{
> +	return tick_nohz_full_cpu(smp_processor_id()) &&
> +		(current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE);
> +}
> +
>  extern void __tick_nohz_full_check(void);
>  extern void tick_nohz_full_kick(void);
>  extern void tick_nohz_full_kick_cpu(int cpu);
>  extern void tick_nohz_full_kick_all(void);
>  extern void __tick_nohz_task_switch(struct task_struct *tsk);
> +extern void tick_nohz_cpu_isolated_enter(void);
>  #else
>  static inline bool tick_nohz_full_enabled(void) { return false; }
>  static inline bool tick_nohz_full_cpu(int cpu) { return false; }
> @@ -158,6 +166,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { }
>  static inline void tick_nohz_full_kick(void) { }
>  static inline void tick_nohz_full_kick_all(void) { }
>  static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
> +static inline bool tick_nohz_is_cpu_isolated(void) { return false; }
> +static inline void tick_nohz_cpu_isolated_enter(void) { }
>  #endif
>  
>  static inline bool is_housekeeping_cpu(int cpu)
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 31891d9535e2..edb40b6b84db 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -190,4 +190,9 @@ struct prctl_mm_map {
>  # define PR_FP_MODE_FR		(1 << 0)	/* 64b FP registers */
>  # define PR_FP_MODE_FRE		(1 << 1)	/* 32b compatibility */
>  
> +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */
> +#define PR_SET_CPU_ISOLATED	47
> +#define PR_GET_CPU_ISOLATED	48
> +# define PR_CPU_ISOLATED_ENABLE	(1 << 0)
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
> index 0a495ab35bc7..f9de3ee12723 100644
> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -20,6 +20,7 @@
>  #include <linux/hardirq.h>
>  #include <linux/export.h>
>  #include <linux/kprobes.h>
> +#include <linux/tick.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/context_tracking.h>
> @@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state)
>  			 * on the tick.
>  			 */
>  			if (state == CONTEXT_USER) {
> +				if (tick_nohz_is_cpu_isolated())
> +					tick_nohz_cpu_isolated_enter();
>  				trace_user_enter(0);
>  				vtime_user_enter(current);
>  			}
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 259fda25eb6b..36eb9a839f1f 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  	case PR_GET_FP_MODE:
>  		error = GET_FP_MODE(me);
>  		break;
> +#ifdef CONFIG_NO_HZ_FULL
> +	case PR_SET_CPU_ISOLATED:
> +		me->cpu_isolated_flags = arg2;
> +		break;
> +	case PR_GET_CPU_ISOLATED:
> +		error = me->cpu_isolated_flags;
> +		break;
> +#endif
>  	default:
>  		error = -EINVAL;
>  		break;
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index c792429e98c6..4cf093c012d1 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -24,6 +24,7 @@
>  #include <linux/posix-timers.h>
>  #include <linux/perf_event.h>
>  #include <linux/context_tracking.h>
> +#include <linux/swap.h>
>  
>  #include <asm/irq_regs.h>
>  
> @@ -389,6 +390,62 @@ void __init tick_nohz_init(void)
>  	pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
>  		cpumask_pr_args(tick_nohz_full_mask));
>  }
> +
> +/*
> + * Rather than continuously polling for the next_event in the
> + * tick_cpu_device, architectures can provide a method to save power
> + * by sleeping until an interrupt arrives.
> + */
> +void __weak tick_nohz_cpu_isolated_wait(void)
> +{
> +	cpu_relax();
> +}
> +
> +/*
> + * We normally return immediately to userspace.
> + *
> + * In "cpu_isolated" mode we wait until no more interrupts are
> + * pending.  Otherwise we nap with interrupts enabled and wait for the
> + * next interrupt to fire, then loop back and retry.
> + *
> + * Note that if you schedule two "cpu_isolated" processes on the same
> + * core, neither will ever leave the kernel, and one will have to be
> + * killed manually.  Otherwise in situations where another process is
> + * in the runqueue on this cpu, this task will just wait for that
> + * other task to go idle before returning to user space.
> + */
> +void tick_nohz_cpu_isolated_enter(void)

Similarly, I'd rather see that in kernel/cpu_isolation.c and call it
cpu_isolation_enter().

> +{
> +	struct clock_event_device *dev =
> +		__this_cpu_read(tick_cpu_device.evtdev);
> +	struct task_struct *task = current;
> +	unsigned long start = jiffies;
> +	bool warned = false;
> +
> +	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
> +	lru_add_drain();
> +
> +	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
> +		if (!warned && (jiffies - start) >= (5 * HZ)) {
> +			pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n",
> +				task->comm, task->pid, smp_processor_id(),
> +				(jiffies - start) / HZ);
> +			warned = true;
> +		}
> +		if (should_resched())
> +			schedule();
> +		if (test_thread_flag(TIF_SIGPENDING))
> +			break;
> +		tick_nohz_cpu_isolated_wait();

If we call cpu_idle(), what is going to wake the CPU up if not further interrupt happen?

We could either implement some sort of tick waiters with proper wake up once the CPU sees
no tick to schedule. Arguably this is all risky because this involve a scheduler wake up
and thus the risk for new noise. But it might work.

Another possibility is an msleep() based wait. But that's about the same, maybe even worse
due to repetitive wake ups.

> +	}
> +	if (warned) {
> +		pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n",
> +			task->comm, task->pid, smp_processor_id(),
> +			(jiffies - start) / HZ);
> +		dump_stack();
> +	}
> +}
> +
>  #endif
>  
>  /*
> -- 
> 2.1.2
> 

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
  2015-07-21 19:10               ` Chris Metcalf
  2015-07-21 19:26                 ` Andy Lutomirski
@ 2015-07-24 14:03                 ` Frederic Weisbecker
  2015-07-24 20:19                   ` Chris Metcalf
  1 sibling, 1 reply; 340+ messages in thread
From: Frederic Weisbecker @ 2015-07-24 14:03 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Andy Lutomirski, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, Linux API, linux-kernel, Mike Galbraith

On Tue, Jul 21, 2015 at 03:10:54PM -0400, Chris Metcalf wrote:
> >>If you're arguing that the cpu_isolated semantic is really the only
> >>one that makes sense for nohz_full, my sense is that it might be
> >>surprising to many of the folks who do nohz_full work.  But, I'm happy
> >>to be wrong on this point, and maybe all the nohz_full community is
> >>interested in making the same tradeoffs for nohz_full generally that
> >>I've proposed in this patch series just for cpu_isolated?
> >nohz_full is currently dog slow for no particularly good reasons.  I
> >suspect that the interrupts you're seeing are also there for no
> >particularly good reasons as well.
> >
> >Let's fix them instead of adding new ABIs to work around them.
> 
> Well, in principle if we accepted my proposed patch series
> and then over time came to decide that it was reasonable
> for nohz_full to have these complete cpu isolation
> semantics, the one proposed ABI simply becomes a no-op.
> So it's not as problematic an ABI as some.
> 
> My issue is this: I'm totally happy with submitting a revised
> patch series that does all the stuff for pure nohz_full that
> I'm currently proposing for cpu_isolated.  But, is it what
> the community wants?  Should I propose it and see?
> 
> Frederic, do you have any insight here?  Thanks!

So you guys mean that if nohz_full was implemented fully like
we expect it to, we shouldn't be burdened at all by noise and that
whole patchset would therefore be pointless, right? And that would meet
the requirements for those who want hard isolation (critical noise-free
guarantee) as well as those who want soft isolation (less noise as
possible for performance).

Well first of all nohz is not isolation, it's a significant part of it
but it's not all isolatiion. We really want to separate these things and
not mess up isolation policies in the tick code.

Second, yes perhaps we can eventually have both soft and hard isolation
expectation eventually be implemented the same way through hard isolation.
But that will only work if we don't do that polling for noise-free before
resuming userspace, which might work for hard isolation that is ready to
sacrifice some warm-up before a run to meet guarantees, but it won't
work for soft isolation workloads.

So the only solution is to offline everything we can to housekeeping
CPUs. And if we still have stuff that can't be dealt with that way
and which need to be taken care of with some explicit operation
before resuming to userspace, then we can start to think about splitting
stuff in several isolation configs.

Similarly, offlining everything to housekeepers means that we sacrifice
a CPU that could have been used in performance oriented workloads so that
might not match soft isolation as well. But I think we'll see that all once
we manage to have pure noise-free CPUs (some patches are on the way to be
posted by Vatika Harlalka concerning the residual 1hz tick to kill).

To summarize, lets first split nohz and isolation. Introduce
CONFIG_CPU_ISOLATION and stuff all the isolation policies to
kernel/cpu_isolation.c, lets try to implement hard isolation and see if that
meets soft isolation workload users as well, if not we'll split that later.

And we can keep the prctl to tell the user when hard isolation has been
broken, through SIGKILL or whatever. I think we are doing a similar thing
with SCHED_DEADLINE when the task hasn't met deadline requirement. We might
want to do the same.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
  2015-07-24 14:03                 ` Frederic Weisbecker
@ 2015-07-24 20:19                   ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-24 20:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andy Lutomirski, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, linux-doc, Linux API, linux-kernel, Mike Galbraith

On 07/24/2015 10:03 AM, Frederic Weisbecker wrote:
> To summarize, lets first split nohz and isolation. Introduce
> CONFIG_CPU_ISOLATION and stuff all the isolation policies to
> kernel/cpu_isolation.c, lets try to implement hard isolation and see if that
> meets soft isolation workload users as well, if not we'll split that later.

I will do that for v5.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
  2015-07-24 13:27         ` Frederic Weisbecker
@ 2015-07-24 20:21             ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-24 20:21 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc,
	linux-api, linux-kernel

On 07/24/2015 09:27 AM, Frederic Weisbecker wrote:
> On Mon, Jul 13, 2015 at 03:57:57PM -0400, Chris Metcalf wrote:
>> +{
>> +	struct clock_event_device *dev =
>> +		__this_cpu_read(tick_cpu_device.evtdev);
>> +	struct task_struct *task = current;
>> +	unsigned long start = jiffies;
>> +	bool warned = false;
>> +
>> +	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
>> +	lru_add_drain();
>> +
>> +	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
>> +		if (!warned && (jiffies - start) >= (5 * HZ)) {
>> +			pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n",
>> +				task->comm, task->pid, smp_processor_id(),
>> +				(jiffies - start) / HZ);
>> +			warned = true;
>> +		}
>> +		if (should_resched())
>> +			schedule();
>> +		if (test_thread_flag(TIF_SIGPENDING))
>> +			break;
>> +		tick_nohz_cpu_isolated_wait();
> If we call cpu_idle(), what is going to wake the CPU up if no further interrupt happen?
>
> We could either implement some sort of tick waiters with proper wake up once the CPU sees
> no tick to schedule. Arguably this is all risky because this involve a scheduler wake up
> and thus the risk for new noise. But it might work.
>
> Another possibility is an msleep() based wait. But that's about the same, maybe even worse
> due to repetitive wake ups.

The presumption here is that it is not possible to have
tick_cpu_device have a pending next_event without also
having a timer interrupt pending to go off.  That certainly
seems to be true on the architectures I have looked at.
Do we think that might ever not be the case?

We are running here with interrupts disabled, so this core won't
transition from "timer interrupt scheduled" to "no timer interrupt
scheduled" before we spin or idle, and presumably no other core
can reach across and turn off our timer interrupt either.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
@ 2015-07-24 20:21             ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-24 20:21 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 07/24/2015 09:27 AM, Frederic Weisbecker wrote:
> On Mon, Jul 13, 2015 at 03:57:57PM -0400, Chris Metcalf wrote:
>> +{
>> +	struct clock_event_device *dev =
>> +		__this_cpu_read(tick_cpu_device.evtdev);
>> +	struct task_struct *task = current;
>> +	unsigned long start = jiffies;
>> +	bool warned = false;
>> +
>> +	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
>> +	lru_add_drain();
>> +
>> +	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
>> +		if (!warned && (jiffies - start) >= (5 * HZ)) {
>> +			pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n",
>> +				task->comm, task->pid, smp_processor_id(),
>> +				(jiffies - start) / HZ);
>> +			warned = true;
>> +		}
>> +		if (should_resched())
>> +			schedule();
>> +		if (test_thread_flag(TIF_SIGPENDING))
>> +			break;
>> +		tick_nohz_cpu_isolated_wait();
> If we call cpu_idle(), what is going to wake the CPU up if no further interrupt happen?
>
> We could either implement some sort of tick waiters with proper wake up once the CPU sees
> no tick to schedule. Arguably this is all risky because this involve a scheduler wake up
> and thus the risk for new noise. But it might work.
>
> Another possibility is an msleep() based wait. But that's about the same, maybe even worse
> due to repetitive wake ups.

The presumption here is that it is not possible to have
tick_cpu_device have a pending next_event without also
having a timer interrupt pending to go off.  That certainly
seems to be true on the architectures I have looked at.
Do we think that might ever not be the case?

We are running here with interrupts disabled, so this core won't
transition from "timer interrupt scheduled" to "no timer interrupt
scheduled" before we spin or idle, and presumably no other core
can reach across and turn off our timer interrupt either.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
  2015-07-22 20:02                         ` Christoph Lameter
@ 2015-07-24 20:21                           ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-24 20:21 UTC (permalink / raw)
  To: Christoph Lameter, Paul E. McKenney
  Cc: Andy Lutomirski, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Viresh Kumar, linux-doc,
	Linux API, linux-kernel

On 07/22/2015 04:02 PM, Christoph Lameter wrote:
> On Wed, 22 Jul 2015, Paul E. McKenney wrote:
>
>>>> Didn't Christoph Lameter fix this?  Or is this an additional problem?
>>> Well the vmstat update must realize first that there are no outstanding
>>> updates before switching itself off. So typically there is one extra tick.
>>> But we could add another function that will simply fold the differential
>>> immediately and turn the kworker task in the expectation that the
>>> processor will stay quiet.
>> Got it, thank you!
>>
>> 								Thanx, Paul
> Ok here is a function that quiets down the vmstat kworkers.

That's great - I will include this patch in my series then, and call it
as part of the "hard isolation" mode return to userspace.  Thanks!

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode
  2015-07-21 19:26                 ` Andy Lutomirski
  2015-07-21 20:36                     ` Paul E. McKenney
@ 2015-07-24 20:22                   ` Chris Metcalf
  1 sibling, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-24 20:22 UTC (permalink / raw)
  To: Andy Lutomirski, Paul McKenney
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Christoph Lameter, Viresh Kumar, linux-doc,
	Linux API, linux-kernel

On 07/21/2015 03:26 PM, Andy Lutomirski wrote:
> On Tue, Jul 21, 2015 at 12:10 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> So just for the sake of precision, the thing I'm talking about
>> is the lru_add_drain() call on kernel exit.  Are you proposing
>> that we call that for every nohz_full core on kernel exit?
>> I'm not opposed to this, but I don't know if other nohz
>> developers feel like this is the right tradeoff.
> I'm proposing either that we do that or that we arrange for other cpus
> to be able to steal our LRU list while we're in RCU user/idle.

That seems challenging; there is a lot that has to be done in
lru_add_drain() and we may not want to do it for the "soft
isolation" mode Frederic alludes to in a later email.  And, we
would have to add a bunch of locking to allow another process
to steal the list from under us, so that's not obviously going
to be a performance win in terms of the per-cpu page cache
for normal operations.

Perhaps there could be a lock taken that nohz_full processes
have to take just to exit from userspace, and that other tasks
could take to do things on behalf of the nohz_full process that
it thinks it can do locklessly.  It gets complicated, since you'd
want to tie that to whether the nohz_full process was currently
in the kernel or not, so some kind of atomic update on the
context_tracking state or some such, perhaps.  Still not really
clear if that overhead is worth it (both from a maintenance
point of view and the possible performance hit).

Limiting it just to the hard isolation mode seems like a good
answer since there we really know that userspace does not
care about the performance implications of kernel/userspace
transitions, and it doesn't cause slowdowns to anyone else.

For now I will bundle it in with my respin as part of the
"hard isolation" mode Frederic proposed.

>> Well, in principle if we accepted my proposed patch series
>> and then over time came to decide that it was reasonable
>> for nohz_full to have these complete cpu isolation
>> semantics, the one proposed ABI simply becomes a no-op.
>> So it's not as problematic an ABI as some.
> What if we made it a debugfs thing instead of a prctl?  Have a mode
> where the system tries really hard to quiesce itself even at the cost
> of performance.

No, since it's really a mode within an individual task that you'd
like to switch on and off depending on what the task is trying
to do - strict mode while it's running its main fast-path userspace
code, but certainly not strict mode during its setup, and possibly
leaving strict mode to run some kinds of slow-path, diagnostic,
or error-handling code.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode
  2015-07-21 19:42               ` Andy Lutomirski
  (?)
@ 2015-07-24 20:29               ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-24 20:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On 07/21/2015 03:42 PM, Andy Lutomirski wrote:
> On Tue, Jul 21, 2015 at 12:34 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> Second, you suggest a tracepoint.  I'm OK with creating a tracepoint
>> dedicated to cpu_isolated strict failures and making that the only
>> way this mechanism works.  But, earlier community feedback seemed to
>> suggest that the signal mechanism was OK; one piece of feedback
>> just requested being able to set which signal was delivered.  Do you
>> think the signal idea is a bad one?  Are you proposing potentially
>> having a signal and/or a tracepoint?
> I prefer the tracepoint.  It's friendlier to debuggers, and it's
> really about diagnosing a kernel problem, not a userspace problem.
> Also, I really doubt that people should deploy a signal thing in
> production.  What if an NMI fires and kills their realtime program?

No, this piece of the patch series is about diagnosing bugs in the
userspace program (likely in third-party code, in our customers'
experience).  When you violate strict mode, you get a signal and
you have a nice pointer to what instruction it was that caused
you to enter the kernel.

You are right that running this in production is likely not a great
idea, as is true for other debugging mechanisms.  But you might
really want to have it as a signal with a signal handler that fires
to generate a trace of some kind into the application's existing
tracing mechanisms, so the app doesn't just report "wow, I lost
a bunch of time in here somewhere, sorry about those packets
I dropped on the floor", but "here's where I took a strict signal".
You probably drop a few additional packets due to the signal
handling and logging, but given you've already fallen away from
100% in this case, the extra diagnostics are almost certainly
worth it.

In this case it's probably not as helpful to have a tracepoint-based
solution, just because you really do want to be able to easily
integrate into the app's existing logging framework.

My sense, I think, is that we can easily add tracepoints to the
strict failure code in the future, so it may not be worth trying to
widen the scope of the patch series just now.

>> Last, you mention systemwide configuration for monitoring.  Can you
>> expand on what you mean by that?  We already support the monitoring
>> only on the nohz_full cores, so to that extent it's already systemwide.
>> And the per-task flag has to be set by the running process when it's
>> ready for this state, so that can't really be systemwide configuration.
>> I don't understand your suggestion on this point.
> I'm really thinking about systemwide configuration for isolation.  I
> think we'll always (at least in the nearish term) need the admin's
> help to set up isolated CPUs.  If the admin makes a whole CPU be
> isolated, then monitoring just that CPU and monitoring it all the time
> seems sensible.  If we really do think that isolating a CPU should
> require a syscall of some sort because it's too expensive otherwise,
> then we can do it that way, too.  And if full isolation requires some
> user help (e.g. don't do certain things that break isolation), then
> having a per-task monitoring flag seems reasonable.
>
> We may always need the user's help to avoid IPIs.  For example, if one
> thread calls munmap, the other thread is going to get an IPI.  There's
> nothing we can do about that.

I think we're mostly agreed on this stuff, though your use of
"monitored" doesn't really match the "strict" mode in this patch.

It's certainly true that, for example, we advise customers not to
run the slow-path code on a housekeeping cpu as a thread in the
same process space as the fast-path code on the nohz_full cores,
just because things like fclose() on a file descriptor will lead to
free() which can lead to munmap() and an IPI to the fast path.

>> I'm certainly OK with rebasing on top of 4.3 after the context
>> tracking stuff is better.  That said, I think it makes sense to continue
>> to debate the intent of the patch series even if we pull this one
>> patch out and defer it until after 4.3, or having it end up pulled
>> into some other repo that includes the improvements and
>> is being pulled for 4.3.
> Sure, no problem.

I will add a comment to the patch and a note to the series about
this, but for now I'll keep it in the series.  If we can arrange to pull
it into Frederic's tree after the context_tracking changes, we can
respin it at that point to layer it on top.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v5 0/6] support "cpu_isolated" mode for nohz_full
@ 2015-07-28 19:49         ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

This version of the patch series incorporates Christoph Lameter's
change to add a quiet_vmstat() call, and restructures cpu_isolated as
a "hard" isolation mode in contrast to nohz_full's "soft" isolation,
breaking it out as a separate CONFIG_CPU_ISOLATED with its own
include/linux/cpu_isolated.h and kernel/time/cpu_isolated.c.
It is rebased to 4.2-rc3.

Thomas: as I mentioned in v4, I haven't heard from you whether my
removal of the cpu_idle calls sufficiently addresses your concerns
about that aspect.

Andy: as I said in email, I've left in the support where cpu_isolated
relies on the context_tracking stuff currently in 4.2-rc3.  I'm not
sure what the cleanest way is for me to pick up the new
context_tracking stuff; if that's all that ends up standing between
this patch series and having it be pulled, perhaps I can rebase it
onto whatever branch it is that has the new context_tracking?

Original patch series cover letter follows:

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it.  However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode.  The
kernel must be built with CONFIG_CPU_ISOLATED to take advantage of
this new mode.  A prctl() option (PR_SET_CPU_ISOLATED) is added to
control whether processes have requested this stricter semantics, and
within that prctl() option we provide a number of different bits for
more precise control.  Additionally, we add a new command-line boot
argument to facilitate debugging where unexpected interrupts are being
delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers.  This patch
series represents the first serious attempt to upstream that
functionality.  Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on cpu_isolated cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on v4.2-rc3) is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

v5:
  rebased on kernel v4.2-rc3
  converted to use CONFIG_CPU_ISOLATED and separate .c and .h files
  incorporates Christoph Lameter's quiet_vmstat() call

v4:
  rebased on kernel v4.2-rc1
  added support for detecting CPU_ISOLATED_STRICT syscalls on arm64

v3:
  remove dependency on cpu_idle subsystem (Thomas Gleixner)
  use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter
  use seconds for console messages instead of jiffies (Thomas Gleixner)
  updated commit description for patch 5/5

v2:
  rename "dataplane" to "cpu_isolated"
  drop ksoftirqd suppression changes (believed no longer needed)
  merge previous "QUIESCE" functionality into baseline functionality
  explicitly track syscalls and exceptions for "STRICT" functionality
  allow configuring a signal to be delivered for STRICT mode failures
  move debug tracking to irq_enter(), not irq_exit()

Note: I have not removed the commit to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if
we remove the 1Hz tick, cpu_isolated threads will never re-enter
userspace since a tick will always be pending.

Chris Metcalf (5):
  cpu_isolated: add initial support
  cpu_isolated: support PR_CPU_ISOLATED_STRICT mode
  cpu_isolated: provide strict mode configurable signal
  cpu_isolated: add debug boot flag
  nohz: cpu_isolated: allow tick to be fully disabled

Christoph Lameter (1):
  vmstat: provide a function to quiet down the diff processing

 Documentation/kernel-parameters.txt |   7 +++
 arch/arm64/kernel/ptrace.c          |   5 ++
 arch/tile/kernel/process.c          |   9 +++
 arch/tile/kernel/ptrace.c           |   5 +-
 arch/tile/mm/homecache.c            |   5 +-
 arch/x86/kernel/ptrace.c            |   2 +
 include/linux/context_tracking.h    |  11 +++-
 include/linux/cpu_isolated.h        |  42 +++++++++++++
 include/linux/sched.h               |   3 +
 include/linux/vmstat.h              |   2 +
 include/uapi/linux/prctl.h          |   8 +++
 kernel/context_tracking.c           |  12 +++-
 kernel/irq_work.c                   |   5 +-
 kernel/sched/core.c                 |  21 +++++++
 kernel/signal.c                     |   5 ++
 kernel/smp.c                        |   4 ++
 kernel/softirq.c                    |   7 +++
 kernel/sys.c                        |   8 +++
 kernel/time/Kconfig                 |  20 +++++++
 kernel/time/Makefile                |   1 +
 kernel/time/cpu_isolated.c          | 116 ++++++++++++++++++++++++++++++++++++
 kernel/time/tick-sched.c            |   3 +-
 mm/vmstat.c                         |  14 +++++
 23 files changed, 305 insertions(+), 10 deletions(-)
 create mode 100644 include/linux/cpu_isolated.h
 create mode 100644 kernel/time/cpu_isolated.c

-- 
2.1.2


^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v5 0/6] support "cpu_isolated" mode for nohz_full
@ 2015-07-28 19:49         ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Chris Metcalf

This version of the patch series incorporates Christoph Lameter's
change to add a quiet_vmstat() call, and restructures cpu_isolated as
a "hard" isolation mode in contrast to nohz_full's "soft" isolation,
breaking it out as a separate CONFIG_CPU_ISOLATED with its own
include/linux/cpu_isolated.h and kernel/time/cpu_isolated.c.
It is rebased to 4.2-rc3.

Thomas: as I mentioned in v4, I haven't heard from you whether my
removal of the cpu_idle calls sufficiently addresses your concerns
about that aspect.

Andy: as I said in email, I've left in the support where cpu_isolated
relies on the context_tracking stuff currently in 4.2-rc3.  I'm not
sure what the cleanest way is for me to pick up the new
context_tracking stuff; if that's all that ends up standing between
this patch series and having it be pulled, perhaps I can rebase it
onto whatever branch it is that has the new context_tracking?

Original patch series cover letter follows:

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it.  However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode.  The
kernel must be built with CONFIG_CPU_ISOLATED to take advantage of
this new mode.  A prctl() option (PR_SET_CPU_ISOLATED) is added to
control whether processes have requested this stricter semantics, and
within that prctl() option we provide a number of different bits for
more precise control.  Additionally, we add a new command-line boot
argument to facilitate debugging where unexpected interrupts are being
delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers.  This patch
series represents the first serious attempt to upstream that
functionality.  Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on cpu_isolated cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on v4.2-rc3) is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

v5:
  rebased on kernel v4.2-rc3
  converted to use CONFIG_CPU_ISOLATED and separate .c and .h files
  incorporates Christoph Lameter's quiet_vmstat() call

v4:
  rebased on kernel v4.2-rc1
  added support for detecting CPU_ISOLATED_STRICT syscalls on arm64

v3:
  remove dependency on cpu_idle subsystem (Thomas Gleixner)
  use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter
  use seconds for console messages instead of jiffies (Thomas Gleixner)
  updated commit description for patch 5/5

v2:
  rename "dataplane" to "cpu_isolated"
  drop ksoftirqd suppression changes (believed no longer needed)
  merge previous "QUIESCE" functionality into baseline functionality
  explicitly track syscalls and exceptions for "STRICT" functionality
  allow configuring a signal to be delivered for STRICT mode failures
  move debug tracking to irq_enter(), not irq_exit()

Note: I have not removed the commit to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if
we remove the 1Hz tick, cpu_isolated threads will never re-enter
userspace since a tick will always be pending.

Chris Metcalf (5):
  cpu_isolated: add initial support
  cpu_isolated: support PR_CPU_ISOLATED_STRICT mode
  cpu_isolated: provide strict mode configurable signal
  cpu_isolated: add debug boot flag
  nohz: cpu_isolated: allow tick to be fully disabled

Christoph Lameter (1):
  vmstat: provide a function to quiet down the diff processing

 Documentation/kernel-parameters.txt |   7 +++
 arch/arm64/kernel/ptrace.c          |   5 ++
 arch/tile/kernel/process.c          |   9 +++
 arch/tile/kernel/ptrace.c           |   5 +-
 arch/tile/mm/homecache.c            |   5 +-
 arch/x86/kernel/ptrace.c            |   2 +
 include/linux/context_tracking.h    |  11 +++-
 include/linux/cpu_isolated.h        |  42 +++++++++++++
 include/linux/sched.h               |   3 +
 include/linux/vmstat.h              |   2 +
 include/uapi/linux/prctl.h          |   8 +++
 kernel/context_tracking.c           |  12 +++-
 kernel/irq_work.c                   |   5 +-
 kernel/sched/core.c                 |  21 +++++++
 kernel/signal.c                     |   5 ++
 kernel/smp.c                        |   4 ++
 kernel/softirq.c                    |   7 +++
 kernel/sys.c                        |   8 +++
 kernel/time/Kconfig                 |  20 +++++++
 kernel/time/Makefile                |   1 +
 kernel/time/cpu_isolated.c          | 116 ++++++++++++++++++++++++++++++++++++
 kernel/time/tick-sched.c            |   3 +-
 mm/vmstat.c                         |  14 +++++
 23 files changed, 305 insertions(+), 10 deletions(-)
 create mode 100644 include/linux/cpu_isolated.h
 create mode 100644 kernel/time/cpu_isolated.c

-- 
2.1.2

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v5 1/6] vmstat: provide a function to quiet down the diff processing
  2015-07-28 19:49         ` Chris Metcalf
  (?)
@ 2015-07-28 19:49         ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel

From: Christoph Lameter <cl@linux.com>

quiet_vmstat() can be called in anticipation of a OS "quiet" period
where no tick processing should be triggered. quiet_vmstat() will fold
all pending differentials into the global counters and disable the
vmstat_worker processing.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Signed-off-by: Christoph Lameter <cl@linux.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 14 ++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 82e7db7f7100..c013b8d8e434 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone *, enum zone_stat_item);
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
+void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -272,6 +273,7 @@ static inline void __dec_zone_page_state(struct page *page,
 static inline void refresh_cpu_vm_stats(int cpu) { }
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd974e11a..cf7d324f16e2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1394,6 +1394,20 @@ static void vmstat_update(struct work_struct *w)
 }
 
 /*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+	do {
+		if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+			cancel_delayed_work(this_cpu_ptr(&vmstat_work));
+
+	} while (refresh_cpu_vm_stats());
+}
+
+/*
  * Check if the diffs for a certain cpu indicate that
  * an update is needed.
  */
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v5 2/6] cpu_isolated: add initial support
  2015-07-28 19:49         ` Chris Metcalf
@ 2015-07-28 19:49           ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new CPU_ISOLATED Kconfig flag
to enable this mode, and the kernel booted with an appropriate
nohz_full=CPULIST boot argument.  The "cpu_isolated" state is then
indicated by setting a new task struct field, cpu_isolated_flags,
to the value passed by prctl().  When the _ENABLE bit is set for a
task, and it is returning to userspace on a nohz_full core, it calls
the new cpu_isolated_enter() routine to take additional actions
to help the task avoid being interrupted in the future.

Initially, there are only three actions taken.  First, the
task calls lru_add_drain() to prevent being interrupted by a
subsequent lru_add_drain_all() call on another core.  Then, it calls
quiet_vmstat() to quieten the vmstat worker to avoid a follow-on
interrupt.  Finally, the code checks for pending timer interrupts
and quiesces until they are no longer pending.  As a result, sys
calls (and page faults, etc.) can be inordinately slow.  However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/kernel/process.c   |  9 ++++++
 include/linux/cpu_isolated.h | 24 +++++++++++++++
 include/linux/sched.h        |  3 ++
 include/uapi/linux/prctl.h   |  5 ++++
 kernel/context_tracking.c    |  3 ++
 kernel/sys.c                 |  8 +++++
 kernel/time/Kconfig          | 20 +++++++++++++
 kernel/time/Makefile         |  1 +
 kernel/time/cpu_isolated.c   | 71 ++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 144 insertions(+)
 create mode 100644 include/linux/cpu_isolated.h
 create mode 100644 kernel/time/cpu_isolated.c

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index e036c0aa9792..7db6f8386417 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -70,6 +70,15 @@ void arch_cpu_idle(void)
 	_cpu_idle();
 }
 
+#ifdef CONFIG_CPU_ISOLATED
+void cpu_isolated_wait(void)
+{
+	set_current_state(TASK_INTERRUPTIBLE);
+	_cpu_idle();
+	set_current_state(TASK_RUNNING);
+}
+#endif
+
 /*
  * Release a thread_info structure
  */
diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h
new file mode 100644
index 000000000000..a3d17360f7ae
--- /dev/null
+++ b/include/linux/cpu_isolated.h
@@ -0,0 +1,24 @@
+/*
+ * CPU isolation related global functions
+ */
+#ifndef _LINUX_CPU_ISOLATED_H
+#define _LINUX_CPU_ISOLATED_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_CPU_ISOLATED
+static inline bool is_cpu_isolated(void)
+{
+	return tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE);
+}
+
+extern void cpu_isolated_enter(void);
+extern void cpu_isolated_wait(void);
+#else
+static inline bool is_cpu_isolated(void) { return false; }
+static inline void cpu_isolated_enter(void) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04b5ada460b4..0bb248385d88 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1776,6 +1776,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_CPU_ISOLATED
+	unsigned int	cpu_isolated_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..edb40b6b84db 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
 # define PR_FP_MODE_FR		(1 << 0)	/* 64b FP registers */
 # define PR_FP_MODE_FRE		(1 << 1)	/* 32b compatibility */
 
+/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */
+#define PR_SET_CPU_ISOLATED	47
+#define PR_GET_CPU_ISOLATED	48
+# define PR_CPU_ISOLATED_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 0a495ab35bc7..36b6509c3e2a 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/cpu_isolated.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state)
 			 * on the tick.
 			 */
 			if (state == CONTEXT_USER) {
+				if (is_cpu_isolated())
+					cpu_isolated_enter();
 				trace_user_enter(0);
 				vtime_user_enter(current);
 			}
diff --git a/kernel/sys.c b/kernel/sys.c
index 259fda25eb6b..c68417ff4800 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_CPU_ISOLATED
+	case PR_SET_CPU_ISOLATED:
+		me->cpu_isolated_flags = arg2;
+		break;
+	case PR_GET_CPU_ISOLATED:
+		error = me->cpu_isolated_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 579ce1b929af..141969149994 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -195,5 +195,25 @@ config HIGH_RES_TIMERS
 	  hardware is not capable then this option only increases
 	  the size of the kernel image.
 
+config CPU_ISOLATED
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL
+	help
+	 Allow userspace processes to place themselves on nohz_full
+	 cores and run prctl(PR_SET_CPU_ISOLATED) to "isolate"
+	 themselves from the kernel.  On return to userspace,
+	 cpu-isolated tasks will first arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  This "hard" isolation from the kernel is
+	 required for userspace tasks that are running hard real-time
+	 tasks in userspace, such as a 10 Gbit network driver in userspace.
+
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 endmenu
 endif
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index 49eca0beed32..984081cce974 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -12,3 +12,4 @@ obj-$(CONFIG_TICK_ONESHOT)			+= tick-oneshot.o tick-sched.o
 obj-$(CONFIG_TIMER_STATS)			+= timer_stats.o
 obj-$(CONFIG_DEBUG_FS)				+= timekeeping_debug.o
 obj-$(CONFIG_TEST_UDELAY)			+= test_udelay.o
+obj-$(CONFIG_CPU_ISOLATED)			+= cpu_isolated.o
diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c
new file mode 100644
index 000000000000..e27259f30caf
--- /dev/null
+++ b/kernel/time/cpu_isolated.c
@@ -0,0 +1,71 @@
+/*
+ *  linux/kernel/time/cpu_isolated.c
+ *
+ *  Implementation for cpu isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/cpu_isolated.h>
+#include "tick-sched.h"
+
+/*
+ * Rather than continuously polling for the next_event in the
+ * tick_cpu_device, architectures can provide a method to save power
+ * by sleeping until an interrupt arrives.
+ */
+void __weak cpu_isolated_wait(void)
+{
+	cpu_relax();
+}
+
+/*
+ * We normally return immediately to userspace.
+ *
+ * In cpu_isolated mode we wait until no more interrupts are
+ * pending.  Otherwise we nap with interrupts enabled and wait for the
+ * next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two cpu_isolated processes on the same
+ * core, neither will ever leave the kernel, and one will have to be
+ * killed manually.  Otherwise in situations where another process is
+ * in the runqueue on this cpu, this task will just wait for that
+ * other task to go idle before returning to user space.
+ */
+void cpu_isolated_enter(void)
+{
+	struct clock_event_device *dev =
+		__this_cpu_read(tick_cpu_device.evtdev);
+	struct task_struct *task = current;
+	unsigned long start = jiffies;
+	bool warned = false;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	quiet_vmstat();
+
+	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+		if (!warned && (jiffies - start) >= (5 * HZ)) {
+			pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n",
+				task->comm, task->pid, smp_processor_id(),
+				(jiffies - start) / HZ);
+			warned = true;
+		}
+		if (should_resched())
+			schedule();
+		if (test_thread_flag(TIF_SIGPENDING))
+			break;
+		cpu_isolated_wait();
+	}
+	if (warned) {
+		pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n",
+			task->comm, task->pid, smp_processor_id(),
+			(jiffies - start) / HZ);
+		dump_stack();
+	}
+}
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v5 2/6] cpu_isolated: add initial support
@ 2015-07-28 19:49           ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new CPU_ISOLATED Kconfig flag
to enable this mode, and the kernel booted with an appropriate
nohz_full=CPULIST boot argument.  The "cpu_isolated" state is then
indicated by setting a new task struct field, cpu_isolated_flags,
to the value passed by prctl().  When the _ENABLE bit is set for a
task, and it is returning to userspace on a nohz_full core, it calls
the new cpu_isolated_enter() routine to take additional actions
to help the task avoid being interrupted in the future.

Initially, there are only three actions taken.  First, the
task calls lru_add_drain() to prevent being interrupted by a
subsequent lru_add_drain_all() call on another core.  Then, it calls
quiet_vmstat() to quieten the vmstat worker to avoid a follow-on
interrupt.  Finally, the code checks for pending timer interrupts
and quiesces until they are no longer pending.  As a result, sys
calls (and page faults, etc.) can be inordinately slow.  However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/kernel/process.c   |  9 ++++++
 include/linux/cpu_isolated.h | 24 +++++++++++++++
 include/linux/sched.h        |  3 ++
 include/uapi/linux/prctl.h   |  5 ++++
 kernel/context_tracking.c    |  3 ++
 kernel/sys.c                 |  8 +++++
 kernel/time/Kconfig          | 20 +++++++++++++
 kernel/time/Makefile         |  1 +
 kernel/time/cpu_isolated.c   | 71 ++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 144 insertions(+)
 create mode 100644 include/linux/cpu_isolated.h
 create mode 100644 kernel/time/cpu_isolated.c

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index e036c0aa9792..7db6f8386417 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -70,6 +70,15 @@ void arch_cpu_idle(void)
 	_cpu_idle();
 }
 
+#ifdef CONFIG_CPU_ISOLATED
+void cpu_isolated_wait(void)
+{
+	set_current_state(TASK_INTERRUPTIBLE);
+	_cpu_idle();
+	set_current_state(TASK_RUNNING);
+}
+#endif
+
 /*
  * Release a thread_info structure
  */
diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h
new file mode 100644
index 000000000000..a3d17360f7ae
--- /dev/null
+++ b/include/linux/cpu_isolated.h
@@ -0,0 +1,24 @@
+/*
+ * CPU isolation related global functions
+ */
+#ifndef _LINUX_CPU_ISOLATED_H
+#define _LINUX_CPU_ISOLATED_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_CPU_ISOLATED
+static inline bool is_cpu_isolated(void)
+{
+	return tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE);
+}
+
+extern void cpu_isolated_enter(void);
+extern void cpu_isolated_wait(void);
+#else
+static inline bool is_cpu_isolated(void) { return false; }
+static inline void cpu_isolated_enter(void) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04b5ada460b4..0bb248385d88 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1776,6 +1776,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_CPU_ISOLATED
+	unsigned int	cpu_isolated_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..edb40b6b84db 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
 # define PR_FP_MODE_FR		(1 << 0)	/* 64b FP registers */
 # define PR_FP_MODE_FRE		(1 << 1)	/* 32b compatibility */
 
+/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */
+#define PR_SET_CPU_ISOLATED	47
+#define PR_GET_CPU_ISOLATED	48
+# define PR_CPU_ISOLATED_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 0a495ab35bc7..36b6509c3e2a 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/cpu_isolated.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state)
 			 * on the tick.
 			 */
 			if (state == CONTEXT_USER) {
+				if (is_cpu_isolated())
+					cpu_isolated_enter();
 				trace_user_enter(0);
 				vtime_user_enter(current);
 			}
diff --git a/kernel/sys.c b/kernel/sys.c
index 259fda25eb6b..c68417ff4800 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_CPU_ISOLATED
+	case PR_SET_CPU_ISOLATED:
+		me->cpu_isolated_flags = arg2;
+		break;
+	case PR_GET_CPU_ISOLATED:
+		error = me->cpu_isolated_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 579ce1b929af..141969149994 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -195,5 +195,25 @@ config HIGH_RES_TIMERS
 	  hardware is not capable then this option only increases
 	  the size of the kernel image.
 
+config CPU_ISOLATED
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL
+	help
+	 Allow userspace processes to place themselves on nohz_full
+	 cores and run prctl(PR_SET_CPU_ISOLATED) to "isolate"
+	 themselves from the kernel.  On return to userspace,
+	 cpu-isolated tasks will first arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  This "hard" isolation from the kernel is
+	 required for userspace tasks that are running hard real-time
+	 tasks in userspace, such as a 10 Gbit network driver in userspace.
+
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 endmenu
 endif
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index 49eca0beed32..984081cce974 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -12,3 +12,4 @@ obj-$(CONFIG_TICK_ONESHOT)			+= tick-oneshot.o tick-sched.o
 obj-$(CONFIG_TIMER_STATS)			+= timer_stats.o
 obj-$(CONFIG_DEBUG_FS)				+= timekeeping_debug.o
 obj-$(CONFIG_TEST_UDELAY)			+= test_udelay.o
+obj-$(CONFIG_CPU_ISOLATED)			+= cpu_isolated.o
diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c
new file mode 100644
index 000000000000..e27259f30caf
--- /dev/null
+++ b/kernel/time/cpu_isolated.c
@@ -0,0 +1,71 @@
+/*
+ *  linux/kernel/time/cpu_isolated.c
+ *
+ *  Implementation for cpu isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/cpu_isolated.h>
+#include "tick-sched.h"
+
+/*
+ * Rather than continuously polling for the next_event in the
+ * tick_cpu_device, architectures can provide a method to save power
+ * by sleeping until an interrupt arrives.
+ */
+void __weak cpu_isolated_wait(void)
+{
+	cpu_relax();
+}
+
+/*
+ * We normally return immediately to userspace.
+ *
+ * In cpu_isolated mode we wait until no more interrupts are
+ * pending.  Otherwise we nap with interrupts enabled and wait for the
+ * next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two cpu_isolated processes on the same
+ * core, neither will ever leave the kernel, and one will have to be
+ * killed manually.  Otherwise in situations where another process is
+ * in the runqueue on this cpu, this task will just wait for that
+ * other task to go idle before returning to user space.
+ */
+void cpu_isolated_enter(void)
+{
+	struct clock_event_device *dev =
+		__this_cpu_read(tick_cpu_device.evtdev);
+	struct task_struct *task = current;
+	unsigned long start = jiffies;
+	bool warned = false;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	quiet_vmstat();
+
+	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+		if (!warned && (jiffies - start) >= (5 * HZ)) {
+			pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n",
+				task->comm, task->pid, smp_processor_id(),
+				(jiffies - start) / HZ);
+			warned = true;
+		}
+		if (should_resched())
+			schedule();
+		if (test_thread_flag(TIF_SIGPENDING))
+			break;
+		cpu_isolated_wait();
+	}
+	if (warned) {
+		pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n",
+			task->comm, task->pid, smp_processor_id(),
+			(jiffies - start) / HZ);
+		dump_stack();
+	}
+}
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v5 3/6] cpu_isolated: support PR_CPU_ISOLATED_STRICT mode
  2015-07-28 19:49         ` Chris Metcalf
@ 2015-07-28 19:49           ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

With cpu_isolated mode, the task is in principle guaranteed not to be
interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry is fatal.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

This change adds the syscall-detection hooks only for x86, arm64,
and tile.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
Note: Andy Lutomirski points out that improvements are coming
to the context_tracking code to make it more robust, which may
mean that some of the code suggested here for context_tracking
may not be necessary.  I am keeping it in the series for now since
it is required for it to work based on 4.2-rc3.

 arch/arm64/kernel/ptrace.c       |  5 +++++
 arch/tile/kernel/ptrace.c        |  5 ++++-
 arch/x86/kernel/ptrace.c         |  2 ++
 include/linux/context_tracking.h | 11 ++++++++---
 include/linux/cpu_isolated.h     | 16 ++++++++++++++++
 include/uapi/linux/prctl.h       |  1 +
 kernel/context_tracking.c        |  9 ++++++---
 kernel/time/cpu_isolated.c       | 38 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index d882b833dbdb..ff83968ab4d4 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/cpu_isolated.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
+	/* Ensure we report cpu_isolated violations in all circumstances. */
+	if (test_thread_flag(TIF_NOHZ) && cpu_isolated_strict())
+		cpu_isolated_syscall(regs->syscallno);
+
 	/* Do the secure computing check first; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..e54256c54311 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (work & _TIF_NOHZ)
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		if (cpu_isolated_strict())
+			cpu_isolated_syscall(regs->regs[TREG_SYSCALL_NR]);
+	}
 
 	if (work & _TIF_SYSCALL_TRACE) {
 		if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9be72bc3613f..e5aec57e8e25 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	if (work & _TIF_NOHZ) {
 		user_exit();
 		work &= ~_TIF_NOHZ;
+		if (cpu_isolated_strict())
+			cpu_isolated_syscall(regs->orig_ax);
 	}
 
 #ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd299966f..590414ef2bf1 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include <linux/sched.h>
 #include <linux/vtime.h>
+#include <linux/cpu_isolated.h>
 #include <linux/context_tracking_state.h>
 #include <asm/ptrace.h>
 
@@ -11,7 +12,7 @@
 extern void context_tracking_cpu_set(int cpu);
 
 extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 
@@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
 		return 0;
 
 	prev_ctx = this_cpu_read(context_tracking.state);
-	if (prev_ctx != CONTEXT_KERNEL)
-		context_tracking_exit(prev_ctx);
+	if (prev_ctx != CONTEXT_KERNEL) {
+		if (context_tracking_exit(prev_ctx)) {
+			if (cpu_isolated_strict())
+				cpu_isolated_exception();
+		}
+	}
 
 	return prev_ctx;
 }
diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h
index a3d17360f7ae..b0f1c2669b2f 100644
--- a/include/linux/cpu_isolated.h
+++ b/include/linux/cpu_isolated.h
@@ -15,10 +15,26 @@ static inline bool is_cpu_isolated(void)
 }
 
 extern void cpu_isolated_enter(void);
+extern void cpu_isolated_syscall(int nr);
+extern void cpu_isolated_exception(void);
 extern void cpu_isolated_wait(void);
 #else
 static inline bool is_cpu_isolated(void) { return false; }
 static inline void cpu_isolated_enter(void) { }
+static inline void cpu_isolated_syscall(int nr) { }
+static inline void cpu_isolated_exception(void) { }
 #endif
 
+static inline bool cpu_isolated_strict(void)
+{
+#ifdef CONFIG_CPU_ISOLATED
+	if (tick_nohz_full_cpu(smp_processor_id()) &&
+	    (current->cpu_isolated_flags &
+	     (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) ==
+	    (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT))
+		return true;
+#endif
+	return false;
+}
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index edb40b6b84db..0c11238a84fb 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
 #define PR_SET_CPU_ISOLATED	47
 #define PR_GET_CPU_ISOLATED	48
 # define PR_CPU_ISOLATED_ENABLE	(1 << 0)
+# define PR_CPU_ISOLATED_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 36b6509c3e2a..c740850eea11 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
 {
 	unsigned long flags;
+	bool from_user = false;
 
 	if (!context_tracking_is_enabled())
-		return;
+		return false;
 
 	if (in_interrupt())
-		return;
+		return false;
 
 	local_irq_save(flags);
 	if (!context_tracking_recursion_enter())
@@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state)
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				from_user = true;
 				vtime_user_exit(current);
 				trace_user_exit(0);
 			}
@@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state)
 	context_tracking_recursion_exit();
 out_irq_restore:
 	local_irq_restore(flags);
+	return from_user;
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
 EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c
index e27259f30caf..d30bf3852897 100644
--- a/kernel/time/cpu_isolated.c
+++ b/kernel/time/cpu_isolated.c
@@ -10,6 +10,7 @@
 #include <linux/swap.h>
 #include <linux/vmstat.h>
 #include <linux/cpu_isolated.h>
+#include <asm/unistd.h>
 #include "tick-sched.h"
 
 /*
@@ -69,3 +70,40 @@ void cpu_isolated_enter(void)
 		dump_stack();
 	}
 }
+
+static void kill_cpu_isolated_strict_task(void)
+{
+	dump_stack();
+	current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
+	send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void cpu_isolated_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return;
+	}
+
+	pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
+		current->comm, current->pid, syscall);
+	kill_cpu_isolated_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void cpu_isolated_exception(void)
+{
+	pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
+		current->comm, current->pid);
+	kill_cpu_isolated_strict_task();
+}
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v5 3/6] cpu_isolated: support PR_CPU_ISOLATED_STRICT mode
@ 2015-07-28 19:49           ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

With cpu_isolated mode, the task is in principle guaranteed not to be
interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry is fatal.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

This change adds the syscall-detection hooks only for x86, arm64,
and tile.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
Note: Andy Lutomirski points out that improvements are coming
to the context_tracking code to make it more robust, which may
mean that some of the code suggested here for context_tracking
may not be necessary.  I am keeping it in the series for now since
it is required for it to work based on 4.2-rc3.

 arch/arm64/kernel/ptrace.c       |  5 +++++
 arch/tile/kernel/ptrace.c        |  5 ++++-
 arch/x86/kernel/ptrace.c         |  2 ++
 include/linux/context_tracking.h | 11 ++++++++---
 include/linux/cpu_isolated.h     | 16 ++++++++++++++++
 include/uapi/linux/prctl.h       |  1 +
 kernel/context_tracking.c        |  9 ++++++---
 kernel/time/cpu_isolated.c       | 38 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index d882b833dbdb..ff83968ab4d4 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/cpu_isolated.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
+	/* Ensure we report cpu_isolated violations in all circumstances. */
+	if (test_thread_flag(TIF_NOHZ) && cpu_isolated_strict())
+		cpu_isolated_syscall(regs->syscallno);
+
 	/* Do the secure computing check first; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..e54256c54311 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (work & _TIF_NOHZ)
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		if (cpu_isolated_strict())
+			cpu_isolated_syscall(regs->regs[TREG_SYSCALL_NR]);
+	}
 
 	if (work & _TIF_SYSCALL_TRACE) {
 		if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9be72bc3613f..e5aec57e8e25 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	if (work & _TIF_NOHZ) {
 		user_exit();
 		work &= ~_TIF_NOHZ;
+		if (cpu_isolated_strict())
+			cpu_isolated_syscall(regs->orig_ax);
 	}
 
 #ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd299966f..590414ef2bf1 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include <linux/sched.h>
 #include <linux/vtime.h>
+#include <linux/cpu_isolated.h>
 #include <linux/context_tracking_state.h>
 #include <asm/ptrace.h>
 
@@ -11,7 +12,7 @@
 extern void context_tracking_cpu_set(int cpu);
 
 extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 
@@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
 		return 0;
 
 	prev_ctx = this_cpu_read(context_tracking.state);
-	if (prev_ctx != CONTEXT_KERNEL)
-		context_tracking_exit(prev_ctx);
+	if (prev_ctx != CONTEXT_KERNEL) {
+		if (context_tracking_exit(prev_ctx)) {
+			if (cpu_isolated_strict())
+				cpu_isolated_exception();
+		}
+	}
 
 	return prev_ctx;
 }
diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h
index a3d17360f7ae..b0f1c2669b2f 100644
--- a/include/linux/cpu_isolated.h
+++ b/include/linux/cpu_isolated.h
@@ -15,10 +15,26 @@ static inline bool is_cpu_isolated(void)
 }
 
 extern void cpu_isolated_enter(void);
+extern void cpu_isolated_syscall(int nr);
+extern void cpu_isolated_exception(void);
 extern void cpu_isolated_wait(void);
 #else
 static inline bool is_cpu_isolated(void) { return false; }
 static inline void cpu_isolated_enter(void) { }
+static inline void cpu_isolated_syscall(int nr) { }
+static inline void cpu_isolated_exception(void) { }
 #endif
 
+static inline bool cpu_isolated_strict(void)
+{
+#ifdef CONFIG_CPU_ISOLATED
+	if (tick_nohz_full_cpu(smp_processor_id()) &&
+	    (current->cpu_isolated_flags &
+	     (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) ==
+	    (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT))
+		return true;
+#endif
+	return false;
+}
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index edb40b6b84db..0c11238a84fb 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
 #define PR_SET_CPU_ISOLATED	47
 #define PR_GET_CPU_ISOLATED	48
 # define PR_CPU_ISOLATED_ENABLE	(1 << 0)
+# define PR_CPU_ISOLATED_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 36b6509c3e2a..c740850eea11 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
 {
 	unsigned long flags;
+	bool from_user = false;
 
 	if (!context_tracking_is_enabled())
-		return;
+		return false;
 
 	if (in_interrupt())
-		return;
+		return false;
 
 	local_irq_save(flags);
 	if (!context_tracking_recursion_enter())
@@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state)
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				from_user = true;
 				vtime_user_exit(current);
 				trace_user_exit(0);
 			}
@@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state)
 	context_tracking_recursion_exit();
 out_irq_restore:
 	local_irq_restore(flags);
+	return from_user;
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
 EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c
index e27259f30caf..d30bf3852897 100644
--- a/kernel/time/cpu_isolated.c
+++ b/kernel/time/cpu_isolated.c
@@ -10,6 +10,7 @@
 #include <linux/swap.h>
 #include <linux/vmstat.h>
 #include <linux/cpu_isolated.h>
+#include <asm/unistd.h>
 #include "tick-sched.h"
 
 /*
@@ -69,3 +70,40 @@ void cpu_isolated_enter(void)
 		dump_stack();
 	}
 }
+
+static void kill_cpu_isolated_strict_task(void)
+{
+	dump_stack();
+	current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
+	send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void cpu_isolated_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return;
+	}
+
+	pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
+		current->comm, current->pid, syscall);
+	kill_cpu_isolated_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void cpu_isolated_exception(void)
+{
+	pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
+		current->comm, current->pid);
+	kill_cpu_isolated_strict_task();
+}
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v5 4/6] cpu_isolated: provide strict mode configurable signal
  2015-07-28 19:49         ` Chris Metcalf
@ 2015-07-28 19:49           ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

Allow userspace to override the default SIGKILL delivered
when a cpu_isolated process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

In addition to being able to set the signal, we now also
pass whether or not the interruption was from a syscall in
the si_code field of the siginfo.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/uapi/linux/prctl.h |  2 ++
 kernel/time/cpu_isolated.c | 17 ++++++++++++-----
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 0c11238a84fb..ab45bd3d5799 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
 #define PR_GET_CPU_ISOLATED	48
 # define PR_CPU_ISOLATED_ENABLE	(1 << 0)
 # define PR_CPU_ISOLATED_STRICT	(1 << 1)
+# define PR_CPU_ISOLATED_SET_SIG(sig)  (((sig) & 0x7f) << 8)
+# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c
index d30bf3852897..9f8fcbd97770 100644
--- a/kernel/time/cpu_isolated.c
+++ b/kernel/time/cpu_isolated.c
@@ -71,11 +71,18 @@ void cpu_isolated_enter(void)
 	}
 }
 
-static void kill_cpu_isolated_strict_task(void)
-{
+static void kill_cpu_isolated_strict_task(int is_syscall)
+ {
+	siginfo_t info = {};
+	int sig;
+
 	dump_stack();
 	current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
-	send_sig(SIGKILL, current, 1);
+
+	sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL;
+	info.si_signo = sig;
+	info.si_code = is_syscall;
+	send_sig_info(sig, &info, current);
 }
 
 /*
@@ -94,7 +101,7 @@ void cpu_isolated_syscall(int syscall)
 
 	pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
 		current->comm, current->pid, syscall);
-	kill_cpu_isolated_strict_task();
+	kill_cpu_isolated_strict_task(1);
 }
 
 /*
@@ -105,5 +112,5 @@ void cpu_isolated_exception(void)
 {
 	pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
 		current->comm, current->pid);
-	kill_cpu_isolated_strict_task();
+	kill_cpu_isolated_strict_task(0);
 }
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v5 4/6] cpu_isolated: provide strict mode configurable signal
@ 2015-07-28 19:49           ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

Allow userspace to override the default SIGKILL delivered
when a cpu_isolated process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

In addition to being able to set the signal, we now also
pass whether or not the interruption was from a syscall in
the si_code field of the siginfo.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/uapi/linux/prctl.h |  2 ++
 kernel/time/cpu_isolated.c | 17 ++++++++++++-----
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 0c11238a84fb..ab45bd3d5799 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
 #define PR_GET_CPU_ISOLATED	48
 # define PR_CPU_ISOLATED_ENABLE	(1 << 0)
 # define PR_CPU_ISOLATED_STRICT	(1 << 1)
+# define PR_CPU_ISOLATED_SET_SIG(sig)  (((sig) & 0x7f) << 8)
+# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c
index d30bf3852897..9f8fcbd97770 100644
--- a/kernel/time/cpu_isolated.c
+++ b/kernel/time/cpu_isolated.c
@@ -71,11 +71,18 @@ void cpu_isolated_enter(void)
 	}
 }
 
-static void kill_cpu_isolated_strict_task(void)
-{
+static void kill_cpu_isolated_strict_task(int is_syscall)
+ {
+	siginfo_t info = {};
+	int sig;
+
 	dump_stack();
 	current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE;
-	send_sig(SIGKILL, current, 1);
+
+	sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL;
+	info.si_signo = sig;
+	info.si_code = is_syscall;
+	send_sig_info(sig, &info, current);
 }
 
 /*
@@ -94,7 +101,7 @@ void cpu_isolated_syscall(int syscall)
 
 	pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n",
 		current->comm, current->pid, syscall);
-	kill_cpu_isolated_strict_task();
+	kill_cpu_isolated_strict_task(1);
 }
 
 /*
@@ -105,5 +112,5 @@ void cpu_isolated_exception(void)
 {
 	pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n",
 		current->comm, current->pid);
-	kill_cpu_isolated_strict_task();
+	kill_cpu_isolated_strict_task(0);
 }
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v5 5/6] cpu_isolated: add debug boot flag
  2015-07-28 19:49         ` Chris Metcalf
                           ` (4 preceding siblings ...)
  (?)
@ 2015-07-28 19:49         ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc,
	linux-kernel
  Cc: Chris Metcalf

The new "cpu_isolated_debug" flag simplifies debugging
of CPU_ISOLATED kernels when processes are running in
PR_CPU_ISOLATED_ENABLE mode.  Such processes should get no interrupts
from the kernel, and if they do, when this boot flag is specified
a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a cpu_isolated
core has unexpectedly entered the kernel.  But what this boot flag
does is allow the kernel to provide better diagnostics, e.g. by
reporting in the IPI-generating code what remote core and context
is preparing to deliver an interrupt to a cpu_isolated core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 Documentation/kernel-parameters.txt |  7 +++++++
 arch/tile/mm/homecache.c            |  5 ++++-
 include/linux/cpu_isolated.h        |  2 ++
 kernel/irq_work.c                   |  5 ++++-
 kernel/sched/core.c                 | 21 +++++++++++++++++++++
 kernel/signal.c                     |  5 +++++
 kernel/smp.c                        |  4 ++++
 kernel/softirq.c                    |  7 +++++++
 8 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 1d6f0459cd7b..940e4c9f1978 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -749,6 +749,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			/proc/<pid>/coredump_filter.
 			See also Documentation/filesystems/proc.txt.
 
+	cpu_isolated_debug	[KNL]
+			In kernels built with CONFIG_CPU_ISOLATED and booted
+			in nohz_full= mode, this setting will generate console
+			backtraces when the kernel is about to interrupt a
+			task that has requested PR_CPU_ISOLATED_ENABLE
+			and is running on a nohz_full core.
+
 	cpuidle.off=1	[CPU_IDLE]
 			disable the cpuidle sub-system
 
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..fdef5e3d6396 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
 #include <linux/smp.h>
 #include <linux/module.h>
 #include <linux/hugetlb.h>
+#include <linux/cpu_isolated.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
 	 * Don't bother to update atomically; losing a count
 	 * here is not that critical.
 	 */
-	for_each_cpu(cpu, &mask)
+	for_each_cpu(cpu, &mask) {
 		++per_cpu(irq_stat, cpu).irq_hv_flush_count;
+		cpu_isolated_debug(cpu);
+	}
 }
 
 /*
diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h
index b0f1c2669b2f..4ea67d640be7 100644
--- a/include/linux/cpu_isolated.h
+++ b/include/linux/cpu_isolated.h
@@ -18,11 +18,13 @@ extern void cpu_isolated_enter(void);
 extern void cpu_isolated_syscall(int nr);
 extern void cpu_isolated_exception(void);
 extern void cpu_isolated_wait(void);
+extern void cpu_isolated_debug(int cpu);
 #else
 static inline bool is_cpu_isolated(void) { return false; }
 static inline void cpu_isolated_enter(void) { }
 static inline void cpu_isolated_syscall(int nr) { }
 static inline void cpu_isolated_exception(void) { }
+static inline void cpu_isolated_debug(int cpu) { }
 #endif
 
 static inline bool cpu_isolated_strict(void)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index cbf9fb899d92..3c08a41f9898 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/cpu_isolated.h>
 #include <asm/processor.h>
 
 
@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (!irq_work_claim(work))
 		return false;
 
-	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+		cpu_isolated_debug(cpu);
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return true;
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78b4bad10081..647671900497 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
 #include <linux/binfmts.h>
 #include <linux/context_tracking.h>
 #include <linux/compiler.h>
+#include <linux/cpu_isolated.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -745,6 +746,26 @@ bool sched_can_stop_tick(void)
 }
 #endif /* CONFIG_NO_HZ_FULL */
 
+#ifdef CONFIG_CPU_ISOLATED
+/* Enable debugging of any interrupts of cpu_isolated cores. */
+static int cpu_isolated_debug_flag;
+static int __init cpu_isolated_debug_func(char *str)
+{
+	cpu_isolated_debug_flag = true;
+	return 1;
+}
+__setup("cpu_isolated_debug", cpu_isolated_debug_func);
+
+void cpu_isolated_debug(int cpu)
+{
+	if (cpu_isolated_debug_flag && tick_nohz_full_cpu(cpu) &&
+	    (cpu_curr(cpu)->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE)) {
+		pr_err("Interrupt detected for cpu_isolated cpu %d\n", cpu);
+		dump_stack();
+	}
+}
+#endif
+
 void sched_avg_update(struct rq *rq)
 {
 	s64 period = sched_avg_period();
diff --git a/kernel/signal.c b/kernel/signal.c
index 836df8dac6cc..90ee460c2586 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -684,6 +684,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+#ifdef CONFIG_NO_HZ_FULL
+	/* If the task is being killed, don't complain about cpu_isolated. */
+	if (state & TASK_WAKEKILL)
+		t->cpu_isolated_flags = 0;
+#endif
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 07854477c164..846e42a3daa3 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
 #include <linux/smp.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include <linux/cpu_isolated.h>
 
 #include "smpboot.h"
 
@@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
+	cpu_isolated_debug(cpu);
 	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
 		arch_send_call_function_single_ipi(cpu);
 
@@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	for_each_cpu(cpu, cfd->cpumask)
+		cpu_isolated_debug(cpu);
 	arch_send_call_function_ipi_mask(cfd->cpumask);
 
 	if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..456149a4a34f 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -24,8 +24,10 @@
 #include <linux/ftrace.h>
 #include <linux/smp.h>
 #include <linux/smpboot.h>
+#include <linux/context_tracking.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
+#include <linux/cpu_isolated.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -335,6 +337,11 @@ void irq_enter(void)
 		_local_bh_enable();
 	}
 
+	if (context_tracking_cpu_is_enabled() &&
+	    context_tracking_in_user() &&
+	    !in_interrupt())
+		cpu_isolated_debug(smp_processor_id());
+
 	__irq_enter();
 }
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v5 6/6] nohz: cpu_isolated: allow tick to be fully disabled
  2015-07-28 19:49         ` Chris Metcalf
                           ` (5 preceding siblings ...)
  (?)
@ 2015-07-28 19:49         ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel
  Cc: Chris Metcalf

While the current fallback to 1-second tick is still helpful for
maintaining completely correct kernel semantics, processes using
prctl(PR_SET_CPU_ISOLATED) semantics place a higher priority on
running completely tickless, so don't bound the time_delta for such
processes.  In addition, due to the way such processes quiesce by
waiting for the timer tick to stop prior to returning to userspace,
without this commit it won't be possible to use the cpu_isolated
mode at all.

Removing the 1-second cap was previously discussed (see link
below) and Thomas Gleixner observed that vruntime, load balancing
data, load accounting, and other things might be impacted.
Frederic Weisbecker similarly observed that allowing the tick to
be indefinitely deferred just meant that no one would ever fix the
underlying bugs.  However it's at least true that the mode proposed
in this patch can only be enabled on a nohz_full core by a process
requesting cpu_isolated mode, which may limit how important it is
to maintain scheduler data correctly, for example.

Paul McKenney observed that if provide a mode where the 1Hz fallback
timer is removed, this will provide an environment where new code
that relies on that tick will get punished, and we won't forgive
such assumptions silently, so it may also be worth it from that
perspective.

Finally, it's worth observing that the tile architecture has been
using similar code for its Zero-Overhead Linux for many years
(starting in 2008) and customers are very enthusiastic about the
resulting bare-metal performance on cores that are available to
run full Linux semantics on demand (crash, logging, shutdown, etc).
So this semantics is very useful if we can convince ourselves that
doing this is safe.

Link: https://lkml.kernel.org/r/alpine.DEB.2.11.1410311058500.32582@gentwo.org
Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 kernel/time/tick-sched.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c792429e98c6..3a1d48418499 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
 #include <linux/posix-timers.h>
 #include <linux/perf_event.h>
 #include <linux/context_tracking.h>
+#include <linux/cpu_isolated.h>
 
 #include <asm/irq_regs.h>
 
@@ -652,7 +653,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
 
 #ifdef CONFIG_NO_HZ_FULL
 	/* Limit the tick delta to the maximum scheduler deferment */
-	if (!ts->inidle)
+	if (!ts->inidle && !is_cpu_isolated())
 		delta = min(delta, scheduler_tick_max_deferment());
 #endif
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* Re: [PATCH v5 2/6] cpu_isolated: add initial support
@ 2015-08-12 16:00             ` Frederic Weisbecker
  0 siblings, 0 replies; 340+ messages in thread
From: Frederic Weisbecker @ 2015-08-12 16:00 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel

On Tue, Jul 28, 2015 at 03:49:36PM -0400, Chris Metcalf wrote:
> The existing nohz_full mode is designed as a "soft" isolation mode
> that makes tradeoffs to minimize userspace interruptions while
> still attempting to avoid overheads in the kernel entry/exit path,
> to provide 100% kernel semantics, etc.
> 
> However, some applications require a "hard" commitment from the
> kernel to avoid interruptions, in particular userspace device
> driver style applications, such as high-speed networking code.
> 
> This change introduces a framework to allow applications
> to elect to have the "hard" semantics as needed, specifying
> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
> Subsequent commits will add additional flags and additional
> semantics.

We are doing this at the process level but the isolation works on
the CPU scope... Now I wonder if prctl is the right interface.

That said the user is rather interested in isolating a task. The CPU
being the backend eventually.

For example if the task is migrated by accident, we want it to be
warned about that. And if the isolation is done on the CPU level
instead of the task level, this won't happen.

I'm also afraid that the naming clashes with cpu_isolated_map,
although it could be a subset of it.

So probably in this case we should consider talking about task rather
than CPU isolation and change naming accordingly (sorry, I know I
suggested cpu_isolation.c, I guess I had to see the result to realize).

We must sort that out first. Either we consider isolation on the task
level (and thus the underlying CPU by backend effect) and we use prctl().
Or we do this on the CPU level and we use a specific syscall or sysfs
which takes effect on any task in the relevant isolated CPUs.

What do you think?

It would be nice to hear others opinions as well.

> The kernel must be built with the new CPU_ISOLATED Kconfig flag
> to enable this mode, and the kernel booted with an appropriate
> nohz_full=CPULIST boot argument.  The "cpu_isolated" state is then
> indicated by setting a new task struct field, cpu_isolated_flags,
> to the value passed by prctl().  When the _ENABLE bit is set for a
> task, and it is returning to userspace on a nohz_full core, it calls
> the new cpu_isolated_enter() routine to take additional actions
> to help the task avoid being interrupted in the future.
> 
> Initially, there are only three actions taken.  First, the
> task calls lru_add_drain() to prevent being interrupted by a
> subsequent lru_add_drain_all() call on another core.  Then, it calls
> quiet_vmstat() to quieten the vmstat worker to avoid a follow-on
> interrupt.  Finally, the code checks for pending timer interrupts
> and quiesces until they are no longer pending.  As a result, sys
> calls (and page faults, etc.) can be inordinately slow.  However,
> this quiescing guarantees that no unexpected interrupts will occur,
> even if the application intentionally calls into the kernel.
> 
> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
> ---
>  arch/tile/kernel/process.c   |  9 ++++++
>  include/linux/cpu_isolated.h | 24 +++++++++++++++
>  include/linux/sched.h        |  3 ++
>  include/uapi/linux/prctl.h   |  5 ++++
>  kernel/context_tracking.c    |  3 ++
>  kernel/sys.c                 |  8 +++++
>  kernel/time/Kconfig          | 20 +++++++++++++
>  kernel/time/Makefile         |  1 +
>  kernel/time/cpu_isolated.c   | 71 ++++++++++++++++++++++++++++++++++++++++++++

It's not about time :-)

The timer is only a part of the isolation.

Moreover "isolatED" is a state. The filename should reflect the process. "isolatION" would
better fit.

kernel/task_isolation.c maybe or just kernel/isolation.c

I think I prefer the latter because I'm not only interested in that task
hard isolation feature, I would like to also drive all the general isolation
operations from there (workqueue affinity, rcu nocb, ...).

>  9 files changed, 144 insertions(+)
>  create mode 100644 include/linux/cpu_isolated.h
>  create mode 100644 kernel/time/cpu_isolated.c
> 
> diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
> index e036c0aa9792..7db6f8386417 100644
> --- a/arch/tile/kernel/process.c
> +++ b/arch/tile/kernel/process.c
> @@ -70,6 +70,15 @@ void arch_cpu_idle(void)
>  	_cpu_idle();
>  }
>  
> +#ifdef CONFIG_CPU_ISOLATED
> +void cpu_isolated_wait(void)
> +{
> +	set_current_state(TASK_INTERRUPTIBLE);
> +	_cpu_idle();
> +	set_current_state(TASK_RUNNING);
> +}

I'm still uncomfortable with that. A wake up model could work?

> +#endif
> +
>  /*
>   * Release a thread_info structure
>   */
> diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h
> new file mode 100644
> index 000000000000..a3d17360f7ae
> --- /dev/null
> +++ b/include/linux/cpu_isolated.h
> @@ -0,0 +1,24 @@
> +/*
> + * CPU isolation related global functions
> + */
> +#ifndef _LINUX_CPU_ISOLATED_H
> +#define _LINUX_CPU_ISOLATED_H
> +
> +#include <linux/tick.h>
> +#include <linux/prctl.h>
> +
> +#ifdef CONFIG_CPU_ISOLATED
> +static inline bool is_cpu_isolated(void)
> +{
> +	return tick_nohz_full_cpu(smp_processor_id()) &&
> +		(current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE);
> +}
> +
> +extern void cpu_isolated_enter(void);
> +extern void cpu_isolated_wait(void);
> +#else
> +static inline bool is_cpu_isolated(void) { return false; }
> +static inline void cpu_isolated_enter(void) { }
> +#endif

And all the naming should be about task as well, if we take that task direction.

> +
> +#endif
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 04b5ada460b4..0bb248385d88 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1776,6 +1776,9 @@ struct task_struct {
>  	unsigned long	task_state_change;
>  #endif
>  	int pagefault_disabled;
> +#ifdef CONFIG_CPU_ISOLATED
> +	unsigned int	cpu_isolated_flags;
> +#endif

Can't we add a new flag to tsk->flags? There seem to be some values remaining.

Thanks.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v5 2/6] cpu_isolated: add initial support
@ 2015-08-12 16:00             ` Frederic Weisbecker
  0 siblings, 0 replies; 340+ messages in thread
From: Frederic Weisbecker @ 2015-08-12 16:00 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Jul 28, 2015 at 03:49:36PM -0400, Chris Metcalf wrote:
> The existing nohz_full mode is designed as a "soft" isolation mode
> that makes tradeoffs to minimize userspace interruptions while
> still attempting to avoid overheads in the kernel entry/exit path,
> to provide 100% kernel semantics, etc.
> 
> However, some applications require a "hard" commitment from the
> kernel to avoid interruptions, in particular userspace device
> driver style applications, such as high-speed networking code.
> 
> This change introduces a framework to allow applications
> to elect to have the "hard" semantics as needed, specifying
> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
> Subsequent commits will add additional flags and additional
> semantics.

We are doing this at the process level but the isolation works on
the CPU scope... Now I wonder if prctl is the right interface.

That said the user is rather interested in isolating a task. The CPU
being the backend eventually.

For example if the task is migrated by accident, we want it to be
warned about that. And if the isolation is done on the CPU level
instead of the task level, this won't happen.

I'm also afraid that the naming clashes with cpu_isolated_map,
although it could be a subset of it.

So probably in this case we should consider talking about task rather
than CPU isolation and change naming accordingly (sorry, I know I
suggested cpu_isolation.c, I guess I had to see the result to realize).

We must sort that out first. Either we consider isolation on the task
level (and thus the underlying CPU by backend effect) and we use prctl().
Or we do this on the CPU level and we use a specific syscall or sysfs
which takes effect on any task in the relevant isolated CPUs.

What do you think?

It would be nice to hear others opinions as well.

> The kernel must be built with the new CPU_ISOLATED Kconfig flag
> to enable this mode, and the kernel booted with an appropriate
> nohz_full=CPULIST boot argument.  The "cpu_isolated" state is then
> indicated by setting a new task struct field, cpu_isolated_flags,
> to the value passed by prctl().  When the _ENABLE bit is set for a
> task, and it is returning to userspace on a nohz_full core, it calls
> the new cpu_isolated_enter() routine to take additional actions
> to help the task avoid being interrupted in the future.
> 
> Initially, there are only three actions taken.  First, the
> task calls lru_add_drain() to prevent being interrupted by a
> subsequent lru_add_drain_all() call on another core.  Then, it calls
> quiet_vmstat() to quieten the vmstat worker to avoid a follow-on
> interrupt.  Finally, the code checks for pending timer interrupts
> and quiesces until they are no longer pending.  As a result, sys
> calls (and page faults, etc.) can be inordinately slow.  However,
> this quiescing guarantees that no unexpected interrupts will occur,
> even if the application intentionally calls into the kernel.
> 
> Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
> ---
>  arch/tile/kernel/process.c   |  9 ++++++
>  include/linux/cpu_isolated.h | 24 +++++++++++++++
>  include/linux/sched.h        |  3 ++
>  include/uapi/linux/prctl.h   |  5 ++++
>  kernel/context_tracking.c    |  3 ++
>  kernel/sys.c                 |  8 +++++
>  kernel/time/Kconfig          | 20 +++++++++++++
>  kernel/time/Makefile         |  1 +
>  kernel/time/cpu_isolated.c   | 71 ++++++++++++++++++++++++++++++++++++++++++++

It's not about time :-)

The timer is only a part of the isolation.

Moreover "isolatED" is a state. The filename should reflect the process. "isolatION" would
better fit.

kernel/task_isolation.c maybe or just kernel/isolation.c

I think I prefer the latter because I'm not only interested in that task
hard isolation feature, I would like to also drive all the general isolation
operations from there (workqueue affinity, rcu nocb, ...).

>  9 files changed, 144 insertions(+)
>  create mode 100644 include/linux/cpu_isolated.h
>  create mode 100644 kernel/time/cpu_isolated.c
> 
> diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
> index e036c0aa9792..7db6f8386417 100644
> --- a/arch/tile/kernel/process.c
> +++ b/arch/tile/kernel/process.c
> @@ -70,6 +70,15 @@ void arch_cpu_idle(void)
>  	_cpu_idle();
>  }
>  
> +#ifdef CONFIG_CPU_ISOLATED
> +void cpu_isolated_wait(void)
> +{
> +	set_current_state(TASK_INTERRUPTIBLE);
> +	_cpu_idle();
> +	set_current_state(TASK_RUNNING);
> +}

I'm still uncomfortable with that. A wake up model could work?

> +#endif
> +
>  /*
>   * Release a thread_info structure
>   */
> diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h
> new file mode 100644
> index 000000000000..a3d17360f7ae
> --- /dev/null
> +++ b/include/linux/cpu_isolated.h
> @@ -0,0 +1,24 @@
> +/*
> + * CPU isolation related global functions
> + */
> +#ifndef _LINUX_CPU_ISOLATED_H
> +#define _LINUX_CPU_ISOLATED_H
> +
> +#include <linux/tick.h>
> +#include <linux/prctl.h>
> +
> +#ifdef CONFIG_CPU_ISOLATED
> +static inline bool is_cpu_isolated(void)
> +{
> +	return tick_nohz_full_cpu(smp_processor_id()) &&
> +		(current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE);
> +}
> +
> +extern void cpu_isolated_enter(void);
> +extern void cpu_isolated_wait(void);
> +#else
> +static inline bool is_cpu_isolated(void) { return false; }
> +static inline void cpu_isolated_enter(void) { }
> +#endif

And all the naming should be about task as well, if we take that task direction.

> +
> +#endif
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 04b5ada460b4..0bb248385d88 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1776,6 +1776,9 @@ struct task_struct {
>  	unsigned long	task_state_change;
>  #endif
>  	int pagefault_disabled;
> +#ifdef CONFIG_CPU_ISOLATED
> +	unsigned int	cpu_isolated_flags;
> +#endif

Can't we add a new flag to tsk->flags? There seem to be some values remaining.

Thanks.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v5 2/6] cpu_isolated: add initial support
  2015-08-12 16:00             ` Frederic Weisbecker
@ 2015-08-12 18:22               ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-08-12 18:22 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel

On 08/12/2015 12:00 PM, Frederic Weisbecker wrote:
> On Tue, Jul 28, 2015 at 03:49:36PM -0400, Chris Metcalf wrote:
>> The existing nohz_full mode is designed as a "soft" isolation mode
>> that makes tradeoffs to minimize userspace interruptions while
>> still attempting to avoid overheads in the kernel entry/exit path,
>> to provide 100% kernel semantics, etc.
>>
>> However, some applications require a "hard" commitment from the
>> kernel to avoid interruptions, in particular userspace device
>> driver style applications, such as high-speed networking code.
>>
>> This change introduces a framework to allow applications
>> to elect to have the "hard" semantics as needed, specifying
>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
>> Subsequent commits will add additional flags and additional
>> semantics.
> We are doing this at the process level but the isolation works on
> the CPU scope... Now I wonder if prctl is the right interface.
>
> That said the user is rather interested in isolating a task. The CPU
> being the backend eventually.
>
> For example if the task is migrated by accident, we want it to be
> warned about that. And if the isolation is done on the CPU level
> instead of the task level, this won't happen.
>
> I'm also afraid that the naming clashes with cpu_isolated_map,
> although it could be a subset of it.
>
> So probably in this case we should consider talking about task rather
> than CPU isolation and change naming accordingly (sorry, I know I
> suggested cpu_isolation.c, I guess I had to see the result to realize).
>
> We must sort that out first. Either we consider isolation on the task
> level (and thus the underlying CPU by backend effect) and we use prctl().
> Or we do this on the CPU level and we use a specific syscall or sysfs
> which takes effect on any task in the relevant isolated CPUs.
>
> What do you think?

Yes, definitely task-centric is the right model.

With the original tilegx version of this code, we also checked that
the process had only a single core in its affinity mask, and that the
single core in question was a nohz_full core, before allowing the
"task isolated" mode to take effect.  I didn't do that in this round
of patches because it seemed a little silly in that the user could
then immediately reset their affinity to another core and lose the
effect, and it wasn't clear how to handle that: do we return EINVAL
from sched_setaffinity() after enabling the "task isolated" mode?
That seems potentially ugly, maybe standards-violating, etc.  So I
didn't bother.

But you could certainly argue for failing prctl() in that case anyway,
as a way to make sure users aren't doing something stupid like calling
the prctl() from a task that's running on a housekeeping core.  And
you could even argue for doing some kind of console spew if you try to
migrate a task that is in "task isolation" state - though I suppose if
you migrate it to another isolcpus and nohz_full core, maybe that's
kind of reasonable and doesn't deserve a warning?  I'm not sure.

>> The kernel must be built with the new CPU_ISOLATED Kconfig flag
>> to enable this mode, and the kernel booted with an appropriate
>> nohz_full=CPULIST boot argument.  The "cpu_isolated" state is then
>> indicated by setting a new task struct field, cpu_isolated_flags,
>> to the value passed by prctl().  When the _ENABLE bit is set for a
>> task, and it is returning to userspace on a nohz_full core, it calls
>> the new cpu_isolated_enter() routine to take additional actions
>> to help the task avoid being interrupted in the future.
>>
>> Initially, there are only three actions taken.  First, the
>> task calls lru_add_drain() to prevent being interrupted by a
>> subsequent lru_add_drain_all() call on another core.  Then, it calls
>> quiet_vmstat() to quieten the vmstat worker to avoid a follow-on
>> interrupt.  Finally, the code checks for pending timer interrupts
>> and quiesces until they are no longer pending.  As a result, sys
>> calls (and page faults, etc.) can be inordinately slow.  However,
>> this quiescing guarantees that no unexpected interrupts will occur,
>> even if the application intentionally calls into the kernel.
>>
>> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
>> ---
>>   arch/tile/kernel/process.c   |  9 ++++++
>>   include/linux/cpu_isolated.h | 24 +++++++++++++++
>>   include/linux/sched.h        |  3 ++
>>   include/uapi/linux/prctl.h   |  5 ++++
>>   kernel/context_tracking.c    |  3 ++
>>   kernel/sys.c                 |  8 +++++
>>   kernel/time/Kconfig          | 20 +++++++++++++
>>   kernel/time/Makefile         |  1 +
>>   kernel/time/cpu_isolated.c   | 71 ++++++++++++++++++++++++++++++++++++++++++++
> It's not about time :-)
>
> The timer is only a part of the isolation.
>
> Moreover "isolatED" is a state. The filename should reflect the process. "isolatION" would
> better fit.
>
> kernel/task_isolation.c maybe or just kernel/isolation.c
>
> I think I prefer the latter because I'm not only interested in that task
> hard isolation feature, I would like to also drive all the general isolation
> operations from there (workqueue affinity, rcu nocb, ...).

That's reasonable, but I think the "task isolation" naming is probably
better for all the stuff that we're doing in this patch.  In other words,
we probably should use "task_isolation" as the prefix for symbols
names and API names, even if we put the code in kernel/isolation.c
for now in anticipation of non-task isolation being added later.

I think my instinct would still be to call it kernel/task_isolation.c
until we actually add some non-task isolation, and at that point we
can decide if it makes sense to rename the file, or put the new
code somewhere else, but I'm OK with doing it the way I described
in the previous paragraph if you think it's better.

>> +#ifdef CONFIG_CPU_ISOLATED
>> +void cpu_isolated_wait(void)
>> +{
>> +	set_current_state(TASK_INTERRUPTIBLE);
>> +	_cpu_idle();
>> +	set_current_state(TASK_RUNNING);
>> +}
> I'm still uncomfortable with that. A wake up model could work?

I don't know exactly what you have in mind.  The theory is that
at this point we're ready to return to user space and we're just
waiting for a timer tick that is guaranteed to arrive, since there
is something pending for the timer.

And, this is an arch-specific method anyway; the generic method
is actually checking to see if a signal has been delivered,
scheduling is needed, etc., each time around the loop, so if
you're not sure your architecture will do the right thing, just
don't provide a method that idles while waiting.  For tilegx I'm
sure it works correctly, so I'm OK providing that method.

>> +extern void cpu_isolated_enter(void);
>> +extern void cpu_isolated_wait(void);
>> +#else
>> +static inline bool is_cpu_isolated(void) { return false; }
>> +static inline void cpu_isolated_enter(void) { }
>> +#endif
> And all the naming should be about task as well, if we take that task direction.

As discussed above, probably task_isolation_enter(), etc.

>> +
>> +#endif
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 04b5ada460b4..0bb248385d88 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1776,6 +1776,9 @@ struct task_struct {
>>   	unsigned long	task_state_change;
>>   #endif
>>   	int pagefault_disabled;
>> +#ifdef CONFIG_CPU_ISOLATED
>> +	unsigned int	cpu_isolated_flags;
>> +#endif
> Can't we add a new flag to tsk->flags? There seem to be some values remaining.

Yeah, I thought of that, but it seems like a pretty scarce resource,
and I wasn't sure it was the right thing to do.  Also, I'm not actually
sure why the lowest two bits aren't apparently being used; looks
like PF_EXITING (0x4) is the first bit used.  And there are only three
more bits higher up in the word that are not assigned.

Also, right now we are allowing users to customize the signal delivered
for STRICT violation, and that signal value is stored in the
cpu_isolated_flags word as well, so we really don't have room in
tsk->flags for all of that anyway.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v5 2/6] cpu_isolated: add initial support
@ 2015-08-12 18:22               ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-08-12 18:22 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 08/12/2015 12:00 PM, Frederic Weisbecker wrote:
> On Tue, Jul 28, 2015 at 03:49:36PM -0400, Chris Metcalf wrote:
>> The existing nohz_full mode is designed as a "soft" isolation mode
>> that makes tradeoffs to minimize userspace interruptions while
>> still attempting to avoid overheads in the kernel entry/exit path,
>> to provide 100% kernel semantics, etc.
>>
>> However, some applications require a "hard" commitment from the
>> kernel to avoid interruptions, in particular userspace device
>> driver style applications, such as high-speed networking code.
>>
>> This change introduces a framework to allow applications
>> to elect to have the "hard" semantics as needed, specifying
>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so.
>> Subsequent commits will add additional flags and additional
>> semantics.
> We are doing this at the process level but the isolation works on
> the CPU scope... Now I wonder if prctl is the right interface.
>
> That said the user is rather interested in isolating a task. The CPU
> being the backend eventually.
>
> For example if the task is migrated by accident, we want it to be
> warned about that. And if the isolation is done on the CPU level
> instead of the task level, this won't happen.
>
> I'm also afraid that the naming clashes with cpu_isolated_map,
> although it could be a subset of it.
>
> So probably in this case we should consider talking about task rather
> than CPU isolation and change naming accordingly (sorry, I know I
> suggested cpu_isolation.c, I guess I had to see the result to realize).
>
> We must sort that out first. Either we consider isolation on the task
> level (and thus the underlying CPU by backend effect) and we use prctl().
> Or we do this on the CPU level and we use a specific syscall or sysfs
> which takes effect on any task in the relevant isolated CPUs.
>
> What do you think?

Yes, definitely task-centric is the right model.

With the original tilegx version of this code, we also checked that
the process had only a single core in its affinity mask, and that the
single core in question was a nohz_full core, before allowing the
"task isolated" mode to take effect.  I didn't do that in this round
of patches because it seemed a little silly in that the user could
then immediately reset their affinity to another core and lose the
effect, and it wasn't clear how to handle that: do we return EINVAL
from sched_setaffinity() after enabling the "task isolated" mode?
That seems potentially ugly, maybe standards-violating, etc.  So I
didn't bother.

But you could certainly argue for failing prctl() in that case anyway,
as a way to make sure users aren't doing something stupid like calling
the prctl() from a task that's running on a housekeeping core.  And
you could even argue for doing some kind of console spew if you try to
migrate a task that is in "task isolation" state - though I suppose if
you migrate it to another isolcpus and nohz_full core, maybe that's
kind of reasonable and doesn't deserve a warning?  I'm not sure.

>> The kernel must be built with the new CPU_ISOLATED Kconfig flag
>> to enable this mode, and the kernel booted with an appropriate
>> nohz_full=CPULIST boot argument.  The "cpu_isolated" state is then
>> indicated by setting a new task struct field, cpu_isolated_flags,
>> to the value passed by prctl().  When the _ENABLE bit is set for a
>> task, and it is returning to userspace on a nohz_full core, it calls
>> the new cpu_isolated_enter() routine to take additional actions
>> to help the task avoid being interrupted in the future.
>>
>> Initially, there are only three actions taken.  First, the
>> task calls lru_add_drain() to prevent being interrupted by a
>> subsequent lru_add_drain_all() call on another core.  Then, it calls
>> quiet_vmstat() to quieten the vmstat worker to avoid a follow-on
>> interrupt.  Finally, the code checks for pending timer interrupts
>> and quiesces until they are no longer pending.  As a result, sys
>> calls (and page faults, etc.) can be inordinately slow.  However,
>> this quiescing guarantees that no unexpected interrupts will occur,
>> even if the application intentionally calls into the kernel.
>>
>> Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
>> ---
>>   arch/tile/kernel/process.c   |  9 ++++++
>>   include/linux/cpu_isolated.h | 24 +++++++++++++++
>>   include/linux/sched.h        |  3 ++
>>   include/uapi/linux/prctl.h   |  5 ++++
>>   kernel/context_tracking.c    |  3 ++
>>   kernel/sys.c                 |  8 +++++
>>   kernel/time/Kconfig          | 20 +++++++++++++
>>   kernel/time/Makefile         |  1 +
>>   kernel/time/cpu_isolated.c   | 71 ++++++++++++++++++++++++++++++++++++++++++++
> It's not about time :-)
>
> The timer is only a part of the isolation.
>
> Moreover "isolatED" is a state. The filename should reflect the process. "isolatION" would
> better fit.
>
> kernel/task_isolation.c maybe or just kernel/isolation.c
>
> I think I prefer the latter because I'm not only interested in that task
> hard isolation feature, I would like to also drive all the general isolation
> operations from there (workqueue affinity, rcu nocb, ...).

That's reasonable, but I think the "task isolation" naming is probably
better for all the stuff that we're doing in this patch.  In other words,
we probably should use "task_isolation" as the prefix for symbols
names and API names, even if we put the code in kernel/isolation.c
for now in anticipation of non-task isolation being added later.

I think my instinct would still be to call it kernel/task_isolation.c
until we actually add some non-task isolation, and at that point we
can decide if it makes sense to rename the file, or put the new
code somewhere else, but I'm OK with doing it the way I described
in the previous paragraph if you think it's better.

>> +#ifdef CONFIG_CPU_ISOLATED
>> +void cpu_isolated_wait(void)
>> +{
>> +	set_current_state(TASK_INTERRUPTIBLE);
>> +	_cpu_idle();
>> +	set_current_state(TASK_RUNNING);
>> +}
> I'm still uncomfortable with that. A wake up model could work?

I don't know exactly what you have in mind.  The theory is that
at this point we're ready to return to user space and we're just
waiting for a timer tick that is guaranteed to arrive, since there
is something pending for the timer.

And, this is an arch-specific method anyway; the generic method
is actually checking to see if a signal has been delivered,
scheduling is needed, etc., each time around the loop, so if
you're not sure your architecture will do the right thing, just
don't provide a method that idles while waiting.  For tilegx I'm
sure it works correctly, so I'm OK providing that method.

>> +extern void cpu_isolated_enter(void);
>> +extern void cpu_isolated_wait(void);
>> +#else
>> +static inline bool is_cpu_isolated(void) { return false; }
>> +static inline void cpu_isolated_enter(void) { }
>> +#endif
> And all the naming should be about task as well, if we take that task direction.

As discussed above, probably task_isolation_enter(), etc.

>> +
>> +#endif
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 04b5ada460b4..0bb248385d88 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1776,6 +1776,9 @@ struct task_struct {
>>   	unsigned long	task_state_change;
>>   #endif
>>   	int pagefault_disabled;
>> +#ifdef CONFIG_CPU_ISOLATED
>> +	unsigned int	cpu_isolated_flags;
>> +#endif
> Can't we add a new flag to tsk->flags? There seem to be some values remaining.

Yeah, I thought of that, but it seems like a pretty scarce resource,
and I wasn't sure it was the right thing to do.  Also, I'm not actually
sure why the lowest two bits aren't apparently being used; looks
like PF_EXITING (0x4) is the first bit used.  And there are only three
more bits higher up in the word that are not assigned.

Also, right now we are allowing users to customize the signal delivered
for STRICT violation, and that signal value is stored in the
cpu_isolated_flags word as well, so we really don't have room in
tsk->flags for all of that anyway.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v6 0/6] support "task_isolated" mode for nohz_full
  2015-07-28 19:49         ` Chris Metcalf
@ 2015-08-25 19:55           ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

The cover email for the patch series is getting a little unwieldy
so I will provide a terser summary here, and just update the
list of changes from version to version.  Please see the previous
versions linked by the In-Reply-To for more detailed comments
about changes in earlier versions of the patch series.

v6:
  restructured to be a "task_isolation" mode not a "cpu_isolated"
  mode (Frederic)

v5:
  rebased on kernel v4.2-rc3
  converted to use CONFIG_CPU_ISOLATED and separate .c and .h files
  incorporates Christoph Lameter's quiet_vmstat() call

v4:
  rebased on kernel v4.2-rc1
  added support for detecting CPU_ISOLATED_STRICT syscalls on arm64

v3:
  remove dependency on cpu_idle subsystem (Thomas Gleixner)
  use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter
  use seconds for console messages instead of jiffies (Thomas Gleixner)
  updated commit description for patch 5/5

v2:
  rename "dataplane" to "cpu_isolated"
  drop ksoftirqd suppression changes (believed no longer needed)
  merge previous "QUIESCE" functionality into baseline functionality
  explicitly track syscalls and exceptions for "STRICT" functionality
  allow configuring a signal to be delivered for STRICT mode failures
  move debug tracking to irq_enter(), not irq_exit()

General summary:

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it.  However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode.  The
kernel must be built with CONFIG_TASK_ISOLATION to take advantage of
this new mode.  A prctl() option (PR_SET_TASK_ISOLATION) is added to
control whether processes have requested this stricter semantics, and
within that prctl() option we provide a number of different bits for
more precise control.  Additionally, we add a new command-line boot
argument to facilitate debugging where unexpected interrupts are being
delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers.  This patch
series represents the first serious attempt to upstream that
functionality.  Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on task_isolation cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on v4.2-rc3) is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Note: I have not removed the commit to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if
we remove the 1Hz tick, task_isolation threads will never re-enter
userspace since a tick will always be pending.

Chris Metcalf (5):
  task_isolation: add initial support
  task_isolation: support PR_TASK_ISOLATION_STRICT mode
  task_isolation: provide strict mode configurable signal
  task_isolation: add debug boot flag
  nohz: task_isolation: allow tick to be fully disabled

Christoph Lameter (1):
  vmstat: provide a function to quiet down the diff processing

 Documentation/kernel-parameters.txt |   7 +++
 arch/arm64/kernel/ptrace.c          |   5 ++
 arch/tile/kernel/process.c          |   9 +++
 arch/tile/kernel/ptrace.c           |   5 +-
 arch/tile/mm/homecache.c            |   5 +-
 arch/x86/kernel/ptrace.c            |   2 +
 include/linux/context_tracking.h    |  11 +++-
 include/linux/isolation.h           |  42 +++++++++++++
 include/linux/sched.h               |   3 +
 include/linux/vmstat.h              |   2 +
 include/uapi/linux/prctl.h          |   8 +++
 init/Kconfig                        |  20 ++++++
 kernel/Makefile                     |   1 +
 kernel/context_tracking.c           |  12 +++-
 kernel/irq_work.c                   |   5 +-
 kernel/isolation.c                  | 122 ++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                 |  21 +++++++
 kernel/signal.c                     |   5 ++
 kernel/smp.c                        |   4 ++
 kernel/softirq.c                    |   7 +++
 kernel/sys.c                        |   8 +++
 kernel/time/tick-sched.c            |   3 +-
 mm/vmstat.c                         |  14 +++++
 23 files changed, 311 insertions(+), 10 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

-- 
2.1.2


^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v6 0/6] support "task_isolated" mode for nohz_full
@ 2015-08-25 19:55           ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

The cover email for the patch series is getting a little unwieldy
so I will provide a terser summary here, and just update the
list of changes from version to version.  Please see the previous
versions linked by the In-Reply-To for more detailed comments
about changes in earlier versions of the patch series.

v6:
  restructured to be a "task_isolation" mode not a "cpu_isolated"
  mode (Frederic)

v5:
  rebased on kernel v4.2-rc3
  converted to use CONFIG_CPU_ISOLATED and separate .c and .h files
  incorporates Christoph Lameter's quiet_vmstat() call

v4:
  rebased on kernel v4.2-rc1
  added support for detecting CPU_ISOLATED_STRICT syscalls on arm64

v3:
  remove dependency on cpu_idle subsystem (Thomas Gleixner)
  use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter
  use seconds for console messages instead of jiffies (Thomas Gleixner)
  updated commit description for patch 5/5

v2:
  rename "dataplane" to "cpu_isolated"
  drop ksoftirqd suppression changes (believed no longer needed)
  merge previous "QUIESCE" functionality into baseline functionality
  explicitly track syscalls and exceptions for "STRICT" functionality
  allow configuring a signal to be delivered for STRICT mode failures
  move debug tracking to irq_enter(), not irq_exit()

General summary:

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it.  However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode.  The
kernel must be built with CONFIG_TASK_ISOLATION to take advantage of
this new mode.  A prctl() option (PR_SET_TASK_ISOLATION) is added to
control whether processes have requested this stricter semantics, and
within that prctl() option we provide a number of different bits for
more precise control.  Additionally, we add a new command-line boot
argument to facilitate debugging where unexpected interrupts are being
delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers.  This patch
series represents the first serious attempt to upstream that
functionality.  Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on task_isolation cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on v4.2-rc3) is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Note: I have not removed the commit to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if
we remove the 1Hz tick, task_isolation threads will never re-enter
userspace since a tick will always be pending.

Chris Metcalf (5):
  task_isolation: add initial support
  task_isolation: support PR_TASK_ISOLATION_STRICT mode
  task_isolation: provide strict mode configurable signal
  task_isolation: add debug boot flag
  nohz: task_isolation: allow tick to be fully disabled

Christoph Lameter (1):
  vmstat: provide a function to quiet down the diff processing

 Documentation/kernel-parameters.txt |   7 +++
 arch/arm64/kernel/ptrace.c          |   5 ++
 arch/tile/kernel/process.c          |   9 +++
 arch/tile/kernel/ptrace.c           |   5 +-
 arch/tile/mm/homecache.c            |   5 +-
 arch/x86/kernel/ptrace.c            |   2 +
 include/linux/context_tracking.h    |  11 +++-
 include/linux/isolation.h           |  42 +++++++++++++
 include/linux/sched.h               |   3 +
 include/linux/vmstat.h              |   2 +
 include/uapi/linux/prctl.h          |   8 +++
 init/Kconfig                        |  20 ++++++
 kernel/Makefile                     |   1 +
 kernel/context_tracking.c           |  12 +++-
 kernel/irq_work.c                   |   5 +-
 kernel/isolation.c                  | 122 ++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                 |  21 +++++++
 kernel/signal.c                     |   5 ++
 kernel/smp.c                        |   4 ++
 kernel/softirq.c                    |   7 +++
 kernel/sys.c                        |   8 +++
 kernel/time/tick-sched.c            |   3 +-
 mm/vmstat.c                         |  14 +++++
 23 files changed, 311 insertions(+), 10 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

-- 
2.1.2

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v6 1/6] vmstat: provide a function to quiet down the diff processing
  2015-08-25 19:55           ` Chris Metcalf
  (?)
@ 2015-08-25 19:55           ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel

From: Christoph Lameter <cl@linux.com>

quiet_vmstat() can be called in anticipation of a OS "quiet" period
where no tick processing should be triggered. quiet_vmstat() will fold
all pending differentials into the global counters and disable the
vmstat_worker processing.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Signed-off-by: Christoph Lameter <cl@linux.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 14 ++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 82e7db7f7100..c013b8d8e434 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone *, enum zone_stat_item);
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
+void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -272,6 +273,7 @@ static inline void __dec_zone_page_state(struct page *page,
 static inline void refresh_cpu_vm_stats(int cpu) { }
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd974e11a..cf7d324f16e2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1394,6 +1394,20 @@ static void vmstat_update(struct work_struct *w)
 }
 
 /*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+	do {
+		if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+			cancel_delayed_work(this_cpu_ptr(&vmstat_work));
+
+	} while (refresh_cpu_vm_stats());
+}
+
+/*
  * Check if the diffs for a certain cpu indicate that
  * an update is needed.
  */
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v6 2/6] task_isolation: add initial support
  2015-08-25 19:55           ` Chris Metcalf
@ 2015-08-25 19:55             ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
nohz_full=CPULIST boot argument.  The "task_isolation" state is then
indicated by setting a new task struct field, task_isolation_flag,
to the value passed by prctl().  When the _ENABLE bit is set for a
task, and it is returning to userspace on a nohz_full core, it calls
the new task_isolation_enter() routine to take additional actions
to help the task avoid being interrupted in the future.

Initially, there are only three actions taken.  First, the
task calls lru_add_drain() to prevent being interrupted by a
subsequent lru_add_drain_all() call on another core.  Then, it calls
quiet_vmstat() to quieten the vmstat worker to avoid a follow-on
interrupt.  Finally, the code checks for pending timer interrupts
and quiesces until they are no longer pending.  As a result, sys
calls (and page faults, etc.) can be inordinately slow.  However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/kernel/process.c |  9 ++++++
 include/linux/isolation.h  | 24 +++++++++++++++
 include/linux/sched.h      |  3 ++
 include/uapi/linux/prctl.h |  5 ++++
 init/Kconfig               | 20 +++++++++++++
 kernel/Makefile            |  1 +
 kernel/context_tracking.c  |  3 ++
 kernel/isolation.c         | 75 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c               |  8 +++++
 9 files changed, 148 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index e036c0aa9792..1d9bd2320a50 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -70,6 +70,15 @@ void arch_cpu_idle(void)
 	_cpu_idle();
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+void task_isolation_wait(void)
+{
+	set_current_state(TASK_INTERRUPTIBLE);
+	_cpu_idle();
+	set_current_state(TASK_RUNNING);
+}
+#endif
+
 /*
  * Release a thread_info structure
  */
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..fd04011b1c1e
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,24 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+static inline bool task_isolation_enabled(void)
+{
+	return tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE);
+}
+
+extern void task_isolation_enter(void);
+extern void task_isolation_wait(void);
+#else
+static inline bool task_isolation_enabled(void) { return false; }
+static inline void task_isolation_enter(void) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04b5ada460b4..2acb618189d0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1776,6 +1776,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned int	task_isolation_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..79da784fe17a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
 # define PR_FP_MODE_FR		(1 << 0)	/* 64b FP registers */
 # define PR_FP_MODE_FRE		(1 << 1)	/* 32b compatibility */
 
+/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */
+#define PR_SET_TASK_ISOLATION		47
+#define PR_GET_TASK_ISOLATION		48
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index af09b4fb43d2..82d313cbd70f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -795,6 +795,26 @@ config RCU_EXPEDITE_BOOT
 
 endmenu # "RCU Subsystem"
 
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL
+	help
+	 Allow userspace processes to place themselves on nohz_full
+	 cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+	 themselves from the kernel.  On return to userspace,
+	 isolated tasks will first arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  This "hard" isolation from the kernel is
+	 required for userspace tasks that are running hard real-time
+	 tasks in userspace, such as a 10 Gbit network driver in userspace.
+
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index 43c4c920f30a..9ffb5c021767 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 0a495ab35bc7..c57c99f5c4d7 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state)
 			 * on the tick.
 			 */
 			if (state == CONTEXT_USER) {
+				if (task_isolation_enabled())
+					task_isolation_enter();
 				trace_user_enter(0);
 				vtime_user_enter(current);
 			}
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..d4618cd9e23d
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,75 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include "time/tick-sched.h"
+
+/*
+ * Rather than continuously polling for the next_event in the
+ * tick_cpu_device, architectures can provide a method to save power
+ * by sleeping until an interrupt arrives.
+ *
+ * Note that it must be guaranteed for a particular architecture
+ * that if next_event is not KTIME_MAX, then a timer interrupt will
+ * occur, otherwise the sleep may never awaken.
+ */
+void __weak task_isolation_wait(void)
+{
+	cpu_relax();
+}
+
+/*
+ * We normally return immediately to userspace.
+ *
+ * In task_isolation mode we wait until no more interrupts are
+ * pending.  Otherwise we nap with interrupts enabled and wait for the
+ * next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two task_isolation processes on the same
+ * core, neither will ever leave the kernel, and one will have to be
+ * killed manually.  Otherwise in situations where another process is
+ * in the runqueue on this cpu, this task will just wait for that
+ * other task to go idle before returning to user space.
+ */
+void task_isolation_enter(void)
+{
+	struct clock_event_device *dev =
+		__this_cpu_read(tick_cpu_device.evtdev);
+	struct task_struct *task = current;
+	unsigned long start = jiffies;
+	bool warned = false;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	quiet_vmstat();
+
+	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+		if (!warned && (jiffies - start) >= (5 * HZ)) {
+			pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n",
+				task->comm, task->pid, smp_processor_id(),
+				(jiffies - start) / HZ);
+			warned = true;
+		}
+		if (should_resched())
+			schedule();
+		if (test_thread_flag(TIF_SIGPENDING))
+			break;
+		task_isolation_wait();
+	}
+	if (warned) {
+		pr_warn("%s/%d: cpu %d: task_isolation task unblocked after %ld seconds\n",
+			task->comm, task->pid, smp_processor_id(),
+			(jiffies - start) / HZ);
+		dump_stack();
+	}
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 259fda25eb6b..c7024be2d79b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_TASK_ISOLATION
+	case PR_SET_TASK_ISOLATION:
+		me->task_isolation_flags = arg2;
+		break;
+	case PR_GET_TASK_ISOLATION:
+		error = me->task_isolation_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v6 2/6] task_isolation: add initial support
@ 2015-08-25 19:55             ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
nohz_full=CPULIST boot argument.  The "task_isolation" state is then
indicated by setting a new task struct field, task_isolation_flag,
to the value passed by prctl().  When the _ENABLE bit is set for a
task, and it is returning to userspace on a nohz_full core, it calls
the new task_isolation_enter() routine to take additional actions
to help the task avoid being interrupted in the future.

Initially, there are only three actions taken.  First, the
task calls lru_add_drain() to prevent being interrupted by a
subsequent lru_add_drain_all() call on another core.  Then, it calls
quiet_vmstat() to quieten the vmstat worker to avoid a follow-on
interrupt.  Finally, the code checks for pending timer interrupts
and quiesces until they are no longer pending.  As a result, sys
calls (and page faults, etc.) can be inordinately slow.  However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/kernel/process.c |  9 ++++++
 include/linux/isolation.h  | 24 +++++++++++++++
 include/linux/sched.h      |  3 ++
 include/uapi/linux/prctl.h |  5 ++++
 init/Kconfig               | 20 +++++++++++++
 kernel/Makefile            |  1 +
 kernel/context_tracking.c  |  3 ++
 kernel/isolation.c         | 75 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c               |  8 +++++
 9 files changed, 148 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index e036c0aa9792..1d9bd2320a50 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -70,6 +70,15 @@ void arch_cpu_idle(void)
 	_cpu_idle();
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+void task_isolation_wait(void)
+{
+	set_current_state(TASK_INTERRUPTIBLE);
+	_cpu_idle();
+	set_current_state(TASK_RUNNING);
+}
+#endif
+
 /*
  * Release a thread_info structure
  */
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..fd04011b1c1e
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,24 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+static inline bool task_isolation_enabled(void)
+{
+	return tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE);
+}
+
+extern void task_isolation_enter(void);
+extern void task_isolation_wait(void);
+#else
+static inline bool task_isolation_enabled(void) { return false; }
+static inline void task_isolation_enter(void) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04b5ada460b4..2acb618189d0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1776,6 +1776,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned int	task_isolation_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..79da784fe17a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,9 @@ struct prctl_mm_map {
 # define PR_FP_MODE_FR		(1 << 0)	/* 64b FP registers */
 # define PR_FP_MODE_FRE		(1 << 1)	/* 32b compatibility */
 
+/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */
+#define PR_SET_TASK_ISOLATION		47
+#define PR_GET_TASK_ISOLATION		48
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index af09b4fb43d2..82d313cbd70f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -795,6 +795,26 @@ config RCU_EXPEDITE_BOOT
 
 endmenu # "RCU Subsystem"
 
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL
+	help
+	 Allow userspace processes to place themselves on nohz_full
+	 cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+	 themselves from the kernel.  On return to userspace,
+	 isolated tasks will first arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  This "hard" isolation from the kernel is
+	 required for userspace tasks that are running hard real-time
+	 tasks in userspace, such as a 10 Gbit network driver in userspace.
+
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index 43c4c920f30a..9ffb5c021767 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 0a495ab35bc7..c57c99f5c4d7 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state)
 			 * on the tick.
 			 */
 			if (state == CONTEXT_USER) {
+				if (task_isolation_enabled())
+					task_isolation_enter();
 				trace_user_enter(0);
 				vtime_user_enter(current);
 			}
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..d4618cd9e23d
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,75 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include "time/tick-sched.h"
+
+/*
+ * Rather than continuously polling for the next_event in the
+ * tick_cpu_device, architectures can provide a method to save power
+ * by sleeping until an interrupt arrives.
+ *
+ * Note that it must be guaranteed for a particular architecture
+ * that if next_event is not KTIME_MAX, then a timer interrupt will
+ * occur, otherwise the sleep may never awaken.
+ */
+void __weak task_isolation_wait(void)
+{
+	cpu_relax();
+}
+
+/*
+ * We normally return immediately to userspace.
+ *
+ * In task_isolation mode we wait until no more interrupts are
+ * pending.  Otherwise we nap with interrupts enabled and wait for the
+ * next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two task_isolation processes on the same
+ * core, neither will ever leave the kernel, and one will have to be
+ * killed manually.  Otherwise in situations where another process is
+ * in the runqueue on this cpu, this task will just wait for that
+ * other task to go idle before returning to user space.
+ */
+void task_isolation_enter(void)
+{
+	struct clock_event_device *dev =
+		__this_cpu_read(tick_cpu_device.evtdev);
+	struct task_struct *task = current;
+	unsigned long start = jiffies;
+	bool warned = false;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	quiet_vmstat();
+
+	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+		if (!warned && (jiffies - start) >= (5 * HZ)) {
+			pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n",
+				task->comm, task->pid, smp_processor_id(),
+				(jiffies - start) / HZ);
+			warned = true;
+		}
+		if (should_resched())
+			schedule();
+		if (test_thread_flag(TIF_SIGPENDING))
+			break;
+		task_isolation_wait();
+	}
+	if (warned) {
+		pr_warn("%s/%d: cpu %d: task_isolation task unblocked after %ld seconds\n",
+			task->comm, task->pid, smp_processor_id(),
+			(jiffies - start) / HZ);
+		dump_stack();
+	}
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 259fda25eb6b..c7024be2d79b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_TASK_ISOLATION
+	case PR_SET_TASK_ISOLATION:
+		me->task_isolation_flags = arg2;
+		break;
+	case PR_GET_TASK_ISOLATION:
+		error = me->task_isolation_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode
  2015-08-25 19:55           ` Chris Metcalf
@ 2015-08-25 19:55             ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry is fatal.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

This change adds the syscall-detection hooks only for x86, arm64,
and tile.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/arm64/kernel/ptrace.c       |  5 +++++
 arch/tile/kernel/ptrace.c        |  5 ++++-
 arch/x86/kernel/ptrace.c         |  2 ++
 include/linux/context_tracking.h | 11 ++++++++---
 include/linux/isolation.h        | 16 ++++++++++++++++
 include/uapi/linux/prctl.h       |  1 +
 kernel/context_tracking.c        |  9 ++++++---
 kernel/isolation.c               | 38 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index d882b833dbdb..e3d83a12f3cf 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
+	/* Ensure we report task_isolation violations in all circumstances. */
+	if (test_thread_flag(TIF_NOHZ) && task_isolation_strict())
+		task_isolation_syscall(regs->syscallno);
+
 	/* Do the secure computing check first; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..c327cb918a44 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (work & _TIF_NOHZ)
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		if (task_isolation_strict())
+			task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]);
+	}
 
 	if (work & _TIF_SYSCALL_TRACE) {
 		if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9be72bc3613f..2f9ce9466daf 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	if (work & _TIF_NOHZ) {
 		user_exit();
 		work &= ~_TIF_NOHZ;
+		if (task_isolation_strict())
+			task_isolation_syscall(regs->orig_ax);
 	}
 
 #ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd299966f..e0ac0228fea1 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include <linux/sched.h>
 #include <linux/vtime.h>
+#include <linux/isolation.h>
 #include <linux/context_tracking_state.h>
 #include <asm/ptrace.h>
 
@@ -11,7 +12,7 @@
 extern void context_tracking_cpu_set(int cpu);
 
 extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 
@@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
 		return 0;
 
 	prev_ctx = this_cpu_read(context_tracking.state);
-	if (prev_ctx != CONTEXT_KERNEL)
-		context_tracking_exit(prev_ctx);
+	if (prev_ctx != CONTEXT_KERNEL) {
+		if (context_tracking_exit(prev_ctx)) {
+			if (task_isolation_strict())
+				task_isolation_exception();
+		}
+	}
 
 	return prev_ctx;
 }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index fd04011b1c1e..27a4469831c1 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void)
 }
 
 extern void task_isolation_enter(void);
+extern void task_isolation_syscall(int nr);
+extern void task_isolation_exception(void);
 extern void task_isolation_wait(void);
 #else
 static inline bool task_isolation_enabled(void) { return false; }
 static inline void task_isolation_enter(void) { }
+static inline void task_isolation_syscall(int nr) { }
+static inline void task_isolation_exception(void) { }
 #endif
 
+static inline bool task_isolation_strict(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+	if (tick_nohz_full_cpu(smp_processor_id()) &&
+	    (current->task_isolation_flags &
+	     (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+	    (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT))
+		return true;
+#endif
+	return false;
+}
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 79da784fe17a..e16e13911e8a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		47
 #define PR_GET_TASK_ISOLATION		48
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index c57c99f5c4d7..17a71f7b66b8 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
 {
 	unsigned long flags;
+	bool from_user = false;
 
 	if (!context_tracking_is_enabled())
-		return;
+		return false;
 
 	if (in_interrupt())
-		return;
+		return false;
 
 	local_irq_save(flags);
 	if (!context_tracking_recursion_enter())
@@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state)
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				from_user = true;
 				vtime_user_exit(current);
 				trace_user_exit(0);
 			}
@@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state)
 	context_tracking_recursion_exit();
 out_irq_restore:
 	local_irq_restore(flags);
+	return from_user;
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
 EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/isolation.c b/kernel/isolation.c
index d4618cd9e23d..a89a6e9adfb4 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -10,6 +10,7 @@
 #include <linux/swap.h>
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
+#include <asm/unistd.h>
 #include "time/tick-sched.h"
 
 /*
@@ -73,3 +74,40 @@ void task_isolation_enter(void)
 		dump_stack();
 	}
 }
+
+static void kill_task_isolation_strict_task(void)
+{
+	dump_stack();
+	current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
+	send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void task_isolation_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return;
+	}
+
+	pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
+		current->comm, current->pid, syscall);
+	kill_task_isolation_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(void)
+{
+	pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
+		current->comm, current->pid);
+	kill_task_isolation_strict_task();
+}
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode
@ 2015-08-25 19:55             ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry is fatal.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

This change adds the syscall-detection hooks only for x86, arm64,
and tile.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/arm64/kernel/ptrace.c       |  5 +++++
 arch/tile/kernel/ptrace.c        |  5 ++++-
 arch/x86/kernel/ptrace.c         |  2 ++
 include/linux/context_tracking.h | 11 ++++++++---
 include/linux/isolation.h        | 16 ++++++++++++++++
 include/uapi/linux/prctl.h       |  1 +
 kernel/context_tracking.c        |  9 ++++++---
 kernel/isolation.c               | 38 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index d882b833dbdb..e3d83a12f3cf 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
+	/* Ensure we report task_isolation violations in all circumstances. */
+	if (test_thread_flag(TIF_NOHZ) && task_isolation_strict())
+		task_isolation_syscall(regs->syscallno);
+
 	/* Do the secure computing check first; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..c327cb918a44 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (work & _TIF_NOHZ)
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		if (task_isolation_strict())
+			task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]);
+	}
 
 	if (work & _TIF_SYSCALL_TRACE) {
 		if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9be72bc3613f..2f9ce9466daf 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	if (work & _TIF_NOHZ) {
 		user_exit();
 		work &= ~_TIF_NOHZ;
+		if (task_isolation_strict())
+			task_isolation_syscall(regs->orig_ax);
 	}
 
 #ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd299966f..e0ac0228fea1 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include <linux/sched.h>
 #include <linux/vtime.h>
+#include <linux/isolation.h>
 #include <linux/context_tracking_state.h>
 #include <asm/ptrace.h>
 
@@ -11,7 +12,7 @@
 extern void context_tracking_cpu_set(int cpu);
 
 extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 
@@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
 		return 0;
 
 	prev_ctx = this_cpu_read(context_tracking.state);
-	if (prev_ctx != CONTEXT_KERNEL)
-		context_tracking_exit(prev_ctx);
+	if (prev_ctx != CONTEXT_KERNEL) {
+		if (context_tracking_exit(prev_ctx)) {
+			if (task_isolation_strict())
+				task_isolation_exception();
+		}
+	}
 
 	return prev_ctx;
 }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index fd04011b1c1e..27a4469831c1 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void)
 }
 
 extern void task_isolation_enter(void);
+extern void task_isolation_syscall(int nr);
+extern void task_isolation_exception(void);
 extern void task_isolation_wait(void);
 #else
 static inline bool task_isolation_enabled(void) { return false; }
 static inline void task_isolation_enter(void) { }
+static inline void task_isolation_syscall(int nr) { }
+static inline void task_isolation_exception(void) { }
 #endif
 
+static inline bool task_isolation_strict(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+	if (tick_nohz_full_cpu(smp_processor_id()) &&
+	    (current->task_isolation_flags &
+	     (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+	    (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT))
+		return true;
+#endif
+	return false;
+}
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 79da784fe17a..e16e13911e8a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		47
 #define PR_GET_TASK_ISOLATION		48
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index c57c99f5c4d7..17a71f7b66b8 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
 {
 	unsigned long flags;
+	bool from_user = false;
 
 	if (!context_tracking_is_enabled())
-		return;
+		return false;
 
 	if (in_interrupt())
-		return;
+		return false;
 
 	local_irq_save(flags);
 	if (!context_tracking_recursion_enter())
@@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state)
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				from_user = true;
 				vtime_user_exit(current);
 				trace_user_exit(0);
 			}
@@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state)
 	context_tracking_recursion_exit();
 out_irq_restore:
 	local_irq_restore(flags);
+	return from_user;
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
 EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/isolation.c b/kernel/isolation.c
index d4618cd9e23d..a89a6e9adfb4 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -10,6 +10,7 @@
 #include <linux/swap.h>
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
+#include <asm/unistd.h>
 #include "time/tick-sched.h"
 
 /*
@@ -73,3 +74,40 @@ void task_isolation_enter(void)
 		dump_stack();
 	}
 }
+
+static void kill_task_isolation_strict_task(void)
+{
+	dump_stack();
+	current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
+	send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void task_isolation_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return;
+	}
+
+	pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
+		current->comm, current->pid, syscall);
+	kill_task_isolation_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(void)
+{
+	pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
+		current->comm, current->pid);
+	kill_task_isolation_strict_task();
+}
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v6 4/6] task_isolation: provide strict mode configurable signal
@ 2015-08-25 19:55             ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

Allow userspace to override the default SIGKILL delivered
when a task_isolation process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

In addition to being able to set the signal, we now also
pass whether or not the interruption was from a syscall in
the si_code field of the siginfo.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/uapi/linux/prctl.h |  2 ++
 kernel/isolation.c         | 17 +++++++++++++----
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index e16e13911e8a..2a4ddc890e22 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
 #define PR_GET_TASK_ISOLATION		48
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
 # define PR_TASK_ISOLATION_STRICT	(1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index a89a6e9adfb4..b776aa632c8f 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -75,11 +75,20 @@ void task_isolation_enter(void)
 	}
 }
 
-static void kill_task_isolation_strict_task(void)
+static void kill_task_isolation_strict_task(int is_syscall)
 {
+	siginfo_t info = {};
+	int sig;
+
 	dump_stack();
 	current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
-	send_sig(SIGKILL, current, 1);
+
+	sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags);
+	if (sig == 0)
+		sig = SIGKILL;
+	info.si_signo = sig;
+	info.si_code = is_syscall;
+	send_sig_info(sig, &info, current);
 }
 
 /*
@@ -98,7 +107,7 @@ void task_isolation_syscall(int syscall)
 
 	pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
 		current->comm, current->pid, syscall);
-	kill_task_isolation_strict_task();
+	kill_task_isolation_strict_task(1);
 }
 
 /*
@@ -109,5 +118,5 @@ void task_isolation_exception(void)
 {
 	pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
 		current->comm, current->pid);
-	kill_task_isolation_strict_task();
+	kill_task_isolation_strict_task(0);
 }
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v6 4/6] task_isolation: provide strict mode configurable signal
@ 2015-08-25 19:55             ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Chris Metcalf

Allow userspace to override the default SIGKILL delivered
when a task_isolation process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

In addition to being able to set the signal, we now also
pass whether or not the interruption was from a syscall in
the si_code field of the siginfo.

Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
---
 include/uapi/linux/prctl.h |  2 ++
 kernel/isolation.c         | 17 +++++++++++++----
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index e16e13911e8a..2a4ddc890e22 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -195,5 +195,7 @@ struct prctl_mm_map {
 #define PR_GET_TASK_ISOLATION		48
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
 # define PR_TASK_ISOLATION_STRICT	(1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index a89a6e9adfb4..b776aa632c8f 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -75,11 +75,20 @@ void task_isolation_enter(void)
 	}
 }
 
-static void kill_task_isolation_strict_task(void)
+static void kill_task_isolation_strict_task(int is_syscall)
 {
+	siginfo_t info = {};
+	int sig;
+
 	dump_stack();
 	current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
-	send_sig(SIGKILL, current, 1);
+
+	sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags);
+	if (sig == 0)
+		sig = SIGKILL;
+	info.si_signo = sig;
+	info.si_code = is_syscall;
+	send_sig_info(sig, &info, current);
 }
 
 /*
@@ -98,7 +107,7 @@ void task_isolation_syscall(int syscall)
 
 	pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
 		current->comm, current->pid, syscall);
-	kill_task_isolation_strict_task();
+	kill_task_isolation_strict_task(1);
 }
 
 /*
@@ -109,5 +118,5 @@ void task_isolation_exception(void)
 {
 	pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
 		current->comm, current->pid);
-	kill_task_isolation_strict_task();
+	kill_task_isolation_strict_task(0);
 }
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v6 5/6] task_isolation: add debug boot flag
  2015-08-25 19:55           ` Chris Metcalf
                             ` (4 preceding siblings ...)
  (?)
@ 2015-08-25 19:55           ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc,
	linux-kernel
  Cc: Chris Metcalf

The new "task_isolation_debug" flag simplifies debugging
of TASK_ISOLATION kernels when processes are running in
PR_TASK_ISOLATION_ENABLE mode.  Such processes should get no
interrupts from the kernel, and if they do, when this boot flag is
specified a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a task_isolation
core has unexpectedly entered the kernel.  But what this boot flag
does is allow the kernel to provide better diagnostics, e.g. by
reporting in the IPI-generating code what remote core and context
is preparing to deliver an interrupt to a task_isolation core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 Documentation/kernel-parameters.txt |  7 +++++++
 arch/tile/mm/homecache.c            |  5 ++++-
 include/linux/isolation.h           |  2 ++
 kernel/irq_work.c                   |  5 ++++-
 kernel/sched/core.c                 | 21 +++++++++++++++++++++
 kernel/signal.c                     |  5 +++++
 kernel/smp.c                        |  4 ++++
 kernel/softirq.c                    |  7 +++++++
 8 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 1d6f0459cd7b..934f172eb140 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3595,6 +3595,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation_debug	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION and booted
+			in nohz_full= mode, this setting will generate console
+			backtraces when the kernel is about to interrupt a
+			task that has requested PR_TASK_ISOLATION_ENABLE
+			and is running on a nohz_full core.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..a79325113105 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
 #include <linux/smp.h>
 #include <linux/module.h>
 #include <linux/hugetlb.h>
+#include <linux/isolation.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
 	 * Don't bother to update atomically; losing a count
 	 * here is not that critical.
 	 */
-	for_each_cpu(cpu, &mask)
+	for_each_cpu(cpu, &mask) {
 		++per_cpu(irq_stat, cpu).irq_hv_flush_count;
+		task_isolation_debug(cpu);
+	}
 }
 
 /*
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index 27a4469831c1..9f1747331a36 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -18,11 +18,13 @@ extern void task_isolation_enter(void);
 extern void task_isolation_syscall(int nr);
 extern void task_isolation_exception(void);
 extern void task_isolation_wait(void);
+extern void task_isolation_debug(int cpu);
 #else
 static inline bool task_isolation_enabled(void) { return false; }
 static inline void task_isolation_enter(void) { }
 static inline void task_isolation_syscall(int nr) { }
 static inline void task_isolation_exception(void) { }
+static inline void task_isolation_debug(int cpu) { }
 #endif
 
 static inline bool task_isolation_strict(void)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index cbf9fb899d92..745c2ea6a4e4 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/isolation.h>
 #include <asm/processor.h>
 
 
@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (!irq_work_claim(work))
 		return false;
 
-	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+		task_isolation_debug(cpu);
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return true;
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78b4bad10081..0c4e4eba69b1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
 #include <linux/binfmts.h>
 #include <linux/context_tracking.h>
 #include <linux/compiler.h>
+#include <linux/isolation.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -745,6 +746,26 @@ bool sched_can_stop_tick(void)
 }
 #endif /* CONFIG_NO_HZ_FULL */
 
+#ifdef CONFIG_TASK_ISOLATION
+/* Enable debugging of any interrupts of task_isolation cores. */
+static int task_isolation_debug_flag;
+static int __init task_isolation_debug_func(char *str)
+{
+	task_isolation_debug_flag = true;
+	return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+void task_isolation_debug(int cpu)
+{
+	if (task_isolation_debug_flag && tick_nohz_full_cpu(cpu) &&
+	    (cpu_curr(cpu)->task_isolation_flags & PR_TASK_ISOLATION_ENABLE)) {
+		pr_err("Interrupt detected for task_isolation cpu %d\n", cpu);
+		dump_stack();
+	}
+}
+#endif
+
 void sched_avg_update(struct rq *rq)
 {
 	s64 period = sched_avg_period();
diff --git a/kernel/signal.c b/kernel/signal.c
index 836df8dac6cc..60e15e835b9e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -684,6 +684,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+#ifdef CONFIG_TASK_ISOLATION
+	/* If the task is being killed, don't complain about task_isolation. */
+	if (state & TASK_WAKEKILL)
+		t->task_isolation_flags = 0;
+#endif
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 07854477c164..b0bddff2693d 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
 #include <linux/smp.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include <linux/isolation.h>
 
 #include "smpboot.h"
 
@@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
+	task_isolation_debug(cpu);
 	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
 		arch_send_call_function_single_ipi(cpu);
 
@@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	for_each_cpu(cpu, cfd->cpumask)
+		task_isolation_debug(cpu);
 	arch_send_call_function_ipi_mask(cfd->cpumask);
 
 	if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..ed762fec7265 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -24,8 +24,10 @@
 #include <linux/ftrace.h>
 #include <linux/smp.h>
 #include <linux/smpboot.h>
+#include <linux/context_tracking.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -335,6 +337,11 @@ void irq_enter(void)
 		_local_bh_enable();
 	}
 
+	if (context_tracking_cpu_is_enabled() &&
+	    context_tracking_in_user() &&
+	    !in_interrupt())
+		task_isolation_debug(smp_processor_id());
+
 	__irq_enter();
 }
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v6 6/6] nohz: task_isolation: allow tick to be fully disabled
  2015-08-25 19:55           ` Chris Metcalf
                             ` (5 preceding siblings ...)
  (?)
@ 2015-08-25 19:55           ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel
  Cc: Chris Metcalf

While the current fallback to 1-second tick is still helpful for
maintaining completely correct kernel semantics, processes using
prctl(PR_SET_TASK_ISOLATION) semantics place a higher priority on
running completely tickless, so don't bound the time_delta for such
processes.  In addition, due to the way such processes quiesce by
waiting for the timer tick to stop prior to returning to userspace,
without this commit it won't be possible to use the task_isolation
mode at all.

Removing the 1-second cap was previously discussed (see link
below) and Thomas Gleixner observed that vruntime, load balancing
data, load accounting, and other things might be impacted.
Frederic Weisbecker similarly observed that allowing the tick to
be indefinitely deferred just meant that no one would ever fix the
underlying bugs.  However it's at least true that the mode proposed
in this patch can only be enabled on a nohz_full core by a process
requesting task_isolation mode, which may limit how important it is
to maintain scheduler data correctly, for example.

Paul McKenney observed that if provide a mode where the 1Hz fallback
timer is removed, this will provide an environment where new code
that relies on that tick will get punished, and we won't forgive
such assumptions silently, so it may also be worth it from that
perspective.

Finally, it's worth observing that the tile architecture has been
using similar code for its Zero-Overhead Linux for many years
(starting in 2008) and customers are very enthusiastic about the
resulting bare-metal performance on cores that are available to
run full Linux semantics on demand (crash, logging, shutdown, etc).
So this semantics is very useful if we can convince ourselves that
doing this is safe.

Link: https://lkml.kernel.org/r/alpine.DEB.2.11.1410311058500.32582@gentwo.org
Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 kernel/time/tick-sched.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c792429e98c6..be296499b753 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
 #include <linux/posix-timers.h>
 #include <linux/perf_event.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 
 #include <asm/irq_regs.h>
 
@@ -652,7 +653,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
 
 #ifdef CONFIG_NO_HZ_FULL
 	/* Limit the tick delta to the maximum scheduler deferment */
-	if (!ts->inidle)
+	if (!ts->inidle && !task_isolation_enabled())
 		delta = min(delta, scheduler_tick_max_deferment());
 #endif
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* Re: [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode
  2015-08-25 19:55             ` Chris Metcalf
  (?)
@ 2015-08-26 10:36             ` Will Deacon
  2015-08-26 15:10               ` Chris Metcalf
  2015-08-28 15:31                 ` Chris Metcalf
  -1 siblings, 2 replies; 340+ messages in thread
From: Will Deacon @ 2015-08-26 10:36 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, linux-doc, linux-api,
	linux-kernel

Hi Chris,

On Tue, Aug 25, 2015 at 08:55:52PM +0100, Chris Metcalf wrote:
> With task_isolation mode, the task is in principle guaranteed not to
> be interrupted by the kernel, but only if it behaves.  In particular,
> if it enters the kernel via system call, page fault, or any of a
> number of other synchronous traps, it may be unexpectedly exposed
> to long latencies.  Add a simple flag that puts the process into
> a state where any such kernel entry is fatal.
> 
> To allow the state to be entered and exited, we ignore the prctl()
> syscall so that we can clear the bit again later, and we ignore
> exit/exit_group to allow exiting the task without a pointless signal
> killing you as you try to do so.
> 
> This change adds the syscall-detection hooks only for x86, arm64,
> and tile.
> 
> The signature of context_tracking_exit() changes to report whether
> we, in fact, are exiting back to user space, so that we can track
> user exceptions properly separately from other kernel entries.
> 
> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
> ---
>  arch/arm64/kernel/ptrace.c       |  5 +++++
>  arch/tile/kernel/ptrace.c        |  5 ++++-
>  arch/x86/kernel/ptrace.c         |  2 ++
>  include/linux/context_tracking.h | 11 ++++++++---
>  include/linux/isolation.h        | 16 ++++++++++++++++
>  include/uapi/linux/prctl.h       |  1 +
>  kernel/context_tracking.c        |  9 ++++++---
>  kernel/isolation.c               | 38 ++++++++++++++++++++++++++++++++++++++
>  8 files changed, 80 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> index d882b833dbdb..e3d83a12f3cf 100644
> --- a/arch/arm64/kernel/ptrace.c
> +++ b/arch/arm64/kernel/ptrace.c
> @@ -37,6 +37,7 @@
>  #include <linux/regset.h>
>  #include <linux/tracehook.h>
>  #include <linux/elf.h>
> +#include <linux/isolation.h>
>  
>  #include <asm/compat.h>
>  #include <asm/debug-monitors.h>
> @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
>  
>  asmlinkage int syscall_trace_enter(struct pt_regs *regs)
>  {
> +	/* Ensure we report task_isolation violations in all circumstances. */
> +	if (test_thread_flag(TIF_NOHZ) && task_isolation_strict())

This is going to force us to check TIF_NOHZ on the syscall slowpath even
when CONFIG_TASK_ISOLATION=n.

> +		task_isolation_syscall(regs->syscallno);
> +
>  	/* Do the secure computing check first; failures should be fast. */

Here we have the usual priority problems with all the subsystems that
hook into the syscall path. If a prctl is later rewritten to a different
syscall, do you care about catching it? Either way, the comment about
doing secure computing "first" needs fixing.

Cheers,

Will

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode
  2015-08-26 10:36             ` Will Deacon
@ 2015-08-26 15:10               ` Chris Metcalf
  2015-09-02 10:13                   ` Will Deacon
  2015-08-28 15:31                 ` Chris Metcalf
  1 sibling, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-08-26 15:10 UTC (permalink / raw)
  To: Will Deacon
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, linux-doc, linux-api,
	linux-kernel

On 08/26/2015 06:36 AM, Will Deacon wrote:
> Hi Chris,
>
> On Tue, Aug 25, 2015 at 08:55:52PM +0100, Chris Metcalf wrote:
>> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
>> index d882b833dbdb..e3d83a12f3cf 100644
>> --- a/arch/arm64/kernel/ptrace.c
>> +++ b/arch/arm64/kernel/ptrace.c
>> @@ -37,6 +37,7 @@
>>   #include <linux/regset.h>
>>   #include <linux/tracehook.h>
>>   #include <linux/elf.h>
>> +#include <linux/isolation.h>
>>   
>>   #include <asm/compat.h>
>>   #include <asm/debug-monitors.h>
>> @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
>>   
>>   asmlinkage int syscall_trace_enter(struct pt_regs *regs)
>>   {
>> +	/* Ensure we report task_isolation violations in all circumstances. */
>> +	if (test_thread_flag(TIF_NOHZ) && task_isolation_strict())
> This is going to force us to check TIF_NOHZ on the syscall slowpath even
> when CONFIG_TASK_ISOLATION=n.

Yes, good catch.  I was thinking the "&& false" would suppress the TIF
test but I forgot that test_bit() takes a volatile argument, so it gets
evaluated even though the result isn't actually used.

But I don't want to just reorder the two tests, because when isolation
is enabled, testing TIF_NOHZ first is better.  I think probably the right
solution is just to put an #ifdef CONFIG_TASK_ISOLATION around that
test, even though that is a little crufty.  The alternative is to provide
a task_isolation_configured() macro that just returns true or false, and
make it a three-part "&&" test with that new macro first, but
that seems a little crufty as well.  Do you have a preference?

>> +		task_isolation_syscall(regs->syscallno);
>> +
>>   	/* Do the secure computing check first; failures should be fast. */
> Here we have the usual priority problems with all the subsystems that
> hook into the syscall path. If a prctl is later rewritten to a different
> syscall, do you care about catching it? Either way, the comment about
> doing secure computing "first" needs fixing.

I admit I am unclear on the utility of rewriting prctl.  My instinct is that
we are trying to catch userspace invocations of prctl and allow them,
and fail most everything else, so doing it pre-rewrite seems OK.

I'm not sure if it makes sense to catch it before or after the
secure computing check, though.  On reflection maybe doing it
afterwards makes more sense - what do you think?

Thanks!

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v5 2/6] cpu_isolated: add initial support
@ 2015-08-26 15:26                 ` Frederic Weisbecker
  0 siblings, 0 replies; 340+ messages in thread
From: Frederic Weisbecker @ 2015-08-26 15:26 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel

On Wed, Aug 12, 2015 at 02:22:09PM -0400, Chris Metcalf wrote:
> On 08/12/2015 12:00 PM, Frederic Weisbecker wrote:
> >>+#ifdef CONFIG_CPU_ISOLATED
> >>+void cpu_isolated_wait(void)
> >>+{
> >>+	set_current_state(TASK_INTERRUPTIBLE);
> >>+	_cpu_idle();
> >>+	set_current_state(TASK_RUNNING);
> >>+}
> >I'm still uncomfortable with that. A wake up model could work?
> 
> I don't know exactly what you have in mind.  The theory is that
> at this point we're ready to return to user space and we're just
> waiting for a timer tick that is guaranteed to arrive, since there
> is something pending for the timer.

Hmm, ok I'm going to discuss that in the new version. One worry is that
it gets racy and we sleep there for ever.

> 
> And, this is an arch-specific method anyway; the generic method
> is actually checking to see if a signal has been delivered,
> scheduling is needed, etc., each time around the loop, so if
> you're not sure your architecture will do the right thing, just
> don't provide a method that idles while waiting.  For tilegx I'm
> sure it works correctly, so I'm OK providing that method.

Yes but we do busy waiting on all other archs then. And since we can wait
for a while there, it doesn't look sane.

> >>diff --git a/include/linux/sched.h b/include/linux/sched.h
> >>index 04b5ada460b4..0bb248385d88 100644
> >>--- a/include/linux/sched.h
> >>+++ b/include/linux/sched.h
> >>@@ -1776,6 +1776,9 @@ struct task_struct {
> >>  	unsigned long	task_state_change;
> >>  #endif
> >>  	int pagefault_disabled;
> >>+#ifdef CONFIG_CPU_ISOLATED
> >>+	unsigned int	cpu_isolated_flags;
> >>+#endif
> >Can't we add a new flag to tsk->flags? There seem to be some values remaining.
> 
> Yeah, I thought of that, but it seems like a pretty scarce resource,
> and I wasn't sure it was the right thing to do.  Also, I'm not actually
> sure why the lowest two bits aren't apparently being used

Probably they were used but got removed.

> looks
> like PF_EXITING (0x4) is the first bit used.  And there are only three
> more bits higher up in the word that are not assigned.

Which makes room for 5 :)

> 
> Also, right now we are allowing users to customize the signal delivered
> for STRICT violation, and that signal value is stored in the
> cpu_isolated_flags word as well, so we really don't have room in
> tsk->flags for all of that anyway.

Yeah indeed, ok lets keep it that way for now.

Thanks.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v5 2/6] cpu_isolated: add initial support
@ 2015-08-26 15:26                 ` Frederic Weisbecker
  0 siblings, 0 replies; 340+ messages in thread
From: Frederic Weisbecker @ 2015-08-26 15:26 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Aug 12, 2015 at 02:22:09PM -0400, Chris Metcalf wrote:
> On 08/12/2015 12:00 PM, Frederic Weisbecker wrote:
> >>+#ifdef CONFIG_CPU_ISOLATED
> >>+void cpu_isolated_wait(void)
> >>+{
> >>+	set_current_state(TASK_INTERRUPTIBLE);
> >>+	_cpu_idle();
> >>+	set_current_state(TASK_RUNNING);
> >>+}
> >I'm still uncomfortable with that. A wake up model could work?
> 
> I don't know exactly what you have in mind.  The theory is that
> at this point we're ready to return to user space and we're just
> waiting for a timer tick that is guaranteed to arrive, since there
> is something pending for the timer.

Hmm, ok I'm going to discuss that in the new version. One worry is that
it gets racy and we sleep there for ever.

> 
> And, this is an arch-specific method anyway; the generic method
> is actually checking to see if a signal has been delivered,
> scheduling is needed, etc., each time around the loop, so if
> you're not sure your architecture will do the right thing, just
> don't provide a method that idles while waiting.  For tilegx I'm
> sure it works correctly, so I'm OK providing that method.

Yes but we do busy waiting on all other archs then. And since we can wait
for a while there, it doesn't look sane.

> >>diff --git a/include/linux/sched.h b/include/linux/sched.h
> >>index 04b5ada460b4..0bb248385d88 100644
> >>--- a/include/linux/sched.h
> >>+++ b/include/linux/sched.h
> >>@@ -1776,6 +1776,9 @@ struct task_struct {
> >>  	unsigned long	task_state_change;
> >>  #endif
> >>  	int pagefault_disabled;
> >>+#ifdef CONFIG_CPU_ISOLATED
> >>+	unsigned int	cpu_isolated_flags;
> >>+#endif
> >Can't we add a new flag to tsk->flags? There seem to be some values remaining.
> 
> Yeah, I thought of that, but it seems like a pretty scarce resource,
> and I wasn't sure it was the right thing to do.  Also, I'm not actually
> sure why the lowest two bits aren't apparently being used

Probably they were used but got removed.

> looks
> like PF_EXITING (0x4) is the first bit used.  And there are only three
> more bits higher up in the word that are not assigned.

Which makes room for 5 :)

> 
> Also, right now we are allowing users to customize the signal delivered
> for STRICT violation, and that signal value is stored in the
> cpu_isolated_flags word as well, so we really don't have room in
> tsk->flags for all of that anyway.

Yeah indeed, ok lets keep it that way for now.

Thanks.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v5 2/6] cpu_isolated: add initial support
  2015-08-26 15:26                 ` Frederic Weisbecker
@ 2015-08-26 15:55                   ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-08-26 15:55 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel

On 08/26/2015 11:26 AM, Frederic Weisbecker wrote:
> On Wed, Aug 12, 2015 at 02:22:09PM -0400, Chris Metcalf wrote:
>> On 08/12/2015 12:00 PM, Frederic Weisbecker wrote:
>>>> +#ifdef CONFIG_CPU_ISOLATED
>>>> +void cpu_isolated_wait(void)
>>>> +{
>>>> +	set_current_state(TASK_INTERRUPTIBLE);
>>>> +	_cpu_idle();
>>>> +	set_current_state(TASK_RUNNING);
>>>> +}
>>> I'm still uncomfortable with that. A wake up model could work?
>> I don't know exactly what you have in mind.  The theory is that
>> at this point we're ready to return to user space and we're just
>> waiting for a timer tick that is guaranteed to arrive, since there
>> is something pending for the timer.
> Hmm, ok I'm going to discuss that in the new version. One worry is that
> it gets racy and we sleep there for ever.
>
>> And, this is an arch-specific method anyway; the generic method
>> is actually checking to see if a signal has been delivered,
>> scheduling is needed, etc., each time around the loop, so if
>> you're not sure your architecture will do the right thing, just
>> don't provide a method that idles while waiting.  For tilegx I'm
>> sure it works correctly, so I'm OK providing that method.
> Yes but we do busy waiting on all other archs then. And since we can wait
> for a while there, it doesn't look sane.

We can wait for a while (potentially multiple ticks), which is
certainly a long time, but that's what the user asked for.

Since we're checking signals and scheduling in the busy loop,
we definitely won't get into some nasty unkillable state, which
would be the real worst-case.

I think the question is, could a process just get stuck there
somehow in the normal course of events, where there is a
future event on the tick_cpu_device, but no interrupt is
enabled that will eventually deal with it?  This seems like it
would be a pretty fundamental timekeeping bug, so my
assumption here is that can't happen, but maybe...?

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v5 2/6] cpu_isolated: add initial support
@ 2015-08-26 15:55                   ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-08-26 15:55 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel

On 08/26/2015 11:26 AM, Frederic Weisbecker wrote:
> On Wed, Aug 12, 2015 at 02:22:09PM -0400, Chris Metcalf wrote:
>> On 08/12/2015 12:00 PM, Frederic Weisbecker wrote:
>>>> +#ifdef CONFIG_CPU_ISOLATED
>>>> +void cpu_isolated_wait(void)
>>>> +{
>>>> +	set_current_state(TASK_INTERRUPTIBLE);
>>>> +	_cpu_idle();
>>>> +	set_current_state(TASK_RUNNING);
>>>> +}
>>> I'm still uncomfortable with that. A wake up model could work?
>> I don't know exactly what you have in mind.  The theory is that
>> at this point we're ready to return to user space and we're just
>> waiting for a timer tick that is guaranteed to arrive, since there
>> is something pending for the timer.
> Hmm, ok I'm going to discuss that in the new version. One worry is that
> it gets racy and we sleep there for ever.
>
>> And, this is an arch-specific method anyway; the generic method
>> is actually checking to see if a signal has been delivered,
>> scheduling is needed, etc., each time around the loop, so if
>> you're not sure your architecture will do the right thing, just
>> don't provide a method that idles while waiting.  For tilegx I'm
>> sure it works correctly, so I'm OK providing that method.
> Yes but we do busy waiting on all other archs then. And since we can wait
> for a while there, it doesn't look sane.

We can wait for a while (potentially multiple ticks), which is
certainly a long time, but that's what the user asked for.

Since we're checking signals and scheduling in the busy loop,
we definitely won't get into some nasty unkillable state, which
would be the real worst-case.

I think the question is, could a process just get stuck there
somehow in the normal course of events, where there is a
future event on the tick_cpu_device, but no interrupt is
enabled that will eventually deal with it?  This seems like it
would be a pretty fundamental timekeeping bug, so my
assumption here is that can't happen, but maybe...?

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v6.1 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode
  2015-08-26 10:36             ` Will Deacon
@ 2015-08-28 15:31                 ` Chris Metcalf
  2015-08-28 15:31                 ` Chris Metcalf
  1 sibling, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-08-28 15:31 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry is fatal.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

This change adds the syscall-detection hooks only for x86, arm64,
and tile.  For arm64 we use an explict #ifdef CONFIG_TASK_ISOLATION
so we can both achieve no overhead for !TASK_ISOLATION, but also
achieve low latency (test TIF_NOHZ first) for TASK_ISOLATION.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
This "v6.1" is just a tweak to the existing v6 series to reflect
Will Deacon's suggestions about the arm64 syscall entry code.
I've updated the git tree with this updated patch in the series.
A more disruptive change would be to capture the thread flags
up front like x86 and tile, which allows the test itself to be
optimized away if the task_isolation call becomes a no-op.

 arch/arm64/kernel/ptrace.c       |  6 ++++++
 arch/tile/kernel/ptrace.c        |  5 ++++-
 arch/x86/kernel/ptrace.c         |  2 ++
 include/linux/context_tracking.h | 11 ++++++++---
 include/linux/isolation.h        | 16 ++++++++++++++++
 include/uapi/linux/prctl.h       |  1 +
 kernel/context_tracking.c        |  9 ++++++---
 kernel/isolation.c               | 38 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 81 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index d882b833dbdb..5d4284445f70 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1154,6 +1155,11 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 	if (secure_computing() == -1)
 		return -1;
 
+#ifdef CONFIG_TASK_ISOLATION
+	if (test_thread_flag(TIF_NOHZ) && task_isolation_strict())
+		task_isolation_syscall(regs->syscallno);
+#endif
+
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..c327cb918a44 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (work & _TIF_NOHZ)
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		if (task_isolation_strict())
+			task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]);
+	}
 
 	if (work & _TIF_SYSCALL_TRACE) {
 		if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9be72bc3613f..2f9ce9466daf 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	if (work & _TIF_NOHZ) {
 		user_exit();
 		work &= ~_TIF_NOHZ;
+		if (task_isolation_strict())
+			task_isolation_syscall(regs->orig_ax);
 	}
 
 #ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd299966f..e0ac0228fea1 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include <linux/sched.h>
 #include <linux/vtime.h>
+#include <linux/isolation.h>
 #include <linux/context_tracking_state.h>
 #include <asm/ptrace.h>
 
@@ -11,7 +12,7 @@
 extern void context_tracking_cpu_set(int cpu);
 
 extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 
@@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
 		return 0;
 
 	prev_ctx = this_cpu_read(context_tracking.state);
-	if (prev_ctx != CONTEXT_KERNEL)
-		context_tracking_exit(prev_ctx);
+	if (prev_ctx != CONTEXT_KERNEL) {
+		if (context_tracking_exit(prev_ctx)) {
+			if (task_isolation_strict())
+				task_isolation_exception();
+		}
+	}
 
 	return prev_ctx;
 }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index fd04011b1c1e..27a4469831c1 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void)
 }
 
 extern void task_isolation_enter(void);
+extern void task_isolation_syscall(int nr);
+extern void task_isolation_exception(void);
 extern void task_isolation_wait(void);
 #else
 static inline bool task_isolation_enabled(void) { return false; }
 static inline void task_isolation_enter(void) { }
+static inline void task_isolation_syscall(int nr) { }
+static inline void task_isolation_exception(void) { }
 #endif
 
+static inline bool task_isolation_strict(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+	if (tick_nohz_full_cpu(smp_processor_id()) &&
+	    (current->task_isolation_flags &
+	     (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+	    (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT))
+		return true;
+#endif
+	return false;
+}
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 79da784fe17a..e16e13911e8a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		47
 #define PR_GET_TASK_ISOLATION		48
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index c57c99f5c4d7..17a71f7b66b8 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
 {
 	unsigned long flags;
+	bool from_user = false;
 
 	if (!context_tracking_is_enabled())
-		return;
+		return false;
 
 	if (in_interrupt())
-		return;
+		return false;
 
 	local_irq_save(flags);
 	if (!context_tracking_recursion_enter())
@@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state)
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				from_user = true;
 				vtime_user_exit(current);
 				trace_user_exit(0);
 			}
@@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state)
 	context_tracking_recursion_exit();
 out_irq_restore:
 	local_irq_restore(flags);
+	return from_user;
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
 EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/isolation.c b/kernel/isolation.c
index d4618cd9e23d..a89a6e9adfb4 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -10,6 +10,7 @@
 #include <linux/swap.h>
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
+#include <asm/unistd.h>
 #include "time/tick-sched.h"
 
 /*
@@ -73,3 +74,40 @@ void task_isolation_enter(void)
 		dump_stack();
 	}
 }
+
+static void kill_task_isolation_strict_task(void)
+{
+	dump_stack();
+	current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
+	send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void task_isolation_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return;
+	}
+
+	pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
+		current->comm, current->pid, syscall);
+	kill_task_isolation_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(void)
+{
+	pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
+		current->comm, current->pid);
+	kill_task_isolation_strict_task();
+}
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v6.1 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode
@ 2015-08-28 15:31                 ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-08-28 15:31 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry is fatal.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

This change adds the syscall-detection hooks only for x86, arm64,
and tile.  For arm64 we use an explict #ifdef CONFIG_TASK_ISOLATION
so we can both achieve no overhead for !TASK_ISOLATION, but also
achieve low latency (test TIF_NOHZ first) for TASK_ISOLATION.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
This "v6.1" is just a tweak to the existing v6 series to reflect
Will Deacon's suggestions about the arm64 syscall entry code.
I've updated the git tree with this updated patch in the series.
A more disruptive change would be to capture the thread flags
up front like x86 and tile, which allows the test itself to be
optimized away if the task_isolation call becomes a no-op.

 arch/arm64/kernel/ptrace.c       |  6 ++++++
 arch/tile/kernel/ptrace.c        |  5 ++++-
 arch/x86/kernel/ptrace.c         |  2 ++
 include/linux/context_tracking.h | 11 ++++++++---
 include/linux/isolation.h        | 16 ++++++++++++++++
 include/uapi/linux/prctl.h       |  1 +
 kernel/context_tracking.c        |  9 ++++++---
 kernel/isolation.c               | 38 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 81 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index d882b833dbdb..5d4284445f70 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1154,6 +1155,11 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 	if (secure_computing() == -1)
 		return -1;
 
+#ifdef CONFIG_TASK_ISOLATION
+	if (test_thread_flag(TIF_NOHZ) && task_isolation_strict())
+		task_isolation_syscall(regs->syscallno);
+#endif
+
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..c327cb918a44 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (work & _TIF_NOHZ)
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		if (task_isolation_strict())
+			task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]);
+	}
 
 	if (work & _TIF_SYSCALL_TRACE) {
 		if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9be72bc3613f..2f9ce9466daf 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	if (work & _TIF_NOHZ) {
 		user_exit();
 		work &= ~_TIF_NOHZ;
+		if (task_isolation_strict())
+			task_isolation_syscall(regs->orig_ax);
 	}
 
 #ifdef CONFIG_SECCOMP
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd299966f..e0ac0228fea1 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include <linux/sched.h>
 #include <linux/vtime.h>
+#include <linux/isolation.h>
 #include <linux/context_tracking_state.h>
 #include <asm/ptrace.h>
 
@@ -11,7 +12,7 @@
 extern void context_tracking_cpu_set(int cpu);
 
 extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 
@@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
 		return 0;
 
 	prev_ctx = this_cpu_read(context_tracking.state);
-	if (prev_ctx != CONTEXT_KERNEL)
-		context_tracking_exit(prev_ctx);
+	if (prev_ctx != CONTEXT_KERNEL) {
+		if (context_tracking_exit(prev_ctx)) {
+			if (task_isolation_strict())
+				task_isolation_exception();
+		}
+	}
 
 	return prev_ctx;
 }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index fd04011b1c1e..27a4469831c1 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void)
 }
 
 extern void task_isolation_enter(void);
+extern void task_isolation_syscall(int nr);
+extern void task_isolation_exception(void);
 extern void task_isolation_wait(void);
 #else
 static inline bool task_isolation_enabled(void) { return false; }
 static inline void task_isolation_enter(void) { }
+static inline void task_isolation_syscall(int nr) { }
+static inline void task_isolation_exception(void) { }
 #endif
 
+static inline bool task_isolation_strict(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+	if (tick_nohz_full_cpu(smp_processor_id()) &&
+	    (current->task_isolation_flags &
+	     (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+	    (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT))
+		return true;
+#endif
+	return false;
+}
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 79da784fe17a..e16e13911e8a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		47
 #define PR_GET_TASK_ISOLATION		48
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index c57c99f5c4d7..17a71f7b66b8 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
 {
 	unsigned long flags;
+	bool from_user = false;
 
 	if (!context_tracking_is_enabled())
-		return;
+		return false;
 
 	if (in_interrupt())
-		return;
+		return false;
 
 	local_irq_save(flags);
 	if (!context_tracking_recursion_enter())
@@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state)
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				from_user = true;
 				vtime_user_exit(current);
 				trace_user_exit(0);
 			}
@@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state)
 	context_tracking_recursion_exit();
 out_irq_restore:
 	local_irq_restore(flags);
+	return from_user;
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
 EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/isolation.c b/kernel/isolation.c
index d4618cd9e23d..a89a6e9adfb4 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -10,6 +10,7 @@
 #include <linux/swap.h>
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
+#include <asm/unistd.h>
 #include "time/tick-sched.h"
 
 /*
@@ -73,3 +74,40 @@ void task_isolation_enter(void)
 		dump_stack();
 	}
 }
+
+static void kill_task_isolation_strict_task(void)
+{
+	dump_stack();
+	current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
+	send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void task_isolation_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return;
+	}
+
+	pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
+		current->comm, current->pid, syscall);
+	kill_task_isolation_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(void)
+{
+	pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
+		current->comm, current->pid);
+	kill_task_isolation_strict_task();
+}
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* Re: [PATCH v6 4/6] task_isolation: provide strict mode configurable signal
  2015-08-25 19:55             ` Chris Metcalf
  (?)
@ 2015-08-28 19:22             ` Andy Lutomirski
  2015-09-02 18:38                 ` Chris Metcalf
  -1 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-08-28 19:22 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On Tue, Aug 25, 2015 at 12:55 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> Allow userspace to override the default SIGKILL delivered
> when a task_isolation process in STRICT mode does a syscall
> or otherwise synchronously enters the kernel.
>
> In addition to being able to set the signal, we now also
> pass whether or not the interruption was from a syscall in
> the si_code field of the siginfo.
>
> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
> ---
>  include/uapi/linux/prctl.h |  2 ++
>  kernel/isolation.c         | 17 +++++++++++++----
>  2 files changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index e16e13911e8a..2a4ddc890e22 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -195,5 +195,7 @@ struct prctl_mm_map {
>  #define PR_GET_TASK_ISOLATION          48
>  # define PR_TASK_ISOLATION_ENABLE      (1 << 0)
>  # define PR_TASK_ISOLATION_STRICT      (1 << 1)
> +# define PR_TASK_ISOLATION_SET_SIG(sig)        (((sig) & 0x7f) << 8)
> +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
>
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/isolation.c b/kernel/isolation.c
> index a89a6e9adfb4..b776aa632c8f 100644
> --- a/kernel/isolation.c
> +++ b/kernel/isolation.c
> @@ -75,11 +75,20 @@ void task_isolation_enter(void)
>         }
>  }
>
> -static void kill_task_isolation_strict_task(void)
> +static void kill_task_isolation_strict_task(int is_syscall)
>  {
> +       siginfo_t info = {};
> +       int sig;
> +
>         dump_stack();
>         current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
> -       send_sig(SIGKILL, current, 1);
> +
> +       sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags);
> +       if (sig == 0)
> +               sig = SIGKILL;
> +       info.si_signo = sig;
> +       info.si_code = is_syscall;
> +       send_sig_info(sig, &info, current);

The stuff you're doing here is sufficiently nasty that I think you
should add something like:

rcu_lockdep_assert(rcu_is_watching(), "some message here");

Because as it stands this is just asking for trouble.

For the record, I am *extremely* unhappy with the state of the context
tracking hooks.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode
@ 2015-09-02 10:13                   ` Will Deacon
  0 siblings, 0 replies; 340+ messages in thread
From: Will Deacon @ 2015-09-02 10:13 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, linux-doc, linux-api,
	linux-kernel

On Wed, Aug 26, 2015 at 04:10:34PM +0100, Chris Metcalf wrote:
> On 08/26/2015 06:36 AM, Will Deacon wrote:
> > On Tue, Aug 25, 2015 at 08:55:52PM +0100, Chris Metcalf wrote:
> >> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> >> index d882b833dbdb..e3d83a12f3cf 100644
> >> --- a/arch/arm64/kernel/ptrace.c
> >> +++ b/arch/arm64/kernel/ptrace.c
> >> @@ -37,6 +37,7 @@
> >>   #include <linux/regset.h>
> >>   #include <linux/tracehook.h>
> >>   #include <linux/elf.h>
> >> +#include <linux/isolation.h>
> >>   
> >>   #include <asm/compat.h>
> >>   #include <asm/debug-monitors.h>
> >> @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
> >>   
> >>   asmlinkage int syscall_trace_enter(struct pt_regs *regs)
> >>   {
> >> +	/* Ensure we report task_isolation violations in all circumstances. */
> >> +	if (test_thread_flag(TIF_NOHZ) && task_isolation_strict())
> > This is going to force us to check TIF_NOHZ on the syscall slowpath even
> > when CONFIG_TASK_ISOLATION=n.
> 
> Yes, good catch.  I was thinking the "&& false" would suppress the TIF
> test but I forgot that test_bit() takes a volatile argument, so it gets
> evaluated even though the result isn't actually used.
> 
> But I don't want to just reorder the two tests, because when isolation
> is enabled, testing TIF_NOHZ first is better.  I think probably the right
> solution is just to put an #ifdef CONFIG_TASK_ISOLATION around that
> test, even though that is a little crufty.  The alternative is to provide
> a task_isolation_configured() macro that just returns true or false, and
> make it a three-part "&&" test with that new macro first, but
> that seems a little crufty as well.  Do you have a preference?

Maybe use IS_ENABLED(CONFIG_TASK_ISOLATION) ?

> >> +		task_isolation_syscall(regs->syscallno);
> >> +
> >>   	/* Do the secure computing check first; failures should be fast. */
> > Here we have the usual priority problems with all the subsystems that
> > hook into the syscall path. If a prctl is later rewritten to a different
> > syscall, do you care about catching it? Either way, the comment about
> > doing secure computing "first" needs fixing.
> 
> I admit I am unclear on the utility of rewriting prctl.  My instinct is that
> we are trying to catch userspace invocations of prctl and allow them,
> and fail most everything else, so doing it pre-rewrite seems OK.
> 
> I'm not sure if it makes sense to catch it before or after the
> secure computing check, though.  On reflection maybe doing it
> afterwards makes more sense - what do you think?

I don't have a strong preference (I really hate all these hooks we have
on the syscall entry/exit path), but we do need to make sure that the
behaviour is consistent across architectures.

Will

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode
@ 2015-09-02 10:13                   ` Will Deacon
  0 siblings, 0 replies; 340+ messages in thread
From: Will Deacon @ 2015-09-02 10:13 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Aug 26, 2015 at 04:10:34PM +0100, Chris Metcalf wrote:
> On 08/26/2015 06:36 AM, Will Deacon wrote:
> > On Tue, Aug 25, 2015 at 08:55:52PM +0100, Chris Metcalf wrote:
> >> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> >> index d882b833dbdb..e3d83a12f3cf 100644
> >> --- a/arch/arm64/kernel/ptrace.c
> >> +++ b/arch/arm64/kernel/ptrace.c
> >> @@ -37,6 +37,7 @@
> >>   #include <linux/regset.h>
> >>   #include <linux/tracehook.h>
> >>   #include <linux/elf.h>
> >> +#include <linux/isolation.h>
> >>   
> >>   #include <asm/compat.h>
> >>   #include <asm/debug-monitors.h>
> >> @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs,
> >>   
> >>   asmlinkage int syscall_trace_enter(struct pt_regs *regs)
> >>   {
> >> +	/* Ensure we report task_isolation violations in all circumstances. */
> >> +	if (test_thread_flag(TIF_NOHZ) && task_isolation_strict())
> > This is going to force us to check TIF_NOHZ on the syscall slowpath even
> > when CONFIG_TASK_ISOLATION=n.
> 
> Yes, good catch.  I was thinking the "&& false" would suppress the TIF
> test but I forgot that test_bit() takes a volatile argument, so it gets
> evaluated even though the result isn't actually used.
> 
> But I don't want to just reorder the two tests, because when isolation
> is enabled, testing TIF_NOHZ first is better.  I think probably the right
> solution is just to put an #ifdef CONFIG_TASK_ISOLATION around that
> test, even though that is a little crufty.  The alternative is to provide
> a task_isolation_configured() macro that just returns true or false, and
> make it a three-part "&&" test with that new macro first, but
> that seems a little crufty as well.  Do you have a preference?

Maybe use IS_ENABLED(CONFIG_TASK_ISOLATION) ?

> >> +		task_isolation_syscall(regs->syscallno);
> >> +
> >>   	/* Do the secure computing check first; failures should be fast. */
> > Here we have the usual priority problems with all the subsystems that
> > hook into the syscall path. If a prctl is later rewritten to a different
> > syscall, do you care about catching it? Either way, the comment about
> > doing secure computing "first" needs fixing.
> 
> I admit I am unclear on the utility of rewriting prctl.  My instinct is that
> we are trying to catch userspace invocations of prctl and allow them,
> and fail most everything else, so doing it pre-rewrite seems OK.
> 
> I'm not sure if it makes sense to catch it before or after the
> secure computing check, though.  On reflection maybe doing it
> afterwards makes more sense - what do you think?

I don't have a strong preference (I really hate all these hooks we have
on the syscall entry/exit path), but we do need to make sure that the
behaviour is consistent across architectures.

Will

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v6.2 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode
@ 2015-09-02 18:38                 ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-02 18:38 UTC (permalink / raw)
  To: Will Deacon, Andy Lutomirski, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

This change updates just one patch of the patch series, so rather than
spamming out the whole series again, I've just updated this patch:

- Will Deacon suggested using IS_ENABLED(CONFIG_TASK_ISOLATION) and
  also recommended having the same ordering between SECCOMP and
  TASK_ISOLATION on all platforms, an excellent suggestion.

- Andy Lutomirski suggested using rcu_lockdep_assert(rcu_is_watching())
  to ensure RCU was properly turned back on during our syscall
  test-and-kill for strict mode.

I will update a full PATCH v7 once there seem to be no further
comments on the rest of the v6 series.

--
From: Chris Metcalf <cmetcalf@ezchip.com>
Date: Tue, 28 Jul 2015 13:25:46 -0400
Subject: [PATCH v6.2 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode

With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry is fatal.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

This change adds the syscall-detection hooks only for x86, arm64,
and tile.  We specify that it happens immediately after the
SECCOMP test, which appropriately should be tested first.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/arm64/kernel/ptrace.c       |  6 ++++++
 arch/tile/kernel/ptrace.c        |  5 ++++-
 arch/x86/kernel/ptrace.c         | 10 +++++++++-
 include/linux/context_tracking.h | 11 ++++++++---
 include/linux/isolation.h        | 16 ++++++++++++++++
 include/uapi/linux/prctl.h       |  1 +
 kernel/context_tracking.c        |  9 ++++++---
 kernel/isolation.c               | 41 ++++++++++++++++++++++++++++++++++++++++
 8 files changed, 91 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index d882b833dbdb..737f62db8a6f 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1154,6 +1155,11 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 	if (secure_computing() == -1)
 		return -1;
 
+	if (IS_ENABLED(CONFIG_TASK_ISOLATION) &&
+	    test_thread_flag(TIF_NOHZ) &&
+	    task_isolation_strict())
+		task_isolation_syscall(regs->syscallno);
+
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..c327cb918a44 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (work & _TIF_NOHZ)
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		if (task_isolation_strict())
+			task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]);
+	}
 
 	if (work & _TIF_SYSCALL_TRACE) {
 		if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9be72bc3613f..821699513a94 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1478,7 +1478,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	 */
 	if (work & _TIF_NOHZ) {
 		user_exit();
-		work &= ~_TIF_NOHZ;
+		if (!IS_ENABLED(CONFIG_TASK_ISOLATION))
+			work &= ~_TIF_NOHZ;
 	}
 
 #ifdef CONFIG_SECCOMP
@@ -1527,6 +1528,13 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	}
 #endif
 
+	/* Now check task isolation, if needed. */
+	if (IS_ENABLED(CONFIG_TASK_ISOLATION) && (work & _TIF_NOHZ)) {
+		work &= ~_TIF_NOHZ;
+		if (task_isolation_strict())
+			task_isolation_syscall(regs->orig_ax);
+	}
+
 	/* Do our best to finish without phase 2. */
 	if (work == 0)
 		return ret;  /* seccomp and/or nohz only (ret == 0 here) */
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd299966f..e0ac0228fea1 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include <linux/sched.h>
 #include <linux/vtime.h>
+#include <linux/isolation.h>
 #include <linux/context_tracking_state.h>
 #include <asm/ptrace.h>
 
@@ -11,7 +12,7 @@
 extern void context_tracking_cpu_set(int cpu);
 
 extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 
@@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
 		return 0;
 
 	prev_ctx = this_cpu_read(context_tracking.state);
-	if (prev_ctx != CONTEXT_KERNEL)
-		context_tracking_exit(prev_ctx);
+	if (prev_ctx != CONTEXT_KERNEL) {
+		if (context_tracking_exit(prev_ctx)) {
+			if (task_isolation_strict())
+				task_isolation_exception();
+		}
+	}
 
 	return prev_ctx;
 }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index fd04011b1c1e..27a4469831c1 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void)
 }
 
 extern void task_isolation_enter(void);
+extern void task_isolation_syscall(int nr);
+extern void task_isolation_exception(void);
 extern void task_isolation_wait(void);
 #else
 static inline bool task_isolation_enabled(void) { return false; }
 static inline void task_isolation_enter(void) { }
+static inline void task_isolation_syscall(int nr) { }
+static inline void task_isolation_exception(void) { }
 #endif
 
+static inline bool task_isolation_strict(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+	if (tick_nohz_full_cpu(smp_processor_id()) &&
+	    (current->task_isolation_flags &
+	     (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+	    (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT))
+		return true;
+#endif
+	return false;
+}
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 79da784fe17a..e16e13911e8a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		47
 #define PR_GET_TASK_ISOLATION		48
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index c57c99f5c4d7..17a71f7b66b8 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
 {
 	unsigned long flags;
+	bool from_user = false;
 
 	if (!context_tracking_is_enabled())
-		return;
+		return false;
 
 	if (in_interrupt())
-		return;
+		return false;
 
 	local_irq_save(flags);
 	if (!context_tracking_recursion_enter())
@@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state)
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				from_user = true;
 				vtime_user_exit(current);
 				trace_user_exit(0);
 			}
@@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state)
 	context_tracking_recursion_exit();
 out_irq_restore:
 	local_irq_restore(flags);
+	return from_user;
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
 EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/isolation.c b/kernel/isolation.c
index d4618cd9e23d..caa40583fe0b 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -10,6 +10,7 @@
 #include <linux/swap.h>
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
+#include <asm/unistd.h>
 #include "time/tick-sched.h"
 
 /*
@@ -73,3 +74,43 @@ void task_isolation_enter(void)
 		dump_stack();
 	}
 }
+
+static void kill_task_isolation_strict_task(void)
+{
+	/* RCU should have been enabled prior to checking the syscall. */
+	rcu_lockdep_assert(rcu_is_watching(), "syscall entry without RCU");
+
+	dump_stack();
+	current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
+	send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void task_isolation_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return;
+	}
+
+	pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
+		current->comm, current->pid, syscall);
+	kill_task_isolation_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(void)
+{
+	pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
+		current->comm, current->pid);
+	kill_task_isolation_strict_task();
+}
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v6.2 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode
@ 2015-09-02 18:38                 ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-02 18:38 UTC (permalink / raw)
  To: Will Deacon, Andy Lutomirski, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Chris Metcalf

This change updates just one patch of the patch series, so rather than
spamming out the whole series again, I've just updated this patch:

- Will Deacon suggested using IS_ENABLED(CONFIG_TASK_ISOLATION) and
  also recommended having the same ordering between SECCOMP and
  TASK_ISOLATION on all platforms, an excellent suggestion.

- Andy Lutomirski suggested using rcu_lockdep_assert(rcu_is_watching())
  to ensure RCU was properly turned back on during our syscall
  test-and-kill for strict mode.

I will update a full PATCH v7 once there seem to be no further
comments on the rest of the v6 series.

--
From: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
Date: Tue, 28 Jul 2015 13:25:46 -0400
Subject: [PATCH v6.2 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode

With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry is fatal.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

This change adds the syscall-detection hooks only for x86, arm64,
and tile.  We specify that it happens immediately after the
SECCOMP test, which appropriately should be tested first.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
---
 arch/arm64/kernel/ptrace.c       |  6 ++++++
 arch/tile/kernel/ptrace.c        |  5 ++++-
 arch/x86/kernel/ptrace.c         | 10 +++++++++-
 include/linux/context_tracking.h | 11 ++++++++---
 include/linux/isolation.h        | 16 ++++++++++++++++
 include/uapi/linux/prctl.h       |  1 +
 kernel/context_tracking.c        |  9 ++++++---
 kernel/isolation.c               | 41 ++++++++++++++++++++++++++++++++++++++++
 8 files changed, 91 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index d882b833dbdb..737f62db8a6f 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1154,6 +1155,11 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 	if (secure_computing() == -1)
 		return -1;
 
+	if (IS_ENABLED(CONFIG_TASK_ISOLATION) &&
+	    test_thread_flag(TIF_NOHZ) &&
+	    task_isolation_strict())
+		task_isolation_syscall(regs->syscallno);
+
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index f84eed8243da..c327cb918a44 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (work & _TIF_NOHZ)
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		if (task_isolation_strict())
+			task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]);
+	}
 
 	if (work & _TIF_SYSCALL_TRACE) {
 		if (tracehook_report_syscall_entry(regs))
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9be72bc3613f..821699513a94 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1478,7 +1478,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	 */
 	if (work & _TIF_NOHZ) {
 		user_exit();
-		work &= ~_TIF_NOHZ;
+		if (!IS_ENABLED(CONFIG_TASK_ISOLATION))
+			work &= ~_TIF_NOHZ;
 	}
 
 #ifdef CONFIG_SECCOMP
@@ -1527,6 +1528,13 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	}
 #endif
 
+	/* Now check task isolation, if needed. */
+	if (IS_ENABLED(CONFIG_TASK_ISOLATION) && (work & _TIF_NOHZ)) {
+		work &= ~_TIF_NOHZ;
+		if (task_isolation_strict())
+			task_isolation_syscall(regs->orig_ax);
+	}
+
 	/* Do our best to finish without phase 2. */
 	if (work == 0)
 		return ret;  /* seccomp and/or nohz only (ret == 0 here) */
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b96bd299966f..e0ac0228fea1 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include <linux/sched.h>
 #include <linux/vtime.h>
+#include <linux/isolation.h>
 #include <linux/context_tracking_state.h>
 #include <asm/ptrace.h>
 
@@ -11,7 +12,7 @@
 extern void context_tracking_cpu_set(int cpu);
 
 extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 
@@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
 		return 0;
 
 	prev_ctx = this_cpu_read(context_tracking.state);
-	if (prev_ctx != CONTEXT_KERNEL)
-		context_tracking_exit(prev_ctx);
+	if (prev_ctx != CONTEXT_KERNEL) {
+		if (context_tracking_exit(prev_ctx)) {
+			if (task_isolation_strict())
+				task_isolation_exception();
+		}
+	}
 
 	return prev_ctx;
 }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index fd04011b1c1e..27a4469831c1 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void)
 }
 
 extern void task_isolation_enter(void);
+extern void task_isolation_syscall(int nr);
+extern void task_isolation_exception(void);
 extern void task_isolation_wait(void);
 #else
 static inline bool task_isolation_enabled(void) { return false; }
 static inline void task_isolation_enter(void) { }
+static inline void task_isolation_syscall(int nr) { }
+static inline void task_isolation_exception(void) { }
 #endif
 
+static inline bool task_isolation_strict(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+	if (tick_nohz_full_cpu(smp_processor_id()) &&
+	    (current->task_isolation_flags &
+	     (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+	    (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT))
+		return true;
+#endif
+	return false;
+}
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 79da784fe17a..e16e13911e8a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -194,5 +194,6 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		47
 #define PR_GET_TASK_ISOLATION		48
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index c57c99f5c4d7..17a71f7b66b8 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
 {
 	unsigned long flags;
+	bool from_user = false;
 
 	if (!context_tracking_is_enabled())
-		return;
+		return false;
 
 	if (in_interrupt())
-		return;
+		return false;
 
 	local_irq_save(flags);
 	if (!context_tracking_recursion_enter())
@@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state)
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				from_user = true;
 				vtime_user_exit(current);
 				trace_user_exit(0);
 			}
@@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state)
 	context_tracking_recursion_exit();
 out_irq_restore:
 	local_irq_restore(flags);
+	return from_user;
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
 EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/isolation.c b/kernel/isolation.c
index d4618cd9e23d..caa40583fe0b 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -10,6 +10,7 @@
 #include <linux/swap.h>
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
+#include <asm/unistd.h>
 #include "time/tick-sched.h"
 
 /*
@@ -73,3 +74,43 @@ void task_isolation_enter(void)
 		dump_stack();
 	}
 }
+
+static void kill_task_isolation_strict_task(void)
+{
+	/* RCU should have been enabled prior to checking the syscall. */
+	rcu_lockdep_assert(rcu_is_watching(), "syscall entry without RCU");
+
+	dump_stack();
+	current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
+	send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void task_isolation_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return;
+	}
+
+	pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
+		current->comm, current->pid, syscall);
+	kill_task_isolation_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(void)
+{
+	pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
+		current->comm, current->pid);
+	kill_task_isolation_strict_task();
+}
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v7 00/11] support "task_isolated" mode for nohz_full
  2015-08-25 19:55           ` Chris Metcalf
@ 2015-09-28 15:17             ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The cover email for the patch series is getting a little unwieldy
so I will provide a terser summary here, and just update the
list of changes from version to version.  Please see the previous
versions linked by the In-Reply-To for more detailed comments
about changes in earlier versions of the patch series.

v7:
  The main change in this version is a change in where we call
  task_isolation_enter().  The arm64 code only invokes the
  context_tracking code right at kernel entry, and right at kernel
  exit, and the exit point is too late for task isolation; one of my
  test cases, when run on arm64, showed that a signal delivered while
  task isolation is waiting for the timer interrupt to quiesce was not
  properly handled before returning to userspace.  The tilegx code
  properly handled that case because it ran user_exit() in the
  work-pending loop.  But since arm64 calls user_exit() later, it was
  too late to go back and handle the signal.  I decided to make the
  task isolation work explicit in the "work" loop done on return to
  userspace, and although I could have done this by hacking up the
  arm64 assembly code for this purpose, I decided to follow the x86
  approach and use the prepare_exit_to_usermode() model where
  architectures handles work looping in C code.  I added that support
  to arm64 and tile as a pre-requisite change, then modified the loop
  in C to call task isolation appropriately.  This both makes the
  slowpath return-to-user code more maintainable for arm64 and tile
  going forward, and also avoids some of the subtlety where the
  context tracking code was being asked to invoke task isolation at
  user_enter() time.

  As a result of this change, I have moved all the
  architecture-specific changes to individual patches: two patches to
  switch arm64 and tile to the prepare_exit_to_usermode() loop, and
  three patches (one each for x86, arm64, and tile) to add the
  necessary call to task_isolation(), plus changes to check at syscall
  entry for strict mode.

  In addition, since arm64 doesn't use exception_enter(), I added an
  explicit call to task_isolation_exception() in do_mem_abort() so
  that page faults would be properly flagged in strict mode.

  I also added an RCU_LOCKDEP_WARN() at Andy Lutomirski's suggestion.

  And, the patch series is rebased to v4.3-rc1.

v6:
  restructured to be a "task_isolation" mode not a "cpu_isolated"
  mode (Frederic)

v5:
  rebased on kernel v4.2-rc3
  converted to use CONFIG_CPU_ISOLATED and separate .c and .h files
  incorporates Christoph Lameter's quiet_vmstat() call

v4:
  rebased on kernel v4.2-rc1
  added support for detecting CPU_ISOLATED_STRICT syscalls on arm64

v3:
  remove dependency on cpu_idle subsystem (Thomas Gleixner)
  use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter
  use seconds for console messages instead of jiffies (Thomas Gleixner)
  updated commit description for patch 5/5

v2:
  rename "dataplane" to "cpu_isolated"
  drop ksoftirqd suppression changes (believed no longer needed)
  merge previous "QUIESCE" functionality into baseline functionality
  explicitly track syscalls and exceptions for "STRICT" functionality
  allow configuring a signal to be delivered for STRICT mode failures
  move debug tracking to irq_enter(), not irq_exit()

General summary:

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it.  However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode.  The
kernel must be built with CONFIG_TASK_ISOLATION to take advantage of
this new mode.  A prctl() option (PR_SET_TASK_ISOLATION) is added to
control whether processes have requested this stricter semantics, and
within that prctl() option we provide a number of different bits for
more precise control.  Additionally, we add a new command-line boot
argument to facilitate debugging where unexpected interrupts are being
delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers.  This patch
series represents the first serious attempt to upstream that
functionality.  Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on task_isolation cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on v4.3-rc1) is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Note: I have not removed the commit to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if
we remove the 1Hz tick, task_isolation threads will never re-enter
userspace since a tick will always be pending.

Chris Metcalf (10):
  task_isolation: add initial support
  task_isolation: support PR_TASK_ISOLATION_STRICT mode
  task_isolation: provide strict mode configurable signal
  task_isolation: add debug boot flag
  nohz: task_isolation: allow tick to be fully disabled
  arch/x86: enable task isolation functionality
  arch/arm64: adopt prepare_exit_to_usermode() model from x86
  arch/arm64: enable task isolation functionality
  arch/tile: adopt prepare_exit_to_usermode() model from x86
  arch/tile: enable task isolation functionality

Christoph Lameter (1):
  vmstat: provide a function to quiet down the diff processing

 Documentation/kernel-parameters.txt  |   7 ++
 arch/arm64/include/asm/thread_info.h |  18 +++--
 arch/arm64/kernel/entry.S            |   6 +-
 arch/arm64/kernel/ptrace.c           |  10 ++-
 arch/arm64/kernel/signal.c           |  36 +++++++---
 arch/arm64/mm/fault.c                |   8 +++
 arch/tile/include/asm/processor.h    |   2 +-
 arch/tile/include/asm/thread_info.h  |   8 ++-
 arch/tile/kernel/intvec_32.S         |  46 ++++---------
 arch/tile/kernel/intvec_64.S         |  49 +++++---------
 arch/tile/kernel/process.c           |  92 ++++++++++++++-----------
 arch/tile/kernel/ptrace.c            |   3 +
 arch/tile/mm/homecache.c             |   5 +-
 arch/x86/entry/common.c              |  45 ++++++++++---
 include/linux/context_tracking.h     |  11 ++-
 include/linux/isolation.h            |  42 ++++++++++++
 include/linux/sched.h                |   3 +
 include/linux/vmstat.h               |   2 +
 include/uapi/linux/prctl.h           |   8 +++
 init/Kconfig                         |  20 ++++++
 kernel/Makefile                      |   1 +
 kernel/context_tracking.c            |   9 ++-
 kernel/irq_work.c                    |   5 +-
 kernel/isolation.c                   | 127 +++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                  |  21 ++++++
 kernel/signal.c                      |   5 ++
 kernel/smp.c                         |   4 ++
 kernel/softirq.c                     |   7 ++
 kernel/sys.c                         |   8 +++
 kernel/time/tick-sched.c             |   3 +-
 mm/vmstat.c                          |  14 ++++
 31 files changed, 477 insertions(+), 148 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

-- 
2.1.2


^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v7 00/11] support "task_isolated" mode for nohz_full
@ 2015-09-28 15:17             ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The cover email for the patch series is getting a little unwieldy
so I will provide a terser summary here, and just update the
list of changes from version to version.  Please see the previous
versions linked by the In-Reply-To for more detailed comments
about changes in earlier versions of the patch series.

v7:
  The main change in this version is a change in where we call
  task_isolation_enter().  The arm64 code only invokes the
  context_tracking code right at kernel entry, and right at kernel
  exit, and the exit point is too late for task isolation; one of my
  test cases, when run on arm64, showed that a signal delivered while
  task isolation is waiting for the timer interrupt to quiesce was not
  properly handled before returning to userspace.  The tilegx code
  properly handled that case because it ran user_exit() in the
  work-pending loop.  But since arm64 calls user_exit() later, it was
  too late to go back and handle the signal.  I decided to make the
  task isolation work explicit in the "work" loop done on return to
  userspace, and although I could have done this by hacking up the
  arm64 assembly code for this purpose, I decided to follow the x86
  approach and use the prepare_exit_to_usermode() model where
  architectures handles work looping in C code.  I added that support
  to arm64 and tile as a pre-requisite change, then modified the loop
  in C to call task isolation appropriately.  This both makes the
  slowpath return-to-user code more maintainable for arm64 and tile
  going forward, and also avoids some of the subtlety where the
  context tracking code was being asked to invoke task isolation at
  user_enter() time.

  As a result of this change, I have moved all the
  architecture-specific changes to individual patches: two patches to
  switch arm64 and tile to the prepare_exit_to_usermode() loop, and
  three patches (one each for x86, arm64, and tile) to add the
  necessary call to task_isolation(), plus changes to check at syscall
  entry for strict mode.

  In addition, since arm64 doesn't use exception_enter(), I added an
  explicit call to task_isolation_exception() in do_mem_abort() so
  that page faults would be properly flagged in strict mode.

  I also added an RCU_LOCKDEP_WARN() at Andy Lutomirski's suggestion.

  And, the patch series is rebased to v4.3-rc1.

v6:
  restructured to be a "task_isolation" mode not a "cpu_isolated"
  mode (Frederic)

v5:
  rebased on kernel v4.2-rc3
  converted to use CONFIG_CPU_ISOLATED and separate .c and .h files
  incorporates Christoph Lameter's quiet_vmstat() call

v4:
  rebased on kernel v4.2-rc1
  added support for detecting CPU_ISOLATED_STRICT syscalls on arm64

v3:
  remove dependency on cpu_idle subsystem (Thomas Gleixner)
  use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter
  use seconds for console messages instead of jiffies (Thomas Gleixner)
  updated commit description for patch 5/5

v2:
  rename "dataplane" to "cpu_isolated"
  drop ksoftirqd suppression changes (believed no longer needed)
  merge previous "QUIESCE" functionality into baseline functionality
  explicitly track syscalls and exceptions for "STRICT" functionality
  allow configuring a signal to be delivered for STRICT mode failures
  move debug tracking to irq_enter(), not irq_exit()

General summary:

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it.  However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode.  The
kernel must be built with CONFIG_TASK_ISOLATION to take advantage of
this new mode.  A prctl() option (PR_SET_TASK_ISOLATION) is added to
control whether processes have requested this stricter semantics, and
within that prctl() option we provide a number of different bits for
more precise control.  Additionally, we add a new command-line boot
argument to facilitate debugging where unexpected interrupts are being
delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers.  This patch
series represents the first serious attempt to upstream that
functionality.  Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts (for example,
workqueues on task_isolation cores still remain to be dealt with), this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series (based currently on v4.3-rc1) is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Note: I have not removed the commit to disable the 1Hz timer tick
fallback that was nack'ed by PeterZ, pending a decision on that thread
as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if
we remove the 1Hz tick, task_isolation threads will never re-enter
userspace since a tick will always be pending.

Chris Metcalf (10):
  task_isolation: add initial support
  task_isolation: support PR_TASK_ISOLATION_STRICT mode
  task_isolation: provide strict mode configurable signal
  task_isolation: add debug boot flag
  nohz: task_isolation: allow tick to be fully disabled
  arch/x86: enable task isolation functionality
  arch/arm64: adopt prepare_exit_to_usermode() model from x86
  arch/arm64: enable task isolation functionality
  arch/tile: adopt prepare_exit_to_usermode() model from x86
  arch/tile: enable task isolation functionality

Christoph Lameter (1):
  vmstat: provide a function to quiet down the diff processing

 Documentation/kernel-parameters.txt  |   7 ++
 arch/arm64/include/asm/thread_info.h |  18 +++--
 arch/arm64/kernel/entry.S            |   6 +-
 arch/arm64/kernel/ptrace.c           |  10 ++-
 arch/arm64/kernel/signal.c           |  36 +++++++---
 arch/arm64/mm/fault.c                |   8 +++
 arch/tile/include/asm/processor.h    |   2 +-
 arch/tile/include/asm/thread_info.h  |   8 ++-
 arch/tile/kernel/intvec_32.S         |  46 ++++---------
 arch/tile/kernel/intvec_64.S         |  49 +++++---------
 arch/tile/kernel/process.c           |  92 ++++++++++++++-----------
 arch/tile/kernel/ptrace.c            |   3 +
 arch/tile/mm/homecache.c             |   5 +-
 arch/x86/entry/common.c              |  45 ++++++++++---
 include/linux/context_tracking.h     |  11 ++-
 include/linux/isolation.h            |  42 ++++++++++++
 include/linux/sched.h                |   3 +
 include/linux/vmstat.h               |   2 +
 include/uapi/linux/prctl.h           |   8 +++
 init/Kconfig                         |  20 ++++++
 kernel/Makefile                      |   1 +
 kernel/context_tracking.c            |   9 ++-
 kernel/irq_work.c                    |   5 +-
 kernel/isolation.c                   | 127 +++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                  |  21 ++++++
 kernel/signal.c                      |   5 ++
 kernel/smp.c                         |   4 ++
 kernel/softirq.c                     |   7 ++
 kernel/sys.c                         |   8 +++
 kernel/time/tick-sched.c             |   3 +-
 mm/vmstat.c                          |  14 ++++
 31 files changed, 477 insertions(+), 148 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

-- 
2.1.2


^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v7 01/11] vmstat: provide a function to quiet down the diff processing
  2015-09-28 15:17             ` Chris Metcalf
  (?)
@ 2015-09-28 15:17             ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel

From: Christoph Lameter <cl@linux.com>

quiet_vmstat() can be called in anticipation of a OS "quiet" period
where no tick processing should be triggered. quiet_vmstat() will fold
all pending differentials into the global counters and disable the
vmstat_worker processing.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Signed-off-by: Christoph Lameter <cl@linux.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 14 ++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 82e7db7f7100..c013b8d8e434 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone *, enum zone_stat_item);
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
+void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -272,6 +273,7 @@ static inline void __dec_zone_page_state(struct page *page,
 static inline void refresh_cpu_vm_stats(int cpu) { }
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd974e11a..cf7d324f16e2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1394,6 +1394,20 @@ static void vmstat_update(struct work_struct *w)
 }
 
 /*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+	do {
+		if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+			cancel_delayed_work(this_cpu_ptr(&vmstat_work));
+
+	} while (refresh_cpu_vm_stats());
+}
+
+/*
  * Check if the diffs for a certain cpu indicate that
  * an update is needed.
  */
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v7 02/11] task_isolation: add initial support
  2015-09-28 15:17             ` Chris Metcalf
@ 2015-09-28 15:17               ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
nohz_full=CPULIST boot argument.  The "task_isolation" state is then
indicated by setting a new task struct field, task_isolation_flag,
to the value passed by prctl().  When the _ENABLE bit is set for a
task, and it is returning to userspace on a nohz_full core, it calls
the new task_isolation_enter() routine to take additional actions
to help the task avoid being interrupted in the future.

Initially, there are only three actions taken.  First, the
task calls lru_add_drain() to prevent being interrupted by a
subsequent lru_add_drain_all() call on another core.  Then, it calls
quiet_vmstat() to quieten the vmstat worker to avoid a follow-on
interrupt.  Finally, the code checks for pending timer interrupts
and quiesces until they are no longer pending.  As a result, sys
calls (and page faults, etc.) can be inordinately slow.  However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

The task_isolation_enter() routine must be called just before the
hard return to userspace, so it is appropriately placed in the
prepare_exit_to_usermode() routine for an individual architecture
or some comparable location.  Separate patches that follow
provide these changes for x86, arm64, and tile.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/isolation.h  | 24 +++++++++++++++
 include/linux/sched.h      |  3 ++
 include/uapi/linux/prctl.h |  5 +++
 init/Kconfig               | 20 ++++++++++++
 kernel/Makefile            |  1 +
 kernel/isolation.c         | 77 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c               |  8 +++++
 7 files changed, 138 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..fd04011b1c1e
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,24 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+static inline bool task_isolation_enabled(void)
+{
+	return tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE);
+}
+
+extern void task_isolation_enter(void);
+extern void task_isolation_wait(void);
+#else
+static inline bool task_isolation_enabled(void) { return false; }
+static inline void task_isolation_enter(void) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a4ab9daa387c..bd2dc26948a6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1800,6 +1800,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned int	task_isolation_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..67224df4b559 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,9 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */
+#define PR_SET_TASK_ISOLATION		48
+#define PR_GET_TASK_ISOLATION		49
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index c24b6f767bf0..4ff7f052059a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -787,6 +787,26 @@ config RCU_EXPEDITE_BOOT
 
 endmenu # "RCU Subsystem"
 
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL
+	help
+	 Allow userspace processes to place themselves on nohz_full
+	 cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+	 themselves from the kernel.  On return to userspace,
+	 isolated tasks will first arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  This "hard" isolation from the kernel is
+	 required for userspace tasks that are running hard real-time
+	 tasks in userspace, such as a 10 Gbit network driver in userspace.
+
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf008ecb3..693a2ba35679 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..6ace866c69f6
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,77 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include "time/tick-sched.h"
+
+/*
+ * Rather than continuously polling for the next_event in the
+ * tick_cpu_device, architectures can provide a method to save power
+ * by sleeping until an interrupt arrives.
+ *
+ * Note that it must be guaranteed for a particular architecture
+ * that if next_event is not KTIME_MAX, then a timer interrupt will
+ * occur, otherwise the sleep may never awaken.
+ */
+void __weak task_isolation_wait(void)
+{
+	cpu_relax();
+}
+
+/*
+ * We normally return immediately to userspace.
+ *
+ * In task_isolation mode we wait until no more interrupts are
+ * pending.  Otherwise we nap with interrupts enabled and wait for the
+ * next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two task_isolation processes on the same
+ * core, neither will ever leave the kernel, and one will have to be
+ * killed manually.  Otherwise in situations where another process is
+ * in the runqueue on this cpu, this task will just wait for that
+ * other task to go idle before returning to user space.
+ */
+void task_isolation_enter(void)
+{
+	struct clock_event_device *dev =
+		__this_cpu_read(tick_cpu_device.evtdev);
+	struct task_struct *task = current;
+	unsigned long start = jiffies;
+	bool warned = false;
+
+	if (WARN_ON(irqs_disabled()))
+		local_irq_enable();
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	quiet_vmstat();
+
+	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+		if (!warned && (jiffies - start) >= (5 * HZ)) {
+			pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n",
+				task->comm, task->pid, smp_processor_id(),
+				(jiffies - start) / HZ);
+			warned = true;
+		}
+		cond_resched();
+		if (test_thread_flag(TIF_SIGPENDING))
+			break;
+		task_isolation_wait();
+	}
+	if (warned) {
+		pr_warn("%s/%d: cpu %d: task_isolation task unblocked after %ld seconds\n",
+			task->comm, task->pid, smp_processor_id(),
+			(jiffies - start) / HZ);
+		dump_stack();
+	}
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index fa2f2f671a5c..a2c6eb1d4ad9 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2266,6 +2266,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_TASK_ISOLATION
+	case PR_SET_TASK_ISOLATION:
+		me->task_isolation_flags = arg2;
+		break;
+	case PR_GET_TASK_ISOLATION:
+		error = me->task_isolation_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v7 02/11] task_isolation: add initial support
@ 2015-09-28 15:17               ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device
driver style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
nohz_full=CPULIST boot argument.  The "task_isolation" state is then
indicated by setting a new task struct field, task_isolation_flag,
to the value passed by prctl().  When the _ENABLE bit is set for a
task, and it is returning to userspace on a nohz_full core, it calls
the new task_isolation_enter() routine to take additional actions
to help the task avoid being interrupted in the future.

Initially, there are only three actions taken.  First, the
task calls lru_add_drain() to prevent being interrupted by a
subsequent lru_add_drain_all() call on another core.  Then, it calls
quiet_vmstat() to quieten the vmstat worker to avoid a follow-on
interrupt.  Finally, the code checks for pending timer interrupts
and quiesces until they are no longer pending.  As a result, sys
calls (and page faults, etc.) can be inordinately slow.  However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

The task_isolation_enter() routine must be called just before the
hard return to userspace, so it is appropriately placed in the
prepare_exit_to_usermode() routine for an individual architecture
or some comparable location.  Separate patches that follow
provide these changes for x86, arm64, and tile.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/isolation.h  | 24 +++++++++++++++
 include/linux/sched.h      |  3 ++
 include/uapi/linux/prctl.h |  5 +++
 init/Kconfig               | 20 ++++++++++++
 kernel/Makefile            |  1 +
 kernel/isolation.c         | 77 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c               |  8 +++++
 7 files changed, 138 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..fd04011b1c1e
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,24 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+static inline bool task_isolation_enabled(void)
+{
+	return tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE);
+}
+
+extern void task_isolation_enter(void);
+extern void task_isolation_wait(void);
+#else
+static inline bool task_isolation_enabled(void) { return false; }
+static inline void task_isolation_enter(void) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a4ab9daa387c..bd2dc26948a6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1800,6 +1800,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned int	task_isolation_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..67224df4b559 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,9 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */
+#define PR_SET_TASK_ISOLATION		48
+#define PR_GET_TASK_ISOLATION		49
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index c24b6f767bf0..4ff7f052059a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -787,6 +787,26 @@ config RCU_EXPEDITE_BOOT
 
 endmenu # "RCU Subsystem"
 
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL
+	help
+	 Allow userspace processes to place themselves on nohz_full
+	 cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+	 themselves from the kernel.  On return to userspace,
+	 isolated tasks will first arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  This "hard" isolation from the kernel is
+	 required for userspace tasks that are running hard real-time
+	 tasks in userspace, such as a 10 Gbit network driver in userspace.
+
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf008ecb3..693a2ba35679 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..6ace866c69f6
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,77 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include "time/tick-sched.h"
+
+/*
+ * Rather than continuously polling for the next_event in the
+ * tick_cpu_device, architectures can provide a method to save power
+ * by sleeping until an interrupt arrives.
+ *
+ * Note that it must be guaranteed for a particular architecture
+ * that if next_event is not KTIME_MAX, then a timer interrupt will
+ * occur, otherwise the sleep may never awaken.
+ */
+void __weak task_isolation_wait(void)
+{
+	cpu_relax();
+}
+
+/*
+ * We normally return immediately to userspace.
+ *
+ * In task_isolation mode we wait until no more interrupts are
+ * pending.  Otherwise we nap with interrupts enabled and wait for the
+ * next interrupt to fire, then loop back and retry.
+ *
+ * Note that if you schedule two task_isolation processes on the same
+ * core, neither will ever leave the kernel, and one will have to be
+ * killed manually.  Otherwise in situations where another process is
+ * in the runqueue on this cpu, this task will just wait for that
+ * other task to go idle before returning to user space.
+ */
+void task_isolation_enter(void)
+{
+	struct clock_event_device *dev =
+		__this_cpu_read(tick_cpu_device.evtdev);
+	struct task_struct *task = current;
+	unsigned long start = jiffies;
+	bool warned = false;
+
+	if (WARN_ON(irqs_disabled()))
+		local_irq_enable();
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	quiet_vmstat();
+
+	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
+		if (!warned && (jiffies - start) >= (5 * HZ)) {
+			pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n",
+				task->comm, task->pid, smp_processor_id(),
+				(jiffies - start) / HZ);
+			warned = true;
+		}
+		cond_resched();
+		if (test_thread_flag(TIF_SIGPENDING))
+			break;
+		task_isolation_wait();
+	}
+	if (warned) {
+		pr_warn("%s/%d: cpu %d: task_isolation task unblocked after %ld seconds\n",
+			task->comm, task->pid, smp_processor_id(),
+			(jiffies - start) / HZ);
+		dump_stack();
+	}
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index fa2f2f671a5c..a2c6eb1d4ad9 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2266,6 +2266,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_TASK_ISOLATION
+	case PR_SET_TASK_ISOLATION:
+		me->task_isolation_flags = arg2;
+		break;
+	case PR_GET_TASK_ISOLATION:
+		error = me->task_isolation_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode
  2015-09-28 15:17             ` Chris Metcalf
@ 2015-09-28 15:17               ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry is fatal; this is defined as
happening immediately after the SECCOMP test.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/context_tracking.h | 11 ++++++++---
 include/linux/isolation.h        | 16 ++++++++++++++++
 include/uapi/linux/prctl.h       |  1 +
 kernel/context_tracking.c        |  9 ++++++---
 kernel/isolation.c               | 41 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 72 insertions(+), 6 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 008fc67d0d96..a840374f5d29 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include <linux/sched.h>
 #include <linux/vtime.h>
+#include <linux/isolation.h>
 #include <linux/context_tracking_state.h>
 #include <asm/ptrace.h>
 
@@ -11,7 +12,7 @@
 extern void context_tracking_cpu_set(int cpu);
 
 extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 
@@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
 		return 0;
 
 	prev_ctx = this_cpu_read(context_tracking.state);
-	if (prev_ctx != CONTEXT_KERNEL)
-		context_tracking_exit(prev_ctx);
+	if (prev_ctx != CONTEXT_KERNEL) {
+		if (context_tracking_exit(prev_ctx)) {
+			if (task_isolation_strict())
+				task_isolation_exception();
+		}
+	}
 
 	return prev_ctx;
 }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index fd04011b1c1e..27a4469831c1 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void)
 }
 
 extern void task_isolation_enter(void);
+extern void task_isolation_syscall(int nr);
+extern void task_isolation_exception(void);
 extern void task_isolation_wait(void);
 #else
 static inline bool task_isolation_enabled(void) { return false; }
 static inline void task_isolation_enter(void) { }
+static inline void task_isolation_syscall(int nr) { }
+static inline void task_isolation_exception(void) { }
 #endif
 
+static inline bool task_isolation_strict(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+	if (tick_nohz_full_cpu(smp_processor_id()) &&
+	    (current->task_isolation_flags &
+	     (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+	    (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT))
+		return true;
+#endif
+	return false;
+}
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 67224df4b559..2b8038b0d1e1 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,6 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		48
 #define PR_GET_TASK_ISOLATION		49
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 0a495ab35bc7..ffca3c3fe64a 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -144,15 +144,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
 {
 	unsigned long flags;
+	bool from_user = false;
 
 	if (!context_tracking_is_enabled())
-		return;
+		return false;
 
 	if (in_interrupt())
-		return;
+		return false;
 
 	local_irq_save(flags);
 	if (!context_tracking_recursion_enter())
@@ -166,6 +167,7 @@ void context_tracking_exit(enum ctx_state state)
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				from_user = true;
 				vtime_user_exit(current);
 				trace_user_exit(0);
 			}
@@ -175,6 +177,7 @@ void context_tracking_exit(enum ctx_state state)
 	context_tracking_recursion_exit();
 out_irq_restore:
 	local_irq_restore(flags);
+	return from_user;
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
 EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 6ace866c69f6..3779ba670472 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -10,6 +10,7 @@
 #include <linux/swap.h>
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
+#include <asm/unistd.h>
 #include "time/tick-sched.h"
 
 /*
@@ -75,3 +76,43 @@ void task_isolation_enter(void)
 		dump_stack();
 	}
 }
+
+static void kill_task_isolation_strict_task(void)
+{
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	dump_stack();
+	current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
+	send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void task_isolation_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return;
+	}
+
+	pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
+		current->comm, current->pid, syscall);
+	kill_task_isolation_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(void)
+{
+	pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
+		current->comm, current->pid);
+	kill_task_isolation_strict_task();
+}
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode
@ 2015-09-28 15:17               ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry is fatal; this is defined as
happening immediately after the SECCOMP test.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

The signature of context_tracking_exit() changes to report whether
we, in fact, are exiting back to user space, so that we can track
user exceptions properly separately from other kernel entries.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/context_tracking.h | 11 ++++++++---
 include/linux/isolation.h        | 16 ++++++++++++++++
 include/uapi/linux/prctl.h       |  1 +
 kernel/context_tracking.c        |  9 ++++++---
 kernel/isolation.c               | 41 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 72 insertions(+), 6 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 008fc67d0d96..a840374f5d29 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -3,6 +3,7 @@
 
 #include <linux/sched.h>
 #include <linux/vtime.h>
+#include <linux/isolation.h>
 #include <linux/context_tracking_state.h>
 #include <asm/ptrace.h>
 
@@ -11,7 +12,7 @@
 extern void context_tracking_cpu_set(int cpu);
 
 extern void context_tracking_enter(enum ctx_state state);
-extern void context_tracking_exit(enum ctx_state state);
+extern bool context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 
@@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
 		return 0;
 
 	prev_ctx = this_cpu_read(context_tracking.state);
-	if (prev_ctx != CONTEXT_KERNEL)
-		context_tracking_exit(prev_ctx);
+	if (prev_ctx != CONTEXT_KERNEL) {
+		if (context_tracking_exit(prev_ctx)) {
+			if (task_isolation_strict())
+				task_isolation_exception();
+		}
+	}
 
 	return prev_ctx;
 }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index fd04011b1c1e..27a4469831c1 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void)
 }
 
 extern void task_isolation_enter(void);
+extern void task_isolation_syscall(int nr);
+extern void task_isolation_exception(void);
 extern void task_isolation_wait(void);
 #else
 static inline bool task_isolation_enabled(void) { return false; }
 static inline void task_isolation_enter(void) { }
+static inline void task_isolation_syscall(int nr) { }
+static inline void task_isolation_exception(void) { }
 #endif
 
+static inline bool task_isolation_strict(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+	if (tick_nohz_full_cpu(smp_processor_id()) &&
+	    (current->task_isolation_flags &
+	     (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+	    (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT))
+		return true;
+#endif
+	return false;
+}
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 67224df4b559..2b8038b0d1e1 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,6 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		48
 #define PR_GET_TASK_ISOLATION		49
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 0a495ab35bc7..ffca3c3fe64a 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -144,15 +144,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(enum ctx_state state)
+bool context_tracking_exit(enum ctx_state state)
 {
 	unsigned long flags;
+	bool from_user = false;
 
 	if (!context_tracking_is_enabled())
-		return;
+		return false;
 
 	if (in_interrupt())
-		return;
+		return false;
 
 	local_irq_save(flags);
 	if (!context_tracking_recursion_enter())
@@ -166,6 +167,7 @@ void context_tracking_exit(enum ctx_state state)
 			 */
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
+				from_user = true;
 				vtime_user_exit(current);
 				trace_user_exit(0);
 			}
@@ -175,6 +177,7 @@ void context_tracking_exit(enum ctx_state state)
 	context_tracking_recursion_exit();
 out_irq_restore:
 	local_irq_restore(flags);
+	return from_user;
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
 EXPORT_SYMBOL_GPL(context_tracking_exit);
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 6ace866c69f6..3779ba670472 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -10,6 +10,7 @@
 #include <linux/swap.h>
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
+#include <asm/unistd.h>
 #include "time/tick-sched.h"
 
 /*
@@ -75,3 +76,43 @@ void task_isolation_enter(void)
 		dump_stack();
 	}
 }
+
+static void kill_task_isolation_strict_task(void)
+{
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	dump_stack();
+	current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
+	send_sig(SIGKILL, current, 1);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+void task_isolation_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return;
+	}
+
+	pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
+		current->comm, current->pid, syscall);
+	kill_task_isolation_strict_task();
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(void)
+{
+	pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
+		current->comm, current->pid);
+	kill_task_isolation_strict_task();
+}
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v7 04/11] task_isolation: provide strict mode configurable signal
  2015-09-28 15:17             ` Chris Metcalf
@ 2015-09-28 15:17               ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

Allow userspace to override the default SIGKILL delivered
when a task_isolation process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

In addition to being able to set the signal, we now also
pass whether or not the interruption was from a syscall in
the si_code field of the siginfo.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/uapi/linux/prctl.h |  2 ++
 kernel/isolation.c         | 17 +++++++++++++----
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 2b8038b0d1e1..a5582ace987f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -202,5 +202,7 @@ struct prctl_mm_map {
 #define PR_GET_TASK_ISOLATION		49
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
 # define PR_TASK_ISOLATION_STRICT	(1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 3779ba670472..44bafcd08bca 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -77,14 +77,23 @@ void task_isolation_enter(void)
 	}
 }
 
-static void kill_task_isolation_strict_task(void)
+static void kill_task_isolation_strict_task(int is_syscall)
 {
+	siginfo_t info = {};
+	int sig;
+
 	/* RCU should have been enabled prior to this point. */
 	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
 
 	dump_stack();
 	current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
-	send_sig(SIGKILL, current, 1);
+
+	sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags);
+	if (sig == 0)
+		sig = SIGKILL;
+	info.si_signo = sig;
+	info.si_code = is_syscall;
+	send_sig_info(sig, &info, current);
 }
 
 /*
@@ -103,7 +112,7 @@ void task_isolation_syscall(int syscall)
 
 	pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
 		current->comm, current->pid, syscall);
-	kill_task_isolation_strict_task();
+	kill_task_isolation_strict_task(1);
 }
 
 /*
@@ -114,5 +123,5 @@ void task_isolation_exception(void)
 {
 	pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
 		current->comm, current->pid);
-	kill_task_isolation_strict_task();
+	kill_task_isolation_strict_task(0);
 }
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v7 04/11] task_isolation: provide strict mode configurable signal
@ 2015-09-28 15:17               ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

Allow userspace to override the default SIGKILL delivered
when a task_isolation process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

In addition to being able to set the signal, we now also
pass whether or not the interruption was from a syscall in
the si_code field of the siginfo.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/uapi/linux/prctl.h |  2 ++
 kernel/isolation.c         | 17 +++++++++++++----
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 2b8038b0d1e1..a5582ace987f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -202,5 +202,7 @@ struct prctl_mm_map {
 #define PR_GET_TASK_ISOLATION		49
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
 # define PR_TASK_ISOLATION_STRICT	(1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 3779ba670472..44bafcd08bca 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -77,14 +77,23 @@ void task_isolation_enter(void)
 	}
 }
 
-static void kill_task_isolation_strict_task(void)
+static void kill_task_isolation_strict_task(int is_syscall)
 {
+	siginfo_t info = {};
+	int sig;
+
 	/* RCU should have been enabled prior to this point. */
 	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
 
 	dump_stack();
 	current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
-	send_sig(SIGKILL, current, 1);
+
+	sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags);
+	if (sig == 0)
+		sig = SIGKILL;
+	info.si_signo = sig;
+	info.si_code = is_syscall;
+	send_sig_info(sig, &info, current);
 }
 
 /*
@@ -103,7 +112,7 @@ void task_isolation_syscall(int syscall)
 
 	pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
 		current->comm, current->pid, syscall);
-	kill_task_isolation_strict_task();
+	kill_task_isolation_strict_task(1);
 }
 
 /*
@@ -114,5 +123,5 @@ void task_isolation_exception(void)
 {
 	pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
 		current->comm, current->pid);
-	kill_task_isolation_strict_task();
+	kill_task_isolation_strict_task(0);
 }
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v7 05/11] task_isolation: add debug boot flag
  2015-09-28 15:17             ` Chris Metcalf
                               ` (4 preceding siblings ...)
  (?)
@ 2015-09-28 15:17             ` Chris Metcalf
  2015-09-28 20:59               ` Andy Lutomirski
  2015-10-05 17:07               ` Luiz Capitulino
  -1 siblings, 2 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-kernel
  Cc: Chris Metcalf

The new "task_isolation_debug" flag simplifies debugging
of TASK_ISOLATION kernels when processes are running in
PR_TASK_ISOLATION_ENABLE mode.  Such processes should get no
interrupts from the kernel, and if they do, when this boot flag is
specified a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a task_isolation
core has unexpectedly entered the kernel.  But what this boot flag
does is allow the kernel to provide better diagnostics, e.g. by
reporting in the IPI-generating code what remote core and context
is preparing to deliver an interrupt to a task_isolation core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 Documentation/kernel-parameters.txt |  7 +++++++
 include/linux/isolation.h           |  2 ++
 kernel/irq_work.c                   |  5 ++++-
 kernel/sched/core.c                 | 21 +++++++++++++++++++++
 kernel/signal.c                     |  5 +++++
 kernel/smp.c                        |  4 ++++
 kernel/softirq.c                    |  7 +++++++
 7 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 22a4b687ea5b..48ff15f3166f 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3623,6 +3623,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation_debug	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION and booted
+			in nohz_full= mode, this setting will generate console
+			backtraces when the kernel is about to interrupt a
+			task that has requested PR_TASK_ISOLATION_ENABLE
+			and is running on a nohz_full core.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index 27a4469831c1..9f1747331a36 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -18,11 +18,13 @@ extern void task_isolation_enter(void);
 extern void task_isolation_syscall(int nr);
 extern void task_isolation_exception(void);
 extern void task_isolation_wait(void);
+extern void task_isolation_debug(int cpu);
 #else
 static inline bool task_isolation_enabled(void) { return false; }
 static inline void task_isolation_enter(void) { }
 static inline void task_isolation_syscall(int nr) { }
 static inline void task_isolation_exception(void) { }
+static inline void task_isolation_debug(int cpu) { }
 #endif
 
 static inline bool task_isolation_strict(void)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index cbf9fb899d92..745c2ea6a4e4 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/isolation.h>
 #include <asm/processor.h>
 
 
@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (!irq_work_claim(work))
 		return false;
 
-	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+		task_isolation_debug(cpu);
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return true;
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3595403921bd..8ddabb0d7510 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
 #include <linux/binfmts.h>
 #include <linux/context_tracking.h>
 #include <linux/compiler.h>
+#include <linux/isolation.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -743,6 +744,26 @@ bool sched_can_stop_tick(void)
 }
 #endif /* CONFIG_NO_HZ_FULL */
 
+#ifdef CONFIG_TASK_ISOLATION
+/* Enable debugging of any interrupts of task_isolation cores. */
+static int task_isolation_debug_flag;
+static int __init task_isolation_debug_func(char *str)
+{
+	task_isolation_debug_flag = true;
+	return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+void task_isolation_debug(int cpu)
+{
+	if (task_isolation_debug_flag && tick_nohz_full_cpu(cpu) &&
+	    (cpu_curr(cpu)->task_isolation_flags & PR_TASK_ISOLATION_ENABLE)) {
+		pr_err("Interrupt detected for task_isolation cpu %d\n", cpu);
+		dump_stack();
+	}
+}
+#endif
+
 void sched_avg_update(struct rq *rq)
 {
 	s64 period = sched_avg_period();
diff --git a/kernel/signal.c b/kernel/signal.c
index 0f6bbbe77b46..c6e09f0f7e24 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -684,6 +684,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+#ifdef CONFIG_TASK_ISOLATION
+	/* If the task is being killed, don't complain about task_isolation. */
+	if (state & TASK_WAKEKILL)
+		t->task_isolation_flags = 0;
+#endif
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 07854477c164..b0bddff2693d 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
 #include <linux/smp.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include <linux/isolation.h>
 
 #include "smpboot.h"
 
@@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
+	task_isolation_debug(cpu);
 	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
 		arch_send_call_function_single_ipi(cpu);
 
@@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	for_each_cpu(cpu, cfd->cpumask)
+		task_isolation_debug(cpu);
 	arch_send_call_function_ipi_mask(cfd->cpumask);
 
 	if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..ed762fec7265 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -24,8 +24,10 @@
 #include <linux/ftrace.h>
 #include <linux/smp.h>
 #include <linux/smpboot.h>
+#include <linux/context_tracking.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -335,6 +337,11 @@ void irq_enter(void)
 		_local_bh_enable();
 	}
 
+	if (context_tracking_cpu_is_enabled() &&
+	    context_tracking_in_user() &&
+	    !in_interrupt())
+		task_isolation_debug(smp_processor_id());
+
 	__irq_enter();
 }
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v7 06/11] nohz: task_isolation: allow tick to be fully disabled
  2015-09-28 15:17             ` Chris Metcalf
                               ` (5 preceding siblings ...)
  (?)
@ 2015-09-28 15:17             ` Chris Metcalf
  2015-09-28 20:40               ` Andy Lutomirski
  -1 siblings, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel
  Cc: Chris Metcalf

While the current fallback to 1-second tick is still helpful for
maintaining completely correct kernel semantics, processes using
prctl(PR_SET_TASK_ISOLATION) semantics place a higher priority on
running completely tickless, so don't bound the time_delta for such
processes.  In addition, due to the way such processes quiesce by
waiting for the timer tick to stop prior to returning to userspace,
without this commit it won't be possible to use the task_isolation
mode at all.

Removing the 1-second cap was previously discussed (see link
below) and Thomas Gleixner observed that vruntime, load balancing
data, load accounting, and other things might be impacted.
Frederic Weisbecker similarly observed that allowing the tick to
be indefinitely deferred just meant that no one would ever fix the
underlying bugs.  However it's at least true that the mode proposed
in this patch can only be enabled on a nohz_full core by a process
requesting task_isolation mode, which may limit how important it is
to maintain scheduler data correctly, for example.

Paul McKenney observed that if provide a mode where the 1Hz fallback
timer is removed, this will provide an environment where new code
that relies on that tick will get punished, and we won't forgive
such assumptions silently, so it may also be worth it from that
perspective.

Finally, it's worth observing that the tile architecture has been
using similar code for its Zero-Overhead Linux for many years
(starting in 2008) and customers are very enthusiastic about the
resulting bare-metal performance on cores that are available to
run full Linux semantics on demand (crash, logging, shutdown, etc).
So this semantics is very useful if we can convince ourselves that
doing this is safe.

Link: https://lkml.kernel.org/r/alpine.DEB.2.11.1410311058500.32582@gentwo.org
Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 kernel/time/tick-sched.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3319e16f31e5..4504c0b95d0d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,6 +24,7 @@
 #include <linux/posix-timers.h>
 #include <linux/perf_event.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 
 #include <asm/irq_regs.h>
 
@@ -634,7 +635,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
 
 #ifdef CONFIG_NO_HZ_FULL
 	/* Limit the tick delta to the maximum scheduler deferment */
-	if (!ts->inidle)
+	if (!ts->inidle && !task_isolation_enabled())
 		delta = min(delta, scheduler_tick_max_deferment());
 #endif
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v7 07/11] arch/x86: enable task isolation functionality
  2015-09-28 15:17             ` Chris Metcalf
                               ` (6 preceding siblings ...)
  (?)
@ 2015-09-28 15:17             ` Chris Metcalf
  2015-09-28 20:59               ` Andy Lutomirski
  -1 siblings, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel, H. Peter Anvin, x86
  Cc: Chris Metcalf

In prepare_exit_to_usermode(), we would like to call
task_isolation_enter() on every return to userspace, and like
other work items, we would like to recheck for more work after
calling it, since it will enable interrupts internally.

However, if task_isolation_enter() is the only work item,
and it has already been called once, we don't want to continue
calling it in a loop.  We don't have a dedicated TIF flag for
task isolation, and it wouldn't make sense to have one, since
we'd want to set it before starting exit every time, and then
clear it the first time around the loop.

Instead, we change the loop structure somewhat, so that we
have a more inclusive set of flags that are tested for on the
first entry to the function (including TIF_NOHZ), and if any
of those flags are set, we enter the loop.  And, we do the
task_isolation() test unconditionally at the bottom of the loop,
but then when making the decision to loop back, we just use the
set of flags that doesn't include TIF_NOHZ.  That way we only
loop if there is other work to do, but then if that work
is done, we again unconditionally call task_isolation_enter().

In syscall_trace_enter_phase1(), we try to add the necessary
support for strict-mode detection of syscalls in an optimized
way, by letting the code remain unchanged if we are not using
TASK_ISOLATION, but otherwise calling enter_from_user_mode()
under the first time we see _TIF_NOHZ, and then waiting until
after we do the secure computing work to actually clear the bit
from the "work" variable and call task_isolation_syscall().

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/x86/entry/common.c | 47 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 36 insertions(+), 11 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 80dcc9261ca3..0f74389c6f3b 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -21,6 +21,7 @@
 #include <linux/context_tracking.h>
 #include <linux/user-return-notifier.h>
 #include <linux/uprobes.h>
+#include <linux/isolation.h>
 
 #include <asm/desc.h>
 #include <asm/traps.h>
@@ -81,7 +82,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	 */
 	if (work & _TIF_NOHZ) {
 		enter_from_user_mode();
-		work &= ~_TIF_NOHZ;
+		if (!IS_ENABLED(CONFIG_TASK_ISOLATION))
+			work &= ~_TIF_NOHZ;
 	}
 #endif
 
@@ -131,6 +133,13 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	}
 #endif
 
+	/* Now check task isolation, if needed. */
+	if (IS_ENABLED(CONFIG_TASK_ISOLATION) && (work & _TIF_NOHZ)) {
+		work &= ~_TIF_NOHZ;
+		if (task_isolation_strict())
+			task_isolation_syscall(regs->orig_ax);
+	}
+
 	/* Do our best to finish without phase 2. */
 	if (work == 0)
 		return ret;  /* seccomp and/or nohz only (ret == 0 here) */
@@ -217,10 +226,26 @@ static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
 /* Called with IRQs disabled. */
 __visible void prepare_exit_to_usermode(struct pt_regs *regs)
 {
+	u32 cached_flags;
+
 	if (WARN_ON(!irqs_disabled()))
 		local_irq_disable();
 
 	/*
+	 * We may want to enter the loop here unconditionally to make
+	 * sure to do some work at least once.  Test here for all
+	 * possible conditions that might make us enter the loop,
+	 * and return immediately if none of them are set.
+	 */
+	cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
+	if (!(cached_flags & (TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
+			      _TIF_UPROBE | _TIF_NEED_RESCHED |
+			      _TIF_USER_RETURN_NOTIFY | _TIF_NOHZ))) {
+		user_enter();
+		return;
+	}
+
+	/*
 	 * In order to return to user mode, we need to have IRQs off with
 	 * none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY,
 	 * _TIF_UPROBE, or _TIF_NEED_RESCHED set.  Several of these flags
@@ -228,15 +253,7 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs)
 	 * so we need to loop.  Disabling preemption wouldn't help: doing the
 	 * work to clear some of the flags can sleep.
 	 */
-	while (true) {
-		u32 cached_flags =
-			READ_ONCE(pt_regs_to_thread_info(regs)->flags);
-
-		if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
-				      _TIF_UPROBE | _TIF_NEED_RESCHED |
-				      _TIF_USER_RETURN_NOTIFY)))
-			break;
-
+	do {
 		/* We have work to do. */
 		local_irq_enable();
 
@@ -258,9 +275,17 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs)
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
 			fire_user_return_notifiers();
 
+		if (task_isolation_enabled())
+			task_isolation_enter();
+
 		/* Disable IRQs and retry */
 		local_irq_disable();
-	}
+
+		cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
+
+	} while (!(cached_flags & (TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
+				   _TIF_UPROBE | _TIF_NEED_RESCHED |
+				   _TIF_USER_RETURN_NOTIFY)));
 
 	user_enter();
 }
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v7 08/11] arch/arm64: adopt prepare_exit_to_usermode() model from x86
  2015-09-28 15:17             ` Chris Metcalf
@ 2015-09-28 15:17               ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel, linux-arm-kernel
  Cc: Chris Metcalf

This change is a prerequisite change for TASK_ISOLATION but also
stands on its own for readability and maintainability.  The existing
arm64 do_notify_resume() is called in a loop from assembly on
the slow path; this change moves the loop into C code as well.
For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add new,
comprehensible entry and exit handlers written in C").

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/arm64/kernel/entry.S  |  6 +++---
 arch/arm64/kernel/signal.c | 32 ++++++++++++++++++++++----------
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 4306c937b1ff..6fcbf8ea307b 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -628,9 +628,8 @@ work_pending:
 	mov	x0, sp				// 'regs'
 	tst	x2, #PSR_MODE_MASK		// user mode regs?
 	b.ne	no_work_pending			// returning to kernel
-	enable_irq				// enable interrupts for do_notify_resume()
-	bl	do_notify_resume
-	b	ret_to_user
+	bl	prepare_exit_to_usermode
+	b	no_user_work_pending
 work_resched:
 	bl	schedule
 
@@ -642,6 +641,7 @@ ret_to_user:
 	ldr	x1, [tsk, #TI_FLAGS]
 	and	x2, x1, #_TIF_WORK_MASK
 	cbnz	x2, work_pending
+no_user_work_pending:
 	enable_step_tsk x1, x2
 no_work_pending:
 	kernel_exit 0
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index e18c48cb6db1..fde59c1139a9 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -399,18 +399,30 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
-asmlinkage void do_notify_resume(struct pt_regs *regs,
-				 unsigned int thread_flags)
+asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
+					 unsigned int thread_flags)
 {
-	if (thread_flags & _TIF_SIGPENDING)
-		do_signal(regs);
+	do {
+		local_irq_enable();
 
-	if (thread_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-	}
+		if (thread_flags & _TIF_NEED_RESCHED)
+			schedule();
+
+		if (thread_flags & _TIF_SIGPENDING)
+			do_signal(regs);
+
+		if (thread_flags & _TIF_NOTIFY_RESUME) {
+			clear_thread_flag(TIF_NOTIFY_RESUME);
+			tracehook_notify_resume(regs);
+		}
+
+		if (thread_flags & _TIF_FOREIGN_FPSTATE)
+			fpsimd_restore_current_state();
+
+		local_irq_disable();
 
-	if (thread_flags & _TIF_FOREIGN_FPSTATE)
-		fpsimd_restore_current_state();
+		thread_flags = READ_ONCE(current_thread_info()->flags) &
+			_TIF_WORK_MASK;
 
+	} while (thread_flags);
 }
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v7 08/11] arch/arm64: adopt prepare_exit_to_usermode() model from x86
@ 2015-09-28 15:17               ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: linux-arm-kernel

This change is a prerequisite change for TASK_ISOLATION but also
stands on its own for readability and maintainability.  The existing
arm64 do_notify_resume() is called in a loop from assembly on
the slow path; this change moves the loop into C code as well.
For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add new,
comprehensible entry and exit handlers written in C").

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/arm64/kernel/entry.S  |  6 +++---
 arch/arm64/kernel/signal.c | 32 ++++++++++++++++++++++----------
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 4306c937b1ff..6fcbf8ea307b 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -628,9 +628,8 @@ work_pending:
 	mov	x0, sp				// 'regs'
 	tst	x2, #PSR_MODE_MASK		// user mode regs?
 	b.ne	no_work_pending			// returning to kernel
-	enable_irq				// enable interrupts for do_notify_resume()
-	bl	do_notify_resume
-	b	ret_to_user
+	bl	prepare_exit_to_usermode
+	b	no_user_work_pending
 work_resched:
 	bl	schedule
 
@@ -642,6 +641,7 @@ ret_to_user:
 	ldr	x1, [tsk, #TI_FLAGS]
 	and	x2, x1, #_TIF_WORK_MASK
 	cbnz	x2, work_pending
+no_user_work_pending:
 	enable_step_tsk x1, x2
 no_work_pending:
 	kernel_exit 0
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index e18c48cb6db1..fde59c1139a9 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -399,18 +399,30 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
-asmlinkage void do_notify_resume(struct pt_regs *regs,
-				 unsigned int thread_flags)
+asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
+					 unsigned int thread_flags)
 {
-	if (thread_flags & _TIF_SIGPENDING)
-		do_signal(regs);
+	do {
+		local_irq_enable();
 
-	if (thread_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-	}
+		if (thread_flags & _TIF_NEED_RESCHED)
+			schedule();
+
+		if (thread_flags & _TIF_SIGPENDING)
+			do_signal(regs);
+
+		if (thread_flags & _TIF_NOTIFY_RESUME) {
+			clear_thread_flag(TIF_NOTIFY_RESUME);
+			tracehook_notify_resume(regs);
+		}
+
+		if (thread_flags & _TIF_FOREIGN_FPSTATE)
+			fpsimd_restore_current_state();
+
+		local_irq_disable();
 
-	if (thread_flags & _TIF_FOREIGN_FPSTATE)
-		fpsimd_restore_current_state();
+		thread_flags = READ_ONCE(current_thread_info()->flags) &
+			_TIF_WORK_MASK;
 
+	} while (thread_flags);
 }
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v7 09/11] arch/arm64: enable task isolation functionality
  2015-09-28 15:17             ` Chris Metcalf
@ 2015-09-28 15:17               ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel, linux-arm-kernel
  Cc: Chris Metcalf

We need to call task_isolation_enter() from prepare_exit_to_usermode(),
so that we can both ensure we do it last before returning to
userspace, and we also are able to re-run signal handling, etc.,
if something occurs while task_isolation_enter() has interrupts
enabled.  To do this we add _TIF_NOHZ to the _TIF_WORK_MASK if
we have CONFIG_TASK_ISOLATION enabled, which brings us into
prepare_exit_to_usermode() on all return to userspace.  But we
don't put _TIF_NOHZ in the flags that we use to loop back and
recheck, since we don't need to loop back only because the flag
is set.  Instead we unconditionally call task_isolation_enter()
at the end of the loop if any other work is done.

To make the assembly code continue to be as optimized as before,
we renumber the _TIF flags so that both _TIF_WORK_MASK and
_TIF_SYSCALL_WORK still have contiguous runs of bits in the
immediate operand for the "and" instruction, as required by the
ARM64 ISA.  Since TIF_NOHZ is in both masks, it must be the
middle bit in the contiguous run that starts with the
_TIF_WORK_MASK bits and ends with the _TIF_SYSCALL_WORK bits.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one.  This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

Also, we have to add an explicit check for STRICT mode in
do_mem_abort() to handle the case of page faults, since arm64
does not use the exception_enter() mechanism.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/arm64/include/asm/thread_info.h | 18 ++++++++++++------
 arch/arm64/kernel/ptrace.c           | 10 ++++++++--
 arch/arm64/kernel/signal.c           |  6 +++++-
 arch/arm64/mm/fault.c                |  8 ++++++++
 4 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index dcd06d18a42a..4c36c4ee3528 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -101,11 +101,11 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	1
 #define TIF_NOTIFY_RESUME	2	/* callback before returning to user */
 #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
-#define TIF_NOHZ		7
-#define TIF_SYSCALL_TRACE	8
-#define TIF_SYSCALL_AUDIT	9
-#define TIF_SYSCALL_TRACEPOINT	10
-#define TIF_SECCOMP		11
+#define TIF_NOHZ		4
+#define TIF_SYSCALL_TRACE	5
+#define TIF_SYSCALL_AUDIT	6
+#define TIF_SYSCALL_TRACEPOINT	7
+#define TIF_SECCOMP		8
 #define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_FREEZE		19
 #define TIF_RESTORE_SIGMASK	20
@@ -124,9 +124,15 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_32BIT		(1 << TIF_32BIT)
 
-#define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
+#define _TIF_WORK_LOOP_MASK	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE)
 
+#ifdef CONFIG_TASK_ISOLATION
+# define _TIF_WORK_MASK		(_TIF_WORK_LOOP_MASK | _TIF_NOHZ)
+#else
+# define _TIF_WORK_MASK		_TIF_WORK_LOOP_MASK
+#endif
+
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
 				 _TIF_NOHZ)
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 1971f491bb90..9113789e9486 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1240,14 +1241,19 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
+	unsigned long work = ACCESS_ONCE(current_thread_info()->flags);
+
 	/* Do the secure computing check first; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
 
-	if (test_thread_flag(TIF_SYSCALL_TRACE))
+	if ((work & _TIF_NOHZ) && task_isolation_strict())
+		task_isolation_syscall(regs->syscallno);
+
+	if (work & _TIF_SYSCALL_TRACE)
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
-	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
+	if (work & _TIF_SYSCALL_TRACEPOINT)
 		trace_sys_enter(regs, regs->syscallno);
 
 	audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index fde59c1139a9..def9166eac9e 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -25,6 +25,7 @@
 #include <linux/uaccess.h>
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
+#include <linux/isolation.h>
 
 #include <asm/debug-monitors.h>
 #include <asm/elf.h>
@@ -419,10 +420,13 @@ asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
 		if (thread_flags & _TIF_FOREIGN_FPSTATE)
 			fpsimd_restore_current_state();
 
+		if (task_isolation_enabled())
+			task_isolation_enter();
+
 		local_irq_disable();
 
 		thread_flags = READ_ONCE(current_thread_info()->flags) &
-			_TIF_WORK_MASK;
+			_TIF_WORK_LOOP_MASK;
 
 	} while (thread_flags);
 }
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index aba9ead1384c..01c9ae336887 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/isolation.h>
 
 #include <asm/cpufeature.h>
 #include <asm/exception.h>
@@ -465,6 +466,13 @@ asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
 	const struct fault_info *inf = fault_info + (esr & 63);
 	struct siginfo info;
 
+	/* We don't use exception_enter(), so we check strict isolation here. */
+	if (IS_ENABLED(CONFIG_TASK_ISOLATION) &&
+	    test_thread_flag(TIF_NOHZ) &&
+	    task_isolation_strict() &&
+	    user_mode(regs))
+		task_isolation_exception();
+
 	if (!inf->fn(addr, esr, regs))
 		return;
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v7 09/11] arch/arm64: enable task isolation functionality
@ 2015-09-28 15:17               ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: linux-arm-kernel

We need to call task_isolation_enter() from prepare_exit_to_usermode(),
so that we can both ensure we do it last before returning to
userspace, and we also are able to re-run signal handling, etc.,
if something occurs while task_isolation_enter() has interrupts
enabled.  To do this we add _TIF_NOHZ to the _TIF_WORK_MASK if
we have CONFIG_TASK_ISOLATION enabled, which brings us into
prepare_exit_to_usermode() on all return to userspace.  But we
don't put _TIF_NOHZ in the flags that we use to loop back and
recheck, since we don't need to loop back only because the flag
is set.  Instead we unconditionally call task_isolation_enter()
at the end of the loop if any other work is done.

To make the assembly code continue to be as optimized as before,
we renumber the _TIF flags so that both _TIF_WORK_MASK and
_TIF_SYSCALL_WORK still have contiguous runs of bits in the
immediate operand for the "and" instruction, as required by the
ARM64 ISA.  Since TIF_NOHZ is in both masks, it must be the
middle bit in the contiguous run that starts with the
_TIF_WORK_MASK bits and ends with the _TIF_SYSCALL_WORK bits.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one.  This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

Also, we have to add an explicit check for STRICT mode in
do_mem_abort() to handle the case of page faults, since arm64
does not use the exception_enter() mechanism.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/arm64/include/asm/thread_info.h | 18 ++++++++++++------
 arch/arm64/kernel/ptrace.c           | 10 ++++++++--
 arch/arm64/kernel/signal.c           |  6 +++++-
 arch/arm64/mm/fault.c                |  8 ++++++++
 4 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index dcd06d18a42a..4c36c4ee3528 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -101,11 +101,11 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	1
 #define TIF_NOTIFY_RESUME	2	/* callback before returning to user */
 #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
-#define TIF_NOHZ		7
-#define TIF_SYSCALL_TRACE	8
-#define TIF_SYSCALL_AUDIT	9
-#define TIF_SYSCALL_TRACEPOINT	10
-#define TIF_SECCOMP		11
+#define TIF_NOHZ		4
+#define TIF_SYSCALL_TRACE	5
+#define TIF_SYSCALL_AUDIT	6
+#define TIF_SYSCALL_TRACEPOINT	7
+#define TIF_SECCOMP		8
 #define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_FREEZE		19
 #define TIF_RESTORE_SIGMASK	20
@@ -124,9 +124,15 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_32BIT		(1 << TIF_32BIT)
 
-#define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
+#define _TIF_WORK_LOOP_MASK	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE)
 
+#ifdef CONFIG_TASK_ISOLATION
+# define _TIF_WORK_MASK		(_TIF_WORK_LOOP_MASK | _TIF_NOHZ)
+#else
+# define _TIF_WORK_MASK		_TIF_WORK_LOOP_MASK
+#endif
+
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
 				 _TIF_NOHZ)
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 1971f491bb90..9113789e9486 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1240,14 +1241,19 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
+	unsigned long work = ACCESS_ONCE(current_thread_info()->flags);
+
 	/* Do the secure computing check first; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
 
-	if (test_thread_flag(TIF_SYSCALL_TRACE))
+	if ((work & _TIF_NOHZ) && task_isolation_strict())
+		task_isolation_syscall(regs->syscallno);
+
+	if (work & _TIF_SYSCALL_TRACE)
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
-	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
+	if (work & _TIF_SYSCALL_TRACEPOINT)
 		trace_sys_enter(regs, regs->syscallno);
 
 	audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index fde59c1139a9..def9166eac9e 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -25,6 +25,7 @@
 #include <linux/uaccess.h>
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
+#include <linux/isolation.h>
 
 #include <asm/debug-monitors.h>
 #include <asm/elf.h>
@@ -419,10 +420,13 @@ asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
 		if (thread_flags & _TIF_FOREIGN_FPSTATE)
 			fpsimd_restore_current_state();
 
+		if (task_isolation_enabled())
+			task_isolation_enter();
+
 		local_irq_disable();
 
 		thread_flags = READ_ONCE(current_thread_info()->flags) &
-			_TIF_WORK_MASK;
+			_TIF_WORK_LOOP_MASK;
 
 	} while (thread_flags);
 }
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index aba9ead1384c..01c9ae336887 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/isolation.h>
 
 #include <asm/cpufeature.h>
 #include <asm/exception.h>
@@ -465,6 +466,13 @@ asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
 	const struct fault_info *inf = fault_info + (esr & 63);
 	struct siginfo info;
 
+	/* We don't use exception_enter(), so we check strict isolation here. */
+	if (IS_ENABLED(CONFIG_TASK_ISOLATION) &&
+	    test_thread_flag(TIF_NOHZ) &&
+	    task_isolation_strict() &&
+	    user_mode(regs))
+		task_isolation_exception();
+
 	if (!inf->fn(addr, esr, regs))
 		return;
 
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v7 10/11] arch/tile: adopt prepare_exit_to_usermode() model from x86
  2015-09-28 15:17             ` Chris Metcalf
                               ` (9 preceding siblings ...)
  (?)
@ 2015-09-28 15:17             ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel
  Cc: Chris Metcalf

This change is a prerequisite change for TASK_ISOLATION but also
stands on its own for readability and maintainability.  The existing
tile do_work_pending() was called in a loop from assembly on
the slow path; this change moves the loop into C code as well.
For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add new,
comprehensible entry and exit handlers written in C").

This change exposes a pre-existing bug on the older tilepro platform;
the singlestep processing is done last, but on tilepro (unlike tilegx)
we enable interrupts while doing that processing, so we could in
theory miss a signal or other asynchronous event.  A future change
could fix this by breaking the singlestep work into a "prepare"
step done in the main loop, and a "trigger" step done after exiting
the loop.  Since this change is intended as purely a restructuring
change, we call out the bug explicitly now, but don't yet fix it.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/include/asm/processor.h   |  2 +-
 arch/tile/include/asm/thread_info.h |  8 +++-
 arch/tile/kernel/intvec_32.S        | 46 +++++++--------------
 arch/tile/kernel/intvec_64.S        | 49 +++++++----------------
 arch/tile/kernel/process.c          | 79 +++++++++++++++++++------------------
 5 files changed, 77 insertions(+), 107 deletions(-)

diff --git a/arch/tile/include/asm/processor.h b/arch/tile/include/asm/processor.h
index 139dfdee0134..0684e88aacd8 100644
--- a/arch/tile/include/asm/processor.h
+++ b/arch/tile/include/asm/processor.h
@@ -212,7 +212,7 @@ static inline void release_thread(struct task_struct *dead_task)
 	/* Nothing for now */
 }
 
-extern int do_work_pending(struct pt_regs *regs, u32 flags);
+extern void prepare_exit_to_usermode(struct pt_regs *regs, u32 flags);
 
 
 /*
diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index dc1fb28d9636..4b7cef9e94e0 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -140,10 +140,14 @@ extern void _cpu_idle(void);
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
 #define _TIF_NOHZ		(1<<TIF_NOHZ)
 
+/* Work to do as we loop to exit to user space. */
+#define _TIF_WORK_MASK \
+	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
+	 _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME)
+
 /* Work to do on any return to user space. */
 #define _TIF_ALLWORK_MASK \
-	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | _TIF_SINGLESTEP | \
-	 _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME | _TIF_NOHZ)
+	(_TIF_WORK_MASK | _TIF_SINGLESTEP | _TIF_NOHZ)
 
 /* Work to do at syscall entry. */
 #define _TIF_SYSCALL_ENTRY_WORK \
diff --git a/arch/tile/kernel/intvec_32.S b/arch/tile/kernel/intvec_32.S
index fbbe2ea882ea..33d48812872a 100644
--- a/arch/tile/kernel/intvec_32.S
+++ b/arch/tile/kernel/intvec_32.S
@@ -846,18 +846,6 @@ STD_ENTRY(interrupt_return)
 	FEEDBACK_REENTER(interrupt_return)
 
 	/*
-	 * Use r33 to hold whether we have already loaded the callee-saves
-	 * into ptregs.  We don't want to do it twice in this loop, since
-	 * then we'd clobber whatever changes are made by ptrace, etc.
-	 * Get base of stack in r32.
-	 */
-	{
-	 GET_THREAD_INFO(r32)
-	 movei  r33, 0
-	}
-
-.Lretry_work_pending:
-	/*
 	 * Disable interrupts so as to make sure we don't
 	 * miss an interrupt that sets any of the thread flags (like
 	 * need_resched or sigpending) between sampling and the iret.
@@ -867,33 +855,27 @@ STD_ENTRY(interrupt_return)
 	IRQ_DISABLE(r20, r21)
 	TRACE_IRQS_OFF  /* Note: clobbers registers r0-r29 */
 
-
-	/* Check to see if there is any work to do before returning to user. */
+	/*
+	 * See if there are any work items (including single-shot items)
+	 * to do.  If so, save the callee-save registers to pt_regs
+	 * and then dispatch to C code.
+	 */
+	GET_THREAD_INFO(r21)
 	{
-	 addi   r29, r32, THREAD_INFO_FLAGS_OFFSET
-	 moveli r1, lo16(_TIF_ALLWORK_MASK)
+	 addi   r22, r21, THREAD_INFO_FLAGS_OFFSET
+	 moveli r20, lo16(_TIF_ALLWORK_MASK)
 	}
 	{
-	 lw     r29, r29
-	 auli   r1, r1, ha16(_TIF_ALLWORK_MASK)
+	 lw     r22, r22
+	 auli   r20, r20, ha16(_TIF_ALLWORK_MASK)
 	}
-	and     r1, r29, r1
-	bzt     r1, .Lrestore_all
-
-	/*
-	 * Make sure we have all the registers saved for signal
-	 * handling, notify-resume, or single-step.  Call out to C
-	 * code to figure out exactly what we need to do for each flag bit,
-	 * then if necessary, reload the flags and recheck.
-	 */
+	and     r1, r22, r20
 	{
 	 PTREGS_PTR(r0, PTREGS_OFFSET_BASE)
-	 bnz    r33, 1f
+	 bzt    r1, .Lrestore_all
 	}
 	push_extra_callee_saves r0
-	movei   r33, 1
-1:	jal     do_work_pending
-	bnz     r0, .Lretry_work_pending
+	jal     prepare_exit_to_usermode
 
 	/*
 	 * In the NMI case we
@@ -1327,7 +1309,7 @@ STD_ENTRY(ret_from_kernel_thread)
 	FEEDBACK_REENTER(ret_from_kernel_thread)
 	{
 	 movei  r30, 0               /* not an NMI */
-	 j      .Lresume_userspace   /* jump into middle of interrupt_return */
+	 j      interrupt_return
 	}
 	STD_ENDPROC(ret_from_kernel_thread)
 
diff --git a/arch/tile/kernel/intvec_64.S b/arch/tile/kernel/intvec_64.S
index 58964d209d4d..a41c994ce237 100644
--- a/arch/tile/kernel/intvec_64.S
+++ b/arch/tile/kernel/intvec_64.S
@@ -879,20 +879,6 @@ STD_ENTRY(interrupt_return)
 	FEEDBACK_REENTER(interrupt_return)
 
 	/*
-	 * Use r33 to hold whether we have already loaded the callee-saves
-	 * into ptregs.  We don't want to do it twice in this loop, since
-	 * then we'd clobber whatever changes are made by ptrace, etc.
-	 */
-	{
-	 movei  r33, 0
-	 move   r32, sp
-	}
-
-	/* Get base of stack in r32. */
-	EXTRACT_THREAD_INFO(r32)
-
-.Lretry_work_pending:
-	/*
 	 * Disable interrupts so as to make sure we don't
 	 * miss an interrupt that sets any of the thread flags (like
 	 * need_resched or sigpending) between sampling and the iret.
@@ -902,33 +888,28 @@ STD_ENTRY(interrupt_return)
 	IRQ_DISABLE(r20, r21)
 	TRACE_IRQS_OFF  /* Note: clobbers registers r0-r29 */
 
-
-	/* Check to see if there is any work to do before returning to user. */
+	/*
+	 * See if there are any work items (including single-shot items)
+	 * to do.  If so, save the callee-save registers to pt_regs
+	 * and then dispatch to C code.
+	 */
+	move    r21, sp
+	EXTRACT_THREAD_INFO(r21)
 	{
-	 addi   r29, r32, THREAD_INFO_FLAGS_OFFSET
-	 moveli r1, hw1_last(_TIF_ALLWORK_MASK)
+	 addi   r22, r21, THREAD_INFO_FLAGS_OFFSET
+	 moveli r20, hw1_last(_TIF_ALLWORK_MASK)
 	}
 	{
-	 ld     r29, r29
-	 shl16insli r1, r1, hw0(_TIF_ALLWORK_MASK)
+	 ld     r22, r22
+	 shl16insli r20, r20, hw0(_TIF_ALLWORK_MASK)
 	}
-	and     r1, r29, r1
-	beqzt   r1, .Lrestore_all
-
-	/*
-	 * Make sure we have all the registers saved for signal
-	 * handling or notify-resume.  Call out to C code to figure out
-	 * exactly what we need to do for each flag bit, then if
-	 * necessary, reload the flags and recheck.
-	 */
+	and     r1, r22, r20
 	{
 	 PTREGS_PTR(r0, PTREGS_OFFSET_BASE)
-	 bnez   r33, 1f
+	 beqzt  r1, .Lrestore_all
 	}
 	push_extra_callee_saves r0
-	movei   r33, 1
-1:	jal     do_work_pending
-	bnez    r0, .Lretry_work_pending
+	jal     prepare_exit_to_usermode
 
 	/*
 	 * In the NMI case we
@@ -1411,7 +1392,7 @@ STD_ENTRY(ret_from_kernel_thread)
 	FEEDBACK_REENTER(ret_from_kernel_thread)
 	{
 	 movei  r30, 0               /* not an NMI */
-	 j      .Lresume_userspace   /* jump into middle of interrupt_return */
+	 j      interrupt_return
 	}
 	STD_ENDPROC(ret_from_kernel_thread)
 
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index 7d5769310bef..b5f30d376ce1 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -462,54 +462,57 @@ struct task_struct *__sched _switch_to(struct task_struct *prev,
 
 /*
  * This routine is called on return from interrupt if any of the
- * TIF_WORK_MASK flags are set in thread_info->flags.  It is
- * entered with interrupts disabled so we don't miss an event
- * that modified the thread_info flags.  If any flag is set, we
- * handle it and return, and the calling assembly code will
- * re-disable interrupts, reload the thread flags, and call back
- * if more flags need to be handled.
- *
- * We return whether we need to check the thread_info flags again
- * or not.  Note that we don't clear TIF_SINGLESTEP here, so it's
- * important that it be tested last, and then claim that we don't
- * need to recheck the flags.
+ * TIF_ALLWORK_MASK flags are set in thread_info->flags.  It is
+ * entered with interrupts disabled so we don't miss an event that
+ * modified the thread_info flags.  We loop until all the tested flags
+ * are clear.  Note that the function is called on certain conditions
+ * that are not listed in the loop condition here (e.g. SINGLESTEP)
+ * which guarantees we will do those things once, and redo them if any
+ * of the other work items is re-done, but won't continue looping if
+ * all the other work is done.
  */
-int do_work_pending(struct pt_regs *regs, u32 thread_info_flags)
+void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags)
 {
-	/* If we enter in kernel mode, do nothing and exit the caller loop. */
-	if (!user_mode(regs))
-		return 0;
+	if (WARN_ON(!user_mode(regs)))
+		return;
 
-	user_exit();
+	do {
+		local_irq_enable();
 
-	/* Enable interrupts; they are disabled again on return to caller. */
-	local_irq_enable();
+		if (thread_info_flags & _TIF_NEED_RESCHED)
+			schedule();
 
-	if (thread_info_flags & _TIF_NEED_RESCHED) {
-		schedule();
-		return 1;
-	}
 #if CHIP_HAS_TILE_DMA()
-	if (thread_info_flags & _TIF_ASYNC_TLB) {
-		do_async_page_fault(regs);
-		return 1;
-	}
+		if (thread_info_flags & _TIF_ASYNC_TLB)
+			do_async_page_fault(regs);
 #endif
-	if (thread_info_flags & _TIF_SIGPENDING) {
-		do_signal(regs);
-		return 1;
-	}
-	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-		return 1;
-	}
-	if (thread_info_flags & _TIF_SINGLESTEP)
+
+		if (thread_info_flags & _TIF_SIGPENDING)
+			do_signal(regs);
+
+		if (thread_info_flags & _TIF_NOTIFY_RESUME) {
+			clear_thread_flag(TIF_NOTIFY_RESUME);
+			tracehook_notify_resume(regs);
+		}
+
+		local_irq_disable();
+		thread_info_flags = READ_ONCE(current_thread_info()->flags);
+
+	} while (thread_info_flags & _TIF_WORK_MASK);
+
+	if (thread_info_flags & _TIF_SINGLESTEP) {
 		single_step_once(regs);
+#ifndef __tilegx__
+		/*
+		 * FIXME: on tilepro, since we enable interrupts in
+		 * this routine, it's possible that we miss a signal
+		 * or other asynchronous event.
+		 */
+		local_irq_disable();
+#endif
+	}
 
 	user_enter();
-
-	return 0;
 }
 
 unsigned long get_wchan(struct task_struct *p)
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v7 11/11] arch/tile: enable task isolation functionality
  2015-09-28 15:17             ` Chris Metcalf
                               ` (10 preceding siblings ...)
  (?)
@ 2015-09-28 15:17             ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel
  Cc: Chris Metcalf

We add the necessary call to task_isolation_enter() in the
prepare_exit_to_usermode() routine.  We already unconditionally
call into this routine if TIF_NOHZ is set, since that's where
we do the user_enter() call.

In addition, we add an overriding task_isolation_wait() call
that runs a nap instruction while waiting for an interrupt, to
make the task_isolation_enter() loop run in a lower-power state.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/kernel/process.c | 13 +++++++++++++
 arch/tile/kernel/ptrace.c  |  3 +++
 arch/tile/mm/homecache.c   |  5 ++++-
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index b5f30d376ce1..28aa0f8b45ef 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -29,6 +29,7 @@
 #include <linux/signal.h>
 #include <linux/delay.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 #include <asm/stack.h>
 #include <asm/switch_to.h>
 #include <asm/homecache.h>
@@ -70,6 +71,15 @@ void arch_cpu_idle(void)
 	_cpu_idle();
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+void task_isolation_wait(void)
+{
+	set_current_state(TASK_INTERRUPTIBLE);
+	_cpu_idle();
+	set_current_state(TASK_RUNNING);
+}
+#endif
+
 /*
  * Release a thread_info structure
  */
@@ -495,6 +505,9 @@ void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags)
 			tracehook_notify_resume(regs);
 		}
 
+		if (task_isolation_enabled())
+			task_isolation_enter();
+
 		local_irq_disable();
 		thread_info_flags = READ_ONCE(current_thread_info()->flags);
 
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index bdc126faf741..04a7a6bf7d0a 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -265,6 +265,9 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 	if (secure_computing() == -1)
 		return -1;
 
+	if ((work & _TIF_NOHZ) && task_isolation_strict())
+		task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]);
+
 	if (work & _TIF_SYSCALL_TRACE) {
 		if (tracehook_report_syscall_entry(regs))
 			regs->regs[TREG_SYSCALL_NR] = -1;
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..a79325113105 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
 #include <linux/smp.h>
 #include <linux/module.h>
 #include <linux/hugetlb.h>
+#include <linux/isolation.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
 	 * Don't bother to update atomically; losing a count
 	 * here is not that critical.
 	 */
-	for_each_cpu(cpu, &mask)
+	for_each_cpu(cpu, &mask) {
 		++per_cpu(irq_stat, cpu).irq_hv_flush_count;
+		task_isolation_debug(cpu);
+	}
 }
 
 /*
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 06/11] nohz: task_isolation: allow tick to be fully disabled
  2015-09-28 15:17             ` [PATCH v7 06/11] nohz: task_isolation: allow tick to be fully disabled Chris Metcalf
@ 2015-09-28 20:40               ` Andy Lutomirski
  2015-10-01 13:07                 ` Frederic Weisbecker
  0 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-09-28 20:40 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel

On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> While the current fallback to 1-second tick is still helpful for
> maintaining completely correct kernel semantics, processes using
> prctl(PR_SET_TASK_ISOLATION) semantics place a higher priority on
> running completely tickless, so don't bound the time_delta for such
> processes.  In addition, due to the way such processes quiesce by
> waiting for the timer tick to stop prior to returning to userspace,
> without this commit it won't be possible to use the task_isolation
> mode at all.
>
> Removing the 1-second cap was previously discussed (see link
> below) and Thomas Gleixner observed that vruntime, load balancing
> data, load accounting, and other things might be impacted.
> Frederic Weisbecker similarly observed that allowing the tick to
> be indefinitely deferred just meant that no one would ever fix the
> underlying bugs.  However it's at least true that the mode proposed
> in this patch can only be enabled on a nohz_full core by a process
> requesting task_isolation mode, which may limit how important it is
> to maintain scheduler data correctly, for example.

What goes wrong when a task enables this?  Presumably either tasks
that enable it experience problems or performance issues or it should
always be enabled.

One possible issue: __vdso_clock_gettime with any of the COARSE clocks
as well as __vdso_time will break if the timekeeping code doesn't run
somewhere with reasonable frequency on some core.  Hopefully this
always works.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode
@ 2015-09-28 20:51                 ` Andy Lutomirski
  0 siblings, 0 replies; 340+ messages in thread
From: Andy Lutomirski @ 2015-09-28 20:51 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> With task_isolation mode, the task is in principle guaranteed not to
> be interrupted by the kernel, but only if it behaves.  In particular,
> if it enters the kernel via system call, page fault, or any of a
> number of other synchronous traps, it may be unexpectedly exposed
> to long latencies.  Add a simple flag that puts the process into
> a state where any such kernel entry is fatal; this is defined as
> happening immediately after the SECCOMP test.

Why after seccomp?  Seccomp is still an entry, and the code would be
considerably simpler if it were before seccomp.

> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
>                 return 0;
>
>         prev_ctx = this_cpu_read(context_tracking.state);
> -       if (prev_ctx != CONTEXT_KERNEL)
> -               context_tracking_exit(prev_ctx);
> +       if (prev_ctx != CONTEXT_KERNEL) {
> +               if (context_tracking_exit(prev_ctx)) {
> +                       if (task_isolation_strict())
> +                               task_isolation_exception();
> +               }
> +       }
>
>         return prev_ctx;
>  }

x86 does not promise to call this function.  In fact, x86 is rather
likely to stop ever calling this function in the reasonably near
future.

> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -144,15 +144,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
>   * This call supports re-entrancy. This way it can be called from any exception
>   * handler without needing to know if we came from userspace or not.
>   */
> -void context_tracking_exit(enum ctx_state state)
> +bool context_tracking_exit(enum ctx_state state)

This needs clear documentation of what the return value means.

> +static void kill_task_isolation_strict_task(void)
> +{
> +       /* RCU should have been enabled prior to this point. */
> +       RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
> +
> +       dump_stack();
> +       current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
> +       send_sig(SIGKILL, current, 1);
> +}

Wasn't this supposed to be configurable?  Or is that something that
happens later on in the series?

> +
> +/*
> + * This routine is called from syscall entry (with the syscall number
> + * passed in) if the _STRICT flag is set.
> + */
> +void task_isolation_syscall(int syscall)
> +{
> +       /* Ignore prctl() syscalls or any task exit. */
> +       switch (syscall) {
> +       case __NR_prctl:
> +       case __NR_exit:
> +       case __NR_exit_group:
> +               return;
> +       }
> +
> +       pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
> +               current->comm, current->pid, syscall);
> +       kill_task_isolation_strict_task();
> +}

Ick.  I guess it works, but this is still quite ugly IMO.

> +void task_isolation_exception(void)
> +{
> +       pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
> +               current->comm, current->pid);
> +       kill_task_isolation_strict_task();
> +}

Should this say what exception?

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode
@ 2015-09-28 20:51                 ` Andy Lutomirski
  0 siblings, 0 replies; 340+ messages in thread
From: Andy Lutomirski @ 2015-09-28 20:51 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
> With task_isolation mode, the task is in principle guaranteed not to
> be interrupted by the kernel, but only if it behaves.  In particular,
> if it enters the kernel via system call, page fault, or any of a
> number of other synchronous traps, it may be unexpectedly exposed
> to long latencies.  Add a simple flag that puts the process into
> a state where any such kernel entry is fatal; this is defined as
> happening immediately after the SECCOMP test.

Why after seccomp?  Seccomp is still an entry, and the code would be
considerably simpler if it were before seccomp.

> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
>                 return 0;
>
>         prev_ctx = this_cpu_read(context_tracking.state);
> -       if (prev_ctx != CONTEXT_KERNEL)
> -               context_tracking_exit(prev_ctx);
> +       if (prev_ctx != CONTEXT_KERNEL) {
> +               if (context_tracking_exit(prev_ctx)) {
> +                       if (task_isolation_strict())
> +                               task_isolation_exception();
> +               }
> +       }
>
>         return prev_ctx;
>  }

x86 does not promise to call this function.  In fact, x86 is rather
likely to stop ever calling this function in the reasonably near
future.

> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -144,15 +144,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
>   * This call supports re-entrancy. This way it can be called from any exception
>   * handler without needing to know if we came from userspace or not.
>   */
> -void context_tracking_exit(enum ctx_state state)
> +bool context_tracking_exit(enum ctx_state state)

This needs clear documentation of what the return value means.

> +static void kill_task_isolation_strict_task(void)
> +{
> +       /* RCU should have been enabled prior to this point. */
> +       RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
> +
> +       dump_stack();
> +       current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
> +       send_sig(SIGKILL, current, 1);
> +}

Wasn't this supposed to be configurable?  Or is that something that
happens later on in the series?

> +
> +/*
> + * This routine is called from syscall entry (with the syscall number
> + * passed in) if the _STRICT flag is set.
> + */
> +void task_isolation_syscall(int syscall)
> +{
> +       /* Ignore prctl() syscalls or any task exit. */
> +       switch (syscall) {
> +       case __NR_prctl:
> +       case __NR_exit:
> +       case __NR_exit_group:
> +               return;
> +       }
> +
> +       pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
> +               current->comm, current->pid, syscall);
> +       kill_task_isolation_strict_task();
> +}

Ick.  I guess it works, but this is still quite ugly IMO.

> +void task_isolation_exception(void)
> +{
> +       pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
> +               current->comm, current->pid);
> +       kill_task_isolation_strict_task();
> +}

Should this say what exception?

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 04/11] task_isolation: provide strict mode configurable signal
  2015-09-28 15:17               ` Chris Metcalf
  (?)
@ 2015-09-28 20:54               ` Andy Lutomirski
  2015-09-28 21:54                   ` Chris Metcalf
  -1 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-09-28 20:54 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> Allow userspace to override the default SIGKILL delivered
> when a task_isolation process in STRICT mode does a syscall
> or otherwise synchronously enters the kernel.
>
> In addition to being able to set the signal, we now also
> pass whether or not the interruption was from a syscall in
> the si_code field of the siginfo.
>
> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
> ---
>  include/uapi/linux/prctl.h |  2 ++
>  kernel/isolation.c         | 17 +++++++++++++----
>  2 files changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 2b8038b0d1e1..a5582ace987f 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -202,5 +202,7 @@ struct prctl_mm_map {
>  #define PR_GET_TASK_ISOLATION          49
>  # define PR_TASK_ISOLATION_ENABLE      (1 << 0)
>  # define PR_TASK_ISOLATION_STRICT      (1 << 1)
> +# define PR_TASK_ISOLATION_SET_SIG(sig)        (((sig) & 0x7f) << 8)
> +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
>
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/isolation.c b/kernel/isolation.c
> index 3779ba670472..44bafcd08bca 100644
> --- a/kernel/isolation.c
> +++ b/kernel/isolation.c
> @@ -77,14 +77,23 @@ void task_isolation_enter(void)
>         }
>  }
>
> -static void kill_task_isolation_strict_task(void)
> +static void kill_task_isolation_strict_task(int is_syscall)
>  {
> +       siginfo_t info = {};
> +       int sig;
> +
>         /* RCU should have been enabled prior to this point. */
>         RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
>
>         dump_stack();
>         current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
> -       send_sig(SIGKILL, current, 1);
> +
> +       sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags);
> +       if (sig == 0)
> +               sig = SIGKILL;
> +       info.si_signo = sig;
> +       info.si_code = is_syscall;

I think this needs real SI_ defines.

> +       send_sig_info(sig, &info, current);
>  }
>
>  /*
> @@ -103,7 +112,7 @@ void task_isolation_syscall(int syscall)
>
>         pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
>                 current->comm, current->pid, syscall);
> -       kill_task_isolation_strict_task();
> +       kill_task_isolation_strict_task(1);

No magic numbers please.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality
  2015-09-28 15:17             ` [PATCH v7 07/11] arch/x86: enable task isolation functionality Chris Metcalf
@ 2015-09-28 20:59               ` Andy Lutomirski
  2015-09-28 21:57                 ` Chris Metcalf
  0 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-09-28 20:59 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel,
	H. Peter Anvin, X86 ML

On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> In prepare_exit_to_usermode(), we would like to call
> task_isolation_enter() on every return to userspace, and like
> other work items, we would like to recheck for more work after
> calling it, since it will enable interrupts internally.
>
> However, if task_isolation_enter() is the only work item,
> and it has already been called once, we don't want to continue
> calling it in a loop.  We don't have a dedicated TIF flag for
> task isolation, and it wouldn't make sense to have one, since
> we'd want to set it before starting exit every time, and then
> clear it the first time around the loop.
>
> Instead, we change the loop structure somewhat, so that we
> have a more inclusive set of flags that are tested for on the
> first entry to the function (including TIF_NOHZ), and if any
> of those flags are set, we enter the loop.  And, we do the
> task_isolation() test unconditionally at the bottom of the loop,
> but then when making the decision to loop back, we just use the
> set of flags that doesn't include TIF_NOHZ.  That way we only
> loop if there is other work to do, but then if that work
> is done, we again unconditionally call task_isolation_enter().
>
> In syscall_trace_enter_phase1(), we try to add the necessary
> support for strict-mode detection of syscalls in an optimized
> way, by letting the code remain unchanged if we are not using
> TASK_ISOLATION, but otherwise calling enter_from_user_mode()
> under the first time we see _TIF_NOHZ, and then waiting until
> after we do the secure computing work to actually clear the bit
> from the "work" variable and call task_isolation_syscall().
>
> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
> ---
>  arch/x86/entry/common.c | 47 ++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 36 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 80dcc9261ca3..0f74389c6f3b 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -21,6 +21,7 @@
>  #include <linux/context_tracking.h>
>  #include <linux/user-return-notifier.h>
>  #include <linux/uprobes.h>
> +#include <linux/isolation.h>
>
>  #include <asm/desc.h>
>  #include <asm/traps.h>
> @@ -81,7 +82,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>          */
>         if (work & _TIF_NOHZ) {
>                 enter_from_user_mode();
> -               work &= ~_TIF_NOHZ;
> +               if (!IS_ENABLED(CONFIG_TASK_ISOLATION))
> +                       work &= ~_TIF_NOHZ;
>         }
>  #endif
>
> @@ -131,6 +133,13 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>         }
>  #endif
>
> +       /* Now check task isolation, if needed. */
> +       if (IS_ENABLED(CONFIG_TASK_ISOLATION) && (work & _TIF_NOHZ)) {
> +               work &= ~_TIF_NOHZ;
> +               if (task_isolation_strict())
> +                       task_isolation_syscall(regs->orig_ax);
> +       }
> +

This is IMO rather nasty.  Can you try to find a way to do this
without making the control flow depend on config options?

What guarantees that TIF_NOHZ is an acceptable thing to check?

>         /* Do our best to finish without phase 2. */
>         if (work == 0)
>                 return ret;  /* seccomp and/or nohz only (ret == 0 here) */
> @@ -217,10 +226,26 @@ static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
>  /* Called with IRQs disabled. */
>  __visible void prepare_exit_to_usermode(struct pt_regs *regs)
>  {
> +       u32 cached_flags;
> +
>         if (WARN_ON(!irqs_disabled()))
>                 local_irq_disable();
>
>         /*
> +        * We may want to enter the loop here unconditionally to make
> +        * sure to do some work at least once.  Test here for all
> +        * possible conditions that might make us enter the loop,
> +        * and return immediately if none of them are set.
> +        */
> +       cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
> +       if (!(cached_flags & (TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
> +                             _TIF_UPROBE | _TIF_NEED_RESCHED |
> +                             _TIF_USER_RETURN_NOTIFY | _TIF_NOHZ))) {
> +               user_enter();
> +               return;
> +       }
> +

Too complicated and too error prone.

In any event, I don't think that the property you actually want is for
the loop to be entered once.  I think the property you want is that
we're isolated by the time we're finished.  Why not just check that
directly in the loop condition?

> +       /*
>          * In order to return to user mode, we need to have IRQs off with
>          * none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY,
>          * _TIF_UPROBE, or _TIF_NEED_RESCHED set.  Several of these flags
> @@ -228,15 +253,7 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs)
>          * so we need to loop.  Disabling preemption wouldn't help: doing the
>          * work to clear some of the flags can sleep.
>          */
> -       while (true) {
> -               u32 cached_flags =
> -                       READ_ONCE(pt_regs_to_thread_info(regs)->flags);
> -
> -               if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
> -                                     _TIF_UPROBE | _TIF_NEED_RESCHED |
> -                                     _TIF_USER_RETURN_NOTIFY)))
> -                       break;
> -
> +       do {
>                 /* We have work to do. */
>                 local_irq_enable();
>
> @@ -258,9 +275,17 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs)
>                 if (cached_flags & _TIF_USER_RETURN_NOTIFY)
>                         fire_user_return_notifiers();
>
> +               if (task_isolation_enabled())
> +                       task_isolation_enter();
> +

Does anything here guarantee forward progress or at least give
reasonable confidence that we'll make forward progress?

>                 /* Disable IRQs and retry */
>                 local_irq_disable();
> -       }
> +
> +               cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
> +
> +       } while (!(cached_flags & (TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
> +                                  _TIF_UPROBE | _TIF_NEED_RESCHED |
> +                                  _TIF_USER_RETURN_NOTIFY)));
>
>         user_enter();
>  }
> --
> 2.1.2
>

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 05/11] task_isolation: add debug boot flag
  2015-09-28 15:17             ` [PATCH v7 05/11] task_isolation: add debug boot flag Chris Metcalf
@ 2015-09-28 20:59               ` Andy Lutomirski
  2015-09-28 21:55                 ` Chris Metcalf
  2015-10-05 17:07               ` Luiz Capitulino
  1 sibling, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-09-28 20:59 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc,
	linux-kernel

On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> The new "task_isolation_debug" flag simplifies debugging
> of TASK_ISOLATION kernels when processes are running in
> PR_TASK_ISOLATION_ENABLE mode.  Such processes should get no
> interrupts from the kernel, and if they do, when this boot flag is
> specified a kernel stack dump on the console is generated.
>
> It's possible to use ftrace to simply detect whether a task_isolation
> core has unexpectedly entered the kernel.  But what this boot flag
> does is allow the kernel to provide better diagnostics, e.g. by
> reporting in the IPI-generating code what remote core and context
> is preparing to deliver an interrupt to a task_isolation core.
>
> It may be worth considering other ways to generate useful debugging
> output rather than console spew, but for now that is simple and direct.

This may be addressed elsewhere, but is there anything that alerts the
task or the admin if it's PR_TASK_ISOLATION_ENABLE and *not* on a
nohz_full core?

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode
  2015-09-28 20:51                 ` Andy Lutomirski
  (?)
@ 2015-09-28 21:54                 ` Chris Metcalf
  2015-09-28 22:38                   ` Andy Lutomirski
  -1 siblings, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 21:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On 09/28/2015 04:51 PM, Andy Lutomirski wrote:
> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> With task_isolation mode, the task is in principle guaranteed not to
>> be interrupted by the kernel, but only if it behaves.  In particular,
>> if it enters the kernel via system call, page fault, or any of a
>> number of other synchronous traps, it may be unexpectedly exposed
>> to long latencies.  Add a simple flag that puts the process into
>> a state where any such kernel entry is fatal; this is defined as
>> happening immediately after the SECCOMP test.
> Why after seccomp?  Seccomp is still an entry, and the code would be
> considerably simpler if it were before seccomp.

I could be convinced to do it either way.  My initial thinking was that
a security violation was more interesting and more important to
report than a strict-mode task-isolation violation.  But see my
comments in response to your email on patch 07/11.

>> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
>>                  return 0;
>>
>>          prev_ctx = this_cpu_read(context_tracking.state);
>> -       if (prev_ctx != CONTEXT_KERNEL)
>> -               context_tracking_exit(prev_ctx);
>> +       if (prev_ctx != CONTEXT_KERNEL) {
>> +               if (context_tracking_exit(prev_ctx)) {
>> +                       if (task_isolation_strict())
>> +                               task_isolation_exception();
>> +               }
>> +       }
>>
>>          return prev_ctx;
>>   }
> x86 does not promise to call this function.  In fact, x86 is rather
> likely to stop ever calling this function in the reasonably near
> future.

Yes, in which case we'd have to do it the same way we are doing
it for arm64 (see patch 09/11), by calling task_isolation_exception()
explicitly from within the relevant exception handlers.  If we start
doing that, it's probably worth wrapping up the logic into a single
inline function to keep the added code short and sweet.

If in fact this might happen in the short term, it might be a good
idea to hook the individual exception handlers in x86 now, and not
hook the exception_enter() mechanism at all.

>> --- a/kernel/context_tracking.c
>> +++ b/kernel/context_tracking.c
>> @@ -144,15 +144,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
>>    * This call supports re-entrancy. This way it can be called from any exception
>>    * handler without needing to know if we came from userspace or not.
>>    */
>> -void context_tracking_exit(enum ctx_state state)
>> +bool context_tracking_exit(enum ctx_state state)
> This needs clear documentation of what the return value means.

Added:

  * Return: if called with state == CONTEXT_USER, the function returns
  * true if we were in fact previously in user mode.

>> +static void kill_task_isolation_strict_task(void)
>> +{
>> +       /* RCU should have been enabled prior to this point. */
>> +       RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
>> +
>> +       dump_stack();
>> +       current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
>> +       send_sig(SIGKILL, current, 1);
>> +}
> Wasn't this supposed to be configurable?  Or is that something that
> happens later on in the series?

Yup, next patch.

>> +void task_isolation_exception(void)
>> +{
>> +       pr_warn("%s/%d: task_isolation strict mode violated by exception\n",
>> +               current->comm, current->pid);
>> +       kill_task_isolation_strict_task();
>> +}
> Should this say what exception?

I could modify it to take a string argument (and then use it for
the arm64 case at least).  For the exception_enter() caller, we actually
don't have the information available to pass down, and it would
be hard to get it.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 04/11] task_isolation: provide strict mode configurable signal
@ 2015-09-28 21:54                   ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 21:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On 09/28/2015 04:54 PM, Andy Lutomirski wrote:
> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> Allow userspace to override the default SIGKILL delivered
>> when a task_isolation process in STRICT mode does a syscall
>> or otherwise synchronously enters the kernel.
>>
>> In addition to being able to set the signal, we now also
>> pass whether or not the interruption was from a syscall in
>> the si_code field of the siginfo.
>>
>> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
>> ---
>>   include/uapi/linux/prctl.h |  2 ++
>>   kernel/isolation.c         | 17 +++++++++++++----
>>   2 files changed, 15 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>> index 2b8038b0d1e1..a5582ace987f 100644
>> --- a/include/uapi/linux/prctl.h
>> +++ b/include/uapi/linux/prctl.h
>> @@ -202,5 +202,7 @@ struct prctl_mm_map {
>>   #define PR_GET_TASK_ISOLATION          49
>>   # define PR_TASK_ISOLATION_ENABLE      (1 << 0)
>>   # define PR_TASK_ISOLATION_STRICT      (1 << 1)
>> +# define PR_TASK_ISOLATION_SET_SIG(sig)        (((sig) & 0x7f) << 8)
>> +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
>>
>>   #endif /* _LINUX_PRCTL_H */
>> diff --git a/kernel/isolation.c b/kernel/isolation.c
>> index 3779ba670472..44bafcd08bca 100644
>> --- a/kernel/isolation.c
>> +++ b/kernel/isolation.c
>> @@ -77,14 +77,23 @@ void task_isolation_enter(void)
>>          }
>>   }
>>
>> -static void kill_task_isolation_strict_task(void)
>> +static void kill_task_isolation_strict_task(int is_syscall)
>>   {
>> +       siginfo_t info = {};
>> +       int sig;
>> +
>>          /* RCU should have been enabled prior to this point. */
>>          RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
>>
>>          dump_stack();
>>          current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
>> -       send_sig(SIGKILL, current, 1);
>> +
>> +       sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags);
>> +       if (sig == 0)
>> +               sig = SIGKILL;
>> +       info.si_signo = sig;
>> +       info.si_code = is_syscall;
> I think this needs real SI_ defines.

Yeah, it's a fair point, but of course SIGKILL has no SI_ defines
at all right now.  I'm tempted to suggest we just back out setting
si_code altogether.  It might be worth a one-line console message
(a la show_signal_message()), and use that to pack in the extra
information, instead of trying to fuss with the siginfo data.

>> +       send_sig_info(sig, &info, current);
>>   }
>>
>>   /*
>> @@ -103,7 +112,7 @@ void task_isolation_syscall(int syscall)
>>
>>          pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
>>                  current->comm, current->pid, syscall);
>> -       kill_task_isolation_strict_task();
>> +       kill_task_isolation_strict_task(1);
> No magic numbers please.

I think mooted by the above, but, good point.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 04/11] task_isolation: provide strict mode configurable signal
@ 2015-09-28 21:54                   ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 21:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 09/28/2015 04:54 PM, Andy Lutomirski wrote:
> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
>> Allow userspace to override the default SIGKILL delivered
>> when a task_isolation process in STRICT mode does a syscall
>> or otherwise synchronously enters the kernel.
>>
>> In addition to being able to set the signal, we now also
>> pass whether or not the interruption was from a syscall in
>> the si_code field of the siginfo.
>>
>> Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
>> ---
>>   include/uapi/linux/prctl.h |  2 ++
>>   kernel/isolation.c         | 17 +++++++++++++----
>>   2 files changed, 15 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>> index 2b8038b0d1e1..a5582ace987f 100644
>> --- a/include/uapi/linux/prctl.h
>> +++ b/include/uapi/linux/prctl.h
>> @@ -202,5 +202,7 @@ struct prctl_mm_map {
>>   #define PR_GET_TASK_ISOLATION          49
>>   # define PR_TASK_ISOLATION_ENABLE      (1 << 0)
>>   # define PR_TASK_ISOLATION_STRICT      (1 << 1)
>> +# define PR_TASK_ISOLATION_SET_SIG(sig)        (((sig) & 0x7f) << 8)
>> +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
>>
>>   #endif /* _LINUX_PRCTL_H */
>> diff --git a/kernel/isolation.c b/kernel/isolation.c
>> index 3779ba670472..44bafcd08bca 100644
>> --- a/kernel/isolation.c
>> +++ b/kernel/isolation.c
>> @@ -77,14 +77,23 @@ void task_isolation_enter(void)
>>          }
>>   }
>>
>> -static void kill_task_isolation_strict_task(void)
>> +static void kill_task_isolation_strict_task(int is_syscall)
>>   {
>> +       siginfo_t info = {};
>> +       int sig;
>> +
>>          /* RCU should have been enabled prior to this point. */
>>          RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
>>
>>          dump_stack();
>>          current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
>> -       send_sig(SIGKILL, current, 1);
>> +
>> +       sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags);
>> +       if (sig == 0)
>> +               sig = SIGKILL;
>> +       info.si_signo = sig;
>> +       info.si_code = is_syscall;
> I think this needs real SI_ defines.

Yeah, it's a fair point, but of course SIGKILL has no SI_ defines
at all right now.  I'm tempted to suggest we just back out setting
si_code altogether.  It might be worth a one-line console message
(a la show_signal_message()), and use that to pack in the extra
information, instead of trying to fuss with the siginfo data.

>> +       send_sig_info(sig, &info, current);
>>   }
>>
>>   /*
>> @@ -103,7 +112,7 @@ void task_isolation_syscall(int syscall)
>>
>>          pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n",
>>                  current->comm, current->pid, syscall);
>> -       kill_task_isolation_strict_task();
>> +       kill_task_isolation_strict_task(1);
> No magic numbers please.

I think mooted by the above, but, good point.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 05/11] task_isolation: add debug boot flag
  2015-09-28 20:59               ` Andy Lutomirski
@ 2015-09-28 21:55                 ` Chris Metcalf
  2015-09-28 22:40                   ` Andy Lutomirski
  0 siblings, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 21:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc,
	linux-kernel

On 09/28/2015 04:59 PM, Andy Lutomirski wrote:
> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> The new "task_isolation_debug" flag simplifies debugging
>> of TASK_ISOLATION kernels when processes are running in
>> PR_TASK_ISOLATION_ENABLE mode.  Such processes should get no
>> interrupts from the kernel, and if they do, when this boot flag is
>> specified a kernel stack dump on the console is generated.
>>
>> It's possible to use ftrace to simply detect whether a task_isolation
>> core has unexpectedly entered the kernel.  But what this boot flag
>> does is allow the kernel to provide better diagnostics, e.g. by
>> reporting in the IPI-generating code what remote core and context
>> is preparing to deliver an interrupt to a task_isolation core.
>>
>> It may be worth considering other ways to generate useful debugging
>> output rather than console spew, but for now that is simple and direct.
> This may be addressed elsewhere, but is there anything that alerts the
> task or the admin if it's PR_TASK_ISOLATION_ENABLE and *not* on a
> nohz_full core?

No, and I've thought about it without coming up with a great
solution.  We could certainly fail the initial prctl() if the caller
was not on a nohz_full core.  But this seems a little asymmetric
since the task could be on such a core at prctl() time, and then
do a sched_setaffinity() later to a non-nohz-full core.  Would
we want to fail that call?  Seems heavy-handed.  Or we could
then clear the task-isolation state and emit a console message.

I suppose we could notice that we were on a nohz-full
enabled system and the task isolation flags were set on return
to userspace, but we were not on a nohz-full core, and emit
a console message and clear the task-isolation state at that point.
But that also seems a little questionable; maybe the user for
some reason was doing some odd migratory thing with their
tasks or threads and was going to end up migrating them to
a final destination where the prctl() would apply.

Any suggestions for a better approach?  Is it worth doing the
minimal printk-warning approach in the previous paragraph?
My instinct is to say that we just leave it as-is, I think.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality
  2015-09-28 20:59               ` Andy Lutomirski
@ 2015-09-28 21:57                 ` Chris Metcalf
  2015-09-28 22:43                   ` Andy Lutomirski
  0 siblings, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-09-28 21:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel,
	H. Peter Anvin, X86 ML

On 09/28/2015 04:59 PM, Andy Lutomirski wrote:
> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> In prepare_exit_to_usermode(), we would like to call
>> task_isolation_enter() on every return to userspace, and like
>> other work items, we would like to recheck for more work after
>> calling it, since it will enable interrupts internally.
>>
>> However, if task_isolation_enter() is the only work item,
>> and it has already been called once, we don't want to continue
>> calling it in a loop.  We don't have a dedicated TIF flag for
>> task isolation, and it wouldn't make sense to have one, since
>> we'd want to set it before starting exit every time, and then
>> clear it the first time around the loop.
>>
>> Instead, we change the loop structure somewhat, so that we
>> have a more inclusive set of flags that are tested for on the
>> first entry to the function (including TIF_NOHZ), and if any
>> of those flags are set, we enter the loop.  And, we do the
>> task_isolation() test unconditionally at the bottom of the loop,
>> but then when making the decision to loop back, we just use the
>> set of flags that doesn't include TIF_NOHZ.  That way we only
>> loop if there is other work to do, but then if that work
>> is done, we again unconditionally call task_isolation_enter().
>>
>> In syscall_trace_enter_phase1(), we try to add the necessary
>> support for strict-mode detection of syscalls in an optimized
>> way, by letting the code remain unchanged if we are not using
>> TASK_ISOLATION, but otherwise calling enter_from_user_mode()
>> under the first time we see _TIF_NOHZ, and then waiting until
>> after we do the secure computing work to actually clear the bit
>> from the "work" variable and call task_isolation_syscall().
>>
>> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
>> ---
>>   arch/x86/entry/common.c | 47 ++++++++++++++++++++++++++++++++++++-----------
>>   1 file changed, 36 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
>> index 80dcc9261ca3..0f74389c6f3b 100644
>> --- a/arch/x86/entry/common.c
>> +++ b/arch/x86/entry/common.c
>> @@ -21,6 +21,7 @@
>>   #include <linux/context_tracking.h>
>>   #include <linux/user-return-notifier.h>
>>   #include <linux/uprobes.h>
>> +#include <linux/isolation.h>
>>
>>   #include <asm/desc.h>
>>   #include <asm/traps.h>
>> @@ -81,7 +82,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>>           */
>>          if (work & _TIF_NOHZ) {
>>                  enter_from_user_mode();
>> -               work &= ~_TIF_NOHZ;
>> +               if (!IS_ENABLED(CONFIG_TASK_ISOLATION))
>> +                       work &= ~_TIF_NOHZ;
>>          }
>>   #endif
>>
>> @@ -131,6 +133,13 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
>>          }
>>   #endif
>>
>> +       /* Now check task isolation, if needed. */
>> +       if (IS_ENABLED(CONFIG_TASK_ISOLATION) && (work & _TIF_NOHZ)) {
>> +               work &= ~_TIF_NOHZ;
>> +               if (task_isolation_strict())
>> +                       task_isolation_syscall(regs->orig_ax);
>> +       }
>> +
> This is IMO rather nasty.  Can you try to find a way to do this
> without making the control flow depend on config options?

Well, I suppose this is the best argument for testing for task
isolation before seccomp :-)

Honestly, if not, it's tricky to see how to do better; I did spend
some time looking at it.  One possibility is to just unconditionally
clear _TIF_NOHZ before testing "work == 0", so that we can
test (work & TIF_NOHZ) once early and once after seccomp.
This presumably costs a cycle in the no-nohz-full case.

So maybe just do it before seccomp...

> What guarantees that TIF_NOHZ is an acceptable thing to check?

Well, TIF_NOHZ is set on all tasks whenever we are running with
nohz_full enabled anywhere, so testing it lets us do stuff on
the fastpath without slowing down the fastpath much.
See context_tracking_cpu_set().

>>          /* Do our best to finish without phase 2. */
>>          if (work == 0)
>>                  return ret;  /* seccomp and/or nohz only (ret == 0 here) */
>> @@ -217,10 +226,26 @@ static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
>>   /* Called with IRQs disabled. */
>>   __visible void prepare_exit_to_usermode(struct pt_regs *regs)
>>   {
>> +       u32 cached_flags;
>> +
>>          if (WARN_ON(!irqs_disabled()))
>>                  local_irq_disable();
>>
>>          /*
>> +        * We may want to enter the loop here unconditionally to make
>> +        * sure to do some work at least once.  Test here for all
>> +        * possible conditions that might make us enter the loop,
>> +        * and return immediately if none of them are set.
>> +        */
>> +       cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
>> +       if (!(cached_flags & (TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
>> +                             _TIF_UPROBE | _TIF_NEED_RESCHED |
>> +                             _TIF_USER_RETURN_NOTIFY | _TIF_NOHZ))) {
>> +               user_enter();
>> +               return;
>> +       }
>> +
> Too complicated and too error prone.
>
> In any event, I don't think that the property you actually want is for
> the loop to be entered once.  I think the property you want is that
> we're isolated by the time we're finished.  Why not just check that
> directly in the loop condition?

So something like this (roughly):

                 if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
                                       _TIF_UPROBE | _TIF_NEED_RESCHED |
                                       _TIF_USER_RETURN_NOTIFY)) &&
+                    task_isolation_done())
                         break;

i.e. just add the one extra call?  That could work, I suppose.
In the body we would then keep the proposed logic that unconditionally
calls task_isolation_enter().

>> +       /*
>>           * In order to return to user mode, we need to have IRQs off with
>>           * none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY,
>>           * _TIF_UPROBE, or _TIF_NEED_RESCHED set.  Several of these flags
>> @@ -228,15 +253,7 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs)
>>           * so we need to loop.  Disabling preemption wouldn't help: doing the
>>           * work to clear some of the flags can sleep.
>>           */
>> -       while (true) {
>> -               u32 cached_flags =
>> -                       READ_ONCE(pt_regs_to_thread_info(regs)->flags);
>> -
>> -               if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
>> -                                     _TIF_UPROBE | _TIF_NEED_RESCHED |
>> -                                     _TIF_USER_RETURN_NOTIFY)))
>> -                       break;
>> -
>> +       do {
>>                  /* We have work to do. */
>>                  local_irq_enable();
>>
>> @@ -258,9 +275,17 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs)
>>                  if (cached_flags & _TIF_USER_RETURN_NOTIFY)
>>                          fire_user_return_notifiers();
>>
>> +               if (task_isolation_enabled())
>> +                       task_isolation_enter();
>> +
> Does anything here guarantee forward progress or at least give
> reasonable confidence that we'll make forward progress?

A given task can get stuck in the kernel if it has a lengthy far-future
alarm() type situation, or if there are multiple task-isolated tasks
scheduled onto the same core, but that only affects those tasks;
other tasks on the same core, and the system as a whole, are OK.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode
  2015-09-28 21:54                 ` Chris Metcalf
@ 2015-09-28 22:38                   ` Andy Lutomirski
  2015-09-29 17:35                     ` Chris Metcalf
  0 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-09-28 22:38 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On Mon, Sep 28, 2015 at 2:54 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> On 09/28/2015 04:51 PM, Andy Lutomirski wrote:
>>
>> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com>
>>> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
>>>                  return 0;
>>>
>>>          prev_ctx = this_cpu_read(context_tracking.state);
>>> -       if (prev_ctx != CONTEXT_KERNEL)
>>> -               context_tracking_exit(prev_ctx);
>>> +       if (prev_ctx != CONTEXT_KERNEL) {
>>> +               if (context_tracking_exit(prev_ctx)) {
>>> +                       if (task_isolation_strict())
>>> +                               task_isolation_exception();
>>> +               }
>>> +       }
>>>
>>>          return prev_ctx;
>>>   }
>>
>> x86 does not promise to call this function.  In fact, x86 is rather
>> likely to stop ever calling this function in the reasonably near
>> future.
>
>
> Yes, in which case we'd have to do it the same way we are doing
> it for arm64 (see patch 09/11), by calling task_isolation_exception()
> explicitly from within the relevant exception handlers.  If we start
> doing that, it's probably worth wrapping up the logic into a single
> inline function to keep the added code short and sweet.
>
> If in fact this might happen in the short term, it might be a good
> idea to hook the individual exception handlers in x86 now, and not
> hook the exception_enter() mechanism at all.

It's already like that in Linus' tree.

FWIW, most of those exception handlers send signals, so it might pay
to do it in notify_die or die instead.

>
>>> --- a/kernel/context_tracking.c
>>> +++ b/kernel/context_tracking.c
>>> @@ -144,15 +144,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
>>>    * This call supports re-entrancy. This way it can be called from any
>>> exception
>>>    * handler without needing to know if we came from userspace or not.
>>>    */
>>> -void context_tracking_exit(enum ctx_state state)
>>> +bool context_tracking_exit(enum ctx_state state)
>>
>> This needs clear documentation of what the return value means.
>
>
> Added:
>
>  * Return: if called with state == CONTEXT_USER, the function returns
>  * true if we were in fact previously in user mode.

This should note that it only returns true if context tracking is on.

>>> +void task_isolation_exception(void)
>>> +{
>>> +       pr_warn("%s/%d: task_isolation strict mode violated by
>>> exception\n",
>>> +               current->comm, current->pid);
>>> +       kill_task_isolation_strict_task();
>>> +}
>>
>> Should this say what exception?
>
>
> I could modify it to take a string argument (and then use it for
> the arm64 case at least).  For the exception_enter() caller, we actually
> don't have the information available to pass down, and it would
> be hard to get it.

For x86, the relevant info might be the actual hw error number
(error_code, which makes it into die) or the signal.  If we send a
death signal, then reporting the error number the usual way might make
sense.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 05/11] task_isolation: add debug boot flag
  2015-09-28 21:55                 ` Chris Metcalf
@ 2015-09-28 22:40                   ` Andy Lutomirski
  2015-09-29 17:35                     ` Chris Metcalf
  0 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-09-28 22:40 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc,
	linux-kernel

On Mon, Sep 28, 2015 at 2:55 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> On 09/28/2015 04:59 PM, Andy Lutomirski wrote:
>>
>> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com>
>> wrote:
>>>
>>> The new "task_isolation_debug" flag simplifies debugging
>>> of TASK_ISOLATION kernels when processes are running in
>>> PR_TASK_ISOLATION_ENABLE mode.  Such processes should get no
>>> interrupts from the kernel, and if they do, when this boot flag is
>>> specified a kernel stack dump on the console is generated.
>>>
>>> It's possible to use ftrace to simply detect whether a task_isolation
>>> core has unexpectedly entered the kernel.  But what this boot flag
>>> does is allow the kernel to provide better diagnostics, e.g. by
>>> reporting in the IPI-generating code what remote core and context
>>> is preparing to deliver an interrupt to a task_isolation core.
>>>
>>> It may be worth considering other ways to generate useful debugging
>>> output rather than console spew, but for now that is simple and direct.
>>
>> This may be addressed elsewhere, but is there anything that alerts the
>> task or the admin if it's PR_TASK_ISOLATION_ENABLE and *not* on a
>> nohz_full core?
>
>
> No, and I've thought about it without coming up with a great
> solution.  We could certainly fail the initial prctl() if the caller
> was not on a nohz_full core.  But this seems a little asymmetric
> since the task could be on such a core at prctl() time, and then
> do a sched_setaffinity() later to a non-nohz-full core.  Would
> we want to fail that call?  Seems heavy-handed.  Or we could
> then clear the task-isolation state and emit a console message.

If I were writing a program that used this feature, I think I'd want
to know early that it's not going to work so I can tell the admin very
loudly to fix it.  Maybe just failing the prctl would be enough.  If
someone turns on the prctl and then changes their affinity, maybe we
should treat it as though they're just asking for trouble and allow
it.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality
  2015-09-28 21:57                 ` Chris Metcalf
@ 2015-09-28 22:43                   ` Andy Lutomirski
  2015-09-29 17:42                     ` Chris Metcalf
  0 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-09-28 22:43 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel,
	H. Peter Anvin, X86 ML

On Mon, Sep 28, 2015 at 2:57 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> On 09/28/2015 04:59 PM, Andy Lutomirski wrote:
>>
>> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com>
>> wrote:
>>>
>>> In prepare_exit_to_usermode(), we would like to call
>>> task_isolation_enter() on every return to userspace, and like
>>> other work items, we would like to recheck for more work after
>>> calling it, since it will enable interrupts internally.
>>>
>>> However, if task_isolation_enter() is the only work item,
>>> and it has already been called once, we don't want to continue
>>> calling it in a loop.  We don't have a dedicated TIF flag for
>>> task isolation, and it wouldn't make sense to have one, since
>>> we'd want to set it before starting exit every time, and then
>>> clear it the first time around the loop.
>>>
>>> Instead, we change the loop structure somewhat, so that we
>>> have a more inclusive set of flags that are tested for on the
>>> first entry to the function (including TIF_NOHZ), and if any
>>> of those flags are set, we enter the loop.  And, we do the
>>> task_isolation() test unconditionally at the bottom of the loop,
>>> but then when making the decision to loop back, we just use the
>>> set of flags that doesn't include TIF_NOHZ.  That way we only
>>> loop if there is other work to do, but then if that work
>>> is done, we again unconditionally call task_isolation_enter().
>>>
>>> In syscall_trace_enter_phase1(), we try to add the necessary
>>> support for strict-mode detection of syscalls in an optimized
>>> way, by letting the code remain unchanged if we are not using
>>> TASK_ISOLATION, but otherwise calling enter_from_user_mode()
>>> under the first time we see _TIF_NOHZ, and then waiting until
>>> after we do the secure computing work to actually clear the bit
>>> from the "work" variable and call task_isolation_syscall().
>>>
>>> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
>>> ---
>>>   arch/x86/entry/common.c | 47
>>> ++++++++++++++++++++++++++++++++++++-----------
>>>   1 file changed, 36 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
>>> index 80dcc9261ca3..0f74389c6f3b 100644
>>> --- a/arch/x86/entry/common.c
>>> +++ b/arch/x86/entry/common.c
>>> @@ -21,6 +21,7 @@
>>>   #include <linux/context_tracking.h>
>>>   #include <linux/user-return-notifier.h>
>>>   #include <linux/uprobes.h>
>>> +#include <linux/isolation.h>
>>>
>>>   #include <asm/desc.h>
>>>   #include <asm/traps.h>
>>> @@ -81,7 +82,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs
>>> *regs, u32 arch)
>>>           */
>>>          if (work & _TIF_NOHZ) {
>>>                  enter_from_user_mode();
>>> -               work &= ~_TIF_NOHZ;
>>> +               if (!IS_ENABLED(CONFIG_TASK_ISOLATION))
>>> +                       work &= ~_TIF_NOHZ;
>>>          }
>>>   #endif
>>>
>>> @@ -131,6 +133,13 @@ unsigned long syscall_trace_enter_phase1(struct
>>> pt_regs *regs, u32 arch)
>>>          }
>>>   #endif
>>>
>>> +       /* Now check task isolation, if needed. */
>>> +       if (IS_ENABLED(CONFIG_TASK_ISOLATION) && (work & _TIF_NOHZ)) {
>>> +               work &= ~_TIF_NOHZ;
>>> +               if (task_isolation_strict())
>>> +                       task_isolation_syscall(regs->orig_ax);
>>> +       }
>>> +
>>
>> This is IMO rather nasty.  Can you try to find a way to do this
>> without making the control flow depend on config options?
>
>
> Well, I suppose this is the best argument for testing for task
> isolation before seccomp :-)
>
> Honestly, if not, it's tricky to see how to do better; I did spend
> some time looking at it.  One possibility is to just unconditionally
> clear _TIF_NOHZ before testing "work == 0", so that we can
> test (work & TIF_NOHZ) once early and once after seccomp.
> This presumably costs a cycle in the no-nohz-full case.
>
> So maybe just do it before seccomp...
>
>> What guarantees that TIF_NOHZ is an acceptable thing to check?
>
>
> Well, TIF_NOHZ is set on all tasks whenever we are running with
> nohz_full enabled anywhere, so testing it lets us do stuff on
> the fastpath without slowing down the fastpath much.
> See context_tracking_cpu_set().
>
>
>>>          /* Do our best to finish without phase 2. */
>>>          if (work == 0)
>>>                  return ret;  /* seccomp and/or nohz only (ret == 0 here)
>>> */
>>> @@ -217,10 +226,26 @@ static struct thread_info
>>> *pt_regs_to_thread_info(struct pt_regs *regs)
>>>   /* Called with IRQs disabled. */
>>>   __visible void prepare_exit_to_usermode(struct pt_regs *regs)
>>>   {
>>> +       u32 cached_flags;
>>> +
>>>          if (WARN_ON(!irqs_disabled()))
>>>                  local_irq_disable();
>>>
>>>          /*
>>> +        * We may want to enter the loop here unconditionally to make
>>> +        * sure to do some work at least once.  Test here for all
>>> +        * possible conditions that might make us enter the loop,
>>> +        * and return immediately if none of them are set.
>>> +        */
>>> +       cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
>>> +       if (!(cached_flags & (TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
>>> +                             _TIF_UPROBE | _TIF_NEED_RESCHED |
>>> +                             _TIF_USER_RETURN_NOTIFY | _TIF_NOHZ))) {
>>> +               user_enter();
>>> +               return;
>>> +       }
>>> +
>>
>> Too complicated and too error prone.
>>
>> In any event, I don't think that the property you actually want is for
>> the loop to be entered once.  I think the property you want is that
>> we're isolated by the time we're finished.  Why not just check that
>> directly in the loop condition?
>
>
> So something like this (roughly):
>
>                 if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
>                                       _TIF_UPROBE | _TIF_NEED_RESCHED |
>                                       _TIF_USER_RETURN_NOTIFY)) &&
> +                    task_isolation_done())
>                         break;
>
> i.e. just add the one extra call?  That could work, I suppose.
> In the body we would then keep the proposed logic that unconditionally
> calls task_isolation_enter().

Yeah, I think so.

>> Does anything here guarantee forward progress or at least give
>> reasonable confidence that we'll make forward progress?
>
>
> A given task can get stuck in the kernel if it has a lengthy far-future
> alarm() type situation, or if there are multiple task-isolated tasks
> scheduled onto the same core, but that only affects those tasks;
> other tasks on the same core, and the system as a whole, are OK.

Why are we treating alarms as something that should defer entry to
userspace?  I think it would be entirely reasonable to set an alarm
for ten minutes, ask for isolation, and then think hard for ten
minutes.

A bigger issue would be if there's an RT task that asks for isolation
and a bunch of other stuff (most notably KVM hosts) running with
uncontrained affinity at full load.  If task_isolation_enter always
sleeps, then your KVM host will get scheduled, and it'll ask for a
user return notifier on the way out, and you might just loop forever.
Can this happen?

ISTM something's suboptimal with the inner workings of all this if
task_isolation_enter needs to sleep to wait for an event that isn't
scheduled for the immediate future (e.g. already queued up as an
interrupt).

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 05/11] task_isolation: add debug boot flag
  2015-09-28 22:40                   ` Andy Lutomirski
@ 2015-09-29 17:35                     ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-29 17:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc,
	linux-kernel

On 09/28/2015 06:40 PM, Andy Lutomirski wrote:
> If I were writing a program that used this feature, I think I'd want
> to know early that it's not going to work so I can tell the admin very
> loudly to fix it.  Maybe just failing the prctl would be enough.  If
> someone turns on the prctl and then changes their affinity, maybe we
> should treat it as though they're just asking for trouble and allow
> it.

Yes, the original Tilera implementation required the task to be
affinitized to a single, nohz_full cpu before you could call the
prctl() successfully.  I think I will re-instate that for the patch series,
and if the user then re-affinitizes to a non-nohz_full core later,
well, "don't do that then".

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode
  2015-09-28 22:38                   ` Andy Lutomirski
@ 2015-09-29 17:35                     ` Chris Metcalf
  2015-09-29 17:46                         ` Andy Lutomirski
  0 siblings, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-09-29 17:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On 09/28/2015 06:38 PM, Andy Lutomirski wrote:
> On Mon, Sep 28, 2015 at 2:54 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> On 09/28/2015 04:51 PM, Andy Lutomirski wrote:
>>> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com>
>>>> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
>>>>                   return 0;
>>>>
>>>>           prev_ctx = this_cpu_read(context_tracking.state);
>>>> -       if (prev_ctx != CONTEXT_KERNEL)
>>>> -               context_tracking_exit(prev_ctx);
>>>> +       if (prev_ctx != CONTEXT_KERNEL) {
>>>> +               if (context_tracking_exit(prev_ctx)) {
>>>> +                       if (task_isolation_strict())
>>>> +                               task_isolation_exception();
>>>> +               }
>>>> +       }
>>>>
>>>>           return prev_ctx;
>>>>    }
>>> x86 does not promise to call this function.  In fact, x86 is rather
>>> likely to stop ever calling this function in the reasonably near
>>> future.
>>
>> Yes, in which case we'd have to do it the same way we are doing
>> it for arm64 (see patch 09/11), by calling task_isolation_exception()
>> explicitly from within the relevant exception handlers.  If we start
>> doing that, it's probably worth wrapping up the logic into a single
>> inline function to keep the added code short and sweet.
>>
>> If in fact this might happen in the short term, it might be a good
>> idea to hook the individual exception handlers in x86 now, and not
>> hook the exception_enter() mechanism at all.
> It's already like that in Linus' tree.

OK, I will restructure so that it doesn't rely on the context_tracking
code at all, but instead requires a line of code in every relevant
kernel exception handler.

> FWIW, most of those exception handlers send signals, so it might pay
> to do it in notify_die or die instead.

Well, the most interesting category is things that don't actually
trigger a signal (e.g. minor page fault) since those are things that
cause significant issues with task isolation processes
(kernel-induced jitter) but aren't otherwise user-visible,
much like an undiscovered syscall in a third-party library
can cause unexpected jitter.

> For x86, the relevant info might be the actual hw error number
> (error_code, which makes it into die) or the signal.  If we send a
> death signal, then reporting the error number the usual way might make
> sense.

I may just choose to use a task_isolation_exception(fmt, ...)
signature so that code can printk a suitable one-liner before
delivering the SIGKILL (or whatever signal was configured).

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality
  2015-09-28 22:43                   ` Andy Lutomirski
@ 2015-09-29 17:42                     ` Chris Metcalf
  2015-09-29 17:57                       ` Andy Lutomirski
  0 siblings, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-09-29 17:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel,
	H. Peter Anvin, X86 ML

On 09/28/2015 06:43 PM, Andy Lutomirski wrote:
> Why are we treating alarms as something that should defer entry to
> userspace?  I think it would be entirely reasonable to set an alarm
> for ten minutes, ask for isolation, and then think hard for ten
> minutes.
>
> A bigger issue would be if there's an RT task that asks for isolation
> and a bunch of other stuff (most notably KVM hosts) running with
> uncontrained affinity at full load.  If task_isolation_enter always
> sleeps, then your KVM host will get scheduled, and it'll ask for a
> user return notifier on the way out, and you might just loop forever.
> Can this happen?

task_isolation_enter() doesn't sleep - it spins.  This is intentional,
because the point is that there should be nothing else that
could be scheduled on that cpu.  We're just waiting for any
pending kernel management timer interrupts to fire.

In any case, you normally wouldn't have a KVM host running
on an isolcpus, nohz_full cpu, unless it was the only thing
running there, I imagine (just as would be true for any other
host process).

> ISTM something's suboptimal with the inner workings of all this if
> task_isolation_enter needs to sleep to wait for an event that isn't
> scheduled for the immediate future (e.g. already queued up as an
> interrupt).

Scheduling a timer for 10 minutes away is typically done by
scheduling timers for the max timer granularity (which could
be just a few seconds) and then waking up a couple of hundred
times between now and now+10 minutes.  Doing this breaks
the task isolation guarantee, so we can't return to userspace
while something like that is pending.  You'd have to do it
by polling in userspace to avoid the unexpected interrupts.

I suppose if your hardware supported it, you could imagine
a mode where userspace can request an alarm a specific
amount of time in the future, and the task isolation code
would then ignore an alarm that was going off at that
specific time.  But I'm not sure what hardware does support
that (I know tile uses the "few seconds and re-arm" model),
and it seems like a pretty corner use-case.  We could
certainly investigate adding such support later, but I don't see
it as part of the core value proposition for task isolation.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode
@ 2015-09-29 17:46                         ` Andy Lutomirski
  0 siblings, 0 replies; 340+ messages in thread
From: Andy Lutomirski @ 2015-09-29 17:46 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> On 09/28/2015 06:38 PM, Andy Lutomirski wrote:
>>
>> On Mon, Sep 28, 2015 at 2:54 PM, Chris Metcalf <cmetcalf@ezchip.com>
>> wrote:
>>>
>>> On 09/28/2015 04:51 PM, Andy Lutomirski wrote:
>>>>
>>>> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com>
>>>>>
>>>>> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
>>>>>                   return 0;
>>>>>
>>>>>           prev_ctx = this_cpu_read(context_tracking.state);
>>>>> -       if (prev_ctx != CONTEXT_KERNEL)
>>>>> -               context_tracking_exit(prev_ctx);
>>>>> +       if (prev_ctx != CONTEXT_KERNEL) {
>>>>> +               if (context_tracking_exit(prev_ctx)) {
>>>>> +                       if (task_isolation_strict())
>>>>> +                               task_isolation_exception();
>>>>> +               }
>>>>> +       }
>>>>>
>>>>>           return prev_ctx;
>>>>>    }
>>>>
>>>> x86 does not promise to call this function.  In fact, x86 is rather
>>>> likely to stop ever calling this function in the reasonably near
>>>> future.
>>>
>>>
>>> Yes, in which case we'd have to do it the same way we are doing
>>> it for arm64 (see patch 09/11), by calling task_isolation_exception()
>>> explicitly from within the relevant exception handlers.  If we start
>>> doing that, it's probably worth wrapping up the logic into a single
>>> inline function to keep the added code short and sweet.
>>>
>>> If in fact this might happen in the short term, it might be a good
>>> idea to hook the individual exception handlers in x86 now, and not
>>> hook the exception_enter() mechanism at all.
>>
>> It's already like that in Linus' tree.
>
>
> OK, I will restructure so that it doesn't rely on the context_tracking
> code at all, but instead requires a line of code in every relevant
> kernel exception handler.
>
>> FWIW, most of those exception handlers send signals, so it might pay
>> to do it in notify_die or die instead.
>
>
> Well, the most interesting category is things that don't actually
> trigger a signal (e.g. minor page fault) since those are things that
> cause significant issues with task isolation processes
> (kernel-induced jitter) but aren't otherwise user-visible,
> much like an undiscovered syscall in a third-party library
> can cause unexpected jitter.

Would it make sense to exempt the exceptions that result in signals?
After all, those are detectable even without your patches.  Going
through all of the exception types:

divide_error, overflow, invalid_op, coprocessor_segment_overrun,
invalid_TSS, segment_not_present, stack_segment, alignment_check:
these all send signals anyway.

double_fault is fatal.

bounds: MPX faults can be silently fixed up, and those will need
notification.  (Or user code should know not to do that, since it
requires an explicit opt in, and user code can flip it back off to get
the signals.)

general_protection: always signals except in vm86 mode.

int3: silently fixed if uprobes are in use, but I don't think
isolation cares about that.  Otherwise signals.

debug: The perf hw_breakpoint can result in silent fixups, but those
require explicit opt-in from the admin.  Otherwise, unless there's a
bug or a debugger, the user will get a signal.  (As a practical
matter, the only interesting case is the undocumented ICEBP
instruction.)

math_error, simd_coprocessor_error: Sends a signal.

spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK.  We should
just WARN if this hits.

device_not_available: If you're using isolation without an FPU, you
have bigger problems.

page_fault: Needs notification.

NMI, MCE: arguably these should *not* notify or at least not fatally.

So maybe a better approach would be to explicitly notify for the
relevant entries: IRQs, non-signalling page faults, and non-signalling
MPX fixups.  Other arches would have their own lists, but they're
probably also short except for emulated instructions.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode
@ 2015-09-29 17:46                         ` Andy Lutomirski
  0 siblings, 0 replies; 340+ messages in thread
From: Andy Lutomirski @ 2015-09-29 17:46 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
> On 09/28/2015 06:38 PM, Andy Lutomirski wrote:
>>
>> On Mon, Sep 28, 2015 at 2:54 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
>> wrote:
>>>
>>> On 09/28/2015 04:51 PM, Andy Lutomirski wrote:
>>>>
>>>> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
>>>>>
>>>>> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void)
>>>>>                   return 0;
>>>>>
>>>>>           prev_ctx = this_cpu_read(context_tracking.state);
>>>>> -       if (prev_ctx != CONTEXT_KERNEL)
>>>>> -               context_tracking_exit(prev_ctx);
>>>>> +       if (prev_ctx != CONTEXT_KERNEL) {
>>>>> +               if (context_tracking_exit(prev_ctx)) {
>>>>> +                       if (task_isolation_strict())
>>>>> +                               task_isolation_exception();
>>>>> +               }
>>>>> +       }
>>>>>
>>>>>           return prev_ctx;
>>>>>    }
>>>>
>>>> x86 does not promise to call this function.  In fact, x86 is rather
>>>> likely to stop ever calling this function in the reasonably near
>>>> future.
>>>
>>>
>>> Yes, in which case we'd have to do it the same way we are doing
>>> it for arm64 (see patch 09/11), by calling task_isolation_exception()
>>> explicitly from within the relevant exception handlers.  If we start
>>> doing that, it's probably worth wrapping up the logic into a single
>>> inline function to keep the added code short and sweet.
>>>
>>> If in fact this might happen in the short term, it might be a good
>>> idea to hook the individual exception handlers in x86 now, and not
>>> hook the exception_enter() mechanism at all.
>>
>> It's already like that in Linus' tree.
>
>
> OK, I will restructure so that it doesn't rely on the context_tracking
> code at all, but instead requires a line of code in every relevant
> kernel exception handler.
>
>> FWIW, most of those exception handlers send signals, so it might pay
>> to do it in notify_die or die instead.
>
>
> Well, the most interesting category is things that don't actually
> trigger a signal (e.g. minor page fault) since those are things that
> cause significant issues with task isolation processes
> (kernel-induced jitter) but aren't otherwise user-visible,
> much like an undiscovered syscall in a third-party library
> can cause unexpected jitter.

Would it make sense to exempt the exceptions that result in signals?
After all, those are detectable even without your patches.  Going
through all of the exception types:

divide_error, overflow, invalid_op, coprocessor_segment_overrun,
invalid_TSS, segment_not_present, stack_segment, alignment_check:
these all send signals anyway.

double_fault is fatal.

bounds: MPX faults can be silently fixed up, and those will need
notification.  (Or user code should know not to do that, since it
requires an explicit opt in, and user code can flip it back off to get
the signals.)

general_protection: always signals except in vm86 mode.

int3: silently fixed if uprobes are in use, but I don't think
isolation cares about that.  Otherwise signals.

debug: The perf hw_breakpoint can result in silent fixups, but those
require explicit opt-in from the admin.  Otherwise, unless there's a
bug or a debugger, the user will get a signal.  (As a practical
matter, the only interesting case is the undocumented ICEBP
instruction.)

math_error, simd_coprocessor_error: Sends a signal.

spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK.  We should
just WARN if this hits.

device_not_available: If you're using isolation without an FPU, you
have bigger problems.

page_fault: Needs notification.

NMI, MCE: arguably these should *not* notify or at least not fatally.

So maybe a better approach would be to explicitly notify for the
relevant entries: IRQs, non-signalling page faults, and non-signalling
MPX fixups.  Other arches would have their own lists, but they're
probably also short except for emulated instructions.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode
@ 2015-09-29 17:57                           ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-29 17:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On 09/29/2015 01:46 PM, Andy Lutomirski wrote:
> On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> Well, the most interesting category is things that don't actually
>> trigger a signal (e.g. minor page fault) since those are things that
>> cause significant issues with task isolation processes
>> (kernel-induced jitter) but aren't otherwise user-visible,
>> much like an undiscovered syscall in a third-party library
>> can cause unexpected jitter.
> Would it make sense to exempt the exceptions that result in signals?
> After all, those are detectable even without your patches.  Going
> through all of the exception types:
>
> divide_error, overflow, invalid_op, coprocessor_segment_overrun,
> invalid_TSS, segment_not_present, stack_segment, alignment_check:
> these all send signals anyway.
>
> double_fault is fatal.
>
> bounds: MPX faults can be silently fixed up, and those will need
> notification.  (Or user code should know not to do that, since it
> requires an explicit opt in, and user code can flip it back off to get
> the signals.)
>
> general_protection: always signals except in vm86 mode.
>
> int3: silently fixed if uprobes are in use, but I don't think
> isolation cares about that.  Otherwise signals.
>
> debug: The perf hw_breakpoint can result in silent fixups, but those
> require explicit opt-in from the admin.  Otherwise, unless there's a
> bug or a debugger, the user will get a signal.  (As a practical
> matter, the only interesting case is the undocumented ICEBP
> instruction.)
>
> math_error, simd_coprocessor_error: Sends a signal.
>
> spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK.  We should
> just WARN if this hits.
>
> device_not_available: If you're using isolation without an FPU, you
> have bigger problems.
>
> page_fault: Needs notification.
>
> NMI, MCE: arguably these should *not* notify or at least not fatally.
>
> So maybe a better approach would be to explicitly notify for the
> relevant entries: IRQs, non-signalling page faults, and non-signalling
> MPX fixups.  Other arches would have their own lists, but they're
> probably also short except for emulated instructions.

IRQs should get notified via the task_isolation_debug boot flag;
the intent is that they should never get delivered to nohz_full
cores anyway, so we produce a console backtrace if the boot
flag is enabled.  This isn't tied to having a task running with
TASK_ISOLATION enabled, since it just shouldn't ever happen.

Thanks for reviewing the possible exception sources on x86,
which I'm less familiar with than tile.  Non-signalling page faults
and MPX fixups sounds exactly right - and I didn't know about
MPX before your email (other than the userspace side of
the notion of bounds registers), so thanks for the pointer.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode
@ 2015-09-29 17:57                           ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-09-29 17:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 09/29/2015 01:46 PM, Andy Lutomirski wrote:
> On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
>> Well, the most interesting category is things that don't actually
>> trigger a signal (e.g. minor page fault) since those are things that
>> cause significant issues with task isolation processes
>> (kernel-induced jitter) but aren't otherwise user-visible,
>> much like an undiscovered syscall in a third-party library
>> can cause unexpected jitter.
> Would it make sense to exempt the exceptions that result in signals?
> After all, those are detectable even without your patches.  Going
> through all of the exception types:
>
> divide_error, overflow, invalid_op, coprocessor_segment_overrun,
> invalid_TSS, segment_not_present, stack_segment, alignment_check:
> these all send signals anyway.
>
> double_fault is fatal.
>
> bounds: MPX faults can be silently fixed up, and those will need
> notification.  (Or user code should know not to do that, since it
> requires an explicit opt in, and user code can flip it back off to get
> the signals.)
>
> general_protection: always signals except in vm86 mode.
>
> int3: silently fixed if uprobes are in use, but I don't think
> isolation cares about that.  Otherwise signals.
>
> debug: The perf hw_breakpoint can result in silent fixups, but those
> require explicit opt-in from the admin.  Otherwise, unless there's a
> bug or a debugger, the user will get a signal.  (As a practical
> matter, the only interesting case is the undocumented ICEBP
> instruction.)
>
> math_error, simd_coprocessor_error: Sends a signal.
>
> spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK.  We should
> just WARN if this hits.
>
> device_not_available: If you're using isolation without an FPU, you
> have bigger problems.
>
> page_fault: Needs notification.
>
> NMI, MCE: arguably these should *not* notify or at least not fatally.
>
> So maybe a better approach would be to explicitly notify for the
> relevant entries: IRQs, non-signalling page faults, and non-signalling
> MPX fixups.  Other arches would have their own lists, but they're
> probably also short except for emulated instructions.

IRQs should get notified via the task_isolation_debug boot flag;
the intent is that they should never get delivered to nohz_full
cores anyway, so we produce a console backtrace if the boot
flag is enabled.  This isn't tied to having a task running with
TASK_ISOLATION enabled, since it just shouldn't ever happen.

Thanks for reviewing the possible exception sources on x86,
which I'm less familiar with than tile.  Non-signalling page faults
and MPX fixups sounds exactly right - and I didn't know about
MPX before your email (other than the userspace side of
the notion of bounds registers), so thanks for the pointer.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality
  2015-09-29 17:42                     ` Chris Metcalf
@ 2015-09-29 17:57                       ` Andy Lutomirski
  2015-09-30 20:25                         ` Thomas Gleixner
  0 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-09-29 17:57 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel,
	H. Peter Anvin, X86 ML

On Tue, Sep 29, 2015 at 10:42 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> On 09/28/2015 06:43 PM, Andy Lutomirski wrote:
>>
>> Why are we treating alarms as something that should defer entry to
>> userspace?  I think it would be entirely reasonable to set an alarm
>> for ten minutes, ask for isolation, and then think hard for ten
>> minutes.
>>
>> A bigger issue would be if there's an RT task that asks for isolation
>> and a bunch of other stuff (most notably KVM hosts) running with
>> uncontrained affinity at full load.  If task_isolation_enter always
>> sleeps, then your KVM host will get scheduled, and it'll ask for a
>> user return notifier on the way out, and you might just loop forever.
>> Can this happen?
>
>
> task_isolation_enter() doesn't sleep - it spins.  This is intentional,
> because the point is that there should be nothing else that
> could be scheduled on that cpu.  We're just waiting for any
> pending kernel management timer interrupts to fire.
>
> In any case, you normally wouldn't have a KVM host running
> on an isolcpus, nohz_full cpu, unless it was the only thing
> running there, I imagine (just as would be true for any other
> host process).

The problem is that AFAICT systemd (and possibly other things) makes
is rather painful to guarantee that nothing low-priority (systemd
itself) would schedule on an arbitrary CPU.  I would hope that merely
setting affinity and RT would be enough to get isolation without
playing fancy cgroup games.  Maybe not.

>
>> ISTM something's suboptimal with the inner workings of all this if
>> task_isolation_enter needs to sleep to wait for an event that isn't
>> scheduled for the immediate future (e.g. already queued up as an
>> interrupt).
>
>
> Scheduling a timer for 10 minutes away is typically done by
> scheduling timers for the max timer granularity (which could
> be just a few seconds) and then waking up a couple of hundred
> times between now and now+10 minutes.  Doing this breaks
> the task isolation guarantee, so we can't return to userspace
> while something like that is pending.  You'd have to do it
> by polling in userspace to avoid the unexpected interrupts.
>

Really?  That sucks.  Hopefully we can fix it.

> I suppose if your hardware supported it, you could imagine
> a mode where userspace can request an alarm a specific
> amount of time in the future, and the task isolation code
> would then ignore an alarm that was going off at that
> specific time.  But I'm not sure what hardware does support
> that (I know tile uses the "few seconds and re-arm" model),
> and it seems like a pretty corner use-case.  We could
> certainly investigate adding such support later, but I don't see
> it as part of the core value proposition for task isolation.
>

Intel chips Sandy Bridge and newer certainly support this. Other chips
might support it as well.  Whether the kernel is able to program the
TSC deadline timer like that is a different question.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode
  2015-09-29 17:57                           ` Chris Metcalf
  (?)
@ 2015-09-29 18:00                           ` Andy Lutomirski
  2015-10-01 19:25                               ` Chris Metcalf
  -1 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-09-29 18:00 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On Tue, Sep 29, 2015 at 10:57 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> On 09/29/2015 01:46 PM, Andy Lutomirski wrote:
>>
>> On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf@ezchip.com>
>> wrote:
>>>
>>> Well, the most interesting category is things that don't actually
>>> trigger a signal (e.g. minor page fault) since those are things that
>>> cause significant issues with task isolation processes
>>> (kernel-induced jitter) but aren't otherwise user-visible,
>>> much like an undiscovered syscall in a third-party library
>>> can cause unexpected jitter.
>>
>> Would it make sense to exempt the exceptions that result in signals?
>> After all, those are detectable even without your patches.  Going
>> through all of the exception types:
>>
>> divide_error, overflow, invalid_op, coprocessor_segment_overrun,
>> invalid_TSS, segment_not_present, stack_segment, alignment_check:
>> these all send signals anyway.
>>
>> double_fault is fatal.
>>
>> bounds: MPX faults can be silently fixed up, and those will need
>> notification.  (Or user code should know not to do that, since it
>> requires an explicit opt in, and user code can flip it back off to get
>> the signals.)
>>
>> general_protection: always signals except in vm86 mode.
>>
>> int3: silently fixed if uprobes are in use, but I don't think
>> isolation cares about that.  Otherwise signals.
>>
>> debug: The perf hw_breakpoint can result in silent fixups, but those
>> require explicit opt-in from the admin.  Otherwise, unless there's a
>> bug or a debugger, the user will get a signal.  (As a practical
>> matter, the only interesting case is the undocumented ICEBP
>> instruction.)
>>
>> math_error, simd_coprocessor_error: Sends a signal.
>>
>> spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK.  We should
>> just WARN if this hits.
>>
>> device_not_available: If you're using isolation without an FPU, you
>> have bigger problems.
>>
>> page_fault: Needs notification.
>>
>> NMI, MCE: arguably these should *not* notify or at least not fatally.
>>
>> So maybe a better approach would be to explicitly notify for the
>> relevant entries: IRQs, non-signalling page faults, and non-signalling
>> MPX fixups.  Other arches would have their own lists, but they're
>> probably also short except for emulated instructions.
>
>
> IRQs should get notified via the task_isolation_debug boot flag;
> the intent is that they should never get delivered to nohz_full
> cores anyway, so we produce a console backtrace if the boot
> flag is enabled.  This isn't tied to having a task running with
> TASK_ISOLATION enabled, since it just shouldn't ever happen.

OK, I like that.  In that case, maybe NMI and MCE should be in a
similar category.  (IOW if a non-fatal MCE happens and the debug param
is set, we could warn, assuming that anyone is willing to write the
code.  Doing printk from MCE is not entirely trivial, although it's
less bad in recent kernels.)

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality
  2015-09-29 17:57                       ` Andy Lutomirski
@ 2015-09-30 20:25                         ` Thomas Gleixner
  2015-09-30 20:58                           ` Chris Metcalf
  0 siblings, 1 reply; 340+ messages in thread
From: Thomas Gleixner @ 2015-09-30 20:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel,
	H. Peter Anvin, X86 ML

On Tue, 29 Sep 2015, Andy Lutomirski wrote:
> On Tue, Sep 29, 2015 at 10:42 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> > Scheduling a timer for 10 minutes away is typically done by
> > scheduling timers for the max timer granularity (which could
> > be just a few seconds) and then waking up a couple of hundred
> > times between now and now+10 minutes.  Doing this breaks
> > the task isolation guarantee, so we can't return to userspace
> > while something like that is pending.  You'd have to do it
> > by polling in userspace to avoid the unexpected interrupts.
> >
> 
> Really?  That sucks.  Hopefully we can fix it.

Well. If there is a timer pending, then what do you want to fix? If
the hardware does not allow to program a long out expiry time, then
the kernel cannot do anything about it.

That depends on the timer hardware, really. PIT can do ~50ms, HPET
~3min, APIC timer ~32sec, TSC deadline timer ~500years.

> > I suppose if your hardware supported it, you could imagine
> > a mode where userspace can request an alarm a specific
> > amount of time in the future, and the task isolation code
> > would then ignore an alarm that was going off at that
> > specific time.

Ignore in what way?

> > But I'm not sure what hardware does support
> > that (I know tile uses the "few seconds and re-arm" model),
> > and it seems like a pretty corner use-case.  We could
> > certainly investigate adding such support later, but I don't see
> > it as part of the core value proposition for task isolation.

I'm really not following here. If user space requested that timer, WHY
on earth do you want to spin in the kernel until it fired? That's
insane.
 
> Intel chips Sandy Bridge and newer certainly support this. Other chips
> might support it as well.  Whether the kernel is able to program the
> TSC deadline timer like that is a different question.

If the next expiry is out 100 years then we will happily program it
for that expiry time.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality
  2015-09-30 20:25                         ` Thomas Gleixner
@ 2015-09-30 20:58                           ` Chris Metcalf
  2015-09-30 22:02                             ` Thomas Gleixner
  0 siblings, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-09-30 20:58 UTC (permalink / raw)
  To: Thomas Gleixner, Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-kernel, H. Peter Anvin,
	X86 ML

On 09/30/2015 04:25 PM, Thomas Gleixner wrote:
> On Tue, 29 Sep 2015, Andy Lutomirski wrote:
>> On Tue, Sep 29, 2015 at 10:42 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>>> I suppose if your hardware supported it, you could imagine
>>> a mode where userspace can request an alarm a specific
>>> amount of time in the future, and the task isolation code
>>> would then ignore an alarm that was going off at that
>>> specific time.
> Ignore in what way?

So the model for task isolation, as proposed, is that you are
NOT trying to use kernel services, because the overheads are
too great.  The typical example is that you are running a
userspace packet-processing application, and entering the
kernel for pretty much any reason will cause your thread to
drop packets all over the floor.  So, we ensure that you never
enter the kernel asynchronously.

Now obviously if the task needs to enter the kernel, you have
to allow it do so, and in fact there will be plenty of scenarios
where that's exactly what you want to do; but for example you
may have a way to register that a particular thread is opting out
of packet processing for the moment, and it will instead go
off and use the kernel to log some kind of information about
some exceptional packet flow, or some debugging state about
its own internals, or whatever.  At that moment you would like
the thread to be able to use arbitrary kernel facilities in a
relatively transparent way.

However, when it is done using the kernel and returns to
userspace and signs up to handle packets again, you REALLY
don't want to encounter some kind of tailing off of timer
interrupts while some kernel subsystem quiesces or whatever.
So we would like to spin in the kernel until no further timer
interrupts are queued.  In the original tile implementation,
we would just wait until the timer interrupt was masked,
which guaranteed it wouldn't fire again; for a more portable
approach in the task-isolation patch series, I'm testing
that the tick_cpu_device has next_event set to KTIME_MAX.

The discussion in this email thread is that maybe it might
make sense for one of these userspace driver threads to
request a SIGALARM in 10 minutes, and then you'd assume
it was done very intentionally, and figure out a way to
discount that timer event somehow -- so you'd still return
to userspace if all that was pending was one timer interrupt
scheduled 10 minutes out, but if your hardware didn't
allow setting such a long timer, you wouldn't return early,
or if some other event was scheduled sooner, you wouldn't
return early, etc.  As I said in my previous email, this seems
like a total corner case and not worth worrying about now;
maybe in the future someone will want to think about it.

So for now, if a task-isolation thread sets up a timer,
they're screwed: so, don't do that.  And it's really not part of
the typical programming model for these kinds of userspace
drivers anyway, so it's pretty reasonable to forbid it.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality
  2015-09-30 20:58                           ` Chris Metcalf
@ 2015-09-30 22:02                             ` Thomas Gleixner
  2015-09-30 22:11                               ` Andy Lutomirski
  0 siblings, 1 reply; 340+ messages in thread
From: Thomas Gleixner @ 2015-09-30 22:02 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Andy Lutomirski, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel,
	H. Peter Anvin, X86 ML

On Wed, 30 Sep 2015, Chris Metcalf wrote:
> So for now, if a task-isolation thread sets up a timer,
> they're screwed: so, don't do that.  And it's really not part of
> the typical programming model for these kinds of userspace
> drivers anyway, so it's pretty reasonable to forbid it.

There is a difference between forbidding it and looping for 10 minutes
in the kernel.

I have yet to understand WHY this loop is there at all. All I've seen
so far is that things might need to settle and that the indicator for
settlement is that the next expiry time of the per cpu timer is set to
KTIME_MAX.

To be blunt, that just stinks. This is duct tape engineering and not
even close to a reasonable design.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality
  2015-09-30 22:02                             ` Thomas Gleixner
@ 2015-09-30 22:11                               ` Andy Lutomirski
  2015-10-01  8:12                                 ` Thomas Gleixner
  0 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-09-30 22:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel,
	H. Peter Anvin, X86 ML

On Wed, Sep 30, 2015 at 3:02 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 30 Sep 2015, Chris Metcalf wrote:
>> So for now, if a task-isolation thread sets up a timer,
>> they're screwed: so, don't do that.  And it's really not part of
>> the typical programming model for these kinds of userspace
>> drivers anyway, so it's pretty reasonable to forbid it.
>
> There is a difference between forbidding it and looping for 10 minutes
> in the kernel.

I don't even like forbidding it.  Setting timers seems like an
entirely reasonable thing for even highly RT or isolated programs to
do, although admittedly they can do it on a non-RT thread and then
kick the RT thread when they're ready.

Heck, even without the TSC deadline timer, the kernel could, in
principle, support that use case by having whatever core is doing
housekeeping keep kicking the can forward until it's time to IPI the
isolated core because it needs to wake up.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality
  2015-09-30 22:11                               ` Andy Lutomirski
@ 2015-10-01  8:12                                 ` Thomas Gleixner
  2015-10-01  9:08                                   ` Christoph Lameter
  2015-10-01 19:25                                   ` Chris Metcalf
  0 siblings, 2 replies; 340+ messages in thread
From: Thomas Gleixner @ 2015-10-01  8:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel,
	H. Peter Anvin, X86 ML

On Wed, 30 Sep 2015, Andy Lutomirski wrote:
> On Wed, Sep 30, 2015 at 3:02 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Wed, 30 Sep 2015, Chris Metcalf wrote:
> >> So for now, if a task-isolation thread sets up a timer,
> >> they're screwed: so, don't do that.  And it's really not part of
> >> the typical programming model for these kinds of userspace
> >> drivers anyway, so it's pretty reasonable to forbid it.
> >
> > There is a difference between forbidding it and looping for 10 minutes
> > in the kernel.
> 
> I don't even like forbidding it.  Setting timers seems like an
> entirely reasonable thing for even highly RT or isolated programs to
> do, although admittedly they can do it on a non-RT thread and then
> kick the RT thread when they're ready.
> 
> Heck, even without the TSC deadline timer, the kernel could, in
> principle, support that use case by having whatever core is doing
> housekeeping keep kicking the can forward until it's time to IPI the
> isolated core because it needs to wake up.

That's simple. Just arm the timer on the other core. It's not rocket
science to do that.

But the whole problem with this isolation stuff is, that it tries to
push half baken duct tape concepts into the tree.

That would be the same if we'd brute force merge the RT stuff and then
let everyone deal with the fallout. There is a really good reason, why
the remaining - hard to solve - pieces of RT are still out of tree.

And I really want to see a proper engineering for that isolation
stuff, which can be done with an out of tree patch set in the first
place. But sure, it's more convenient to push crap into mainline and
let everyone else deal with the fallouts.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality
  2015-10-01  8:12                                 ` Thomas Gleixner
@ 2015-10-01  9:08                                   ` Christoph Lameter
  2015-10-01 10:10                                     ` Thomas Gleixner
  2015-10-01 19:25                                   ` Chris Metcalf
  1 sibling, 1 reply; 340+ messages in thread
From: Christoph Lameter @ 2015-10-01  9:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-kernel, H. Peter Anvin,
	X86 ML

On Thu, 1 Oct 2015, Thomas Gleixner wrote:

> And I really want to see a proper engineering for that isolation
> stuff, which can be done with an out of tree patch set in the first
> place. But sure, it's more convenient to push crap into mainline and
> let everyone else deal with the fallouts.

Yes lets keep the half baked stuff out please. Firing a timer that signals
the app via a signal causes an event that is not unlike the OS noise that
we are trying to avoid. Its an asynchrononous event that may interrupt at
random locations in the code. In that case I would say its perfectly fine
to use the tick and other timer processing on the processor that requested
it. If you really want low latency and be OS intervention free then please
do not set up timers. In fact any signal should bring on full OS services
on a processor.

AFAICT one would communicate via shared memory structures rather than IPIs
within the threads of an app that requires low latency and the OS noise to
be minimal.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality
  2015-10-01  9:08                                   ` Christoph Lameter
@ 2015-10-01 10:10                                     ` Thomas Gleixner
  0 siblings, 0 replies; 340+ messages in thread
From: Thomas Gleixner @ 2015-10-01 10:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-kernel, H. Peter Anvin,
	X86 ML

On Thu, 1 Oct 2015, Christoph Lameter wrote:
> On Thu, 1 Oct 2015, Thomas Gleixner wrote:
> 
> > And I really want to see a proper engineering for that isolation
> > stuff, which can be done with an out of tree patch set in the first
> > place. But sure, it's more convenient to push crap into mainline and
> > let everyone else deal with the fallouts.
> 
> Yes lets keep the half baked stuff out please. Firing a timer that signals
> the app via a signal causes an event that is not unlike the OS noise that
> we are trying to avoid. Its an asynchrononous event that may interrupt at
> random locations in the code. In that case I would say its perfectly fine
> to use the tick and other timer processing on the processor that requested
> it. If you really want low latency and be OS intervention free then please
> do not set up timers. In fact any signal should bring on full OS services
> on a processor.
 
Right. That's a recommendation to the application developer, which he
can follow or not. And he has to deal with the consequences if not.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 02/11] task_isolation: add initial support
  2015-09-28 15:17               ` Chris Metcalf
  (?)
@ 2015-10-01 12:14               ` Frederic Weisbecker
  2015-10-01 12:18                 ` Thomas Gleixner
  2015-10-01 19:25                   ` Chris Metcalf
  -1 siblings, 2 replies; 340+ messages in thread
From: Frederic Weisbecker @ 2015-10-01 12:14 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote:
> diff --git a/include/linux/isolation.h b/include/linux/isolation.h
> new file mode 100644
> index 000000000000..fd04011b1c1e
> --- /dev/null
> +++ b/include/linux/isolation.h
> @@ -0,0 +1,24 @@
> +/*
> + * Task isolation related global functions
> + */
> +#ifndef _LINUX_ISOLATION_H
> +#define _LINUX_ISOLATION_H
> +
> +#include <linux/tick.h>
> +#include <linux/prctl.h>
> +
> +#ifdef CONFIG_TASK_ISOLATION
> +static inline bool task_isolation_enabled(void)
> +{
> +	return tick_nohz_full_cpu(smp_processor_id()) &&
> +		(current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE);

Ok, I may be a bit burdening with that but, how about using the regular
existing task flags, and if needed later we can still introduce a new field
in struct task_struct?

> diff --git a/kernel/isolation.c b/kernel/isolation.c
> new file mode 100644
> index 000000000000..6ace866c69f6
> --- /dev/null
> +++ b/kernel/isolation.c
> @@ -0,0 +1,77 @@
> +/*
> + *  linux/kernel/isolation.c
> + *
> + *  Implementation for task isolation.
> + *
> + *  Distributed under GPLv2.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/swap.h>
> +#include <linux/vmstat.h>
> +#include <linux/isolation.h>
> +#include "time/tick-sched.h"
> +
> +/*
> + * Rather than continuously polling for the next_event in the
> + * tick_cpu_device, architectures can provide a method to save power
> + * by sleeping until an interrupt arrives.
> + *
> + * Note that it must be guaranteed for a particular architecture
> + * that if next_event is not KTIME_MAX, then a timer interrupt will
> + * occur, otherwise the sleep may never awaken.
> + */
> +void __weak task_isolation_wait(void)
> +{
> +	cpu_relax();
> +}
> +
> +/*
> + * We normally return immediately to userspace.
> + *
> + * In task_isolation mode we wait until no more interrupts are
> + * pending.  Otherwise we nap with interrupts enabled and wait for the
> + * next interrupt to fire, then loop back and retry.
> + *
> + * Note that if you schedule two task_isolation processes on the same
> + * core, neither will ever leave the kernel, and one will have to be
> + * killed manually.  Otherwise in situations where another process is
> + * in the runqueue on this cpu, this task will just wait for that
> + * other task to go idle before returning to user space.
> + */
> +void task_isolation_enter(void)
> +{
> +	struct clock_event_device *dev =
> +		__this_cpu_read(tick_cpu_device.evtdev);
> +	struct task_struct *task = current;
> +	unsigned long start = jiffies;
> +	bool warned = false;
> +
> +	if (WARN_ON(irqs_disabled()))
> +		local_irq_enable();
> +
> +	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
> +	lru_add_drain();
> +
> +	/* Quieten the vmstat worker so it won't interrupt us. */
> +	quiet_vmstat();
> +
> +	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {

You should add a function in tick-sched.c to get the next tick. This
is supposed to be a private field.

> +		if (!warned && (jiffies - start) >= (5 * HZ)) {
> +			pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n",
> +				task->comm, task->pid, smp_processor_id(),
> +				(jiffies - start) / HZ);
> +			warned = true;
> +		}
> +		cond_resched();
> +		if (test_thread_flag(TIF_SIGPENDING))
> +			break;

Why not use signal_pending()?

> +		task_isolation_wait();

I still think we could try a wait-wake standard scheme.

Thanks.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 02/11] task_isolation: add initial support
  2015-10-01 12:14               ` Frederic Weisbecker
@ 2015-10-01 12:18                 ` Thomas Gleixner
  2015-10-01 12:23                   ` Frederic Weisbecker
  2015-10-01 17:02                     ` Chris Metcalf
  2015-10-01 19:25                   ` Chris Metcalf
  1 sibling, 2 replies; 340+ messages in thread
From: Thomas Gleixner @ 2015-10-01 12:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Thu, 1 Oct 2015, Frederic Weisbecker wrote:
> On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote:
> > +
> > +	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
> 
> You should add a function in tick-sched.c to get the next tick. This
> is supposed to be a private field.

Just to make it clear. Neither the above nor a similar check in
tick-sched.c is going to happen.

This busy waiting is just horrible. Get your act together and solve
the problems at the root and do not inflict your quick and dirty
'solutions' on us.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 02/11] task_isolation: add initial support
  2015-10-01 12:18                 ` Thomas Gleixner
@ 2015-10-01 12:23                   ` Frederic Weisbecker
  2015-10-01 12:31                     ` Thomas Gleixner
  2015-10-01 17:02                     ` Chris Metcalf
  1 sibling, 1 reply; 340+ messages in thread
From: Frederic Weisbecker @ 2015-10-01 12:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Thu, Oct 01, 2015 at 02:18:42PM +0200, Thomas Gleixner wrote:
> On Thu, 1 Oct 2015, Frederic Weisbecker wrote:
> > On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote:
> > > +
> > > +	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
> > 
> > You should add a function in tick-sched.c to get the next tick. This
> > is supposed to be a private field.
> 
> Just to make it clear. Neither the above nor a similar check in
> tick-sched.c is going to happen.
> 
> This busy waiting is just horrible. Get your act together and solve
> the problems at the root and do not inflict your quick and dirty
> 'solutions' on us.

That's why I proposed a wait-wake scheme instead with the tick stop
code. What's your opinion about such direction?

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 02/11] task_isolation: add initial support
  2015-10-01 12:23                   ` Frederic Weisbecker
@ 2015-10-01 12:31                     ` Thomas Gleixner
  0 siblings, 0 replies; 340+ messages in thread
From: Thomas Gleixner @ 2015-10-01 12:31 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Thu, 1 Oct 2015, Frederic Weisbecker wrote:
> On Thu, Oct 01, 2015 at 02:18:42PM +0200, Thomas Gleixner wrote:
> > On Thu, 1 Oct 2015, Frederic Weisbecker wrote:
> > > On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote:
> > > > +
> > > > +	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
> > > 
> > > You should add a function in tick-sched.c to get the next tick. This
> > > is supposed to be a private field.
> > 
> > Just to make it clear. Neither the above nor a similar check in
> > tick-sched.c is going to happen.
> > 
> > This busy waiting is just horrible. Get your act together and solve
> > the problems at the root and do not inflict your quick and dirty
> > 'solutions' on us.
> 
> That's why I proposed a wait-wake scheme instead with the tick stop
> code. What's your opinion about such direction?

Definitely more sensible than mindlessly busy looping.

Thanks,

	tglx
 

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 06/11] nohz: task_isolation: allow tick to be fully disabled
  2015-09-28 20:40               ` Andy Lutomirski
@ 2015-10-01 13:07                 ` Frederic Weisbecker
  2015-10-01 14:13                   ` Thomas Gleixner
  0 siblings, 1 reply; 340+ messages in thread
From: Frederic Weisbecker @ 2015-10-01 13:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel

On Mon, Sep 28, 2015 at 04:40:56PM -0400, Andy Lutomirski wrote:
> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> > While the current fallback to 1-second tick is still helpful for
> > maintaining completely correct kernel semantics, processes using
> > prctl(PR_SET_TASK_ISOLATION) semantics place a higher priority on
> > running completely tickless, so don't bound the time_delta for such
> > processes.  In addition, due to the way such processes quiesce by
> > waiting for the timer tick to stop prior to returning to userspace,
> > without this commit it won't be possible to use the task_isolation
> > mode at all.
> >
> > Removing the 1-second cap was previously discussed (see link
> > below) and Thomas Gleixner observed that vruntime, load balancing
> > data, load accounting, and other things might be impacted.
> > Frederic Weisbecker similarly observed that allowing the tick to
> > be indefinitely deferred just meant that no one would ever fix the
> > underlying bugs.  However it's at least true that the mode proposed
> > in this patch can only be enabled on a nohz_full core by a process
> > requesting task_isolation mode, which may limit how important it is
> > to maintain scheduler data correctly, for example.
> 
> What goes wrong when a task enables this?  Presumably either tasks
> that enable it experience problems or performance issues or it should
> always be enabled.

We need to make the scheduler resilient to 0Hz tick. Currently it doesn't
even correctly support 1Hz or any dynticks behaviour that isn't idle.

See update_cpu_load_active() for exemple.

> 
> One possible issue: __vdso_clock_gettime with any of the COARSE clocks
> as well as __vdso_time will break if the timekeeping code doesn't run
> somewhere with reasonable frequency on some core.  Hopefully this
> always works.
> 
> --Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 06/11] nohz: task_isolation: allow tick to be fully disabled
  2015-10-01 13:07                 ` Frederic Weisbecker
@ 2015-10-01 14:13                   ` Thomas Gleixner
  0 siblings, 0 replies; 340+ messages in thread
From: Thomas Gleixner @ 2015-10-01 14:13 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-kernel

On Thu, 1 Oct 2015, Frederic Weisbecker wrote:
> On Mon, Sep 28, 2015 at 04:40:56PM -0400, Andy Lutomirski wrote:
> > On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> > > While the current fallback to 1-second tick is still helpful for
> > > maintaining completely correct kernel semantics, processes using
> > > prctl(PR_SET_TASK_ISOLATION) semantics place a higher priority on
> > > running completely tickless, so don't bound the time_delta for such
> > > processes.  In addition, due to the way such processes quiesce by
> > > waiting for the timer tick to stop prior to returning to userspace,
> > > without this commit it won't be possible to use the task_isolation
> > > mode at all.
> > >
> > > Removing the 1-second cap was previously discussed (see link
> > > below) and Thomas Gleixner observed that vruntime, load balancing
> > > data, load accounting, and other things might be impacted.
> > > Frederic Weisbecker similarly observed that allowing the tick to
> > > be indefinitely deferred just meant that no one would ever fix the
> > > underlying bugs.  However it's at least true that the mode proposed
> > > in this patch can only be enabled on a nohz_full core by a process
> > > requesting task_isolation mode, which may limit how important it is
> > > to maintain scheduler data correctly, for example.
> > 
> > What goes wrong when a task enables this?  Presumably either tasks
> > that enable it experience problems or performance issues or it should
> > always be enabled.
> 
> We need to make the scheduler resilient to 0Hz tick. Currently it doesn't
> even correctly support 1Hz or any dynticks behaviour that isn't idle.

Rik has started to work on this. No idea what the status of that is.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 02/11] task_isolation: add initial support
  2015-10-01 12:18                 ` Thomas Gleixner
@ 2015-10-01 17:02                     ` Chris Metcalf
  2015-10-01 17:02                     ` Chris Metcalf
  1 sibling, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-01 17:02 UTC (permalink / raw)
  To: Thomas Gleixner, Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, linux-doc, linux-api, linux-kernel

On 10/01/2015 08:18 AM, Thomas Gleixner wrote:
> On Thu, 1 Oct 2015, Frederic Weisbecker wrote:
>> On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote:
>>> +
>>> +	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
>> You should add a function in tick-sched.c to get the next tick. This
>> is supposed to be a private field.
> Just to make it clear. Neither the above nor a similar check in
> tick-sched.c is going to happen.
>
> This busy waiting is just horrible. Get your act together and solve
> the problems at the root and do not inflict your quick and dirty
> 'solutions' on us.

Thomas,

You've raised a couple of different concerns and I want to
make sure I try to address them individually.

But first I want to address the question of the basic semantics
of the patch series.  I wrote up a description of why it's useful
in my email yesterday:

https://lkml.kernel.org/r/560C4CF4.9090601@ezchip.com

I haven't directly heard from you as to whether you buy the
basic premise of "hard isolation" in terms of protecting tasks
from all kernel interrupts while they execute in userspace.

I will add here that we've heard from multiple customers that
the equivalent Tilera functionality (Zero-Overhead Linux) was
the thing that brought them to buy our hardware rather than a
competitor's.  It's allowed them to write code that runs under
a full-featured Linux environment rather than doing the thing
that they otherwise would have been required to do, which is
to target a minimal bare-metal environment.  So as a feature,
if we can gain consensus on an implementation of it, I think it
will be an important step for that class of users, and potential
users, of Linux.

So I first want to address what is effectively the API concern that
you raised, namely that you're concerned that there is a wait
loop in the implementation.

The nice thing here is that there is in fact no requirement in
the API/ABI that we have a wait loop in the kernel at all.  Let's
say hypothetically that in the future we come up with a way to
guarantee, perhaps in some constrained kind of way, that you
can enter and exit the kernel and are guaranteed no further
timer interrupts, and we are so confident of this property that
we don't have to test for it programmatically on kernel exit.
(In fact, we would likely still use the task_isolation_debug boot
flag to generate a console warning if it ever did happen, but
whatever.)  At this point we could simply remove the timer
interrupt test loop in task_isolation_wait(); the applications would
be none the wiser, and the kernel would be that much cleaner.

However, today, and I think for the future, I see that loop as an
important backstop for whatever timer-elimination coding happens.
In general, the hard task-isolation requirement is something that
is of particular interest only to a subset of the kernel community.
As the kernel grows, adds features, re-implements functionality,
etc., it seems entirely likely that odd bits of deferred functionality
might be added in the same way that RCU, workqueues, etc., have
done in the past.  Or, applications might exercise unusual corners
of the kernel's semantics and come across an existing mechanism
that ends up enabling kernel ticks (maybe only one or two) before
returning to userspace.  The proposed busy-loop just prevents
that from damaging the application.  I'm skeptical that we can
prevent all such possible changes today and in the future, and I
think the loop is a simple way of arranging to avoid breaking applications
with interrupts, that only triggers for applications that have
requested it, on cores that have been configured to support it.

One additional insight that argues in favor of a busy-waiting solution
is that a task that requests task isolation is almost certainly alone
on the core.  If multiple tasks are in fact runnable on that core,
we have already abandoned the ability to use proper task isolation
since we will want to use timer ticks to run the scheduler for
pre-emption.  So we only busy wait when, in fact, no other useful
work is likely to get done on that core anyway.

The other questions you raise have to do with the mechanism for
ensuring that we wait until no timer interrupts are scheduled.

First is the question of how we detect that case.

As I said yesterday, the original approach I chose for the Tilera
implementation was one where we simply wait until the timer interrupt
is masked (as is done via the set_state_shutdown, set_state_oneshot,
and tick_resume callbacks in the tile clock_event_device).  When
unmasked, the timer down-counter just counts down to zero,
fires the interrupt, resets to its start value, and counts down again
until it fires again.  So we use masking of the interrupt to turn off
the timer tick.  Once we have done so, we are guaranteed no
further timer interrupts can occur.  I'm less familiar with the timer
subsystems of other architectures, but there are clearly per-platform
ways to make the same kinds of checks.  If this seems like a better
approach, I'm happy to work to add the necessary checks on
tile, arm64, and x86, though I'd certainly benefit from some
guidance on the timer implementation on the latter two platforms.

One reason this might be necessary is if there is support on some
platforms for multiple timer interrupts any of which can fire, not just
a single timer driven by the clock_event_device.  I'm not sure whether
this is ever in fact a problem, but if it is, that would seem like it would
almost certainly require per-architecture code to determine whether
all the relevant timers were quiesced.

However, I'm not sure whether you don't like the fact of checking
the next_event in tick_cpu_device per se, or if it's the busy-waiting
we do when it indicates a pending timer that bothers you.  If you could
help clarify this piece, that would be good.

The last question is what to do when we detect that there is a timer
interrupt scheduled.  The current code spins, testing for resched
or signal events, and bails out back to the work-pending loop when
that happens.  As an extension, one can add support for spinning in
a lower-power state, as I did for tile, but this isn't required and frankly
isn't that important, since we don't anticipate spending much time in
the busy-loop state anyway.

The suggestion proposed by Frederic and echoed by you is a wake-wait
scheme.  I'm curious to hear a more fully fleshed-out suggestion.
Clearly, we can test for pending timer interrupts and put the task to
sleep (pretty late in the return-to-userspace process, but maybe that's
OK).  The question is, how and when do we wake the task?  We could
add a hook to the platform timer shutdown code that would also wake
any process that was waiting for the no-timer case; that process would
then end up getting scheduled sometime later, and hopefully when it
came time for it to try exiting to userspace again, the timer would still
be shutdown.

This could be problematic if the scheduler code or some other part of
the kernel sets up the timer again before scheduling the waiting task
back in.  Arguably we can work to avoid this if it's really a problem.
And, there is the question of how to handle multiple timer interrupt
sources, since they would all have to quiesce before we would want to
wake the waiting process, but the "multiple timers" isn't handled by
the current code either, and it seems not to be a problem, so perhaps
that's OK.  Lastly, of course, is the question of what the kernel would
end up doing while waiting: and the answer is almost certainly that it
would sit in the cpu idle loop, waiting for the pending timer to fire and
wake the waiting task.  I'm not convinced that the extra complexity here
is worth the gain.

But I am open and willing to being convinced that I am wrong, and to
implement different approaches.  Let me know!

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 02/11] task_isolation: add initial support
@ 2015-10-01 17:02                     ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-01 17:02 UTC (permalink / raw)
  To: Thomas Gleixner, Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, linux-doc, linux-api, linux-kernel

On 10/01/2015 08:18 AM, Thomas Gleixner wrote:
> On Thu, 1 Oct 2015, Frederic Weisbecker wrote:
>> On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote:
>>> +
>>> +	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
>> You should add a function in tick-sched.c to get the next tick. This
>> is supposed to be a private field.
> Just to make it clear. Neither the above nor a similar check in
> tick-sched.c is going to happen.
>
> This busy waiting is just horrible. Get your act together and solve
> the problems at the root and do not inflict your quick and dirty
> 'solutions' on us.

Thomas,

You've raised a couple of different concerns and I want to
make sure I try to address them individually.

But first I want to address the question of the basic semantics
of the patch series.  I wrote up a description of why it's useful
in my email yesterday:

https://lkml.kernel.org/r/560C4CF4.9090601@ezchip.com

I haven't directly heard from you as to whether you buy the
basic premise of "hard isolation" in terms of protecting tasks
from all kernel interrupts while they execute in userspace.

I will add here that we've heard from multiple customers that
the equivalent Tilera functionality (Zero-Overhead Linux) was
the thing that brought them to buy our hardware rather than a
competitor's.  It's allowed them to write code that runs under
a full-featured Linux environment rather than doing the thing
that they otherwise would have been required to do, which is
to target a minimal bare-metal environment.  So as a feature,
if we can gain consensus on an implementation of it, I think it
will be an important step for that class of users, and potential
users, of Linux.

So I first want to address what is effectively the API concern that
you raised, namely that you're concerned that there is a wait
loop in the implementation.

The nice thing here is that there is in fact no requirement in
the API/ABI that we have a wait loop in the kernel at all.  Let's
say hypothetically that in the future we come up with a way to
guarantee, perhaps in some constrained kind of way, that you
can enter and exit the kernel and are guaranteed no further
timer interrupts, and we are so confident of this property that
we don't have to test for it programmatically on kernel exit.
(In fact, we would likely still use the task_isolation_debug boot
flag to generate a console warning if it ever did happen, but
whatever.)  At this point we could simply remove the timer
interrupt test loop in task_isolation_wait(); the applications would
be none the wiser, and the kernel would be that much cleaner.

However, today, and I think for the future, I see that loop as an
important backstop for whatever timer-elimination coding happens.
In general, the hard task-isolation requirement is something that
is of particular interest only to a subset of the kernel community.
As the kernel grows, adds features, re-implements functionality,
etc., it seems entirely likely that odd bits of deferred functionality
might be added in the same way that RCU, workqueues, etc., have
done in the past.  Or, applications might exercise unusual corners
of the kernel's semantics and come across an existing mechanism
that ends up enabling kernel ticks (maybe only one or two) before
returning to userspace.  The proposed busy-loop just prevents
that from damaging the application.  I'm skeptical that we can
prevent all such possible changes today and in the future, and I
think the loop is a simple way of arranging to avoid breaking applications
with interrupts, that only triggers for applications that have
requested it, on cores that have been configured to support it.

One additional insight that argues in favor of a busy-waiting solution
is that a task that requests task isolation is almost certainly alone
on the core.  If multiple tasks are in fact runnable on that core,
we have already abandoned the ability to use proper task isolation
since we will want to use timer ticks to run the scheduler for
pre-emption.  So we only busy wait when, in fact, no other useful
work is likely to get done on that core anyway.

The other questions you raise have to do with the mechanism for
ensuring that we wait until no timer interrupts are scheduled.

First is the question of how we detect that case.

As I said yesterday, the original approach I chose for the Tilera
implementation was one where we simply wait until the timer interrupt
is masked (as is done via the set_state_shutdown, set_state_oneshot,
and tick_resume callbacks in the tile clock_event_device).  When
unmasked, the timer down-counter just counts down to zero,
fires the interrupt, resets to its start value, and counts down again
until it fires again.  So we use masking of the interrupt to turn off
the timer tick.  Once we have done so, we are guaranteed no
further timer interrupts can occur.  I'm less familiar with the timer
subsystems of other architectures, but there are clearly per-platform
ways to make the same kinds of checks.  If this seems like a better
approach, I'm happy to work to add the necessary checks on
tile, arm64, and x86, though I'd certainly benefit from some
guidance on the timer implementation on the latter two platforms.

One reason this might be necessary is if there is support on some
platforms for multiple timer interrupts any of which can fire, not just
a single timer driven by the clock_event_device.  I'm not sure whether
this is ever in fact a problem, but if it is, that would seem like it would
almost certainly require per-architecture code to determine whether
all the relevant timers were quiesced.

However, I'm not sure whether you don't like the fact of checking
the next_event in tick_cpu_device per se, or if it's the busy-waiting
we do when it indicates a pending timer that bothers you.  If you could
help clarify this piece, that would be good.

The last question is what to do when we detect that there is a timer
interrupt scheduled.  The current code spins, testing for resched
or signal events, and bails out back to the work-pending loop when
that happens.  As an extension, one can add support for spinning in
a lower-power state, as I did for tile, but this isn't required and frankly
isn't that important, since we don't anticipate spending much time in
the busy-loop state anyway.

The suggestion proposed by Frederic and echoed by you is a wake-wait
scheme.  I'm curious to hear a more fully fleshed-out suggestion.
Clearly, we can test for pending timer interrupts and put the task to
sleep (pretty late in the return-to-userspace process, but maybe that's
OK).  The question is, how and when do we wake the task?  We could
add a hook to the platform timer shutdown code that would also wake
any process that was waiting for the no-timer case; that process would
then end up getting scheduled sometime later, and hopefully when it
came time for it to try exiting to userspace again, the timer would still
be shutdown.

This could be problematic if the scheduler code or some other part of
the kernel sets up the timer again before scheduling the waiting task
back in.  Arguably we can work to avoid this if it's really a problem.
And, there is the question of how to handle multiple timer interrupt
sources, since they would all have to quiesce before we would want to
wake the waiting process, but the "multiple timers" isn't handled by
the current code either, and it seems not to be a problem, so perhaps
that's OK.  Lastly, of course, is the question of what the kernel would
end up doing while waiting: and the answer is almost certainly that it
would sit in the cpu idle loop, waiting for the pending timer to fire and
wake the waiting task.  I'm not convinced that the extra complexity here
is worth the gain.

But I am open and willing to being convinced that I am wrong, and to
implement different approaches.  Let me know!

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality
  2015-10-01  8:12                                 ` Thomas Gleixner
  2015-10-01  9:08                                   ` Christoph Lameter
@ 2015-10-01 19:25                                   ` Chris Metcalf
  1 sibling, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-01 19:25 UTC (permalink / raw)
  To: Thomas Gleixner, Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-kernel, H. Peter Anvin,
	X86 ML

On 10/01/2015 04:12 AM, Thomas Gleixner wrote:
> On Wed, 30 Sep 2015, Andy Lutomirski wrote:
>> On Wed, Sep 30, 2015 at 3:02 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> On Wed, 30 Sep 2015, Chris Metcalf wrote:
>>>> So for now, if a task-isolation thread sets up a timer,
>>>> they're screwed: so, don't do that.  And it's really not part of
>>>> the typical programming model for these kinds of userspace
>>>> drivers anyway, so it's pretty reasonable to forbid it.
>>> There is a difference between forbidding it and looping for 10 minutes
>>> in the kernel.
>> I don't even like forbidding it.  Setting timers seems like an
>> entirely reasonable thing for even highly RT or isolated programs to
>> do, although admittedly they can do it on a non-RT thread and then
>> kick the RT thread when they're ready.
>>
>> Heck, even without the TSC deadline timer, the kernel could, in
>> principle, support that use case by having whatever core is doing
>> housekeeping keep kicking the can forward until it's time to IPI the
>> isolated core because it needs to wake up.
> That's simple. Just arm the timer on the other core. It's not rocket
> science to do that.

This is a plausible direction to go for alarms requested when
task isolation is enabled.  But as Christoph said, it's almost
certainly a bad idea anyway.  Our customers are advised to
do this kind of stuff (what we call "control-plane" activity) in
a separate process on a housekeeping core, which communicates
with the nohz_full cores via shared memory.  On the nohz_full
side the threads just use polling and simple atomics for
locking.  (That's fun too, because you can't actually allow
those locks to get into contended state since that obliges the
unlocker to invoke futex_wake in the kernel, so we can't just
use pthread mutexes or other common implementations.)

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode
@ 2015-10-01 19:25                               ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-01 19:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On 09/29/2015 02:00 PM, Andy Lutomirski wrote:
> On Tue, Sep 29, 2015 at 10:57 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> On 09/29/2015 01:46 PM, Andy Lutomirski wrote:
>>> On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf@ezchip.com>
>>> wrote:
>>>> Well, the most interesting category is things that don't actually
>>>> trigger a signal (e.g. minor page fault) since those are things that
>>>> cause significant issues with task isolation processes
>>>> (kernel-induced jitter) but aren't otherwise user-visible,
>>>> much like an undiscovered syscall in a third-party library
>>>> can cause unexpected jitter.
>>> Would it make sense to exempt the exceptions that result in signals?
>>> After all, those are detectable even without your patches.  Going
>>> through all of the exception types:
>>>
>>> divide_error, overflow, invalid_op, coprocessor_segment_overrun,
>>> invalid_TSS, segment_not_present, stack_segment, alignment_check:
>>> these all send signals anyway.
>>>
>>> double_fault is fatal.
>>>
>>> bounds: MPX faults can be silently fixed up, and those will need
>>> notification.  (Or user code should know not to do that, since it
>>> requires an explicit opt in, and user code can flip it back off to get
>>> the signals.)
>>>
>>> general_protection: always signals except in vm86 mode.
>>>
>>> int3: silently fixed if uprobes are in use, but I don't think
>>> isolation cares about that.  Otherwise signals.
>>>
>>> debug: The perf hw_breakpoint can result in silent fixups, but those
>>> require explicit opt-in from the admin.  Otherwise, unless there's a
>>> bug or a debugger, the user will get a signal.  (As a practical
>>> matter, the only interesting case is the undocumented ICEBP
>>> instruction.)
>>>
>>> math_error, simd_coprocessor_error: Sends a signal.
>>>
>>> spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK.  We should
>>> just WARN if this hits.
>>>
>>> device_not_available: If you're using isolation without an FPU, you
>>> have bigger problems.
>>>
>>> page_fault: Needs notification.
>>>
>>> NMI, MCE: arguably these should *not* notify or at least not fatally.
>>>
>>> So maybe a better approach would be to explicitly notify for the
>>> relevant entries: IRQs, non-signalling page faults, and non-signalling
>>> MPX fixups.  Other arches would have their own lists, but they're
>>> probably also short except for emulated instructions.
>>
>> IRQs should get notified via the task_isolation_debug boot flag;
>> the intent is that they should never get delivered to nohz_full
>> cores anyway, so we produce a console backtrace if the boot
>> flag is enabled.  This isn't tied to having a task running with
>> TASK_ISOLATION enabled, since it just shouldn't ever happen.
> OK, I like that.  In that case, maybe NMI and MCE should be in a
> similar category.  (IOW if a non-fatal MCE happens and the debug param
> is set, we could warn, assuming that anyone is willing to write the
> code.  Doing printk from MCE is not entirely trivial, although it's
> less bad in recent kernels.)

For now I will stay away from tampering with the NMI/MCE
handlers, though if it turns out that it's the cause of mysterious
latencies in task-isolation applications in the future, it will
likely make sense to add some debugging there.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode
@ 2015-10-01 19:25                               ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-01 19:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 09/29/2015 02:00 PM, Andy Lutomirski wrote:
> On Tue, Sep 29, 2015 at 10:57 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
>> On 09/29/2015 01:46 PM, Andy Lutomirski wrote:
>>> On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
>>> wrote:
>>>> Well, the most interesting category is things that don't actually
>>>> trigger a signal (e.g. minor page fault) since those are things that
>>>> cause significant issues with task isolation processes
>>>> (kernel-induced jitter) but aren't otherwise user-visible,
>>>> much like an undiscovered syscall in a third-party library
>>>> can cause unexpected jitter.
>>> Would it make sense to exempt the exceptions that result in signals?
>>> After all, those are detectable even without your patches.  Going
>>> through all of the exception types:
>>>
>>> divide_error, overflow, invalid_op, coprocessor_segment_overrun,
>>> invalid_TSS, segment_not_present, stack_segment, alignment_check:
>>> these all send signals anyway.
>>>
>>> double_fault is fatal.
>>>
>>> bounds: MPX faults can be silently fixed up, and those will need
>>> notification.  (Or user code should know not to do that, since it
>>> requires an explicit opt in, and user code can flip it back off to get
>>> the signals.)
>>>
>>> general_protection: always signals except in vm86 mode.
>>>
>>> int3: silently fixed if uprobes are in use, but I don't think
>>> isolation cares about that.  Otherwise signals.
>>>
>>> debug: The perf hw_breakpoint can result in silent fixups, but those
>>> require explicit opt-in from the admin.  Otherwise, unless there's a
>>> bug or a debugger, the user will get a signal.  (As a practical
>>> matter, the only interesting case is the undocumented ICEBP
>>> instruction.)
>>>
>>> math_error, simd_coprocessor_error: Sends a signal.
>>>
>>> spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK.  We should
>>> just WARN if this hits.
>>>
>>> device_not_available: If you're using isolation without an FPU, you
>>> have bigger problems.
>>>
>>> page_fault: Needs notification.
>>>
>>> NMI, MCE: arguably these should *not* notify or at least not fatally.
>>>
>>> So maybe a better approach would be to explicitly notify for the
>>> relevant entries: IRQs, non-signalling page faults, and non-signalling
>>> MPX fixups.  Other arches would have their own lists, but they're
>>> probably also short except for emulated instructions.
>>
>> IRQs should get notified via the task_isolation_debug boot flag;
>> the intent is that they should never get delivered to nohz_full
>> cores anyway, so we produce a console backtrace if the boot
>> flag is enabled.  This isn't tied to having a task running with
>> TASK_ISOLATION enabled, since it just shouldn't ever happen.
> OK, I like that.  In that case, maybe NMI and MCE should be in a
> similar category.  (IOW if a non-fatal MCE happens and the debug param
> is set, we could warn, assuming that anyone is willing to write the
> code.  Doing printk from MCE is not entirely trivial, although it's
> less bad in recent kernels.)

For now I will stay away from tampering with the NMI/MCE
handlers, though if it turns out that it's the cause of mysterious
latencies in task-isolation applications in the future, it will
likely make sense to add some debugging there.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 02/11] task_isolation: add initial support
  2015-10-01 12:14               ` Frederic Weisbecker
@ 2015-10-01 19:25                   ` Chris Metcalf
  2015-10-01 19:25                   ` Chris Metcalf
  1 sibling, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-01 19:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 10/01/2015 08:14 AM, Frederic Weisbecker wrote:
> On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote:
>> diff --git a/include/linux/isolation.h b/include/linux/isolation.h
>> new file mode 100644
>> index 000000000000..fd04011b1c1e
>> --- /dev/null
>> +++ b/include/linux/isolation.h
>> @@ -0,0 +1,24 @@
>> +/*
>> + * Task isolation related global functions
>> + */
>> +#ifndef _LINUX_ISOLATION_H
>> +#define _LINUX_ISOLATION_H
>> +
>> +#include <linux/tick.h>
>> +#include <linux/prctl.h>
>> +
>> +#ifdef CONFIG_TASK_ISOLATION
>> +static inline bool task_isolation_enabled(void)
>> +{
>> +	return tick_nohz_full_cpu(smp_processor_id()) &&
>> +		(current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE);
> Ok, I may be a bit burdening with that but, how about using the regular
> existing task flags, and if needed later we can still introduce a new field
> in struct task_struct?

The problem is still that we have two basic bits ("enabled" and
"strict") plus eight bits of signal number to override SIGKILL.
So we end up with *something* extra in task_struct no matter what.
And, right now it's conveniently the same value as the bits
passed to prctl(), so we don't need to marshall and unmarshall
the prctl() get/set results.

If we could convince ourselves not to do the "settable signal"
stuff I'd agree that use task flags makes sense, but I was
convinced for v2 of the patch series to add a settable signal,
and I suspect it still does make sense.

>> +	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
> You should add a function in tick-sched.c to get the next tick. This
> is supposed to be a private field.

Yes.  Or probably better, a function that just says whether the
timer is quiesced.  Obviously I'll wait to hear what Thomas says
on this subject first, though.

>> +		if (!warned && (jiffies - start) >= (5 * HZ)) {
>> +			pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n",
>> +				task->comm, task->pid, smp_processor_id(),
>> +				(jiffies - start) / HZ);
>> +			warned = true;
>> +		}
>> +		cond_resched();
>> +		if (test_thread_flag(TIF_SIGPENDING))
>> +			break;
> Why not use signal_pending()?

Makes sense, thanks.

> I still think we could try a wait-wake standard scheme. 

I'm curious to hear what you make of my arguments in the
other thread on this subject!

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 02/11] task_isolation: add initial support
@ 2015-10-01 19:25                   ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-01 19:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 10/01/2015 08:14 AM, Frederic Weisbecker wrote:
> On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote:
>> diff --git a/include/linux/isolation.h b/include/linux/isolation.h
>> new file mode 100644
>> index 000000000000..fd04011b1c1e
>> --- /dev/null
>> +++ b/include/linux/isolation.h
>> @@ -0,0 +1,24 @@
>> +/*
>> + * Task isolation related global functions
>> + */
>> +#ifndef _LINUX_ISOLATION_H
>> +#define _LINUX_ISOLATION_H
>> +
>> +#include <linux/tick.h>
>> +#include <linux/prctl.h>
>> +
>> +#ifdef CONFIG_TASK_ISOLATION
>> +static inline bool task_isolation_enabled(void)
>> +{
>> +	return tick_nohz_full_cpu(smp_processor_id()) &&
>> +		(current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE);
> Ok, I may be a bit burdening with that but, how about using the regular
> existing task flags, and if needed later we can still introduce a new field
> in struct task_struct?

The problem is still that we have two basic bits ("enabled" and
"strict") plus eight bits of signal number to override SIGKILL.
So we end up with *something* extra in task_struct no matter what.
And, right now it's conveniently the same value as the bits
passed to prctl(), so we don't need to marshall and unmarshall
the prctl() get/set results.

If we could convince ourselves not to do the "settable signal"
stuff I'd agree that use task flags makes sense, but I was
convinced for v2 of the patch series to add a settable signal,
and I suspect it still does make sense.

>> +	while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) {
> You should add a function in tick-sched.c to get the next tick. This
> is supposed to be a private field.

Yes.  Or probably better, a function that just says whether the
timer is quiesced.  Obviously I'll wait to hear what Thomas says
on this subject first, though.

>> +		if (!warned && (jiffies - start) >= (5 * HZ)) {
>> +			pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n",
>> +				task->comm, task->pid, smp_processor_id(),
>> +				(jiffies - start) / HZ);
>> +			warned = true;
>> +		}
>> +		cond_resched();
>> +		if (test_thread_flag(TIF_SIGPENDING))
>> +			break;
> Why not use signal_pending()?

Makes sense, thanks.

> I still think we could try a wait-wake standard scheme. 

I'm curious to hear what you make of my arguments in the
other thread on this subject!

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 02/11] task_isolation: add initial support
@ 2015-10-01 21:20                       ` Thomas Gleixner
  0 siblings, 0 replies; 340+ messages in thread
From: Thomas Gleixner @ 2015-10-01 21:20 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Frederic Weisbecker, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Thu, 1 Oct 2015, Chris Metcalf wrote:
> But first I want to address the question of the basic semantics
> of the patch series.  I wrote up a description of why it's useful
> in my email yesterday:
> 
> https://lkml.kernel.org/r/560C4CF4.9090601@ezchip.com
> 
> I haven't directly heard from you as to whether you buy the
> basic premise of "hard isolation" in terms of protecting tasks
> from all kernel interrupts while they execute in userspace.

Just for the record. The first serious initiative to solve that
problem started here in my own company when I guided Frederic through
the endavour of figuring out what needs to be done to achieve
that. That was the assignement of his master thesis, which I gave him.

So I'm very well aware why this is needed and what needs to be done.

I started this, because I got tired of half baken attempts to solve
the problem, which were even worse than what you are trying to do now.

> So I first want to address what is effectively the API concern that
> you raised, namely that you're concerned that there is a wait
> loop in the implementation.

That wait loop is just a place holder for the underlying more serious
concern I have with this whole approach. And I raised that concern
several times in the past and I'm happy to do so again.

The people working on this, especially you, are just dead set to
achieve a certain functionality by jamming half baken mechanisms into
the kernel and especially into the low level entry/exit code. And
that's something which really annoys me, simply because you refuse to
tackle the problems which have been identified as need to be solved 5+
years ago when Frederic did his thesis.

Remote accounting:
==================

It's not an easy problem, but it's not rocket science either. It's
just quite some work.

I know that you just give a shit about it because your use case
does not care. But it's an essential part of the problem space. You
just work around it, by shutting down the tick completely and rely
on the fact that it does not explode in your face today.

If we accept your hackery, then who is going to fix it, when it
explodes in half a year from now?

Tick shut down:
===============

I still have to understand why the tick is needed at all.

There is exactly one reason why the tick must run if a cpu is in
full isolation mode:

  More than one SCHED_OTHER task is runnable on that cpu.

There is no other reason, period.

If there are requirements today to switch on the tick when a task
running in full isolation mode enters the kernel, then they need to be
fixed first.

And again you don't care, because for your particular use case it's
good enough to slap a busy wait loop into every archs low level exit
code and be done with it.

>From your mail excusing that approach:

> The nice thing here is that there is in fact no requirement in
> the API/ABI that we have a wait loop in the kernel at all.  Let's
> say hypothetically that in the future we come up with a way to
> guarantee, perhaps in some constrained kind of way, that you
> can enter and exit the kernel and are guaranteed no further
> timer interrupts, ....

"Let's say hypothetically" tells it all. You are not even trying to
find a proper solution. You just try to get your particular interest
solved.

That's exactly the attitude which drives me nuts and that's the point
where I say no.

You can do all of that in an out of tree patch set as many other hard
to solve features have done for years. Yes, it's an annoying catchup
game, but it forces you to think harder, refactor code and do a lot of
extra work to finally get it merged.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 02/11] task_isolation: add initial support
@ 2015-10-01 21:20                       ` Thomas Gleixner
  0 siblings, 0 replies; 340+ messages in thread
From: Thomas Gleixner @ 2015-10-01 21:20 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Frederic Weisbecker, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, 1 Oct 2015, Chris Metcalf wrote:
> But first I want to address the question of the basic semantics
> of the patch series.  I wrote up a description of why it's useful
> in my email yesterday:
> 
> https://lkml.kernel.org/r/560C4CF4.9090601-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org
> 
> I haven't directly heard from you as to whether you buy the
> basic premise of "hard isolation" in terms of protecting tasks
> from all kernel interrupts while they execute in userspace.

Just for the record. The first serious initiative to solve that
problem started here in my own company when I guided Frederic through
the endavour of figuring out what needs to be done to achieve
that. That was the assignement of his master thesis, which I gave him.

So I'm very well aware why this is needed and what needs to be done.

I started this, because I got tired of half baken attempts to solve
the problem, which were even worse than what you are trying to do now.

> So I first want to address what is effectively the API concern that
> you raised, namely that you're concerned that there is a wait
> loop in the implementation.

That wait loop is just a place holder for the underlying more serious
concern I have with this whole approach. And I raised that concern
several times in the past and I'm happy to do so again.

The people working on this, especially you, are just dead set to
achieve a certain functionality by jamming half baken mechanisms into
the kernel and especially into the low level entry/exit code. And
that's something which really annoys me, simply because you refuse to
tackle the problems which have been identified as need to be solved 5+
years ago when Frederic did his thesis.

Remote accounting:
==================

It's not an easy problem, but it's not rocket science either. It's
just quite some work.

I know that you just give a shit about it because your use case
does not care. But it's an essential part of the problem space. You
just work around it, by shutting down the tick completely and rely
on the fact that it does not explode in your face today.

If we accept your hackery, then who is going to fix it, when it
explodes in half a year from now?

Tick shut down:
===============

I still have to understand why the tick is needed at all.

There is exactly one reason why the tick must run if a cpu is in
full isolation mode:

  More than one SCHED_OTHER task is runnable on that cpu.

There is no other reason, period.

If there are requirements today to switch on the tick when a task
running in full isolation mode enters the kernel, then they need to be
fixed first.

And again you don't care, because for your particular use case it's
good enough to slap a busy wait loop into every archs low level exit
code and be done with it.

>From your mail excusing that approach:

> The nice thing here is that there is in fact no requirement in
> the API/ABI that we have a wait loop in the kernel at all.  Let's
> say hypothetically that in the future we come up with a way to
> guarantee, perhaps in some constrained kind of way, that you
> can enter and exit the kernel and are guaranteed no further
> timer interrupts, ....

"Let's say hypothetically" tells it all. You are not even trying to
find a proper solution. You just try to get your particular interest
solved.

That's exactly the attitude which drives me nuts and that's the point
where I say no.

You can do all of that in an out of tree patch set as many other hard
to solve features have done for years. Yes, it's an annoying catchup
game, but it forces you to think harder, refactor code and do a lot of
extra work to finally get it merged.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 02/11] task_isolation: add initial support
  2015-10-01 21:20                       ` Thomas Gleixner
@ 2015-10-02 17:15                         ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-02 17:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 10/01/2015 05:20 PM, Thomas Gleixner wrote:
> On Thu, 1 Oct 2015, Chris Metcalf wrote:
>> But first I want to address the question of the basic semantics
>> of the patch series.  I wrote up a description of why it's useful
>> in my email yesterday:
>>
>> https://lkml.kernel.org/r/560C4CF4.9090601@ezchip.com
>>
>> I haven't directly heard from you as to whether you buy the
>> basic premise of "hard isolation" in terms of protecting tasks
>> from all kernel interrupts while they execute in userspace.
> Just for the record. The first serious initiative to solve that
> problem started here in my own company when I guided Frederic through
> the endavour of figuring out what needs to be done to achieve
> that. That was the assignement of his master thesis, which I gave him.

Thanks for that background.  I didn't know you had gotten
Frederic started down that path originally.

>> So I first want to address what is effectively the API concern that
>> you raised, namely that you're concerned that there is a wait
>> loop in the implementation.
> That wait loop is just a place holder for the underlying more serious
> concern I have with this whole approach. And I raised that concern
> several times in the past and I'm happy to do so again.
>
> The people working on this, especially you, are just dead set to
> achieve a certain functionality by jamming half baken mechanisms into
> the kernel and especially into the low level entry/exit code. And
> that's something which really annoys me, simply because you refuse to
> tackle the problems which have been identified as need to be solved 5+
> years ago when Frederic did his thesis.

I think you raise a good point.  I still claim my arguments are
plausible, but you may be right that this is an instance where
forcing a different approach is better for the kernel community
as a whole.

Given that, what would you think of the following two changes
to my proposed patch series:

1. Rather than spinning in a busy loop if timers are pending,
we reschedule if more than one task is ready to run.  This
directly targets the "architected" problem with the scheduler
tick, rather than sweeping up the scheduler tick and any other
timers into the one catch-all of "any timer ready to fire".
(We can use sched_can_stop_tick() to check the case where
other tasks can preempt us.)  This would then provide part
of the semantics of the task-isolation flag.  The other part is
running whatever code can be run to avoid the various ways
tasks might get interrupted later (lru_add_drain(),
quiet_vmstat(), etc) that are not appropriate to run
unconditionally for tasks that aren't trying to be isolated.

2. Remove the tie between disabling the 1 Hz max deferment
and task isolation per se.  Instead add a boot flag (e.g.
"debug_1hz_tick") that lets us turn off the 1 Hz tick to make it
easy to experiment with both the negative effects of the
missing tick, as well as to try to learn in parallel what actual
timer interrupts are firing "on purpose" rather than just due
to the 1 Hz tick to try to eliminate them as well.

For #1, I'm not sure if it's better to hack up the scheduler's
pick_next_task callback methods to avoid task-isolation tasks
when other tasks are also available to run, or just to observe
that there are additional tasks ready to run during exit to
userspace, and yield the cpu to allow those other tasks to run.
The advantage of doing it at exit to userspace is that we can
easily yield in a loop and pay attention to whether we seem
not to be making forward progress with that task and generate
a suitable warning; it also keeps a lot of task-isolation stuff
out of the core scheduler code, which may be a plus.

With these changes, and booting with the "debug_1hz_tick"
flag, I'm seeing a couple of timer ticks hit my task-isolation
task in the first 20 ms or so, and then it quiesces.  I will
plan to work on figuring out what is triggering those
interrupts and seeing how to fix them.  My hope is that in
parallel with that work, other folks can be working on how to
fix problems that occur more silently with the scheduler
tick max deferment disabled; I'm also happy to work on those
problems to the extent that I understand them (and I'm
always happy to learn more).

As part of the patch series I'd extend the proposed
task_isolation_debug flag to also track timer scheduling
events against task-isolation tasks that are ready to run
in userspace (no other runnable tasks).

What do you think of this approach?

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 02/11] task_isolation: add initial support
@ 2015-10-02 17:15                         ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-02 17:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 10/01/2015 05:20 PM, Thomas Gleixner wrote:
> On Thu, 1 Oct 2015, Chris Metcalf wrote:
>> But first I want to address the question of the basic semantics
>> of the patch series.  I wrote up a description of why it's useful
>> in my email yesterday:
>>
>> https://lkml.kernel.org/r/560C4CF4.9090601@ezchip.com
>>
>> I haven't directly heard from you as to whether you buy the
>> basic premise of "hard isolation" in terms of protecting tasks
>> from all kernel interrupts while they execute in userspace.
> Just for the record. The first serious initiative to solve that
> problem started here in my own company when I guided Frederic through
> the endavour of figuring out what needs to be done to achieve
> that. That was the assignement of his master thesis, which I gave him.

Thanks for that background.  I didn't know you had gotten
Frederic started down that path originally.

>> So I first want to address what is effectively the API concern that
>> you raised, namely that you're concerned that there is a wait
>> loop in the implementation.
> That wait loop is just a place holder for the underlying more serious
> concern I have with this whole approach. And I raised that concern
> several times in the past and I'm happy to do so again.
>
> The people working on this, especially you, are just dead set to
> achieve a certain functionality by jamming half baken mechanisms into
> the kernel and especially into the low level entry/exit code. And
> that's something which really annoys me, simply because you refuse to
> tackle the problems which have been identified as need to be solved 5+
> years ago when Frederic did his thesis.

I think you raise a good point.  I still claim my arguments are
plausible, but you may be right that this is an instance where
forcing a different approach is better for the kernel community
as a whole.

Given that, what would you think of the following two changes
to my proposed patch series:

1. Rather than spinning in a busy loop if timers are pending,
we reschedule if more than one task is ready to run.  This
directly targets the "architected" problem with the scheduler
tick, rather than sweeping up the scheduler tick and any other
timers into the one catch-all of "any timer ready to fire".
(We can use sched_can_stop_tick() to check the case where
other tasks can preempt us.)  This would then provide part
of the semantics of the task-isolation flag.  The other part is
running whatever code can be run to avoid the various ways
tasks might get interrupted later (lru_add_drain(),
quiet_vmstat(), etc) that are not appropriate to run
unconditionally for tasks that aren't trying to be isolated.

2. Remove the tie between disabling the 1 Hz max deferment
and task isolation per se.  Instead add a boot flag (e.g.
"debug_1hz_tick") that lets us turn off the 1 Hz tick to make it
easy to experiment with both the negative effects of the
missing tick, as well as to try to learn in parallel what actual
timer interrupts are firing "on purpose" rather than just due
to the 1 Hz tick to try to eliminate them as well.

For #1, I'm not sure if it's better to hack up the scheduler's
pick_next_task callback methods to avoid task-isolation tasks
when other tasks are also available to run, or just to observe
that there are additional tasks ready to run during exit to
userspace, and yield the cpu to allow those other tasks to run.
The advantage of doing it at exit to userspace is that we can
easily yield in a loop and pay attention to whether we seem
not to be making forward progress with that task and generate
a suitable warning; it also keeps a lot of task-isolation stuff
out of the core scheduler code, which may be a plus.

With these changes, and booting with the "debug_1hz_tick"
flag, I'm seeing a couple of timer ticks hit my task-isolation
task in the first 20 ms or so, and then it quiesces.  I will
plan to work on figuring out what is triggering those
interrupts and seeing how to fix them.  My hope is that in
parallel with that work, other folks can be working on how to
fix problems that occur more silently with the scheduler
tick max deferment disabled; I'm also happy to work on those
problems to the extent that I understand them (and I'm
always happy to learn more).

As part of the patch series I'd extend the proposed
task_isolation_debug flag to also track timer scheduling
events against task-isolation tasks that are ready to run
in userspace (no other runnable tasks).

What do you think of this approach?

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 02/11] task_isolation: add initial support
@ 2015-10-02 19:02                           ` Thomas Gleixner
  0 siblings, 0 replies; 340+ messages in thread
From: Thomas Gleixner @ 2015-10-02 19:02 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Frederic Weisbecker, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

Chris,

On Fri, 2 Oct 2015, Chris Metcalf wrote:
> 1. Rather than spinning in a busy loop if timers are pending,
> we reschedule if more than one task is ready to run.  This
> directly targets the "architected" problem with the scheduler
> tick, rather than sweeping up the scheduler tick and any other
> timers into the one catch-all of "any timer ready to fire".
> (We can use sched_can_stop_tick() to check the case where
> other tasks can preempt us.)  This would then provide part
> of the semantics of the task-isolation flag.  The other part is
> running whatever code can be run to avoid the various ways
> tasks might get interrupted later (lru_add_drain(),
> quiet_vmstat(), etc) that are not appropriate to run
> unconditionally for tasks that aren't trying to be isolated.

Sounds like a plan
 
> 2. Remove the tie between disabling the 1 Hz max deferment
> and task isolation per se.  Instead add a boot flag (e.g.
> "debug_1hz_tick") that lets us turn off the 1 Hz tick to make it
> easy to experiment with both the negative effects of the
> missing tick, as well as to try to learn in parallel what actual
> timer interrupts are firing "on purpose" rather than just due
> to the 1 Hz tick to try to eliminate them as well.

I have no problem with a debug flag, which allows you to experiment,
though I'm not entirely sure whether we need to carry it in mainline
or just in an extra isolation git tree.

> For #1, I'm not sure if it's better to hack up the scheduler's
> pick_next_task callback methods to avoid task-isolation tasks
> when other tasks are also available to run, or just to observe
> that there are additional tasks ready to run during exit to
> userspace, and yield the cpu to allow those other tasks to run.
> The advantage of doing it at exit to userspace is that we can
> easily yield in a loop and pay attention to whether we seem
> not to be making forward progress with that task and generate
> a suitable warning; it also keeps a lot of task-isolation stuff
> out of the core scheduler code, which may be a plus.

You should discuss that with Peter Zijlstra. I see the plus not to
have it in the scheduler, but OTOH having it in the core code has its
advantages as well. Let's see how ugly it gets.
 
> With these changes, and booting with the "debug_1hz_tick"
> flag, I'm seeing a couple of timer ticks hit my task-isolation
> task in the first 20 ms or so, and then it quiesces.  I will
> plan to work on figuring out what is triggering those
> interrupts and seeing how to fix them.  My hope is that in
> parallel with that work, other folks can be working on how to
> fix problems that occur more silently with the scheduler
> tick max deferment disabled; I'm also happy to work on those
> problems to the extent that I understand them (and I'm
> always happy to learn more).

I like that approach :)
 
> As part of the patch series I'd extend the proposed
> task_isolation_debug flag to also track timer scheduling
> events against task-isolation tasks that are ready to run
> in userspace (no other runnable tasks).
>
> What do you think of this approach?

Makes sense.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 02/11] task_isolation: add initial support
@ 2015-10-02 19:02                           ` Thomas Gleixner
  0 siblings, 0 replies; 340+ messages in thread
From: Thomas Gleixner @ 2015-10-02 19:02 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Frederic Weisbecker, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Chris,

On Fri, 2 Oct 2015, Chris Metcalf wrote:
> 1. Rather than spinning in a busy loop if timers are pending,
> we reschedule if more than one task is ready to run.  This
> directly targets the "architected" problem with the scheduler
> tick, rather than sweeping up the scheduler tick and any other
> timers into the one catch-all of "any timer ready to fire".
> (We can use sched_can_stop_tick() to check the case where
> other tasks can preempt us.)  This would then provide part
> of the semantics of the task-isolation flag.  The other part is
> running whatever code can be run to avoid the various ways
> tasks might get interrupted later (lru_add_drain(),
> quiet_vmstat(), etc) that are not appropriate to run
> unconditionally for tasks that aren't trying to be isolated.

Sounds like a plan
 
> 2. Remove the tie between disabling the 1 Hz max deferment
> and task isolation per se.  Instead add a boot flag (e.g.
> "debug_1hz_tick") that lets us turn off the 1 Hz tick to make it
> easy to experiment with both the negative effects of the
> missing tick, as well as to try to learn in parallel what actual
> timer interrupts are firing "on purpose" rather than just due
> to the 1 Hz tick to try to eliminate them as well.

I have no problem with a debug flag, which allows you to experiment,
though I'm not entirely sure whether we need to carry it in mainline
or just in an extra isolation git tree.

> For #1, I'm not sure if it's better to hack up the scheduler's
> pick_next_task callback methods to avoid task-isolation tasks
> when other tasks are also available to run, or just to observe
> that there are additional tasks ready to run during exit to
> userspace, and yield the cpu to allow those other tasks to run.
> The advantage of doing it at exit to userspace is that we can
> easily yield in a loop and pay attention to whether we seem
> not to be making forward progress with that task and generate
> a suitable warning; it also keeps a lot of task-isolation stuff
> out of the core scheduler code, which may be a plus.

You should discuss that with Peter Zijlstra. I see the plus not to
have it in the scheduler, but OTOH having it in the core code has its
advantages as well. Let's see how ugly it gets.
 
> With these changes, and booting with the "debug_1hz_tick"
> flag, I'm seeing a couple of timer ticks hit my task-isolation
> task in the first 20 ms or so, and then it quiesces.  I will
> plan to work on figuring out what is triggering those
> interrupts and seeing how to fix them.  My hope is that in
> parallel with that work, other folks can be working on how to
> fix problems that occur more silently with the scheduler
> tick max deferment disabled; I'm also happy to work on those
> problems to the extent that I understand them (and I'm
> always happy to learn more).

I like that approach :)
 
> As part of the patch series I'd extend the proposed
> task_isolation_debug flag to also track timer scheduling
> events against task-isolation tasks that are ready to run
> in userspace (no other runnable tasks).
>
> What do you think of this approach?

Makes sense.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 05/11] task_isolation: add debug boot flag
  2015-09-28 15:17             ` [PATCH v7 05/11] task_isolation: add debug boot flag Chris Metcalf
  2015-09-28 20:59               ` Andy Lutomirski
@ 2015-10-05 17:07               ` Luiz Capitulino
  2015-10-08  0:33                 ` Chris Metcalf
  1 sibling, 1 reply; 340+ messages in thread
From: Luiz Capitulino @ 2015-10-05 17:07 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-kernel

On Mon, 28 Sep 2015 11:17:20 -0400
Chris Metcalf <cmetcalf@ezchip.com> wrote:

> The new "task_isolation_debug" flag simplifies debugging
> of TASK_ISOLATION kernels when processes are running in
> PR_TASK_ISOLATION_ENABLE mode.  Such processes should get no
> interrupts from the kernel, and if they do, when this boot flag is
> specified a kernel stack dump on the console is generated.
> 
> It's possible to use ftrace to simply detect whether a task_isolation
> core has unexpectedly entered the kernel.  But what this boot flag
> does is allow the kernel to provide better diagnostics, e.g. by
> reporting in the IPI-generating code what remote core and context
> is preparing to deliver an interrupt to a task_isolation core.
> 
> It may be worth considering other ways to generate useful debugging
> output rather than console spew, but for now that is simple and direct.

Honest question: does any of the task_isolation_debug() calls added
by this patch take care of the case where vmstat_shepherd() may
schedule vmstat_update() to run because a TASK_ISOLATION process is
changing memory stats?

If that's not taken care of yet, should we? I just don't know if we
should call task_isolation_exception() or task_isolation_debug().
In the case of the latter, wouldn't it be interesting to add it to
__queue_work() then?

> 
> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
> ---
>  Documentation/kernel-parameters.txt |  7 +++++++
>  include/linux/isolation.h           |  2 ++
>  kernel/irq_work.c                   |  5 ++++-
>  kernel/sched/core.c                 | 21 +++++++++++++++++++++
>  kernel/signal.c                     |  5 +++++
>  kernel/smp.c                        |  4 ++++
>  kernel/softirq.c                    |  7 +++++++
>  7 files changed, 50 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 22a4b687ea5b..48ff15f3166f 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -3623,6 +3623,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  			neutralize any effect of /proc/sys/kernel/sysrq.
>  			Useful for debugging.
>  
> +	task_isolation_debug	[KNL]
> +			In kernels built with CONFIG_TASK_ISOLATION and booted
> +			in nohz_full= mode, this setting will generate console
> +			backtraces when the kernel is about to interrupt a
> +			task that has requested PR_TASK_ISOLATION_ENABLE
> +			and is running on a nohz_full core.
> +
>  	tcpmhash_entries= [KNL,NET]
>  			Set the number of tcp_metrics_hash slots.
>  			Default value is 8192 or 16384 depending on total
> diff --git a/include/linux/isolation.h b/include/linux/isolation.h
> index 27a4469831c1..9f1747331a36 100644
> --- a/include/linux/isolation.h
> +++ b/include/linux/isolation.h
> @@ -18,11 +18,13 @@ extern void task_isolation_enter(void);
>  extern void task_isolation_syscall(int nr);
>  extern void task_isolation_exception(void);
>  extern void task_isolation_wait(void);
> +extern void task_isolation_debug(int cpu);
>  #else
>  static inline bool task_isolation_enabled(void) { return false; }
>  static inline void task_isolation_enter(void) { }
>  static inline void task_isolation_syscall(int nr) { }
>  static inline void task_isolation_exception(void) { }
> +static inline void task_isolation_debug(int cpu) { }
>  #endif
>  
>  static inline bool task_isolation_strict(void)
> diff --git a/kernel/irq_work.c b/kernel/irq_work.c
> index cbf9fb899d92..745c2ea6a4e4 100644
> --- a/kernel/irq_work.c
> +++ b/kernel/irq_work.c
> @@ -17,6 +17,7 @@
>  #include <linux/cpu.h>
>  #include <linux/notifier.h>
>  #include <linux/smp.h>
> +#include <linux/isolation.h>
>  #include <asm/processor.h>
>  
>  
> @@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
>  	if (!irq_work_claim(work))
>  		return false;
>  
> -	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
> +	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
> +		task_isolation_debug(cpu);
>  		arch_send_call_function_single_ipi(cpu);
> +	}
>  
>  	return true;
>  }
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3595403921bd..8ddabb0d7510 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -74,6 +74,7 @@
>  #include <linux/binfmts.h>
>  #include <linux/context_tracking.h>
>  #include <linux/compiler.h>
> +#include <linux/isolation.h>
>  
>  #include <asm/switch_to.h>
>  #include <asm/tlb.h>
> @@ -743,6 +744,26 @@ bool sched_can_stop_tick(void)
>  }
>  #endif /* CONFIG_NO_HZ_FULL */
>  
> +#ifdef CONFIG_TASK_ISOLATION
> +/* Enable debugging of any interrupts of task_isolation cores. */
> +static int task_isolation_debug_flag;
> +static int __init task_isolation_debug_func(char *str)
> +{
> +	task_isolation_debug_flag = true;
> +	return 1;
> +}
> +__setup("task_isolation_debug", task_isolation_debug_func);
> +
> +void task_isolation_debug(int cpu)
> +{
> +	if (task_isolation_debug_flag && tick_nohz_full_cpu(cpu) &&
> +	    (cpu_curr(cpu)->task_isolation_flags & PR_TASK_ISOLATION_ENABLE)) {
> +		pr_err("Interrupt detected for task_isolation cpu %d\n", cpu);
> +		dump_stack();
> +	}
> +}
> +#endif
> +
>  void sched_avg_update(struct rq *rq)
>  {
>  	s64 period = sched_avg_period();
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 0f6bbbe77b46..c6e09f0f7e24 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -684,6 +684,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
>   */
>  void signal_wake_up_state(struct task_struct *t, unsigned int state)
>  {
> +#ifdef CONFIG_TASK_ISOLATION
> +	/* If the task is being killed, don't complain about task_isolation. */
> +	if (state & TASK_WAKEKILL)
> +		t->task_isolation_flags = 0;
> +#endif
>  	set_tsk_thread_flag(t, TIF_SIGPENDING);
>  	/*
>  	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 07854477c164..b0bddff2693d 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -14,6 +14,7 @@
>  #include <linux/smp.h>
>  #include <linux/cpu.h>
>  #include <linux/sched.h>
> +#include <linux/isolation.h>
>  
>  #include "smpboot.h"
>  
> @@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
>  	 * locking and barrier primitives. Generic code isn't really
>  	 * equipped to do the right thing...
>  	 */
> +	task_isolation_debug(cpu);
>  	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
>  		arch_send_call_function_single_ipi(cpu);
>  
> @@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask,
>  	}
>  
>  	/* Send a message to all CPUs in the map */
> +	for_each_cpu(cpu, cfd->cpumask)
> +		task_isolation_debug(cpu);
>  	arch_send_call_function_ipi_mask(cfd->cpumask);
>  
>  	if (wait) {
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 479e4436f787..ed762fec7265 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -24,8 +24,10 @@
>  #include <linux/ftrace.h>
>  #include <linux/smp.h>
>  #include <linux/smpboot.h>
> +#include <linux/context_tracking.h>
>  #include <linux/tick.h>
>  #include <linux/irq.h>
> +#include <linux/isolation.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/irq.h>
> @@ -335,6 +337,11 @@ void irq_enter(void)
>  		_local_bh_enable();
>  	}
>  
> +	if (context_tracking_cpu_is_enabled() &&
> +	    context_tracking_in_user() &&
> +	    !in_interrupt())
> +		task_isolation_debug(smp_processor_id());
> +
>  	__irq_enter();
>  }
>  


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 05/11] task_isolation: add debug boot flag
  2015-10-05 17:07               ` Luiz Capitulino
@ 2015-10-08  0:33                 ` Chris Metcalf
  2015-10-08 20:28                   ` Luiz Capitulino
  0 siblings, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-10-08  0:33 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-kernel

On 10/5/2015 1:07 PM, Luiz Capitulino wrote:
> On Mon, 28 Sep 2015 11:17:20 -0400
> Chris Metcalf <cmetcalf@ezchip.com> wrote:
>
>> The new "task_isolation_debug" flag simplifies debugging
>> of TASK_ISOLATION kernels when processes are running in
>> PR_TASK_ISOLATION_ENABLE mode.  Such processes should get no
>> interrupts from the kernel, and if they do, when this boot flag is
>> specified a kernel stack dump on the console is generated.
>>
>> It's possible to use ftrace to simply detect whether a task_isolation
>> core has unexpectedly entered the kernel.  But what this boot flag
>> does is allow the kernel to provide better diagnostics, e.g. by
>> reporting in the IPI-generating code what remote core and context
>> is preparing to deliver an interrupt to a task_isolation core.
>>
>> It may be worth considering other ways to generate useful debugging
>> output rather than console spew, but for now that is simple and direct.
> Honest question: does any of the task_isolation_debug() calls added
> by this patch take care of the case where vmstat_shepherd() may
> schedule vmstat_update() to run because a TASK_ISOLATION process is
> changing memory stats?

The task_isolation_debug() calls don't "take care of" any cases - they are
really just there to generate console dumps when the kernel unexpectedly
interrupts a task_isolated task.

The idea with vmstat is that before a task_isolated task returns to
userspace, it quiesces the vmstat thread (does a final sweep to collect
the stats and turns off the scheduled work item).  As a result, the vmstat
shepherd won't run while the task is in userspace.  When and if it returns
to the kernel, it will again sweep up the stats before returning to userspace.

The usual shepherd mechanism on a housekeeping core might notice
that the task had entered the kernel and started changing stats, and
might then asynchronously restart the scheduled work, but it should be
quiesced again regardless on the way back out to userspace.

> If that's not taken care of yet, should we? I just don't know if we
> should call task_isolation_exception() or task_isolation_debug().

task_isolation_exception() is called when an exception (page fault or
similar) is generated synchronously by the running task and we want
to make sure to notify the task with a signal if it has set up STRICT mode
to indicate that it is not planning to enter the kernel.

> In the case of the latter, wouldn't it be interesting to add it to
> __queue_work() then?

Well, queuing remote work involves sending an IPI, and we already tag
both the SMP send side AND the client side IRQ side with a task_isolation_debug(),
so I expect in practice it would be detected.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v7 05/11] task_isolation: add debug boot flag
  2015-10-08  0:33                 ` Chris Metcalf
@ 2015-10-08 20:28                   ` Luiz Capitulino
  0 siblings, 0 replies; 340+ messages in thread
From: Luiz Capitulino @ 2015-10-08 20:28 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-kernel

On Wed, 7 Oct 2015 20:33:56 -0400
Chris Metcalf <cmetcalf@ezchip.com> wrote:

> On 10/5/2015 1:07 PM, Luiz Capitulino wrote:
> > On Mon, 28 Sep 2015 11:17:20 -0400
> > Chris Metcalf <cmetcalf@ezchip.com> wrote:
> >
> >> The new "task_isolation_debug" flag simplifies debugging
> >> of TASK_ISOLATION kernels when processes are running in
> >> PR_TASK_ISOLATION_ENABLE mode.  Such processes should get no
> >> interrupts from the kernel, and if they do, when this boot flag is
> >> specified a kernel stack dump on the console is generated.
> >>
> >> It's possible to use ftrace to simply detect whether a task_isolation
> >> core has unexpectedly entered the kernel.  But what this boot flag
> >> does is allow the kernel to provide better diagnostics, e.g. by
> >> reporting in the IPI-generating code what remote core and context
> >> is preparing to deliver an interrupt to a task_isolation core.
> >>
> >> It may be worth considering other ways to generate useful debugging
> >> output rather than console spew, but for now that is simple and direct.
> > Honest question: does any of the task_isolation_debug() calls added
> > by this patch take care of the case where vmstat_shepherd() may
> > schedule vmstat_update() to run because a TASK_ISOLATION process is
> > changing memory stats?
> 
> The task_isolation_debug() calls don't "take care of" any cases - they are
> really just there to generate console dumps when the kernel unexpectedly
> interrupts a task_isolated task.
> 
> The idea with vmstat is that before a task_isolated task returns to
> userspace, it quiesces the vmstat thread (does a final sweep to collect
> the stats and turns off the scheduled work item).  As a result, the vmstat
> shepherd won't run while the task is in userspace.  When and if it returns
> to the kernel, it will again sweep up the stats before returning to userspace.
> 
> The usual shepherd mechanism on a housekeeping core might notice
> that the task had entered the kernel and started changing stats, and
> might then asynchronously restart the scheduled work, but it should be
> quiesced again regardless on the way back out to userspace.

OK, I've missed the (obvious) fact that the process has to enter the
kernel to change stats. Thanks a lot for your explanation.

> > If that's not taken care of yet, should we? I just don't know if we
> > should call task_isolation_exception() or task_isolation_debug().
> 
> task_isolation_exception() is called when an exception (page fault or
> similar) is generated synchronously by the running task and we want
> to make sure to notify the task with a signal if it has set up STRICT mode
> to indicate that it is not planning to enter the kernel.
> 
> > In the case of the latter, wouldn't it be interesting to add it to
> > __queue_work() then?
> 
> Well, queuing remote work involves sending an IPI, and we already tag
> both the SMP send side AND the client side IRQ side with a task_isolation_debug(),
> so I expect in practice it would be detected.
> 


^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v8 00/14] support "task_isolation" mode for nohz_full
  2015-09-28 15:17             ` Chris Metcalf
@ 2015-10-20 20:35               ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:35 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

This email discusses in detail the changes for v8; please see
older versions of the cover letter for details about older versions.

v8: 
  The biggest difference in this version is, at Thomas Gleixner's
  suggestion, I removed the code that busy-waits until there are no
  scheduler-tick timer events queued.  Instead, we now test for
  higher-level properties when attempting to return to userspace.
  We check if the core believes it has stopped the scheduler tick
  (which handles checking for scheduler contention from other tasks,
  RCU usage of the cpu, posix cpu timers, perf, etc), and if it
  hasn't, we request that the current process be rescheduled.  In
  addition, we check if there are per-cpu lru pages to be drained, and
  we check if the vmstat worker has been quiesced.  The structure is
  pretty clean so we can add additional tests as needed there as well.

  One nice aspect of this revised structure is that if the user
  actually requests a signal from a timer (for example), we will
  now return to userspace and let the program run.  Of course it
  may get bombed with incremental timer ticks if the timer can't
  be programmed to the whole time interval in one step, but it still
  feels more correct this way then holding the process in the kernel
  until the user-requested timer expires.

  At Andy Lutomirski's suggestion, we separate out from the previous
  task_isolation_enter() a separate task_isolation_ready() test
  that can be done at the same time as we test the TIF_xxx flags,
  with interrupts disabled, so we can guarantee that the conditions
  we test for are still true when we return to userspace.

  To accomplish this we break out a new vmstat_idle() function
  that checks if the vmstat subsystem is quiesced on this core.
  Similarly, we factor out an lru_add_drain_needed() function from
  where it used to be in lru_add_drain_all().  Both of these
  "check" functions can now be called from task_isolation_ready()
  with interrupts disabled.

  Also at Andy's suggestion (and aligning with how I had done things
  previously in the Tilera private fork), the prctl() to enable task
  isolation will now fail with EINVAL if you attempt to enable
  task-isolation mode when your affinity does not lock you to a
  single core, or if that core is not a nohz_full core.

  We move the "strict" syscall test to just before SECCOMP instead
  of just after.  It's not particularly clear that one is better
  than the other abstractly, and on a couple of the supported
  platforms (x86, tile) it makes the code structure work out better
  because the user_enter() can be done at the same time as the
  test for strict mode.

  The integration with context_tracking has been completely dropped;
  discussing with Andy showed that there are only a few exception
  sites that need strict-mode checking (the typical one is
  page faults that don't raise signals) so just putting the checks
  in the relevant functions feels cleaner than trying to hijack
  the exception_enter/exception_exit paths, which are being
  removed for x86 in any case.

  The task_isolation_exception() hook now takes full printf
  format arguments, so that we can generate a much more useful
  report as to why we are killing the task.  As a result, we also
  remove the dump_stack() call, whose only utility was pointing
  the finger at which exception function had triggered.

  Rather than automatically disabling the 1 Hz maximum scheduler
  deferment for task-isolation tasks, we now require the user to
  specify a boot flag ("debug_1hz_tick") to do this.  The boot
  flag allows us to test the case where all the 1 Hz updating
  subsystems have been fixed before that work actually is finished.

  An architecture-specific fix is included in this patch series for
  the tile architecture; I will push it through the tile tree (along
  with the tile prepare_exit_to_usermode restructuring) if there are
  no concerns.  At issue is that we end up with one gratuitous timer
  tick when we are shutting down the timer; by setting up the
  set_state_oneshot_stopped function pointer callback for the tile
  tick timer we can avoid this problem.  (Thomas, I'd particularly
  appreciate your ack on this fix, which is number 13 out of 14 in
  this patch series.)

  Rebased to v4.3-rc6 to pick up the fix for vmstat to properly
  use schedule_delayed_work_on(), since I was hitting a VM_BUG_ON
  without the fix (which I separately tracked down - oh well).

v7:
  switch to architecture hooks for task_isolation_enter
  add an RCU_LOCKDEP_WARN() (Andy Lutomirski)
  rebased to v4.3-rc1

v6:
  restructured to be a "task_isolation" mode not a "cpu_isolated"
  mode (Frederic)

v5:
  rebased on kernel v4.2-rc3
  converted to use CONFIG_CPU_ISOLATED and separate .c and .h files
  incorporates Christoph Lameter's quiet_vmstat() call

v4:
  rebased on kernel v4.2-rc1
  added support for detecting CPU_ISOLATED_STRICT syscalls on arm64

v3:
  remove dependency on cpu_idle subsystem (Thomas Gleixner)
  use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter
  use seconds for console messages instead of jiffies (Thomas Gleixner)
  updated commit description for patch 5/5

v2:
  rename "dataplane" to "cpu_isolated"
  drop ksoftirqd suppression changes (believed no longer needed)
  merge previous "QUIESCE" functionality into baseline functionality
  explicitly track syscalls and exceptions for "STRICT" functionality
  allow configuring a signal to be delivered for STRICT mode failures
  move debug tracking to irq_enter(), not irq_exit()

General summary:

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it.  However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode.  The
kernel must be built with CONFIG_TASK_ISOLATION to take advantage of
this new mode.  A prctl() option (PR_SET_TASK_ISOLATION) is added to
control whether processes have requested this stricter semantics, and
within that prctl() option we provide a number of different bits for
more precise control.  Additionally, we add a new command-line boot
argument to facilitate debugging where unexpected interrupts are being
delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers.  This patch
series represents the first serious attempt to upstream that
functionality.  Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts, this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Chris Metcalf (13):
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: support PR_TASK_ISOLATION_STRICT mode
  task_isolation: provide strict mode configurable signal
  task_isolation: add debug boot flag
  nohz_full: allow disabling the 1Hz minimum tick at boot
  arch/x86: enable task isolation functionality
  arch/arm64: adopt prepare_exit_to_usermode() model from x86
  arch/arm64: enable task isolation functionality
  arch/tile: adopt prepare_exit_to_usermode() model from x86
  arch/tile: turn off timer tick for oneshot_stopped state
  arch/tile: enable task isolation functionality

Christoph Lameter (1):
  vmstat: provide a function to quiet down the diff processing

 Documentation/kernel-parameters.txt  |   7 ++
 arch/arm64/include/asm/thread_info.h |  18 +++--
 arch/arm64/kernel/entry.S            |   6 +-
 arch/arm64/kernel/ptrace.c           |  12 +++-
 arch/arm64/kernel/signal.c           |  35 +++++++---
 arch/arm64/mm/fault.c                |   4 ++
 arch/tile/include/asm/processor.h    |   2 +-
 arch/tile/include/asm/thread_info.h  |   8 ++-
 arch/tile/kernel/intvec_32.S         |  46 ++++---------
 arch/tile/kernel/intvec_64.S         |  49 +++++---------
 arch/tile/kernel/process.c           |  83 ++++++++++++-----------
 arch/tile/kernel/ptrace.c            |   6 +-
 arch/tile/kernel/single_step.c       |   5 ++
 arch/tile/kernel/time.c              |   1 +
 arch/tile/kernel/unaligned.c         |   3 +
 arch/tile/mm/fault.c                 |   3 +
 arch/tile/mm/homecache.c             |   5 +-
 arch/x86/entry/common.c              |  10 ++-
 arch/x86/kernel/traps.c              |   2 +
 arch/x86/mm/fault.c                  |   2 +
 include/linux/isolation.h            |  61 +++++++++++++++++
 include/linux/sched.h                |   3 +
 include/linux/swap.h                 |   1 +
 include/linux/vmstat.h               |   4 ++
 include/uapi/linux/prctl.h           |   8 +++
 init/Kconfig                         |  20 ++++++
 kernel/Makefile                      |   1 +
 kernel/irq_work.c                    |   5 +-
 kernel/isolation.c                   | 127 +++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                  |  37 ++++++++++
 kernel/signal.c                      |  13 ++++
 kernel/smp.c                         |   4 ++
 kernel/softirq.c                     |   7 ++
 kernel/sys.c                         |   9 +++
 mm/swap.c                            |  13 ++--
 mm/vmstat.c                          |  24 +++++++
 36 files changed, 507 insertions(+), 137 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

-- 
2.1.2


^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v8 00/14] support "task_isolation" mode for nohz_full
@ 2015-10-20 20:35               ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:35 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

This email discusses in detail the changes for v8; please see
older versions of the cover letter for details about older versions.

v8: 
  The biggest difference in this version is, at Thomas Gleixner's
  suggestion, I removed the code that busy-waits until there are no
  scheduler-tick timer events queued.  Instead, we now test for
  higher-level properties when attempting to return to userspace.
  We check if the core believes it has stopped the scheduler tick
  (which handles checking for scheduler contention from other tasks,
  RCU usage of the cpu, posix cpu timers, perf, etc), and if it
  hasn't, we request that the current process be rescheduled.  In
  addition, we check if there are per-cpu lru pages to be drained, and
  we check if the vmstat worker has been quiesced.  The structure is
  pretty clean so we can add additional tests as needed there as well.

  One nice aspect of this revised structure is that if the user
  actually requests a signal from a timer (for example), we will
  now return to userspace and let the program run.  Of course it
  may get bombed with incremental timer ticks if the timer can't
  be programmed to the whole time interval in one step, but it still
  feels more correct this way then holding the process in the kernel
  until the user-requested timer expires.

  At Andy Lutomirski's suggestion, we separate out from the previous
  task_isolation_enter() a separate task_isolation_ready() test
  that can be done at the same time as we test the TIF_xxx flags,
  with interrupts disabled, so we can guarantee that the conditions
  we test for are still true when we return to userspace.

  To accomplish this we break out a new vmstat_idle() function
  that checks if the vmstat subsystem is quiesced on this core.
  Similarly, we factor out an lru_add_drain_needed() function from
  where it used to be in lru_add_drain_all().  Both of these
  "check" functions can now be called from task_isolation_ready()
  with interrupts disabled.

  Also at Andy's suggestion (and aligning with how I had done things
  previously in the Tilera private fork), the prctl() to enable task
  isolation will now fail with EINVAL if you attempt to enable
  task-isolation mode when your affinity does not lock you to a
  single core, or if that core is not a nohz_full core.

  We move the "strict" syscall test to just before SECCOMP instead
  of just after.  It's not particularly clear that one is better
  than the other abstractly, and on a couple of the supported
  platforms (x86, tile) it makes the code structure work out better
  because the user_enter() can be done at the same time as the
  test for strict mode.

  The integration with context_tracking has been completely dropped;
  discussing with Andy showed that there are only a few exception
  sites that need strict-mode checking (the typical one is
  page faults that don't raise signals) so just putting the checks
  in the relevant functions feels cleaner than trying to hijack
  the exception_enter/exception_exit paths, which are being
  removed for x86 in any case.

  The task_isolation_exception() hook now takes full printf
  format arguments, so that we can generate a much more useful
  report as to why we are killing the task.  As a result, we also
  remove the dump_stack() call, whose only utility was pointing
  the finger at which exception function had triggered.

  Rather than automatically disabling the 1 Hz maximum scheduler
  deferment for task-isolation tasks, we now require the user to
  specify a boot flag ("debug_1hz_tick") to do this.  The boot
  flag allows us to test the case where all the 1 Hz updating
  subsystems have been fixed before that work actually is finished.

  An architecture-specific fix is included in this patch series for
  the tile architecture; I will push it through the tile tree (along
  with the tile prepare_exit_to_usermode restructuring) if there are
  no concerns.  At issue is that we end up with one gratuitous timer
  tick when we are shutting down the timer; by setting up the
  set_state_oneshot_stopped function pointer callback for the tile
  tick timer we can avoid this problem.  (Thomas, I'd particularly
  appreciate your ack on this fix, which is number 13 out of 14 in
  this patch series.)

  Rebased to v4.3-rc6 to pick up the fix for vmstat to properly
  use schedule_delayed_work_on(), since I was hitting a VM_BUG_ON
  without the fix (which I separately tracked down - oh well).

v7:
  switch to architecture hooks for task_isolation_enter
  add an RCU_LOCKDEP_WARN() (Andy Lutomirski)
  rebased to v4.3-rc1

v6:
  restructured to be a "task_isolation" mode not a "cpu_isolated"
  mode (Frederic)

v5:
  rebased on kernel v4.2-rc3
  converted to use CONFIG_CPU_ISOLATED and separate .c and .h files
  incorporates Christoph Lameter's quiet_vmstat() call

v4:
  rebased on kernel v4.2-rc1
  added support for detecting CPU_ISOLATED_STRICT syscalls on arm64

v3:
  remove dependency on cpu_idle subsystem (Thomas Gleixner)
  use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter
  use seconds for console messages instead of jiffies (Thomas Gleixner)
  updated commit description for patch 5/5

v2:
  rename "dataplane" to "cpu_isolated"
  drop ksoftirqd suppression changes (believed no longer needed)
  merge previous "QUIESCE" functionality into baseline functionality
  explicitly track syscalls and exceptions for "STRICT" functionality
  allow configuring a signal to be delivered for STRICT mode failures
  move debug tracking to irq_enter(), not irq_exit()

General summary:

The existing nohz_full mode does a nice job of suppressing extraneous
kernel interrupts for cores that desire it.  However, there is a need
for a more deterministic mode that rigorously disallows kernel
interrupts, even at a higher cost in user/kernel transition time:
for example, high-speed networking applications running userspace
drivers that will drop packets if they are ever interrupted.

These changes attempt to provide an initial draft of such a framework;
the changes do not add any overhead to the usual non-nohz_full mode,
and only very small overhead to the typical nohz_full mode.  The
kernel must be built with CONFIG_TASK_ISOLATION to take advantage of
this new mode.  A prctl() option (PR_SET_TASK_ISOLATION) is added to
control whether processes have requested this stricter semantics, and
within that prctl() option we provide a number of different bits for
more precise control.  Additionally, we add a new command-line boot
argument to facilitate debugging where unexpected interrupts are being
delivered from.

Code that is conceptually similar has been in use in Tilera's
Multicore Development Environment since 2008, known as Zero-Overhead
Linux, and has seen wide adoption by a range of customers.  This patch
series represents the first serious attempt to upstream that
functionality.  Although the current state of the kernel isn't quite
ready to run with absolutely no kernel interrupts, this
patch series provides a way to make dynamic tradeoffs between avoiding
kernel interrupts on the one hand, and making voluntary calls in and
out of the kernel more expensive, for tasks that want it.

The series is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Chris Metcalf (13):
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: support PR_TASK_ISOLATION_STRICT mode
  task_isolation: provide strict mode configurable signal
  task_isolation: add debug boot flag
  nohz_full: allow disabling the 1Hz minimum tick at boot
  arch/x86: enable task isolation functionality
  arch/arm64: adopt prepare_exit_to_usermode() model from x86
  arch/arm64: enable task isolation functionality
  arch/tile: adopt prepare_exit_to_usermode() model from x86
  arch/tile: turn off timer tick for oneshot_stopped state
  arch/tile: enable task isolation functionality

Christoph Lameter (1):
  vmstat: provide a function to quiet down the diff processing

 Documentation/kernel-parameters.txt  |   7 ++
 arch/arm64/include/asm/thread_info.h |  18 +++--
 arch/arm64/kernel/entry.S            |   6 +-
 arch/arm64/kernel/ptrace.c           |  12 +++-
 arch/arm64/kernel/signal.c           |  35 +++++++---
 arch/arm64/mm/fault.c                |   4 ++
 arch/tile/include/asm/processor.h    |   2 +-
 arch/tile/include/asm/thread_info.h  |   8 ++-
 arch/tile/kernel/intvec_32.S         |  46 ++++---------
 arch/tile/kernel/intvec_64.S         |  49 +++++---------
 arch/tile/kernel/process.c           |  83 ++++++++++++-----------
 arch/tile/kernel/ptrace.c            |   6 +-
 arch/tile/kernel/single_step.c       |   5 ++
 arch/tile/kernel/time.c              |   1 +
 arch/tile/kernel/unaligned.c         |   3 +
 arch/tile/mm/fault.c                 |   3 +
 arch/tile/mm/homecache.c             |   5 +-
 arch/x86/entry/common.c              |  10 ++-
 arch/x86/kernel/traps.c              |   2 +
 arch/x86/mm/fault.c                  |   2 +
 include/linux/isolation.h            |  61 +++++++++++++++++
 include/linux/sched.h                |   3 +
 include/linux/swap.h                 |   1 +
 include/linux/vmstat.h               |   4 ++
 include/uapi/linux/prctl.h           |   8 +++
 init/Kconfig                         |  20 ++++++
 kernel/Makefile                      |   1 +
 kernel/irq_work.c                    |   5 +-
 kernel/isolation.c                   | 127 +++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                  |  37 ++++++++++
 kernel/signal.c                      |  13 ++++
 kernel/smp.c                         |   4 ++
 kernel/softirq.c                     |   7 ++
 kernel/sys.c                         |   9 +++
 mm/swap.c                            |  13 ++--
 mm/vmstat.c                          |  24 +++++++
 36 files changed, 507 insertions(+), 137 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

-- 
2.1.2


^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH v8 01/14] vmstat: provide a function to quiet down the diff processing
  2015-10-20 20:35               ` Chris Metcalf
  (?)
@ 2015-10-20 20:35               ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:35 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel
  Cc: Chris Metcalf

From: Christoph Lameter <cl@linux.com>

quiet_vmstat() can be called in anticipation of a OS "quiet" period
where no tick processing should be triggered. quiet_vmstat() will fold
all pending differentials into the global counters and disable the
vmstat_worker processing.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 14 ++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 82e7db7f7100..c013b8d8e434 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone *, enum zone_stat_item);
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
+void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -272,6 +273,7 @@ static inline void __dec_zone_page_state(struct page *page,
 static inline void refresh_cpu_vm_stats(int cpu) { }
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fbf14485a049..a9c446353c7e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_struct *w)
 }
 
 /*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+	do {
+		if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+			cancel_delayed_work(this_cpu_ptr(&vmstat_work));
+
+	} while (refresh_cpu_vm_stats());
+}
+
+/*
  * Check if the diffs for a certain cpu indicate that
  * an update is needed.
  */
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 02/14] vmstat: add vmstat_idle function
  2015-10-20 20:35               ` Chris Metcalf
  (?)
  (?)
@ 2015-10-20 20:36               ` Chris Metcalf
  2015-10-20 20:45                 ` Christoph Lameter
  -1 siblings, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel
  Cc: Chris Metcalf

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index c013b8d8e434..34e3b768e432 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -212,6 +212,7 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
 void quiet_vmstat(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -274,6 +275,7 @@ static inline void refresh_cpu_vm_stats(int cpu) { }
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a9c446353c7e..05fa1f0eefc8 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1431,6 +1431,16 @@ static bool need_update(int cpu)
 	return false;
 }
 
+/*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+	int cpu = smp_processor_id();
+	return cpumask_test_cpu(cpu, cpu_stat_off) && !need_update(cpu);
+}
+
 
 /*
  * Shepherd worker thread that checks the
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 03/14] lru_add_drain_all: factor out lru_add_drain_needed
  2015-10-20 20:35               ` Chris Metcalf
@ 2015-10-20 20:36                 ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-mm, linux-kernel
  Cc: Chris Metcalf

This per-cpu check was being done in the loop in lru_add_drain_all(),
but having it be callable for a particular cpu is helpful for the
task-isolation patches.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/swap.h |  1 +
 mm/swap.c            | 13 +++++++++----
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7ba7dccaf0e7..66719610c9f5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -305,6 +305,7 @@ extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
+extern bool lru_add_drain_needed(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index 983f692a47fd..e21f3357cedd 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -854,6 +854,14 @@ void deactivate_file_page(struct page *page)
 	}
 }
 
+bool lru_add_drain_needed(int cpu)
+{
+	return (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
+		pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
+		pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		need_activate_page_drain(cpu));
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
@@ -880,10 +888,7 @@ void lru_add_drain_all(void)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
-		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-		    need_activate_page_drain(cpu)) {
+		if (lru_add_drain_needed(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			schedule_work_on(cpu, work);
 			cpumask_set_cpu(cpu, &has_work);
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 03/14] lru_add_drain_all: factor out lru_add_drain_needed
@ 2015-10-20 20:36                 ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-mm, linux-kernel
  Cc: Chris Metcalf

This per-cpu check was being done in the loop in lru_add_drain_all(),
but having it be callable for a particular cpu is helpful for the
task-isolation patches.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/swap.h |  1 +
 mm/swap.c            | 13 +++++++++----
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7ba7dccaf0e7..66719610c9f5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -305,6 +305,7 @@ extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
+extern bool lru_add_drain_needed(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index 983f692a47fd..e21f3357cedd 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -854,6 +854,14 @@ void deactivate_file_page(struct page *page)
 	}
 }
 
+bool lru_add_drain_needed(int cpu)
+{
+	return (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
+		pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
+		pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		need_activate_page_drain(cpu));
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
@@ -880,10 +888,7 @@ void lru_add_drain_all(void)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
-		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-		    need_activate_page_drain(cpu)) {
+		if (lru_add_drain_needed(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			schedule_work_on(cpu, work);
 			cpumask_set_cpu(cpu, &has_work);
-- 
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 04/14] task_isolation: add initial support
  2015-10-20 20:35               ` Chris Metcalf
@ 2015-10-20 20:36                 ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
nohz_full=CPULIST boot argument.  The "task_isolation" state is then
indicated by setting a new task struct field, task_isolation_flag,
to the value passed by prctl().  When the _ENABLE bit is set for
a task, and it is returning to userspace on a nohz_full core,
it calls the new task_isolation_ready() / task_isolation_enter()
routines to take additional actions to help the task avoid being
interrupted in the future.

The task_isolation_ready() call plays an equivalent role to the
TIF_xxx flags when returning to userspace, and should be checked
in the loop check of the prepare_exit_to_usermode() routine or its
architecture equivalent.  It is called with interrupts disabled and
inspects the kernel state to determine if it is safe to return into
an isolated state.  In particular, if it sees that the scheduler
tick is still enabled, it sets the TIF_NEED_RESCHED bit to notify
the scheduler to attempt to schedule a different task.

Each time through the loop of TIF work to do, we call the new
task_isolation_enter() routine, which takes any actions that might
avoid a future interrupt to the core, such as a worker thread
being scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

As a result of these tests on the "return to userspace" path, sys
calls (and page faults, etc.) can be inordinately slow.  However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

Separate patches that follow provide these changes for x86, arm64,
and tile.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/isolation.h  | 38 ++++++++++++++++++++++
 include/linux/sched.h      |  3 ++
 include/uapi/linux/prctl.h |  5 +++
 init/Kconfig               | 20 ++++++++++++
 kernel/Makefile            |  1 +
 kernel/isolation.c         | 78 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c               |  9 ++++++
 7 files changed, 154 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..4bef90024924
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,38 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+extern int task_isolation_set(unsigned int flags);
+static inline bool task_isolation_enabled(void)
+{
+	return tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE);
+}
+
+extern bool _task_isolation_ready(void);
+extern void _task_isolation_enter(void);
+
+static inline bool task_isolation_ready(void)
+{
+	return !task_isolation_enabled() || _task_isolation_ready();
+}
+
+static inline void task_isolation_enter(void)
+{
+	if (task_isolation_enabled())
+		_task_isolation_enter();
+}
+
+#else
+static inline bool task_isolation_enabled(void) { return false; }
+static inline bool task_isolation_ready(void) { return true; }
+static inline void task_isolation_enter(void) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b7b9501b41af..7a50f6904675 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1812,6 +1812,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned int	task_isolation_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..67224df4b559 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,9 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */
+#define PR_SET_TASK_ISOLATION		48
+#define PR_GET_TASK_ISOLATION		49
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index c24b6f767bf0..4ff7f052059a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -787,6 +787,26 @@ config RCU_EXPEDITE_BOOT
 
 endmenu # "RCU Subsystem"
 
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL
+	help
+	 Allow userspace processes to place themselves on nohz_full
+	 cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+	 themselves from the kernel.  On return to userspace,
+	 isolated tasks will first arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  This "hard" isolation from the kernel is
+	 required for userspace tasks that are running hard real-time
+	 tasks in userspace, such as a 10 Gbit network driver in userspace.
+
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf008ecb3..693a2ba35679 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..9a73235db0bb
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,78 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include "time/tick-sched.h"
+
+/*
+ * This routine controls whether we can enable task-isolation mode.
+ * The task must be affinitized to a single nohz_full core or we will
+ * return EINVAL.  Although the application could later re-affinitize
+ * to a housekeeping core and lose task isolation semantics, this
+ * initial test should catch 99% of bugs with task placement prior to
+ * enabling task isolation.
+ */
+int task_isolation_set(unsigned int flags)
+{
+	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
+	    !tick_nohz_full_cpu(smp_processor_id()))
+		return -EINVAL;
+
+	current->task_isolation_flags = flags;
+	return 0;
+}
+
+/*
+ * In task isolation mode we try to return to userspace only after
+ * attempting to make sure we won't be interrupted again.  To handle
+ * the periodic scheduler tick, we test to make sure that the tick is
+ * stopped, and if it isn't yet, we request a reschedule so that if
+ * another task needs to run to completion first, it can do so.
+ * Similarly, if any other subsystems require quiescing, we will need
+ * to do that before we return to userspace.
+ */
+bool _task_isolation_ready(void)
+{
+	WARN_ON_ONCE(!irqs_disabled());
+
+	/* If we need to drain the LRU cache, we're not ready. */
+	if (lru_add_drain_needed(smp_processor_id()))
+		return false;
+
+	/* If vmstats need updating, we're not ready. */
+	if (!vmstat_idle())
+		return false;
+
+	/* If the tick is running, request rescheduling; we're not ready. */
+	if (!tick_nohz_tick_stopped()) {
+		set_tsk_need_resched(current);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Each time we try to prepare for return to userspace in a process
+ * with task isolation enabled, we run this code to quiesce whatever
+ * subsystems we can readily quiesce to avoid later interrupts.
+ */
+void _task_isolation_enter(void)
+{
+	WARN_ON_ONCE(irqs_disabled());
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	quiet_vmstat();
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index fa2f2f671a5c..f1b1d333f74d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2266,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_TASK_ISOLATION
+	case PR_SET_TASK_ISOLATION:
+		error = task_isolation_set(arg2);
+		break;
+	case PR_GET_TASK_ISOLATION:
+		error = me->task_isolation_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 04/14] task_isolation: add initial support
@ 2015-10-20 20:36                 ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
nohz_full=CPULIST boot argument.  The "task_isolation" state is then
indicated by setting a new task struct field, task_isolation_flag,
to the value passed by prctl().  When the _ENABLE bit is set for
a task, and it is returning to userspace on a nohz_full core,
it calls the new task_isolation_ready() / task_isolation_enter()
routines to take additional actions to help the task avoid being
interrupted in the future.

The task_isolation_ready() call plays an equivalent role to the
TIF_xxx flags when returning to userspace, and should be checked
in the loop check of the prepare_exit_to_usermode() routine or its
architecture equivalent.  It is called with interrupts disabled and
inspects the kernel state to determine if it is safe to return into
an isolated state.  In particular, if it sees that the scheduler
tick is still enabled, it sets the TIF_NEED_RESCHED bit to notify
the scheduler to attempt to schedule a different task.

Each time through the loop of TIF work to do, we call the new
task_isolation_enter() routine, which takes any actions that might
avoid a future interrupt to the core, such as a worker thread
being scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

As a result of these tests on the "return to userspace" path, sys
calls (and page faults, etc.) can be inordinately slow.  However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

Separate patches that follow provide these changes for x86, arm64,
and tile.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/isolation.h  | 38 ++++++++++++++++++++++
 include/linux/sched.h      |  3 ++
 include/uapi/linux/prctl.h |  5 +++
 init/Kconfig               | 20 ++++++++++++
 kernel/Makefile            |  1 +
 kernel/isolation.c         | 78 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c               |  9 ++++++
 7 files changed, 154 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..4bef90024924
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,38 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+extern int task_isolation_set(unsigned int flags);
+static inline bool task_isolation_enabled(void)
+{
+	return tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE);
+}
+
+extern bool _task_isolation_ready(void);
+extern void _task_isolation_enter(void);
+
+static inline bool task_isolation_ready(void)
+{
+	return !task_isolation_enabled() || _task_isolation_ready();
+}
+
+static inline void task_isolation_enter(void)
+{
+	if (task_isolation_enabled())
+		_task_isolation_enter();
+}
+
+#else
+static inline bool task_isolation_enabled(void) { return false; }
+static inline bool task_isolation_ready(void) { return true; }
+static inline void task_isolation_enter(void) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b7b9501b41af..7a50f6904675 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1812,6 +1812,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned int	task_isolation_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..67224df4b559 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,9 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */
+#define PR_SET_TASK_ISOLATION		48
+#define PR_GET_TASK_ISOLATION		49
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index c24b6f767bf0..4ff7f052059a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -787,6 +787,26 @@ config RCU_EXPEDITE_BOOT
 
 endmenu # "RCU Subsystem"
 
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL
+	help
+	 Allow userspace processes to place themselves on nohz_full
+	 cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+	 themselves from the kernel.  On return to userspace,
+	 isolated tasks will first arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  This "hard" isolation from the kernel is
+	 required for userspace tasks that are running hard real-time
+	 tasks in userspace, such as a 10 Gbit network driver in userspace.
+
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf008ecb3..693a2ba35679 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..9a73235db0bb
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,78 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include "time/tick-sched.h"
+
+/*
+ * This routine controls whether we can enable task-isolation mode.
+ * The task must be affinitized to a single nohz_full core or we will
+ * return EINVAL.  Although the application could later re-affinitize
+ * to a housekeeping core and lose task isolation semantics, this
+ * initial test should catch 99% of bugs with task placement prior to
+ * enabling task isolation.
+ */
+int task_isolation_set(unsigned int flags)
+{
+	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
+	    !tick_nohz_full_cpu(smp_processor_id()))
+		return -EINVAL;
+
+	current->task_isolation_flags = flags;
+	return 0;
+}
+
+/*
+ * In task isolation mode we try to return to userspace only after
+ * attempting to make sure we won't be interrupted again.  To handle
+ * the periodic scheduler tick, we test to make sure that the tick is
+ * stopped, and if it isn't yet, we request a reschedule so that if
+ * another task needs to run to completion first, it can do so.
+ * Similarly, if any other subsystems require quiescing, we will need
+ * to do that before we return to userspace.
+ */
+bool _task_isolation_ready(void)
+{
+	WARN_ON_ONCE(!irqs_disabled());
+
+	/* If we need to drain the LRU cache, we're not ready. */
+	if (lru_add_drain_needed(smp_processor_id()))
+		return false;
+
+	/* If vmstats need updating, we're not ready. */
+	if (!vmstat_idle())
+		return false;
+
+	/* If the tick is running, request rescheduling; we're not ready. */
+	if (!tick_nohz_tick_stopped()) {
+		set_tsk_need_resched(current);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Each time we try to prepare for return to userspace in a process
+ * with task isolation enabled, we run this code to quiesce whatever
+ * subsystems we can readily quiesce to avoid later interrupts.
+ */
+void _task_isolation_enter(void)
+{
+	WARN_ON_ONCE(irqs_disabled());
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	quiet_vmstat();
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index fa2f2f671a5c..f1b1d333f74d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2266,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_TASK_ISOLATION
+	case PR_SET_TASK_ISOLATION:
+		error = task_isolation_set(arg2);
+		break;
+	case PR_GET_TASK_ISOLATION:
+		error = me->task_isolation_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 05/14] task_isolation: support PR_TASK_ISOLATION_STRICT mode
  2015-10-20 20:35               ` Chris Metcalf
@ 2015-10-20 20:36                 ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry is fatal; this is defined as
happening immediately before the SECCOMP test.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/isolation.h  | 21 +++++++++++++++++++++
 include/uapi/linux/prctl.h |  1 +
 kernel/isolation.c         | 42 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 64 insertions(+)

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index 4bef90024924..dc14057a359c 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -29,10 +29,31 @@ static inline void task_isolation_enter(void)
 		_task_isolation_enter();
 }
 
+extern bool task_isolation_syscall(int nr);
+extern bool task_isolation_exception(const char *fmt, ...);
+
+static inline bool task_isolation_strict(void)
+{
+	return (tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->task_isolation_flags &
+		 (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+		(PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT));
+}
+
+#define task_isolation_check_syscall(nr) \
+	(task_isolation_strict() && \
+	 task_isolation_syscall(nr))
+
+#define task_isolation_check_exception(fmt, ...) \
+	(task_isolation_strict() && \
+	 task_isolation_exception(fmt, ## __VA_ARGS__))
+
 #else
 static inline bool task_isolation_enabled(void) { return false; }
 static inline bool task_isolation_ready(void) { return true; }
 static inline void task_isolation_enter(void) { }
+static inline bool task_isolation_check_syscall(int nr) { return false; }
+static inline bool task_isolation_check_exception(const char *fmt, ...) { return false; }
 #endif
 
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 67224df4b559..2b8038b0d1e1 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,6 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		48
 #define PR_GET_TASK_ISOLATION		49
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 9a73235db0bb..30db40098a35 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,7 @@
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
 #include <linux/syscalls.h>
+#include <asm/unistd.h>
 #include "time/tick-sched.h"
 
 /*
@@ -76,3 +77,44 @@ void _task_isolation_enter(void)
 	/* Quieten the vmstat worker so it won't interrupt us. */
 	quiet_vmstat();
 }
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+bool task_isolation_exception(const char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	pr_warn("%s/%d: task_isolation strict mode violated by %s\n",
+		current->comm, current->pid, buf);
+	current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
+	send_sig(SIGKILL, current, 1);
+
+	return true;
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+bool task_isolation_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return false;
+	}
+
+	return task_isolation_exception("syscall %d", syscall);
+}
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 05/14] task_isolation: support PR_TASK_ISOLATION_STRICT mode
@ 2015-10-20 20:36                 ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry is fatal; this is defined as
happening immediately before the SECCOMP test.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/isolation.h  | 21 +++++++++++++++++++++
 include/uapi/linux/prctl.h |  1 +
 kernel/isolation.c         | 42 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 64 insertions(+)

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index 4bef90024924..dc14057a359c 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -29,10 +29,31 @@ static inline void task_isolation_enter(void)
 		_task_isolation_enter();
 }
 
+extern bool task_isolation_syscall(int nr);
+extern bool task_isolation_exception(const char *fmt, ...);
+
+static inline bool task_isolation_strict(void)
+{
+	return (tick_nohz_full_cpu(smp_processor_id()) &&
+		(current->task_isolation_flags &
+		 (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+		(PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT));
+}
+
+#define task_isolation_check_syscall(nr) \
+	(task_isolation_strict() && \
+	 task_isolation_syscall(nr))
+
+#define task_isolation_check_exception(fmt, ...) \
+	(task_isolation_strict() && \
+	 task_isolation_exception(fmt, ## __VA_ARGS__))
+
 #else
 static inline bool task_isolation_enabled(void) { return false; }
 static inline bool task_isolation_ready(void) { return true; }
 static inline void task_isolation_enter(void) { }
+static inline bool task_isolation_check_syscall(int nr) { return false; }
+static inline bool task_isolation_check_exception(const char *fmt, ...) { return false; }
 #endif
 
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 67224df4b559..2b8038b0d1e1 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,6 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		48
 #define PR_GET_TASK_ISOLATION		49
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_STRICT	(1 << 1)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 9a73235db0bb..30db40098a35 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,7 @@
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
 #include <linux/syscalls.h>
+#include <asm/unistd.h>
 #include "time/tick-sched.h"
 
 /*
@@ -76,3 +77,44 @@ void _task_isolation_enter(void)
 	/* Quieten the vmstat worker so it won't interrupt us. */
 	quiet_vmstat();
 }
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+bool task_isolation_exception(const char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	pr_warn("%s/%d: task_isolation strict mode violated by %s\n",
+		current->comm, current->pid, buf);
+	current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
+	send_sig(SIGKILL, current, 1);
+
+	return true;
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+bool task_isolation_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return false;
+	}
+
+	return task_isolation_exception("syscall %d", syscall);
+}
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
  2015-10-20 20:35               ` Chris Metcalf
@ 2015-10-20 20:36                 ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

Allow userspace to override the default SIGKILL delivered
when a task_isolation process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/uapi/linux/prctl.h | 2 ++
 kernel/isolation.c         | 9 ++++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 2b8038b0d1e1..a5582ace987f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -202,5 +202,7 @@ struct prctl_mm_map {
 #define PR_GET_TASK_ISOLATION		49
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
 # define PR_TASK_ISOLATION_STRICT	(1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 30db40098a35..0fa13b081bb4 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -84,8 +84,10 @@ void _task_isolation_enter(void)
  */
 bool task_isolation_exception(const char *fmt, ...)
 {
+	siginfo_t info = {};
 	va_list args;
 	char buf[100];
+	int sig;
 
 	/* RCU should have been enabled prior to this point. */
 	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
@@ -97,7 +99,12 @@ bool task_isolation_exception(const char *fmt, ...)
 	pr_warn("%s/%d: task_isolation strict mode violated by %s\n",
 		current->comm, current->pid, buf);
 	current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
-	send_sig(SIGKILL, current, 1);
+
+	sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags);
+	if (sig == 0)
+		sig = SIGKILL;
+	info.si_signo = sig;
+	send_sig_info(sig, &info, current);
 
 	return true;
 }
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
@ 2015-10-20 20:36                 ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

Allow userspace to override the default SIGKILL delivered
when a task_isolation process in STRICT mode does a syscall
or otherwise synchronously enters the kernel.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/uapi/linux/prctl.h | 2 ++
 kernel/isolation.c         | 9 ++++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 2b8038b0d1e1..a5582ace987f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -202,5 +202,7 @@ struct prctl_mm_map {
 #define PR_GET_TASK_ISOLATION		49
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
 # define PR_TASK_ISOLATION_STRICT	(1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 30db40098a35..0fa13b081bb4 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -84,8 +84,10 @@ void _task_isolation_enter(void)
  */
 bool task_isolation_exception(const char *fmt, ...)
 {
+	siginfo_t info = {};
 	va_list args;
 	char buf[100];
+	int sig;
 
 	/* RCU should have been enabled prior to this point. */
 	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
@@ -97,7 +99,12 @@ bool task_isolation_exception(const char *fmt, ...)
 	pr_warn("%s/%d: task_isolation strict mode violated by %s\n",
 		current->comm, current->pid, buf);
 	current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE;
-	send_sig(SIGKILL, current, 1);
+
+	sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags);
+	if (sig == 0)
+		sig = SIGKILL;
+	info.si_signo = sig;
+	send_sig_info(sig, &info, current);
 
 	return true;
 }
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 07/14] task_isolation: add debug boot flag
  2015-10-20 20:35               ` Chris Metcalf
                                 ` (6 preceding siblings ...)
  (?)
@ 2015-10-20 20:36               ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-kernel
  Cc: Chris Metcalf

The new "task_isolation_debug" flag simplifies debugging
of TASK_ISOLATION kernels when processes are running in
PR_TASK_ISOLATION_ENABLE mode.  Such processes should get no
interrupts from the kernel, and if they do, when this boot flag is
specified a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a task_isolation
core has unexpectedly entered the kernel.  But what this boot flag
does is allow the kernel to provide better diagnostics, e.g. by
reporting in the IPI-generating code what remote core and context
is preparing to deliver an interrupt to a task_isolation core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 Documentation/kernel-parameters.txt |  7 +++++++
 include/linux/isolation.h           |  2 ++
 kernel/irq_work.c                   |  5 ++++-
 kernel/sched/core.c                 | 21 +++++++++++++++++++++
 kernel/signal.c                     |  5 +++++
 kernel/smp.c                        |  4 ++++
 kernel/softirq.c                    |  7 +++++++
 7 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 22a4b687ea5b..48ff15f3166f 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3623,6 +3623,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation_debug	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION and booted
+			in nohz_full= mode, this setting will generate console
+			backtraces when the kernel is about to interrupt a
+			task that has requested PR_TASK_ISOLATION_ENABLE
+			and is running on a nohz_full core.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index dc14057a359c..ad94d1168c31 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -31,6 +31,7 @@ static inline void task_isolation_enter(void)
 
 extern bool task_isolation_syscall(int nr);
 extern bool task_isolation_exception(const char *fmt, ...);
+extern void task_isolation_debug(int cpu);
 
 static inline bool task_isolation_strict(void)
 {
@@ -54,6 +55,7 @@ static inline bool task_isolation_ready(void) { return true; }
 static inline void task_isolation_enter(void) { }
 static inline bool task_isolation_check_syscall(int nr) { return false; }
 static inline bool task_isolation_check_exception(const char *fmt, ...) { return false; }
+static inline void task_isolation_debug(int cpu) { }
 #endif
 
 #endif
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index cbf9fb899d92..745c2ea6a4e4 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/isolation.h>
 #include <asm/processor.h>
 
 
@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (!irq_work_claim(work))
 		return false;
 
-	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+		task_isolation_debug(cpu);
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return true;
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 10a8faa1b0d4..b79f8e0aeffb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
 #include <linux/binfmts.h>
 #include <linux/context_tracking.h>
 #include <linux/compiler.h>
+#include <linux/isolation.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -746,6 +747,26 @@ bool sched_can_stop_tick(void)
 }
 #endif /* CONFIG_NO_HZ_FULL */
 
+#ifdef CONFIG_TASK_ISOLATION
+/* Enable debugging of any interrupts of task_isolation cores. */
+static int task_isolation_debug_flag;
+static int __init task_isolation_debug_func(char *str)
+{
+	task_isolation_debug_flag = true;
+	return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+void task_isolation_debug(int cpu)
+{
+	if (task_isolation_debug_flag && tick_nohz_full_cpu(cpu) &&
+	    (cpu_curr(cpu)->task_isolation_flags & PR_TASK_ISOLATION_ENABLE)) {
+		pr_err("Interrupt detected for task_isolation cpu %d\n", cpu);
+		dump_stack();
+	}
+}
+#endif
+
 void sched_avg_update(struct rq *rq)
 {
 	s64 period = sched_avg_period();
diff --git a/kernel/signal.c b/kernel/signal.c
index 0f6bbbe77b46..c6e09f0f7e24 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -684,6 +684,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+#ifdef CONFIG_TASK_ISOLATION
+	/* If the task is being killed, don't complain about task_isolation. */
+	if (state & TASK_WAKEKILL)
+		t->task_isolation_flags = 0;
+#endif
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 07854477c164..b0bddff2693d 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
 #include <linux/smp.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include <linux/isolation.h>
 
 #include "smpboot.h"
 
@@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
+	task_isolation_debug(cpu);
 	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
 		arch_send_call_function_single_ipi(cpu);
 
@@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	for_each_cpu(cpu, cfd->cpumask)
+		task_isolation_debug(cpu);
 	arch_send_call_function_ipi_mask(cfd->cpumask);
 
 	if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..ed762fec7265 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -24,8 +24,10 @@
 #include <linux/ftrace.h>
 #include <linux/smp.h>
 #include <linux/smpboot.h>
+#include <linux/context_tracking.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -335,6 +337,11 @@ void irq_enter(void)
 		_local_bh_enable();
 	}
 
+	if (context_tracking_cpu_is_enabled() &&
+	    context_tracking_in_user() &&
+	    !in_interrupt())
+		task_isolation_debug(smp_processor_id());
+
 	__irq_enter();
 }
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot
  2015-10-20 20:35               ` Chris Metcalf
                                 ` (7 preceding siblings ...)
  (?)
@ 2015-10-20 20:36               ` Chris Metcalf
  2015-10-20 21:03                 ` Frederic Weisbecker
  -1 siblings, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-kernel
  Cc: Chris Metcalf

While the current fallback to 1-second tick is still required for
a number of kernel accounting tasks (e.g. vruntime, load balancing
data, and load accounting), it's useful to be able to disable it
for testing purposes.  Paul McKenney observed that if we provide
a mode where the 1Hz fallback timer is removed, this will provide
an environment where new code that relies on that tick will get
punished, and we won't forgive such assumptions silently.

This option also allows easy testing of nohz_full and task-isolation
modes to determine what functionality needs to be implemented,
and what possibly-spurious timer interrupts are scheduled when
the basic 1Hz tick has been turned off.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 kernel/sched/core.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b79f8e0aeffb..634d5c2ab08a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2849,6 +2849,19 @@ void scheduler_tick(void)
 }
 
 #ifdef CONFIG_NO_HZ_FULL
+/*
+ * Allow a boot-time option to debug running
+ * without the 1Hz minimum tick on nohz_full cores.
+ */
+static bool debug_1hz_tick;
+
+static __init int set_debug_1hz_tick(char *arg)
+{
+	debug_1hz_tick = true;
+	return 1;
+}
+__setup("debug_1hz_tick", set_debug_1hz_tick);
+
 /**
  * scheduler_tick_max_deferment
  *
@@ -2867,6 +2880,9 @@ u64 scheduler_tick_max_deferment(void)
 	struct rq *rq = this_rq();
 	unsigned long next, now = READ_ONCE(jiffies);
 
+	if (debug_1hz_tick)
+		return KTIME_MAX;
+
 	next = rq->last_sched_tick + HZ;
 
 	if (time_before_eq(next, now))
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 09/14] arch/x86: enable task isolation functionality
  2015-10-20 20:35               ` Chris Metcalf
                                 ` (8 preceding siblings ...)
  (?)
@ 2015-10-20 20:36               ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	H. Peter Anvin, x86, linux-kernel
  Cc: Chris Metcalf

In prepare_exit_to_usermode(), call task_isolation_ready()
when we are checking the thread-info flags, and after we've handled
the other work, call task_isolation_enter() unconditionally.

In syscall_trace_enter_phase1(), we add the necessary support for
strict-mode detection of syscalls.

We add strict reporting for the kernel exception types that do
not result in signals, namely non-signalling page faults and
non-signalling MPX fixups.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/x86/entry/common.c | 10 +++++++++-
 arch/x86/kernel/traps.c |  2 ++
 arch/x86/mm/fault.c     |  2 ++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 80dcc9261ca3..13426c0656b4 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -21,6 +21,7 @@
 #include <linux/context_tracking.h>
 #include <linux/user-return-notifier.h>
 #include <linux/uprobes.h>
+#include <linux/isolation.h>
 
 #include <asm/desc.h>
 #include <asm/traps.h>
@@ -81,6 +82,10 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	 */
 	if (work & _TIF_NOHZ) {
 		enter_from_user_mode();
+		if (task_isolation_check_syscall(regs->orig_ax)) {
+			regs->orig_ax = -1;
+			return 0;
+		}
 		work &= ~_TIF_NOHZ;
 	}
 #endif
@@ -234,7 +239,8 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs)
 
 		if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
 				      _TIF_UPROBE | _TIF_NEED_RESCHED |
-				      _TIF_USER_RETURN_NOTIFY)))
+				      _TIF_USER_RETURN_NOTIFY)) &&
+		    task_isolation_ready())
 			break;
 
 		/* We have work to do. */
@@ -258,6 +264,8 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs)
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
 			fire_user_return_notifiers();
 
+		task_isolation_enter();
+
 		/* Disable IRQs and retry */
 		local_irq_disable();
 	}
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 346eec73f7db..1ed4d8a52d23 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -36,6 +36,7 @@
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/io.h>
+#include <linux/isolation.h>
 
 #ifdef CONFIG_EISA
 #include <linux/ioport.h>
@@ -398,6 +399,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 	case 2:	/* Bound directory has invalid entry. */
 		if (mpx_handle_bd_fault())
 			goto exit_trap;
+		task_isolation_check_exception("bounds check");
 		break; /* Success, it was handled */
 	case 1: /* Bound violation. */
 		info = mpx_generate_siginfo(regs);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index eef44d9a3f77..7b23487a3bd7 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -14,6 +14,7 @@
 #include <linux/prefetch.h>		/* prefetchw			*/
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
+#include <linux/isolation.h>		/* task_isolation_check_exception */
 
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
@@ -1148,6 +1149,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		local_irq_enable();
 		error_code |= PF_USER;
 		flags |= FAULT_FLAG_USER;
+		task_isolation_check_exception("page fault at %#lx", address);
 	} else {
 		if (regs->flags & X86_EFLAGS_IF)
 			local_irq_enable();
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 10/14] arch/arm64: adopt prepare_exit_to_usermode() model from x86
  2015-10-20 20:35               ` Chris Metcalf
@ 2015-10-20 20:36                 ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-arm-kernel, linux-kernel
  Cc: Chris Metcalf

This change is a prerequisite change for TASK_ISOLATION but also
stands on its own for readability and maintainability.  The existing
arm64 do_notify_resume() is called in a loop from assembly on
the slow path; this change moves the loop into C code as well.
For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add new,
comprehensible entry and exit handlers written in C").

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/arm64/kernel/entry.S  |  6 +++---
 arch/arm64/kernel/signal.c | 32 ++++++++++++++++++++++----------
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 4306c937b1ff..6fcbf8ea307b 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -628,9 +628,8 @@ work_pending:
 	mov	x0, sp				// 'regs'
 	tst	x2, #PSR_MODE_MASK		// user mode regs?
 	b.ne	no_work_pending			// returning to kernel
-	enable_irq				// enable interrupts for do_notify_resume()
-	bl	do_notify_resume
-	b	ret_to_user
+	bl	prepare_exit_to_usermode
+	b	no_user_work_pending
 work_resched:
 	bl	schedule
 
@@ -642,6 +641,7 @@ ret_to_user:
 	ldr	x1, [tsk, #TI_FLAGS]
 	and	x2, x1, #_TIF_WORK_MASK
 	cbnz	x2, work_pending
+no_user_work_pending:
 	enable_step_tsk x1, x2
 no_work_pending:
 	kernel_exit 0
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index e18c48cb6db1..fde59c1139a9 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -399,18 +399,30 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
-asmlinkage void do_notify_resume(struct pt_regs *regs,
-				 unsigned int thread_flags)
+asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
+					 unsigned int thread_flags)
 {
-	if (thread_flags & _TIF_SIGPENDING)
-		do_signal(regs);
+	do {
+		local_irq_enable();
 
-	if (thread_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-	}
+		if (thread_flags & _TIF_NEED_RESCHED)
+			schedule();
+
+		if (thread_flags & _TIF_SIGPENDING)
+			do_signal(regs);
+
+		if (thread_flags & _TIF_NOTIFY_RESUME) {
+			clear_thread_flag(TIF_NOTIFY_RESUME);
+			tracehook_notify_resume(regs);
+		}
+
+		if (thread_flags & _TIF_FOREIGN_FPSTATE)
+			fpsimd_restore_current_state();
+
+		local_irq_disable();
 
-	if (thread_flags & _TIF_FOREIGN_FPSTATE)
-		fpsimd_restore_current_state();
+		thread_flags = READ_ONCE(current_thread_info()->flags) &
+			_TIF_WORK_MASK;
 
+	} while (thread_flags);
 }
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 10/14] arch/arm64: adopt prepare_exit_to_usermode() model from x86
@ 2015-10-20 20:36                 ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: linux-arm-kernel

This change is a prerequisite change for TASK_ISOLATION but also
stands on its own for readability and maintainability.  The existing
arm64 do_notify_resume() is called in a loop from assembly on
the slow path; this change moves the loop into C code as well.
For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add new,
comprehensible entry and exit handlers written in C").

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/arm64/kernel/entry.S  |  6 +++---
 arch/arm64/kernel/signal.c | 32 ++++++++++++++++++++++----------
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 4306c937b1ff..6fcbf8ea307b 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -628,9 +628,8 @@ work_pending:
 	mov	x0, sp				// 'regs'
 	tst	x2, #PSR_MODE_MASK		// user mode regs?
 	b.ne	no_work_pending			// returning to kernel
-	enable_irq				// enable interrupts for do_notify_resume()
-	bl	do_notify_resume
-	b	ret_to_user
+	bl	prepare_exit_to_usermode
+	b	no_user_work_pending
 work_resched:
 	bl	schedule
 
@@ -642,6 +641,7 @@ ret_to_user:
 	ldr	x1, [tsk, #TI_FLAGS]
 	and	x2, x1, #_TIF_WORK_MASK
 	cbnz	x2, work_pending
+no_user_work_pending:
 	enable_step_tsk x1, x2
 no_work_pending:
 	kernel_exit 0
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index e18c48cb6db1..fde59c1139a9 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -399,18 +399,30 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
-asmlinkage void do_notify_resume(struct pt_regs *regs,
-				 unsigned int thread_flags)
+asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
+					 unsigned int thread_flags)
 {
-	if (thread_flags & _TIF_SIGPENDING)
-		do_signal(regs);
+	do {
+		local_irq_enable();
 
-	if (thread_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-	}
+		if (thread_flags & _TIF_NEED_RESCHED)
+			schedule();
+
+		if (thread_flags & _TIF_SIGPENDING)
+			do_signal(regs);
+
+		if (thread_flags & _TIF_NOTIFY_RESUME) {
+			clear_thread_flag(TIF_NOTIFY_RESUME);
+			tracehook_notify_resume(regs);
+		}
+
+		if (thread_flags & _TIF_FOREIGN_FPSTATE)
+			fpsimd_restore_current_state();
+
+		local_irq_disable();
 
-	if (thread_flags & _TIF_FOREIGN_FPSTATE)
-		fpsimd_restore_current_state();
+		thread_flags = READ_ONCE(current_thread_info()->flags) &
+			_TIF_WORK_MASK;
 
+	} while (thread_flags);
 }
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 11/14] arch/arm64: enable task isolation functionality
  2015-10-20 20:35               ` Chris Metcalf
@ 2015-10-20 20:36                 ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-arm-kernel, linux-kernel
  Cc: Chris Metcalf

We need to call task_isolation_enter() from prepare_exit_to_usermode(),
so that we can both ensure we do it last before returning to
userspace, and we also are able to re-run signal handling, etc.,
if something occurs while task_isolation_enter() has interrupts
enabled.  To do this we add _TIF_NOHZ to the _TIF_WORK_MASK if
we have CONFIG_TASK_ISOLATION enabled, which brings us into
prepare_exit_to_usermode() on all return to userspace.  But we
don't put _TIF_NOHZ in the flags that we use to loop back and
recheck, since we don't need to loop back only because the flag
is set.  Instead we unconditionally call task_isolation_enter()
at the end of the loop if any other work is done.

To make the assembly code continue to be as optimized as before,
we renumber the _TIF flags so that both _TIF_WORK_MASK and
_TIF_SYSCALL_WORK still have contiguous runs of bits in the
immediate operand for the "and" instruction, as required by the
ARM64 ISA.  Since TIF_NOHZ is in both masks, it must be the
middle bit in the contiguous run that starts with the
_TIF_WORK_MASK bits and ends with the _TIF_SYSCALL_WORK bits.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one.  This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

Finally, add an explicit check for STRICT mode in do_mem_abort()
to handle the case of page faults.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/arm64/include/asm/thread_info.h | 18 ++++++++++++------
 arch/arm64/kernel/ptrace.c           | 12 +++++++++---
 arch/arm64/kernel/signal.c           |  7 +++++--
 arch/arm64/mm/fault.c                |  4 ++++
 4 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index dcd06d18a42a..4c36c4ee3528 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -101,11 +101,11 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	1
 #define TIF_NOTIFY_RESUME	2	/* callback before returning to user */
 #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
-#define TIF_NOHZ		7
-#define TIF_SYSCALL_TRACE	8
-#define TIF_SYSCALL_AUDIT	9
-#define TIF_SYSCALL_TRACEPOINT	10
-#define TIF_SECCOMP		11
+#define TIF_NOHZ		4
+#define TIF_SYSCALL_TRACE	5
+#define TIF_SYSCALL_AUDIT	6
+#define TIF_SYSCALL_TRACEPOINT	7
+#define TIF_SECCOMP		8
 #define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_FREEZE		19
 #define TIF_RESTORE_SIGMASK	20
@@ -124,9 +124,15 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_32BIT		(1 << TIF_32BIT)
 
-#define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
+#define _TIF_WORK_LOOP_MASK	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE)
 
+#ifdef CONFIG_TASK_ISOLATION
+# define _TIF_WORK_MASK		(_TIF_WORK_LOOP_MASK | _TIF_NOHZ)
+#else
+# define _TIF_WORK_MASK		_TIF_WORK_LOOP_MASK
+#endif
+
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
 				 _TIF_NOHZ)
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 1971f491bb90..69ed3ba81650 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1240,14 +1241,19 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
-	/* Do the secure computing check first; failures should be fast. */
+	unsigned long work = ACCESS_ONCE(current_thread_info()->flags);
+
+	if ((work & _TIF_NOHZ) && task_isolation_check_syscall(regs->syscallno))
+		return -1;
+
+	/* Do the secure computing check early; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
 
-	if (test_thread_flag(TIF_SYSCALL_TRACE))
+	if (work & _TIF_SYSCALL_TRACE)
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
-	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
+	if (work & _TIF_SYSCALL_TRACEPOINT)
 		trace_sys_enter(regs, regs->syscallno);
 
 	audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index fde59c1139a9..641c828653c7 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -25,6 +25,7 @@
 #include <linux/uaccess.h>
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
+#include <linux/isolation.h>
 
 #include <asm/debug-monitors.h>
 #include <asm/elf.h>
@@ -419,10 +420,12 @@ asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
 		if (thread_flags & _TIF_FOREIGN_FPSTATE)
 			fpsimd_restore_current_state();
 
+		task_isolation_enter();
+
 		local_irq_disable();
 
 		thread_flags = READ_ONCE(current_thread_info()->flags) &
-			_TIF_WORK_MASK;
+			_TIF_WORK_LOOP_MASK;
 
-	} while (thread_flags);
+	} while (thread_flags || !task_isolation_ready());
 }
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 9fadf6d7039b..a726f9f3ef3c 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/isolation.h>
 
 #include <asm/cpufeature.h>
 #include <asm/exception.h>
@@ -466,6 +467,9 @@ asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
 	const struct fault_info *inf = fault_info + (esr & 63);
 	struct siginfo info;
 
+	if (user_mode(regs))
+		task_isolation_check_exception("%s at %#lx", inf->name, addr);
+
 	if (!inf->fn(addr, esr, regs))
 		return;
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 11/14] arch/arm64: enable task isolation functionality
@ 2015-10-20 20:36                 ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: linux-arm-kernel

We need to call task_isolation_enter() from prepare_exit_to_usermode(),
so that we can both ensure we do it last before returning to
userspace, and we also are able to re-run signal handling, etc.,
if something occurs while task_isolation_enter() has interrupts
enabled.  To do this we add _TIF_NOHZ to the _TIF_WORK_MASK if
we have CONFIG_TASK_ISOLATION enabled, which brings us into
prepare_exit_to_usermode() on all return to userspace.  But we
don't put _TIF_NOHZ in the flags that we use to loop back and
recheck, since we don't need to loop back only because the flag
is set.  Instead we unconditionally call task_isolation_enter()
at the end of the loop if any other work is done.

To make the assembly code continue to be as optimized as before,
we renumber the _TIF flags so that both _TIF_WORK_MASK and
_TIF_SYSCALL_WORK still have contiguous runs of bits in the
immediate operand for the "and" instruction, as required by the
ARM64 ISA.  Since TIF_NOHZ is in both masks, it must be the
middle bit in the contiguous run that starts with the
_TIF_WORK_MASK bits and ends with the _TIF_SYSCALL_WORK bits.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one.  This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

Finally, add an explicit check for STRICT mode in do_mem_abort()
to handle the case of page faults.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/arm64/include/asm/thread_info.h | 18 ++++++++++++------
 arch/arm64/kernel/ptrace.c           | 12 +++++++++---
 arch/arm64/kernel/signal.c           |  7 +++++--
 arch/arm64/mm/fault.c                |  4 ++++
 4 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index dcd06d18a42a..4c36c4ee3528 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -101,11 +101,11 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	1
 #define TIF_NOTIFY_RESUME	2	/* callback before returning to user */
 #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
-#define TIF_NOHZ		7
-#define TIF_SYSCALL_TRACE	8
-#define TIF_SYSCALL_AUDIT	9
-#define TIF_SYSCALL_TRACEPOINT	10
-#define TIF_SECCOMP		11
+#define TIF_NOHZ		4
+#define TIF_SYSCALL_TRACE	5
+#define TIF_SYSCALL_AUDIT	6
+#define TIF_SYSCALL_TRACEPOINT	7
+#define TIF_SECCOMP		8
 #define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_FREEZE		19
 #define TIF_RESTORE_SIGMASK	20
@@ -124,9 +124,15 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_32BIT		(1 << TIF_32BIT)
 
-#define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
+#define _TIF_WORK_LOOP_MASK	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE)
 
+#ifdef CONFIG_TASK_ISOLATION
+# define _TIF_WORK_MASK		(_TIF_WORK_LOOP_MASK | _TIF_NOHZ)
+#else
+# define _TIF_WORK_MASK		_TIF_WORK_LOOP_MASK
+#endif
+
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
 				 _TIF_NOHZ)
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 1971f491bb90..69ed3ba81650 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1240,14 +1241,19 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
-	/* Do the secure computing check first; failures should be fast. */
+	unsigned long work = ACCESS_ONCE(current_thread_info()->flags);
+
+	if ((work & _TIF_NOHZ) && task_isolation_check_syscall(regs->syscallno))
+		return -1;
+
+	/* Do the secure computing check early; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
 
-	if (test_thread_flag(TIF_SYSCALL_TRACE))
+	if (work & _TIF_SYSCALL_TRACE)
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
-	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
+	if (work & _TIF_SYSCALL_TRACEPOINT)
 		trace_sys_enter(regs, regs->syscallno);
 
 	audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index fde59c1139a9..641c828653c7 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -25,6 +25,7 @@
 #include <linux/uaccess.h>
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
+#include <linux/isolation.h>
 
 #include <asm/debug-monitors.h>
 #include <asm/elf.h>
@@ -419,10 +420,12 @@ asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
 		if (thread_flags & _TIF_FOREIGN_FPSTATE)
 			fpsimd_restore_current_state();
 
+		task_isolation_enter();
+
 		local_irq_disable();
 
 		thread_flags = READ_ONCE(current_thread_info()->flags) &
-			_TIF_WORK_MASK;
+			_TIF_WORK_LOOP_MASK;
 
-	} while (thread_flags);
+	} while (thread_flags || !task_isolation_ready());
 }
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 9fadf6d7039b..a726f9f3ef3c 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/isolation.h>
 
 #include <asm/cpufeature.h>
 #include <asm/exception.h>
@@ -466,6 +467,9 @@ asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
 	const struct fault_info *inf = fault_info + (esr & 63);
 	struct siginfo info;
 
+	if (user_mode(regs))
+		task_isolation_check_exception("%s at %#lx", inf->name, addr);
+
 	if (!inf->fn(addr, esr, regs))
 		return;
 
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 12/14] arch/tile: adopt prepare_exit_to_usermode() model from x86
  2015-10-20 20:35               ` Chris Metcalf
                                 ` (11 preceding siblings ...)
  (?)
@ 2015-10-20 20:36               ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel
  Cc: Chris Metcalf

This change is a prerequisite change for TASK_ISOLATION but also
stands on its own for readability and maintainability.  The existing
tile do_work_pending() was called in a loop from assembly on
the slow path; this change moves the loop into C code as well.
For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add new,
comprehensible entry and exit handlers written in C").

This change exposes a pre-existing bug on the older tilepro platform;
the singlestep processing is done last, but on tilepro (unlike tilegx)
we enable interrupts while doing that processing, so we could in
theory miss a signal or other asynchronous event.  A future change
could fix this by breaking the singlestep work into a "prepare"
step done in the main loop, and a "trigger" step done after exiting
the loop.  Since this change is intended as purely a restructuring
change, we call out the bug explicitly now, but don't yet fix it.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/include/asm/processor.h   |  2 +-
 arch/tile/include/asm/thread_info.h |  8 +++-
 arch/tile/kernel/intvec_32.S        | 46 +++++++--------------
 arch/tile/kernel/intvec_64.S        | 49 +++++++----------------
 arch/tile/kernel/process.c          | 79 +++++++++++++++++++------------------
 5 files changed, 77 insertions(+), 107 deletions(-)

diff --git a/arch/tile/include/asm/processor.h b/arch/tile/include/asm/processor.h
index 139dfdee0134..0684e88aacd8 100644
--- a/arch/tile/include/asm/processor.h
+++ b/arch/tile/include/asm/processor.h
@@ -212,7 +212,7 @@ static inline void release_thread(struct task_struct *dead_task)
 	/* Nothing for now */
 }
 
-extern int do_work_pending(struct pt_regs *regs, u32 flags);
+extern void prepare_exit_to_usermode(struct pt_regs *regs, u32 flags);
 
 
 /*
diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index dc1fb28d9636..4b7cef9e94e0 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -140,10 +140,14 @@ extern void _cpu_idle(void);
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
 #define _TIF_NOHZ		(1<<TIF_NOHZ)
 
+/* Work to do as we loop to exit to user space. */
+#define _TIF_WORK_MASK \
+	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
+	 _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME)
+
 /* Work to do on any return to user space. */
 #define _TIF_ALLWORK_MASK \
-	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | _TIF_SINGLESTEP | \
-	 _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME | _TIF_NOHZ)
+	(_TIF_WORK_MASK | _TIF_SINGLESTEP | _TIF_NOHZ)
 
 /* Work to do at syscall entry. */
 #define _TIF_SYSCALL_ENTRY_WORK \
diff --git a/arch/tile/kernel/intvec_32.S b/arch/tile/kernel/intvec_32.S
index fbbe2ea882ea..33d48812872a 100644
--- a/arch/tile/kernel/intvec_32.S
+++ b/arch/tile/kernel/intvec_32.S
@@ -846,18 +846,6 @@ STD_ENTRY(interrupt_return)
 	FEEDBACK_REENTER(interrupt_return)
 
 	/*
-	 * Use r33 to hold whether we have already loaded the callee-saves
-	 * into ptregs.  We don't want to do it twice in this loop, since
-	 * then we'd clobber whatever changes are made by ptrace, etc.
-	 * Get base of stack in r32.
-	 */
-	{
-	 GET_THREAD_INFO(r32)
-	 movei  r33, 0
-	}
-
-.Lretry_work_pending:
-	/*
 	 * Disable interrupts so as to make sure we don't
 	 * miss an interrupt that sets any of the thread flags (like
 	 * need_resched or sigpending) between sampling and the iret.
@@ -867,33 +855,27 @@ STD_ENTRY(interrupt_return)
 	IRQ_DISABLE(r20, r21)
 	TRACE_IRQS_OFF  /* Note: clobbers registers r0-r29 */
 
-
-	/* Check to see if there is any work to do before returning to user. */
+	/*
+	 * See if there are any work items (including single-shot items)
+	 * to do.  If so, save the callee-save registers to pt_regs
+	 * and then dispatch to C code.
+	 */
+	GET_THREAD_INFO(r21)
 	{
-	 addi   r29, r32, THREAD_INFO_FLAGS_OFFSET
-	 moveli r1, lo16(_TIF_ALLWORK_MASK)
+	 addi   r22, r21, THREAD_INFO_FLAGS_OFFSET
+	 moveli r20, lo16(_TIF_ALLWORK_MASK)
 	}
 	{
-	 lw     r29, r29
-	 auli   r1, r1, ha16(_TIF_ALLWORK_MASK)
+	 lw     r22, r22
+	 auli   r20, r20, ha16(_TIF_ALLWORK_MASK)
 	}
-	and     r1, r29, r1
-	bzt     r1, .Lrestore_all
-
-	/*
-	 * Make sure we have all the registers saved for signal
-	 * handling, notify-resume, or single-step.  Call out to C
-	 * code to figure out exactly what we need to do for each flag bit,
-	 * then if necessary, reload the flags and recheck.
-	 */
+	and     r1, r22, r20
 	{
 	 PTREGS_PTR(r0, PTREGS_OFFSET_BASE)
-	 bnz    r33, 1f
+	 bzt    r1, .Lrestore_all
 	}
 	push_extra_callee_saves r0
-	movei   r33, 1
-1:	jal     do_work_pending
-	bnz     r0, .Lretry_work_pending
+	jal     prepare_exit_to_usermode
 
 	/*
 	 * In the NMI case we
@@ -1327,7 +1309,7 @@ STD_ENTRY(ret_from_kernel_thread)
 	FEEDBACK_REENTER(ret_from_kernel_thread)
 	{
 	 movei  r30, 0               /* not an NMI */
-	 j      .Lresume_userspace   /* jump into middle of interrupt_return */
+	 j      interrupt_return
 	}
 	STD_ENDPROC(ret_from_kernel_thread)
 
diff --git a/arch/tile/kernel/intvec_64.S b/arch/tile/kernel/intvec_64.S
index 58964d209d4d..a41c994ce237 100644
--- a/arch/tile/kernel/intvec_64.S
+++ b/arch/tile/kernel/intvec_64.S
@@ -879,20 +879,6 @@ STD_ENTRY(interrupt_return)
 	FEEDBACK_REENTER(interrupt_return)
 
 	/*
-	 * Use r33 to hold whether we have already loaded the callee-saves
-	 * into ptregs.  We don't want to do it twice in this loop, since
-	 * then we'd clobber whatever changes are made by ptrace, etc.
-	 */
-	{
-	 movei  r33, 0
-	 move   r32, sp
-	}
-
-	/* Get base of stack in r32. */
-	EXTRACT_THREAD_INFO(r32)
-
-.Lretry_work_pending:
-	/*
 	 * Disable interrupts so as to make sure we don't
 	 * miss an interrupt that sets any of the thread flags (like
 	 * need_resched or sigpending) between sampling and the iret.
@@ -902,33 +888,28 @@ STD_ENTRY(interrupt_return)
 	IRQ_DISABLE(r20, r21)
 	TRACE_IRQS_OFF  /* Note: clobbers registers r0-r29 */
 
-
-	/* Check to see if there is any work to do before returning to user. */
+	/*
+	 * See if there are any work items (including single-shot items)
+	 * to do.  If so, save the callee-save registers to pt_regs
+	 * and then dispatch to C code.
+	 */
+	move    r21, sp
+	EXTRACT_THREAD_INFO(r21)
 	{
-	 addi   r29, r32, THREAD_INFO_FLAGS_OFFSET
-	 moveli r1, hw1_last(_TIF_ALLWORK_MASK)
+	 addi   r22, r21, THREAD_INFO_FLAGS_OFFSET
+	 moveli r20, hw1_last(_TIF_ALLWORK_MASK)
 	}
 	{
-	 ld     r29, r29
-	 shl16insli r1, r1, hw0(_TIF_ALLWORK_MASK)
+	 ld     r22, r22
+	 shl16insli r20, r20, hw0(_TIF_ALLWORK_MASK)
 	}
-	and     r1, r29, r1
-	beqzt   r1, .Lrestore_all
-
-	/*
-	 * Make sure we have all the registers saved for signal
-	 * handling or notify-resume.  Call out to C code to figure out
-	 * exactly what we need to do for each flag bit, then if
-	 * necessary, reload the flags and recheck.
-	 */
+	and     r1, r22, r20
 	{
 	 PTREGS_PTR(r0, PTREGS_OFFSET_BASE)
-	 bnez   r33, 1f
+	 beqzt  r1, .Lrestore_all
 	}
 	push_extra_callee_saves r0
-	movei   r33, 1
-1:	jal     do_work_pending
-	bnez    r0, .Lretry_work_pending
+	jal     prepare_exit_to_usermode
 
 	/*
 	 * In the NMI case we
@@ -1411,7 +1392,7 @@ STD_ENTRY(ret_from_kernel_thread)
 	FEEDBACK_REENTER(ret_from_kernel_thread)
 	{
 	 movei  r30, 0               /* not an NMI */
-	 j      .Lresume_userspace   /* jump into middle of interrupt_return */
+	 j      interrupt_return
 	}
 	STD_ENDPROC(ret_from_kernel_thread)
 
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index 7d5769310bef..b5f30d376ce1 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -462,54 +462,57 @@ struct task_struct *__sched _switch_to(struct task_struct *prev,
 
 /*
  * This routine is called on return from interrupt if any of the
- * TIF_WORK_MASK flags are set in thread_info->flags.  It is
- * entered with interrupts disabled so we don't miss an event
- * that modified the thread_info flags.  If any flag is set, we
- * handle it and return, and the calling assembly code will
- * re-disable interrupts, reload the thread flags, and call back
- * if more flags need to be handled.
- *
- * We return whether we need to check the thread_info flags again
- * or not.  Note that we don't clear TIF_SINGLESTEP here, so it's
- * important that it be tested last, and then claim that we don't
- * need to recheck the flags.
+ * TIF_ALLWORK_MASK flags are set in thread_info->flags.  It is
+ * entered with interrupts disabled so we don't miss an event that
+ * modified the thread_info flags.  We loop until all the tested flags
+ * are clear.  Note that the function is called on certain conditions
+ * that are not listed in the loop condition here (e.g. SINGLESTEP)
+ * which guarantees we will do those things once, and redo them if any
+ * of the other work items is re-done, but won't continue looping if
+ * all the other work is done.
  */
-int do_work_pending(struct pt_regs *regs, u32 thread_info_flags)
+void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags)
 {
-	/* If we enter in kernel mode, do nothing and exit the caller loop. */
-	if (!user_mode(regs))
-		return 0;
+	if (WARN_ON(!user_mode(regs)))
+		return;
 
-	user_exit();
+	do {
+		local_irq_enable();
 
-	/* Enable interrupts; they are disabled again on return to caller. */
-	local_irq_enable();
+		if (thread_info_flags & _TIF_NEED_RESCHED)
+			schedule();
 
-	if (thread_info_flags & _TIF_NEED_RESCHED) {
-		schedule();
-		return 1;
-	}
 #if CHIP_HAS_TILE_DMA()
-	if (thread_info_flags & _TIF_ASYNC_TLB) {
-		do_async_page_fault(regs);
-		return 1;
-	}
+		if (thread_info_flags & _TIF_ASYNC_TLB)
+			do_async_page_fault(regs);
 #endif
-	if (thread_info_flags & _TIF_SIGPENDING) {
-		do_signal(regs);
-		return 1;
-	}
-	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-		return 1;
-	}
-	if (thread_info_flags & _TIF_SINGLESTEP)
+
+		if (thread_info_flags & _TIF_SIGPENDING)
+			do_signal(regs);
+
+		if (thread_info_flags & _TIF_NOTIFY_RESUME) {
+			clear_thread_flag(TIF_NOTIFY_RESUME);
+			tracehook_notify_resume(regs);
+		}
+
+		local_irq_disable();
+		thread_info_flags = READ_ONCE(current_thread_info()->flags);
+
+	} while (thread_info_flags & _TIF_WORK_MASK);
+
+	if (thread_info_flags & _TIF_SINGLESTEP) {
 		single_step_once(regs);
+#ifndef __tilegx__
+		/*
+		 * FIXME: on tilepro, since we enable interrupts in
+		 * this routine, it's possible that we miss a signal
+		 * or other asynchronous event.
+		 */
+		local_irq_disable();
+#endif
+	}
 
 	user_enter();
-
-	return 0;
 }
 
 unsigned long get_wchan(struct task_struct *p)
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 13/14] arch/tile: turn off timer tick for oneshot_stopped state
  2015-10-20 20:35               ` Chris Metcalf
                                 ` (12 preceding siblings ...)
  (?)
@ 2015-10-20 20:36               ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel
  Cc: Chris Metcalf

When the schedule tick is disabled in tick_nohz_stop_sched_tick(),
we call hrtimer_cancel(), which eventually calls down into
__remove_hrtimer() and thus into hrtimer_force_reprogram().  That
function's call to tick_program_event() detects that we are trying to
set the expiration to KTIME_MAX and calls clockevents_switch_state()
to set the state to ONESHOT_STOPPED, and returns.

However, by default the internal __clockevents_switch_state() code
doesn't have a "set_state_oneshot_stopped" function pointer for the
tile clock_event_device, so that code returns -ENOSYS, and we end up
not setting the state, and more importantly, we don't actually turn
off the tile hardware timer.  As a result, the timer tick we were
waiting for before is still queued, and fires shortly afterwards,
only to discover there was nothing for it to do, at which point
it quiesces.

The fix is to provide that function pointer for tile, and like the
other function pointers, have it just turn off the timer interrupt.
Any call to set a new timer interval will properly re-enable it.

This fix avoids a small performance hiccup for regular applications,
but for TASK_ISOLATION code, it fixes a potentially disastrous
kernel timer interruption that could cause packets to be dropped.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/kernel/time.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index 178989e6d3e3..fbedf380d9d4 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -159,6 +159,7 @@ static DEFINE_PER_CPU(struct clock_event_device, tile_timer) = {
 	.set_next_event = tile_timer_set_next_event,
 	.set_state_shutdown = tile_timer_shutdown,
 	.set_state_oneshot = tile_timer_shutdown,
+	.set_state_oneshot_stopped = tile_timer_shutdown,
 	.tick_resume = tile_timer_shutdown,
 };
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH v8 14/14] arch/tile: enable task isolation functionality
  2015-10-20 20:35               ` Chris Metcalf
                                 ` (13 preceding siblings ...)
  (?)
@ 2015-10-20 20:36               ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel
  Cc: Chris Metcalf

We add the necessary call to task_isolation_enter() in the
prepare_exit_to_usermode() routine.  We already unconditionally
call into this routine if TIF_NOHZ is set, since that's where
we do the user_enter() call.

We add calls to task_isolation_check_exception() in places
where exceptions may not generate signals to the application.

In addition, we add an overriding task_isolation_wait() call
that runs a nap instruction while waiting for an interrupt, to
make the task_isolation_enter() loop run in a lower-power state.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/kernel/process.c     | 6 +++++-
 arch/tile/kernel/ptrace.c      | 6 +++++-
 arch/tile/kernel/single_step.c | 5 +++++
 arch/tile/kernel/unaligned.c   | 3 +++
 arch/tile/mm/fault.c           | 3 +++
 arch/tile/mm/homecache.c       | 5 ++++-
 6 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index b5f30d376ce1..832febfd65df 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -29,6 +29,7 @@
 #include <linux/signal.h>
 #include <linux/delay.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 #include <asm/stack.h>
 #include <asm/switch_to.h>
 #include <asm/homecache.h>
@@ -495,10 +496,13 @@ void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags)
 			tracehook_notify_resume(regs);
 		}
 
+		task_isolation_enter();
+
 		local_irq_disable();
 		thread_info_flags = READ_ONCE(current_thread_info()->flags);
 
-	} while (thread_info_flags & _TIF_WORK_MASK);
+	} while ((thread_info_flags & _TIF_WORK_MASK) ||
+		 !task_isolation_ready());
 
 	if (thread_info_flags & _TIF_SINGLESTEP) {
 		single_step_once(regs);
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index bdc126faf741..63acf7b4655f 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -23,6 +23,7 @@
 #include <linux/elf.h>
 #include <linux/tracehook.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 #include <asm/traps.h>
 #include <arch/chip.h>
 
@@ -259,8 +260,11 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 	 * If TIF_NOHZ is set, we are required to call user_exit() before
 	 * doing anything that could touch RCU.
 	 */
-	if (work & _TIF_NOHZ)
+	if (work & _TIF_NOHZ) {
 		user_exit();
+		if (task_isolation_check_syscall(regs->regs[TREG_SYSCALL_NR]))
+			return -1;
+	}
 
 	if (secure_computing() == -1)
 		return -1;
diff --git a/arch/tile/kernel/single_step.c b/arch/tile/kernel/single_step.c
index 53f7b9def07b..4cba9f4a1915 100644
--- a/arch/tile/kernel/single_step.c
+++ b/arch/tile/kernel/single_step.c
@@ -23,6 +23,7 @@
 #include <linux/types.h>
 #include <linux/err.h>
 #include <linux/prctl.h>
+#include <linux/isolation.h>
 #include <linux/context_tracking.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
@@ -321,6 +322,8 @@ void single_step_once(struct pt_regs *regs)
 	int size = 0, sign_ext = 0;  /* happy compiler */
 	int align_ctl;
 
+	task_isolation_check_exception("single step at %#lx", regs->pc);
+
 	align_ctl = unaligned_fixup;
 	switch (task_thread_info(current)->align_ctl) {
 	case PR_UNALIGN_NOPRINT:
@@ -770,6 +773,8 @@ void single_step_once(struct pt_regs *regs)
 	unsigned long *ss_pc = this_cpu_ptr(&ss_saved_pc);
 	unsigned long control = __insn_mfspr(SPR_SINGLE_STEP_CONTROL_K);
 
+	task_isolation_check_exception("single step at %#lx", regs->pc);
+
 	*ss_pc = regs->pc;
 	control |= SPR_SINGLE_STEP_CONTROL_1__CANCELED_MASK;
 	control |= SPR_SINGLE_STEP_CONTROL_1__INHIBIT_MASK;
diff --git a/arch/tile/kernel/unaligned.c b/arch/tile/kernel/unaligned.c
index d075f92ccee0..dbb9c1144236 100644
--- a/arch/tile/kernel/unaligned.c
+++ b/arch/tile/kernel/unaligned.c
@@ -26,6 +26,7 @@
 #include <linux/compat.h>
 #include <linux/prctl.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <asm/uaccess.h>
@@ -1547,6 +1548,8 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
 		goto done;
 	}
 
+	task_isolation_check_exception("unaligned JIT at %#lx", regs->pc);
+
 	if (!info->unalign_jit_base) {
 		void __user *user_page;
 
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 13eac59bf16a..53514ca54143 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -36,6 +36,7 @@
 #include <linux/uaccess.h>
 #include <linux/kdebug.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 
 #include <asm/pgalloc.h>
 #include <asm/sections.h>
@@ -846,6 +847,8 @@ void do_page_fault(struct pt_regs *regs, int fault_num,
 		   unsigned long address, unsigned long write)
 {
 	enum ctx_state prev_state = exception_enter();
+	task_isolation_check_exception("page fault interrupt %d at %#lx (%#lx)",
+				       fault_num, regs->pc, address);
 	__do_page_fault(regs, fault_num, address, write);
 	exception_exit(prev_state);
 }
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..a79325113105 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
 #include <linux/smp.h>
 #include <linux/module.h>
 #include <linux/hugetlb.h>
+#include <linux/isolation.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
 	 * Don't bother to update atomically; losing a count
 	 * here is not that critical.
 	 */
-	for_each_cpu(cpu, &mask)
+	for_each_cpu(cpu, &mask) {
 		++per_cpu(irq_stat, cpu).irq_hv_flush_count;
+		task_isolation_debug(cpu);
+	}
 }
 
 /*
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 02/14] vmstat: add vmstat_idle function
  2015-10-20 20:36               ` [PATCH v8 02/14] vmstat: add vmstat_idle function Chris Metcalf
@ 2015-10-20 20:45                 ` Christoph Lameter
  0 siblings, 0 replies; 340+ messages in thread
From: Christoph Lameter @ 2015-10-20 20:45 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, linux-kernel

On Tue, 20 Oct 2015, Chris Metcalf wrote:

> This function checks to see if a vmstat worker is not running,
> and the vmstat diffs don't require an update.  The function is
> called from the task-isolation code to see if we need to
> actually do some work to quiet vmstat.

Acked-by: Christoph Lameter <cl@linux.com>


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
  2015-10-20 20:36                 ` Chris Metcalf
  (?)
@ 2015-10-20 20:56                 ` Andy Lutomirski
  2015-10-20 21:20                     ` Chris Metcalf
  -1 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-10-20 20:56 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On Tue, Oct 20, 2015 at 1:36 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> +/*
> + * In task isolation mode we try to return to userspace only after
> + * attempting to make sure we won't be interrupted again.  To handle
> + * the periodic scheduler tick, we test to make sure that the tick is
> + * stopped, and if it isn't yet, we request a reschedule so that if
> + * another task needs to run to completion first, it can do so.
> + * Similarly, if any other subsystems require quiescing, we will need
> + * to do that before we return to userspace.
> + */
> +bool _task_isolation_ready(void)
> +{
> +       WARN_ON_ONCE(!irqs_disabled());
> +
> +       /* If we need to drain the LRU cache, we're not ready. */
> +       if (lru_add_drain_needed(smp_processor_id()))
> +               return false;
> +
> +       /* If vmstats need updating, we're not ready. */
> +       if (!vmstat_idle())
> +               return false;
> +
> +       /* If the tick is running, request rescheduling; we're not ready. */
> +       if (!tick_nohz_tick_stopped()) {
> +               set_tsk_need_resched(current);
> +               return false;
> +       }
> +
> +       return true;
> +}

I still don't get why this is a loop.

I would argue that this should simply drain the LRU, quiet vmstat, and
return.  If the tick isn't stopped, then there's a reason why it's not
stopped (which may involve having SCHED_OTHER tasks around, in which
case user code shouldn't do that or there should simply be a
requirement that isolation requires a real-time scheduler class).

BTW, should isolation just be a scheduler class (SCHED_ISOLATED)?

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot
  2015-10-20 20:36               ` [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot Chris Metcalf
@ 2015-10-20 21:03                 ` Frederic Weisbecker
  2015-10-20 21:18                   ` Chris Metcalf
                                     ` (2 more replies)
  0 siblings, 3 replies; 340+ messages in thread
From: Frederic Weisbecker @ 2015-10-20 21:03 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-kernel

On Tue, Oct 20, 2015 at 04:36:06PM -0400, Chris Metcalf wrote:
> While the current fallback to 1-second tick is still required for
> a number of kernel accounting tasks (e.g. vruntime, load balancing
> data, and load accounting), it's useful to be able to disable it
> for testing purposes.  Paul McKenney observed that if we provide
> a mode where the 1Hz fallback timer is removed, this will provide
> an environment where new code that relies on that tick will get
> punished, and we won't forgive such assumptions silently.
> 
> This option also allows easy testing of nohz_full and task-isolation
> modes to determine what functionality needs to be implemented,
> and what possibly-spurious timer interrupts are scheduled when
> the basic 1Hz tick has been turned off.
> 
> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>

There have been proposals to disable/tune the 1 Hz tick via debugfs which
I Nacked because once you give such an opportunity to the users, they
will use that hack and never fix the real underlying issue.

For the same reasons, I'm sorry but I have to Nack this proposal as well.

If this is for development or testing purpose, scheduler_max_tick_deferment() is
easily commented out.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot
  2015-10-20 21:03                 ` Frederic Weisbecker
@ 2015-10-20 21:18                   ` Chris Metcalf
  2015-10-21  0:59                     ` Steven Rostedt
  2015-10-21  6:56                   ` Gilad Ben Yossef
  2015-10-21 14:28                   ` Christoph Lameter
  2 siblings, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 21:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-kernel

On 10/20/2015 05:03 PM, Frederic Weisbecker wrote:
> On Tue, Oct 20, 2015 at 04:36:06PM -0400, Chris Metcalf wrote:
>> While the current fallback to 1-second tick is still required for
>> a number of kernel accounting tasks (e.g. vruntime, load balancing
>> data, and load accounting), it's useful to be able to disable it
>> for testing purposes.  Paul McKenney observed that if we provide
>> a mode where the 1Hz fallback timer is removed, this will provide
>> an environment where new code that relies on that tick will get
>> punished, and we won't forgive such assumptions silently.
>>
>> This option also allows easy testing of nohz_full and task-isolation
>> modes to determine what functionality needs to be implemented,
>> and what possibly-spurious timer interrupts are scheduled when
>> the basic 1Hz tick has been turned off.
>>
>> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
> There have been proposals to disable/tune the 1 Hz tick via debugfs which
> I Nacked because once you give such an opportunity to the users, they
> will use that hack and never fix the real underlying issue.
>
> For the same reasons, I'm sorry but I have to Nack this proposal as well.
>
> If this is for development or testing purpose, scheduler_max_tick_deferment() is
> easily commented out.

Fair enough and certainly your prerogative, so don't hesitate to
say "no" to the following argument.  :-)

I would tend to differentiate a debugfs proposal from a boot flag
proposal: a boot flag is a more hardcore thing to change, and it's
not like application developers will come along and explain that
you have to boot with different flags to run their app - whereas
if they can just sneak in a modification to a debugfs setting that's
much easier for the app to tweak.

So perhaps a boot flag is an acceptable compromise between
"nothing" and a debugfs tweak?  It certainly does make it easier
to hack on the task-isolation code, and likely other things where
people are trying out fixes to subsystems where they are attempting
to remove the reliance on the tick.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
@ 2015-10-20 21:20                     ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 21:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On 10/20/2015 04:56 PM, Andy Lutomirski wrote:
> On Tue, Oct 20, 2015 at 1:36 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> +/*
>> + * In task isolation mode we try to return to userspace only after
>> + * attempting to make sure we won't be interrupted again.  To handle
>> + * the periodic scheduler tick, we test to make sure that the tick is
>> + * stopped, and if it isn't yet, we request a reschedule so that if
>> + * another task needs to run to completion first, it can do so.
>> + * Similarly, if any other subsystems require quiescing, we will need
>> + * to do that before we return to userspace.
>> + */
>> +bool _task_isolation_ready(void)
>> +{
>> +       WARN_ON_ONCE(!irqs_disabled());
>> +
>> +       /* If we need to drain the LRU cache, we're not ready. */
>> +       if (lru_add_drain_needed(smp_processor_id()))
>> +               return false;
>> +
>> +       /* If vmstats need updating, we're not ready. */
>> +       if (!vmstat_idle())
>> +               return false;
>> +
>> +       /* If the tick is running, request rescheduling; we're not ready. */
>> +       if (!tick_nohz_tick_stopped()) {
>> +               set_tsk_need_resched(current);
>> +               return false;
>> +       }
>> +
>> +       return true;
>> +}
> I still don't get why this is a loop.

You mean, why is this code called from prepare_exit_to_userspace()
in the loop, instead of after the loop?  It's because the actual functions
that clean up the LRU, vmstat worker, etc., may need interrupts enabled,
may reschedule internally, etc.  (refresh_cpu_vm_stats() calls
cond_resched(), for example.)  Even more importantly, we rely on
rescheduling to take care of the fact that the scheduler tick may still
be running, and therefore loop back to the schedule() call that's run
when TIF_NEED_RESCHED gets set.

And so, since interrupts and scheduling can happen, we need to be
run in a loop to retest, just like the existing tests for signal dispatch,
need_resched, etc.


> I would argue that this should simply drain the LRU, quiet vmstat, and
> return.  If the tick isn't stopped, then there's a reason why it's not
> stopped (which may involve having SCHED_OTHER tasks around, in which
> case user code shouldn't do that or there should simply be a
> requirement that isolation requires a real-time scheduler class).

Sure, the tick not being stopped has a reason for not being stopped,
but if it's not yet stopped, we need to schedule out and wait for
that to happen.  A real-time scheduler class won't completely
take care of this as you still may have issues like RCU needing the
cpu or any of the other cases in can_stop_full_tick().

> BTW, should isolation just be a scheduler class (SCHED_ISOLATED)?

So a scheduler class is an interesting idea certainly, although not
one I know immediately how to implement.  I'm not sure whether
it makes sense to require a user be root or have a suitable rtprio
rlimit, but perhaps so.  The nice thing about the current patch
series is that you can affinitize yourself to a nohz_full core and
declare that you want to run task-isolated, and none of that
requires root nor really is there a reason it should.  I guess you
could make SCHED_ISOLATED like SCHED_BATCH and perhaps
therefore allow non-root users to switch to it?

In any case it would have to be true that we would still be doing
all the other tests we do now, even if we could count on the
scheduler to take care of only trying to run it when there were no
other runnable processes.  So it would certainly add complexity.
I'm not sure how to evaluate the utility.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
@ 2015-10-20 21:20                     ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-20 21:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 10/20/2015 04:56 PM, Andy Lutomirski wrote:
> On Tue, Oct 20, 2015 at 1:36 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
>> +/*
>> + * In task isolation mode we try to return to userspace only after
>> + * attempting to make sure we won't be interrupted again.  To handle
>> + * the periodic scheduler tick, we test to make sure that the tick is
>> + * stopped, and if it isn't yet, we request a reschedule so that if
>> + * another task needs to run to completion first, it can do so.
>> + * Similarly, if any other subsystems require quiescing, we will need
>> + * to do that before we return to userspace.
>> + */
>> +bool _task_isolation_ready(void)
>> +{
>> +       WARN_ON_ONCE(!irqs_disabled());
>> +
>> +       /* If we need to drain the LRU cache, we're not ready. */
>> +       if (lru_add_drain_needed(smp_processor_id()))
>> +               return false;
>> +
>> +       /* If vmstats need updating, we're not ready. */
>> +       if (!vmstat_idle())
>> +               return false;
>> +
>> +       /* If the tick is running, request rescheduling; we're not ready. */
>> +       if (!tick_nohz_tick_stopped()) {
>> +               set_tsk_need_resched(current);
>> +               return false;
>> +       }
>> +
>> +       return true;
>> +}
> I still don't get why this is a loop.

You mean, why is this code called from prepare_exit_to_userspace()
in the loop, instead of after the loop?  It's because the actual functions
that clean up the LRU, vmstat worker, etc., may need interrupts enabled,
may reschedule internally, etc.  (refresh_cpu_vm_stats() calls
cond_resched(), for example.)  Even more importantly, we rely on
rescheduling to take care of the fact that the scheduler tick may still
be running, and therefore loop back to the schedule() call that's run
when TIF_NEED_RESCHED gets set.

And so, since interrupts and scheduling can happen, we need to be
run in a loop to retest, just like the existing tests for signal dispatch,
need_resched, etc.


> I would argue that this should simply drain the LRU, quiet vmstat, and
> return.  If the tick isn't stopped, then there's a reason why it's not
> stopped (which may involve having SCHED_OTHER tasks around, in which
> case user code shouldn't do that or there should simply be a
> requirement that isolation requires a real-time scheduler class).

Sure, the tick not being stopped has a reason for not being stopped,
but if it's not yet stopped, we need to schedule out and wait for
that to happen.  A real-time scheduler class won't completely
take care of this as you still may have issues like RCU needing the
cpu or any of the other cases in can_stop_full_tick().

> BTW, should isolation just be a scheduler class (SCHED_ISOLATED)?

So a scheduler class is an interesting idea certainly, although not
one I know immediately how to implement.  I'm not sure whether
it makes sense to require a user be root or have a suitable rtprio
rlimit, but perhaps so.  The nice thing about the current patch
series is that you can affinitize yourself to a nohz_full core and
declare that you want to run task-isolated, and none of that
requires root nor really is there a reason it should.  I guess you
could make SCHED_ISOLATED like SCHED_BATCH and perhaps
therefore allow non-root users to switch to it?

In any case it would have to be true that we would still be doing
all the other tests we do now, even if we could count on the
scheduler to take care of only trying to run it when there were no
other runnable processes.  So it would certainly add complexity.
I'm not sure how to evaluate the utility.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
@ 2015-10-20 21:26                       ` Andy Lutomirski
  0 siblings, 0 replies; 340+ messages in thread
From: Andy Lutomirski @ 2015-10-20 21:26 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On Tue, Oct 20, 2015 at 2:20 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> On 10/20/2015 04:56 PM, Andy Lutomirski wrote:
>>
>> On Tue, Oct 20, 2015 at 1:36 PM, Chris Metcalf <cmetcalf@ezchip.com>
>> wrote:
>>>
>>> +/*
>>> + * In task isolation mode we try to return to userspace only after
>>> + * attempting to make sure we won't be interrupted again.  To handle
>>> + * the periodic scheduler tick, we test to make sure that the tick is
>>> + * stopped, and if it isn't yet, we request a reschedule so that if
>>> + * another task needs to run to completion first, it can do so.
>>> + * Similarly, if any other subsystems require quiescing, we will need
>>> + * to do that before we return to userspace.
>>> + */
>>> +bool _task_isolation_ready(void)
>>> +{
>>> +       WARN_ON_ONCE(!irqs_disabled());
>>> +
>>> +       /* If we need to drain the LRU cache, we're not ready. */
>>> +       if (lru_add_drain_needed(smp_processor_id()))
>>> +               return false;
>>> +
>>> +       /* If vmstats need updating, we're not ready. */
>>> +       if (!vmstat_idle())
>>> +               return false;
>>> +
>>> +       /* If the tick is running, request rescheduling; we're not ready.
>>> */
>>> +       if (!tick_nohz_tick_stopped()) {
>>> +               set_tsk_need_resched(current);
>>> +               return false;
>>> +       }
>>> +
>>> +       return true;
>>> +}
>>
>> I still don't get why this is a loop.
>
>
> You mean, why is this code called from prepare_exit_to_userspace()
> in the loop, instead of after the loop?  It's because the actual functions
> that clean up the LRU, vmstat worker, etc., may need interrupts enabled,
> may reschedule internally, etc.  (refresh_cpu_vm_stats() calls
> cond_resched(), for example.)

Yuck.  I guess that's a reasonable argument, although it could also be fixed.

>  Even more importantly, we rely on
> rescheduling to take care of the fact that the scheduler tick may still
> be running, and therefore loop back to the schedule() call that's run
> when TIF_NEED_RESCHED gets set.

This just seems like a mis-design.  We don't know why the scheduler
tick is on, so we're just going to reschedule until the problem goes
away?

>
>> BTW, should isolation just be a scheduler class (SCHED_ISOLATED)?
>
>
> So a scheduler class is an interesting idea certainly, although not
> one I know immediately how to implement.  I'm not sure whether
> it makes sense to require a user be root or have a suitable rtprio
> rlimit, but perhaps so.  The nice thing about the current patch
> series is that you can affinitize yourself to a nohz_full core and
> declare that you want to run task-isolated, and none of that
> requires root nor really is there a reason it should.

Your patches more or less implement "don't run me unless I'm
isolated".  A scheduler class would be more like "isolate me (and
maybe make me super high priority so it actually happens)".

I'm not a scheduler person, so I don't know.  But "don't run me unless
I'm isolated" seems like a design that will, at best, only ever work
by dumb luck.  You have to disable migration, avoid other runnable
tasks, hope that the kernel keeps working the way it did when you
wrote the patch, hope you continue to get lucky enough that you ever
get to user mode in the first place, etc.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
@ 2015-10-20 21:26                       ` Andy Lutomirski
  0 siblings, 0 replies; 340+ messages in thread
From: Andy Lutomirski @ 2015-10-20 21:26 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Oct 20, 2015 at 2:20 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
> On 10/20/2015 04:56 PM, Andy Lutomirski wrote:
>>
>> On Tue, Oct 20, 2015 at 1:36 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
>> wrote:
>>>
>>> +/*
>>> + * In task isolation mode we try to return to userspace only after
>>> + * attempting to make sure we won't be interrupted again.  To handle
>>> + * the periodic scheduler tick, we test to make sure that the tick is
>>> + * stopped, and if it isn't yet, we request a reschedule so that if
>>> + * another task needs to run to completion first, it can do so.
>>> + * Similarly, if any other subsystems require quiescing, we will need
>>> + * to do that before we return to userspace.
>>> + */
>>> +bool _task_isolation_ready(void)
>>> +{
>>> +       WARN_ON_ONCE(!irqs_disabled());
>>> +
>>> +       /* If we need to drain the LRU cache, we're not ready. */
>>> +       if (lru_add_drain_needed(smp_processor_id()))
>>> +               return false;
>>> +
>>> +       /* If vmstats need updating, we're not ready. */
>>> +       if (!vmstat_idle())
>>> +               return false;
>>> +
>>> +       /* If the tick is running, request rescheduling; we're not ready.
>>> */
>>> +       if (!tick_nohz_tick_stopped()) {
>>> +               set_tsk_need_resched(current);
>>> +               return false;
>>> +       }
>>> +
>>> +       return true;
>>> +}
>>
>> I still don't get why this is a loop.
>
>
> You mean, why is this code called from prepare_exit_to_userspace()
> in the loop, instead of after the loop?  It's because the actual functions
> that clean up the LRU, vmstat worker, etc., may need interrupts enabled,
> may reschedule internally, etc.  (refresh_cpu_vm_stats() calls
> cond_resched(), for example.)

Yuck.  I guess that's a reasonable argument, although it could also be fixed.

>  Even more importantly, we rely on
> rescheduling to take care of the fact that the scheduler tick may still
> be running, and therefore loop back to the schedule() call that's run
> when TIF_NEED_RESCHED gets set.

This just seems like a mis-design.  We don't know why the scheduler
tick is on, so we're just going to reschedule until the problem goes
away?

>
>> BTW, should isolation just be a scheduler class (SCHED_ISOLATED)?
>
>
> So a scheduler class is an interesting idea certainly, although not
> one I know immediately how to implement.  I'm not sure whether
> it makes sense to require a user be root or have a suitable rtprio
> rlimit, but perhaps so.  The nice thing about the current patch
> series is that you can affinitize yourself to a nohz_full core and
> declare that you want to run task-isolated, and none of that
> requires root nor really is there a reason it should.

Your patches more or less implement "don't run me unless I'm
isolated".  A scheduler class would be more like "isolate me (and
maybe make me super high priority so it actually happens)".

I'm not a scheduler person, so I don't know.  But "don't run me unless
I'm isolated" seems like a design that will, at best, only ever work
by dumb luck.  You have to disable migration, avoid other runnable
tasks, hope that the kernel keeps working the way it did when you
wrote the patch, hope you continue to get lucky enough that you ever
get to user mode in the first place, etc.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
@ 2015-10-21  0:29                         ` Steven Rostedt
  0 siblings, 0 replies; 340+ messages in thread
From: Steven Rostedt @ 2015-10-21  0:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On Tue, 20 Oct 2015 14:26:34 -0700
Andy Lutomirski <luto@amacapital.net> wrote:

> I'm not a scheduler person, so I don't know.  But "don't run me unless
> I'm isolated" seems like a design that will, at best, only ever work
> by dumb luck.  You have to disable migration, avoid other runnable
> tasks, hope that the kernel keeps working the way it did when you
> wrote the patch, hope you continue to get lucky enough that you ever
> get to user mode in the first place, etc.


Since it only makes sense to run one isolated task per cpu (not more
than one on the same CPU), I wonder if we should add a new interface
for this, that would force everything else off the CPU that it
requests. That is, you bind a task to a CPU, and then change it to
SCHED_ISOLATED (or what not), and the kernel will force all other tasks
off that CPU. Well, we would still have kernel threads, but that's a
different matter.

Also, doesn't RCU need to have a few ticks go by before it can safely
disable itself from userspace? I recall something like that. Paul?

-- Steve

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
@ 2015-10-21  0:29                         ` Steven Rostedt
  0 siblings, 0 replies; 340+ messages in thread
From: Steven Rostedt @ 2015-10-21  0:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, 20 Oct 2015 14:26:34 -0700
Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:

> I'm not a scheduler person, so I don't know.  But "don't run me unless
> I'm isolated" seems like a design that will, at best, only ever work
> by dumb luck.  You have to disable migration, avoid other runnable
> tasks, hope that the kernel keeps working the way it did when you
> wrote the patch, hope you continue to get lucky enough that you ever
> get to user mode in the first place, etc.


Since it only makes sense to run one isolated task per cpu (not more
than one on the same CPU), I wonder if we should add a new interface
for this, that would force everything else off the CPU that it
requests. That is, you bind a task to a CPU, and then change it to
SCHED_ISOLATED (or what not), and the kernel will force all other tasks
off that CPU. Well, we would still have kernel threads, but that's a
different matter.

Also, doesn't RCU need to have a few ticks go by before it can safely
disable itself from userspace? I recall something like that. Paul?

-- Steve

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
  2015-10-20 20:36                 ` Chris Metcalf
@ 2015-10-21  0:56                   ` Steven Rostedt
  -1 siblings, 0 replies; 340+ messages in thread
From: Steven Rostedt @ 2015-10-21  0:56 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Tue, 20 Oct 2015 16:36:04 -0400
Chris Metcalf <cmetcalf@ezchip.com> wrote:

> Allow userspace to override the default SIGKILL delivered
> when a task_isolation process in STRICT mode does a syscall
> or otherwise synchronously enters the kernel.
> 

Is this really a good idea? This means that there's no way to terminate
a task in this mode, even if it goes astray.

-- Steve

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
@ 2015-10-21  0:56                   ` Steven Rostedt
  0 siblings, 0 replies; 340+ messages in thread
From: Steven Rostedt @ 2015-10-21  0:56 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Tue, 20 Oct 2015 16:36:04 -0400
Chris Metcalf <cmetcalf@ezchip.com> wrote:

> Allow userspace to override the default SIGKILL delivered
> when a task_isolation process in STRICT mode does a syscall
> or otherwise synchronously enters the kernel.
> 

Is this really a good idea? This means that there's no way to terminate
a task in this mode, even if it goes astray.

-- Steve

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot
  2015-10-20 21:18                   ` Chris Metcalf
@ 2015-10-21  0:59                     ` Steven Rostedt
  0 siblings, 0 replies; 340+ messages in thread
From: Steven Rostedt @ 2015-10-21  0:59 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Frederic Weisbecker, Gilad Ben Yossef, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-kernel

On Tue, 20 Oct 2015 17:18:13 -0400
Chris Metcalf <cmetcalf@ezchip.com> wrote:

> So perhaps a boot flag is an acceptable compromise between
> "nothing" and a debugfs tweak?  It certainly does make it easier
> to hack on the task-isolation code, and likely other things where
> people are trying out fixes to subsystems where they are attempting
> to remove the reliance on the tick.
> 

Just change the name to:

this_will_crash_your_kernel_and_kill_your_kittens_debug_1hz_tick

-- Steve

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
@ 2015-10-21  1:30                     ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-21  1:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 10/20/2015 8:56 PM, Steven Rostedt wrote:
> On Tue, 20 Oct 2015 16:36:04 -0400
> Chris Metcalf <cmetcalf@ezchip.com> wrote:
>
>> Allow userspace to override the default SIGKILL delivered
>> when a task_isolation process in STRICT mode does a syscall
>> or otherwise synchronously enters the kernel.
>>
> Is this really a good idea? This means that there's no way to terminate
> a task in this mode, even if it goes astray.

It doesn't map SIGKILL to some other signal unconditionally.  It just allows
the "hey, you broke the STRICT contract and entered the kernel" signal
to be something besides the default SIGKILL.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
@ 2015-10-21  1:30                     ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-21  1:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 10/20/2015 8:56 PM, Steven Rostedt wrote:
> On Tue, 20 Oct 2015 16:36:04 -0400
> Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
>
>> Allow userspace to override the default SIGKILL delivered
>> when a task_isolation process in STRICT mode does a syscall
>> or otherwise synchronously enters the kernel.
>>
> Is this really a good idea? This means that there's no way to terminate
> a task in this mode, even if it goes astray.

It doesn't map SIGKILL to some other signal unconditionally.  It just allows
the "hey, you broke the STRICT contract and entered the kernel" signal
to be something besides the default SIGKILL.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
  2015-10-21  1:30                     ` Chris Metcalf
@ 2015-10-21  1:41                       ` Steven Rostedt
  -1 siblings, 0 replies; 340+ messages in thread
From: Steven Rostedt @ 2015-10-21  1:41 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Tue, 20 Oct 2015 21:30:36 -0400
Chris Metcalf <cmetcalf@ezchip.com> wrote:

> On 10/20/2015 8:56 PM, Steven Rostedt wrote:
> > On Tue, 20 Oct 2015 16:36:04 -0400
> > Chris Metcalf <cmetcalf@ezchip.com> wrote:
> >
> >> Allow userspace to override the default SIGKILL delivered
> >> when a task_isolation process in STRICT mode does a syscall
> >> or otherwise synchronously enters the kernel.
> >>
> > Is this really a good idea? This means that there's no way to terminate
> > a task in this mode, even if it goes astray.
> 
> It doesn't map SIGKILL to some other signal unconditionally.  It just allows
> the "hey, you broke the STRICT contract and entered the kernel" signal
> to be something besides the default SIGKILL.
> 

Ah, I misread the change log. Now looking at the actual code, it makes
sense. Sorry for the noise ;-)

-- Steve

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
@ 2015-10-21  1:41                       ` Steven Rostedt
  0 siblings, 0 replies; 340+ messages in thread
From: Steven Rostedt @ 2015-10-21  1:41 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Tue, 20 Oct 2015 21:30:36 -0400
Chris Metcalf <cmetcalf@ezchip.com> wrote:

> On 10/20/2015 8:56 PM, Steven Rostedt wrote:
> > On Tue, 20 Oct 2015 16:36:04 -0400
> > Chris Metcalf <cmetcalf@ezchip.com> wrote:
> >
> >> Allow userspace to override the default SIGKILL delivered
> >> when a task_isolation process in STRICT mode does a syscall
> >> or otherwise synchronously enters the kernel.
> >>
> > Is this really a good idea? This means that there's no way to terminate
> > a task in this mode, even if it goes astray.
> 
> It doesn't map SIGKILL to some other signal unconditionally.  It just allows
> the "hey, you broke the STRICT contract and entered the kernel" signal
> to be something besides the default SIGKILL.
> 

Ah, I misread the change log. Now looking at the actual code, it makes
sense. Sorry for the noise ;-)

-- Steve

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
  2015-10-21  1:30                     ` Chris Metcalf
  (?)
  (?)
@ 2015-10-21  1:42                     ` Andy Lutomirski
  2015-10-21  6:41                         ` Gilad Ben Yossef
  -1 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-10-21  1:42 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> On 10/20/2015 8:56 PM, Steven Rostedt wrote:
>>
>> On Tue, 20 Oct 2015 16:36:04 -0400
>> Chris Metcalf <cmetcalf@ezchip.com> wrote:
>>
>>> Allow userspace to override the default SIGKILL delivered
>>> when a task_isolation process in STRICT mode does a syscall
>>> or otherwise synchronously enters the kernel.
>>>
>> Is this really a good idea? This means that there's no way to terminate
>> a task in this mode, even if it goes astray.
>
>
> It doesn't map SIGKILL to some other signal unconditionally.  It just allows
> the "hey, you broke the STRICT contract and entered the kernel" signal
> to be something besides the default SIGKILL.
>

...which has the odd side effect that sending a non-fatal signal from
another process will cause the strict process to enter the kernel and
receive an extra signal.

I still dislike this thing.  It seems like a debugging feature being
implemented using signals instead of existing APIs.  I *still* don't
see why perf can't be used to accomplish your goal.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* RE: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
@ 2015-10-21  6:41                         ` Gilad Ben Yossef
  0 siblings, 0 replies; 340+ messages in thread
From: Gilad Ben Yossef @ 2015-10-21  6:41 UTC (permalink / raw)
  To: Andy Lutomirski, Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2648 bytes --]



> From: Andy Lutomirski [mailto:luto@amacapital.net]
> Sent: Wednesday, October 21, 2015 4:43 AM
> To: Chris Metcalf
> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode
> configurable signal
> 
> On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com>
> wrote:
> > On 10/20/2015 8:56 PM, Steven Rostedt wrote:
> >>
> >> On Tue, 20 Oct 2015 16:36:04 -0400
> >> Chris Metcalf <cmetcalf@ezchip.com> wrote:
> >>
> >>> Allow userspace to override the default SIGKILL delivered
> >>> when a task_isolation process in STRICT mode does a syscall
> >>> or otherwise synchronously enters the kernel.
> >>>
<snip>
> >
> > It doesn't map SIGKILL to some other signal unconditionally.  It just allows
> > the "hey, you broke the STRICT contract and entered the kernel" signal
> > to be something besides the default SIGKILL.
> >
> 

<snip>
> 
> I still dislike this thing.  It seems like a debugging feature being
> implemented using signals instead of existing APIs.  I *still* don't
> see why perf can't be used to accomplish your goal.
> 

It is not (just) a debugging feature. There are workloads were not performing an action is much preferred to being late.

Consider the following artificial but representative scenario: a task running in strict isolation is controlling a radiotherapy alpha emitter.
The code runs in a tight event loop, reading an MMIO register with location data, making some calculation and in response writing an 
MMIO register that triggers the alpha emitter. As a safety measure, each trigger is for a specific very short time frame - the alpha emitter 
auto stops.

The code has a strict assumption that no more than X cycles pass between reading the value and the response and the system is built in 
such a way that as long as the code has mastery of the CPU the assumption holds true. If something breaks this assumption (unplanned
context switch to kernel), what you want to do is just stop place
rather than fire the alpha emitter X nanoseconds too late.

This feature lets you say: if the "contract" of isolation is broken, notify/kill me at once.

For code where isolation is important, the correctness of a calculation is dependent on timing. It's like you would accept the kernel to
kill a task if it read from an unmapped virtual address rather than returning garbage data. With an isolated task, the right data acted on 
later than you think is garbage just the same.

I hope this sheds some light on the issue.

Thanks,
Gilad

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 340+ messages in thread

* RE: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
@ 2015-10-21  6:41                         ` Gilad Ben Yossef
  0 siblings, 0 replies; 340+ messages in thread
From: Gilad Ben Yossef @ 2015-10-21  6:41 UTC (permalink / raw)
  To: Andy Lutomirski, Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA



> From: Andy Lutomirski [mailto:luto@amacapital.net]
> Sent: Wednesday, October 21, 2015 4:43 AM
> To: Chris Metcalf
> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode
> configurable signal
> 
> On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com>
> wrote:
> > On 10/20/2015 8:56 PM, Steven Rostedt wrote:
> >>
> >> On Tue, 20 Oct 2015 16:36:04 -0400
> >> Chris Metcalf <cmetcalf@ezchip.com> wrote:
> >>
> >>> Allow userspace to override the default SIGKILL delivered
> >>> when a task_isolation process in STRICT mode does a syscall
> >>> or otherwise synchronously enters the kernel.
> >>>
<snip>
> >
> > It doesn't map SIGKILL to some other signal unconditionally.  It just allows
> > the "hey, you broke the STRICT contract and entered the kernel" signal
> > to be something besides the default SIGKILL.
> >
> 

<snip>
> 
> I still dislike this thing.  It seems like a debugging feature being
> implemented using signals instead of existing APIs.  I *still* don't
> see why perf can't be used to accomplish your goal.
> 

It is not (just) a debugging feature. There are workloads were not performing an action is much preferred to being late.

Consider the following artificial but representative scenario: a task running in strict isolation is controlling a radiotherapy alpha emitter.
The code runs in a tight event loop, reading an MMIO register with location data, making some calculation and in response writing an 
MMIO register that triggers the alpha emitter. As a safety measure, each trigger is for a specific very short time frame - the alpha emitter 
auto stops.

The code has a strict assumption that no more than X cycles pass between reading the value and the response and the system is built in 
such a way that as long as the code has mastery of the CPU the assumption holds true. If something breaks this assumption (unplanned
context switch to kernel), what you want to do is just stop place
rather than fire the alpha emitter X nanoseconds too late.

This feature lets you say: if the "contract" of isolation is broken, notify/kill me at once.

For code where isolation is important, the correctness of a calculation is dependent on timing. It's like you would accept the kernel to
kill a task if it read from an unmapped virtual address rather than returning garbage data. With an isolated task, the right data acted on 
later than you think is garbage just the same.

I hope this sheds some light on the issue.

Thanks,
Gilad


^ permalink raw reply	[flat|nested] 340+ messages in thread

* RE: [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot
  2015-10-20 21:03                 ` Frederic Weisbecker
  2015-10-20 21:18                   ` Chris Metcalf
@ 2015-10-21  6:56                   ` Gilad Ben Yossef
  2015-10-21 14:28                   ` Christoph Lameter
  2 siblings, 0 replies; 340+ messages in thread
From: Gilad Ben Yossef @ 2015-10-21  6:56 UTC (permalink / raw)
  To: Frederic Weisbecker, Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, linux-doc, linux-kernel


> From: Frederic Weisbecker [mailto:fweisbec@gmail.com]
> > This option also allows easy testing of nohz_full and task-isolation
> > modes to determine what functionality needs to be implemented,
> > and what possibly-spurious timer interrupts are scheduled when
> > the basic 1Hz tick has been turned off.
> >
> > Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
> 
> There have been proposals to disable/tune the 1 Hz tick via debugfs which
> I Nacked because once you give such an opportunity to the users, they
> will use that hack and never fix the real underlying issue.
> 
> For the same reasons, I'm sorry but I have to Nack this proposal as well.
> 
> If this is for development or testing purpose,
> scheduler_max_tick_deferment() is
> easily commented out.

The problem with the latter is that it is much easier get back to one of the poor^H^H^H^H brave souls that are  
willing to risk their kittens testing this stuff for us saying: "can you please boot without this boot option and let 
me know if that behavior you were complaining about still happens?" rather than "can you please go to this 
and that line in the source file and un-comment it and re-compile and see if it still happens?"

I hope this makes more sense.

Thinking about it, it's probably a good idea to taint the kernel when this option is set as well.

Thanks,
Gilad

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full
  2015-10-20 20:35               ` Chris Metcalf
                                 ` (14 preceding siblings ...)
  (?)
@ 2015-10-21 12:39               ` Peter Zijlstra
  2015-10-22 20:31                   ` Chris Metcalf
  -1 siblings, 1 reply; 340+ messages in thread
From: Peter Zijlstra @ 2015-10-21 12:39 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel



Can you *please* start a new thread with each posting?

This is absolutely unmanageable.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot
  2015-10-20 21:03                 ` Frederic Weisbecker
  2015-10-20 21:18                   ` Chris Metcalf
  2015-10-21  6:56                   ` Gilad Ben Yossef
@ 2015-10-21 14:28                   ` Christoph Lameter
  2015-10-21 15:35                     ` Frederic Weisbecker
  2 siblings, 1 reply; 340+ messages in thread
From: Christoph Lameter @ 2015-10-21 14:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, linux-doc, linux-kernel

On Tue, 20 Oct 2015, Frederic Weisbecker wrote:

> There have been proposals to disable/tune the 1 Hz tick via debugfs which
> I Nacked because once you give such an opportunity to the users, they
> will use that hack and never fix the real underlying issue.

Well this is a pretty bad argument. By that reasoning no one should be
allowed to use root. After all stupid users will become root and kill
processes that are hung. And the underlying issue that causes those hangs
in processes will never be fixed because they will keep on killing
processes.



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot
  2015-10-21 14:28                   ` Christoph Lameter
@ 2015-10-21 15:35                     ` Frederic Weisbecker
  0 siblings, 0 replies; 340+ messages in thread
From: Frederic Weisbecker @ 2015-10-21 15:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, linux-doc, linux-kernel

On Wed, Oct 21, 2015 at 09:28:04AM -0500, Christoph Lameter wrote:
> On Tue, 20 Oct 2015, Frederic Weisbecker wrote:
> 
> > There have been proposals to disable/tune the 1 Hz tick via debugfs which
> > I Nacked because once you give such an opportunity to the users, they
> > will use that hack and never fix the real underlying issue.
> 
> Well this is a pretty bad argument. By that reasoning no one should be
> allowed to use root. After all stupid users will become root and kill
> processes that are hung. And the underlying issue that causes those hangs
> in processes will never be fixed because they will keep on killing
> processes.

I disagree. There is an signifiant frontier between:

_ hack it up and you're responsible of the consequences, yourself

      and
      
_ provide a buggy hack to the user and support this officially upstream


Especially as all I've seen in two years, wrt. solving the 1 Hz issue, is patches like this.
Almost nobody really tried to dig into the real issues that are well known and identified
after all now, and not that hard to fix: it's many standalone issues to make the scheduler
resilient with full-total-hard-dynticks.

The only effort toward that I've seen lately is:  https://lkml.org/lkml/2015/10/14/192
and still I think the author came to that nohz issue by accident.

Many users are too easily happy with hacks like this and that one is a too much a good
opportunity for them.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
@ 2015-10-21 16:12                   ` Frederic Weisbecker
  0 siblings, 0 replies; 340+ messages in thread
From: Frederic Weisbecker @ 2015-10-21 16:12 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote:
> diff --git a/kernel/isolation.c b/kernel/isolation.c
> new file mode 100644
> index 000000000000..9a73235db0bb
> --- /dev/null
> +++ b/kernel/isolation.c
> @@ -0,0 +1,78 @@
> +/*
> + *  linux/kernel/isolation.c
> + *
> + *  Implementation for task isolation.
> + *
> + *  Distributed under GPLv2.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/swap.h>
> +#include <linux/vmstat.h>
> +#include <linux/isolation.h>
> +#include <linux/syscalls.h>
> +#include "time/tick-sched.h"
> +
> +/*
> + * This routine controls whether we can enable task-isolation mode.
> + * The task must be affinitized to a single nohz_full core or we will
> + * return EINVAL.  Although the application could later re-affinitize
> + * to a housekeeping core and lose task isolation semantics, this
> + * initial test should catch 99% of bugs with task placement prior to
> + * enabling task isolation.
> + */
> +int task_isolation_set(unsigned int flags)
> +{
> +	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||

I think you'll have to make sure the task can not be concurrently reaffined
to more CPUs. This may involve setting task_isolation_flags under the runqueue
lock and thus move that tiny part to the scheduler code. And then we must forbid
changing the affinity while the task has the isolation flag, or deactivate the flag.

In any case this needs some synchronization.

> +	    !tick_nohz_full_cpu(smp_processor_id()))
> +		return -EINVAL;
> +
> +	current->task_isolation_flags = flags;
> +	return 0;
> +}
> +
> +/*
> + * In task isolation mode we try to return to userspace only after
> + * attempting to make sure we won't be interrupted again.  To handle
> + * the periodic scheduler tick, we test to make sure that the tick is
> + * stopped, and if it isn't yet, we request a reschedule so that if
> + * another task needs to run to completion first, it can do so.
> + * Similarly, if any other subsystems require quiescing, we will need
> + * to do that before we return to userspace.
> + */
> +bool _task_isolation_ready(void)
> +{
> +	WARN_ON_ONCE(!irqs_disabled());
> +
> +	/* If we need to drain the LRU cache, we're not ready. */
> +	if (lru_add_drain_needed(smp_processor_id()))
> +		return false;
> +
> +	/* If vmstats need updating, we're not ready. */
> +	if (!vmstat_idle())
> +		return false;
> +
> +	/* If the tick is running, request rescheduling; we're not ready. */
> +	if (!tick_nohz_tick_stopped()) {

Note that this function tells whether the tick is in dynticks mode, which means
the tick currently only run on-demand. But it's not necessarily completely stopped.

I think we should rename that function and the field it refers to.

> +		set_tsk_need_resched(current);
> +		return false;
> +	}
> +
> +	return true;
> +}

Thanks.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
@ 2015-10-21 16:12                   ` Frederic Weisbecker
  0 siblings, 0 replies; 340+ messages in thread
From: Frederic Weisbecker @ 2015-10-21 16:12 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote:
> diff --git a/kernel/isolation.c b/kernel/isolation.c
> new file mode 100644
> index 000000000000..9a73235db0bb
> --- /dev/null
> +++ b/kernel/isolation.c
> @@ -0,0 +1,78 @@
> +/*
> + *  linux/kernel/isolation.c
> + *
> + *  Implementation for task isolation.
> + *
> + *  Distributed under GPLv2.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/swap.h>
> +#include <linux/vmstat.h>
> +#include <linux/isolation.h>
> +#include <linux/syscalls.h>
> +#include "time/tick-sched.h"
> +
> +/*
> + * This routine controls whether we can enable task-isolation mode.
> + * The task must be affinitized to a single nohz_full core or we will
> + * return EINVAL.  Although the application could later re-affinitize
> + * to a housekeeping core and lose task isolation semantics, this
> + * initial test should catch 99% of bugs with task placement prior to
> + * enabling task isolation.
> + */
> +int task_isolation_set(unsigned int flags)
> +{
> +	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||

I think you'll have to make sure the task can not be concurrently reaffined
to more CPUs. This may involve setting task_isolation_flags under the runqueue
lock and thus move that tiny part to the scheduler code. And then we must forbid
changing the affinity while the task has the isolation flag, or deactivate the flag.

In any case this needs some synchronization.

> +	    !tick_nohz_full_cpu(smp_processor_id()))
> +		return -EINVAL;
> +
> +	current->task_isolation_flags = flags;
> +	return 0;
> +}
> +
> +/*
> + * In task isolation mode we try to return to userspace only after
> + * attempting to make sure we won't be interrupted again.  To handle
> + * the periodic scheduler tick, we test to make sure that the tick is
> + * stopped, and if it isn't yet, we request a reschedule so that if
> + * another task needs to run to completion first, it can do so.
> + * Similarly, if any other subsystems require quiescing, we will need
> + * to do that before we return to userspace.
> + */
> +bool _task_isolation_ready(void)
> +{
> +	WARN_ON_ONCE(!irqs_disabled());
> +
> +	/* If we need to drain the LRU cache, we're not ready. */
> +	if (lru_add_drain_needed(smp_processor_id()))
> +		return false;
> +
> +	/* If vmstats need updating, we're not ready. */
> +	if (!vmstat_idle())
> +		return false;
> +
> +	/* If the tick is running, request rescheduling; we're not ready. */
> +	if (!tick_nohz_tick_stopped()) {

Note that this function tells whether the tick is in dynticks mode, which means
the tick currently only run on-demand. But it's not necessarily completely stopped.

I think we should rename that function and the field it refers to.

> +		set_tsk_need_resched(current);
> +		return false;
> +	}
> +
> +	return true;
> +}

Thanks.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
  2015-10-21  6:41                         ` Gilad Ben Yossef
  (?)
@ 2015-10-21 18:53                         ` Andy Lutomirski
  2015-10-22 20:44                           ` Chris Metcalf
  2015-10-24  9:16                             ` Gilad Ben Yossef
  -1 siblings, 2 replies; 340+ messages in thread
From: Andy Lutomirski @ 2015-10-21 18:53 UTC (permalink / raw)
  To: Gilad Ben Yossef
  Cc: Chris Metcalf, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On Tue, Oct 20, 2015 at 11:41 PM, Gilad Ben Yossef <giladb@ezchip.com> wrote:
>
>
>> From: Andy Lutomirski [mailto:luto@amacapital.net]
>> Sent: Wednesday, October 21, 2015 4:43 AM
>> To: Chris Metcalf
>> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode
>> configurable signal
>>
>> On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com>
>> wrote:
>> > On 10/20/2015 8:56 PM, Steven Rostedt wrote:
>> >>
>> >> On Tue, 20 Oct 2015 16:36:04 -0400
>> >> Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> >>
>> >>> Allow userspace to override the default SIGKILL delivered
>> >>> when a task_isolation process in STRICT mode does a syscall
>> >>> or otherwise synchronously enters the kernel.
>> >>>
> <snip>
>> >
>> > It doesn't map SIGKILL to some other signal unconditionally.  It just allows
>> > the "hey, you broke the STRICT contract and entered the kernel" signal
>> > to be something besides the default SIGKILL.
>> >
>>
>
> <snip>
>>
>> I still dislike this thing.  It seems like a debugging feature being
>> implemented using signals instead of existing APIs.  I *still* don't
>> see why perf can't be used to accomplish your goal.
>>
>
> It is not (just) a debugging feature. There are workloads were not performing an action is much preferred to being late.
>
> Consider the following artificial but representative scenario: a task running in strict isolation is controlling a radiotherapy alpha emitter.
> The code runs in a tight event loop, reading an MMIO register with location data, making some calculation and in response writing an
> MMIO register that triggers the alpha emitter. As a safety measure, each trigger is for a specific very short time frame - the alpha emitter
> auto stops.
>
> The code has a strict assumption that no more than X cycles pass between reading the value and the response and the system is built in
> such a way that as long as the code has mastery of the CPU the assumption holds true. If something breaks this assumption (unplanned
> context switch to kernel), what you want to do is just stop place
> rather than fire the alpha emitter X nanoseconds too late.
>
> This feature lets you say: if the "contract" of isolation is broken, notify/kill me at once.

That's a fair point.  It's risky, though, for quite a few reasons.

1. If someone builds an alpha emitter like this, they did it wrong.
The kernel should write a trigger *and* a timestamp to the hardware
and the hardware should trigger at the specified time if the time is
in the future and throw an error if it's in the past.  If you need to
check that you made the deadline, check the actual desired condition
(did you meat the deadline?) not a proxy (did the signal fire?).

2. This strict mode thing isn't exhaustive.  It's missing, at least,
coverage for NMI, MCE, and SMI.  Sure, you can think that you've
disabled all NMI sources, you can try to remember to set the
appropriate boot flag that panics on MCE (and hope that you don't get
screwed by broadcast MCE on Intel systems before it got fixed
(Skylake?  Is the fix even available in a released chip?), and, for
SMI, good luck...

3. You haven't dealt with IPIs.  The TLB flush code in particular
seems like it will break all your assumptions.

Maybe it would make sense to whack more of the moles before adding a
big assertion that there aren't any moles any more.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full
  2015-10-21 12:39               ` [PATCH v8 00/14] support "task_isolation" mode for nohz_full Peter Zijlstra
@ 2015-10-22 20:31                   ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-22 20:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 10/21/2015 08:39 AM, Peter Zijlstra wrote:
> Can you *please* start a new thread with each posting?
>
> This is absolutely unmanageable.

I've been explicitly threading the multiple patch series on purpose
due to this text in "git help send-email":

        --in-reply-to=<identifier>
               Make the first mail (or all the mails with --no-thread) appear
               as a reply to the given Message-Id, which avoids breaking
               threads to provide a new patch series. The second and subsequent
               emails will be sent as replies according to the
               --[no]-chain-reply-to setting.

               So for example when --thread and --no-chain-reply-to are
               specified, the second and subsequent patches will be replies to
               the first one like in the illustration below where [PATCH v2
               0/3] is in reply to [PATCH 0/2]:

               [PATCH 0/2] Here is what I did...
                 [PATCH 1/2] Clean up and tests
                 [PATCH 2/2] Implementation
                 [PATCH v2 0/3] Here is a reroll
                   [PATCH v2 1/3] Clean up
                   [PATCH v2 2/3] New tests
                   [PATCH v2 3/3] Implementation

It sounds like this is exactly the behavior you are objecting
to.  It's all one to me because I am not seeing these emails
come up in some hugely nested fashion, but just viewing the
responses that I haven't yet triaged away.

So is your recommendation to avoid the git send-email --in-reply-to
option?  If so, would you recommend including an lkml.kernel.org
link in the cover letter pointing to the previous version, or
is there something else that would make your workflow better?

If you think this is actually the wrong thing, is it worth trying
to fix the git docs to deprecate this option?  Or is it more a question
of scale, and the 80-odd patches that I've posted so far just pushed
an otherwise good system into a more dysfunctional mode?  If so,
perhaps some text in Documentation/SubmittingPatches would be
helpful here.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full
@ 2015-10-22 20:31                   ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-22 20:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 10/21/2015 08:39 AM, Peter Zijlstra wrote:
> Can you *please* start a new thread with each posting?
>
> This is absolutely unmanageable.

I've been explicitly threading the multiple patch series on purpose
due to this text in "git help send-email":

        --in-reply-to=<identifier>
               Make the first mail (or all the mails with --no-thread) appear
               as a reply to the given Message-Id, which avoids breaking
               threads to provide a new patch series. The second and subsequent
               emails will be sent as replies according to the
               --[no]-chain-reply-to setting.

               So for example when --thread and --no-chain-reply-to are
               specified, the second and subsequent patches will be replies to
               the first one like in the illustration below where [PATCH v2
               0/3] is in reply to [PATCH 0/2]:

               [PATCH 0/2] Here is what I did...
                 [PATCH 1/2] Clean up and tests
                 [PATCH 2/2] Implementation
                 [PATCH v2 0/3] Here is a reroll
                   [PATCH v2 1/3] Clean up
                   [PATCH v2 2/3] New tests
                   [PATCH v2 3/3] Implementation

It sounds like this is exactly the behavior you are objecting
to.  It's all one to me because I am not seeing these emails
come up in some hugely nested fashion, but just viewing the
responses that I haven't yet triaged away.

So is your recommendation to avoid the git send-email --in-reply-to
option?  If so, would you recommend including an lkml.kernel.org
link in the cover letter pointing to the previous version, or
is there something else that would make your workflow better?

If you think this is actually the wrong thing, is it worth trying
to fix the git docs to deprecate this option?  Or is it more a question
of scale, and the 80-odd patches that I've posted so far just pushed
an otherwise good system into a more dysfunctional mode?  If so,
perhaps some text in Documentation/SubmittingPatches would be
helpful here.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
  2015-10-21 18:53                         ` Andy Lutomirski
@ 2015-10-22 20:44                           ` Chris Metcalf
  2015-10-22 21:00                             ` Andy Lutomirski
  2015-10-24  9:16                             ` Gilad Ben Yossef
  1 sibling, 1 reply; 340+ messages in thread
From: Chris Metcalf @ 2015-10-22 20:44 UTC (permalink / raw)
  To: Andy Lutomirski, Gilad Ben Yossef
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel

On 10/21/2015 02:53 PM, Andy Lutomirski wrote:
> On Tue, Oct 20, 2015 at 11:41 PM, Gilad Ben Yossef <giladb@ezchip.com> wrote:
>>
>>> From: Andy Lutomirski [mailto:luto@amacapital.net]
>>> Sent: Wednesday, October 21, 2015 4:43 AM
>>> To: Chris Metcalf
>>> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode
>>> configurable signal
>>>
>>> On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com>
>>> wrote:
>>>> On 10/20/2015 8:56 PM, Steven Rostedt wrote:
>>>>> On Tue, 20 Oct 2015 16:36:04 -0400
>>>>> Chris Metcalf <cmetcalf@ezchip.com> wrote:
>>>>>
>>>>>> Allow userspace to override the default SIGKILL delivered
>>>>>> when a task_isolation process in STRICT mode does a syscall
>>>>>> or otherwise synchronously enters the kernel.
>>>>>>
>> <snip>
>>>> It doesn't map SIGKILL to some other signal unconditionally.  It just allows
>>>> the "hey, you broke the STRICT contract and entered the kernel" signal
>>>> to be something besides the default SIGKILL.
>>>>
>> <snip>
>>> I still dislike this thing.  It seems like a debugging feature being
>>> implemented using signals instead of existing APIs.  I *still* don't
>>> see why perf can't be used to accomplish your goal.
>>>
>> It is not (just) a debugging feature. There are workloads were not performing an action is much preferred to being late.
>>
>> Consider the following artificial but representative scenario: a task running in strict isolation is controlling a radiotherapy alpha emitter.
>> The code runs in a tight event loop, reading an MMIO register with location data, making some calculation and in response writing an
>> MMIO register that triggers the alpha emitter. As a safety measure, each trigger is for a specific very short time frame - the alpha emitter
>> auto stops.
>>
>> The code has a strict assumption that no more than X cycles pass between reading the value and the response and the system is built in
>> such a way that as long as the code has mastery of the CPU the assumption holds true. If something breaks this assumption (unplanned
>> context switch to kernel), what you want to do is just stop place
>> rather than fire the alpha emitter X nanoseconds too late.
>>
>> This feature lets you say: if the "contract" of isolation is broken, notify/kill me at once.
> That's a fair point.  It's risky, though, for quite a few reasons.
>
> 1. If someone builds an alpha emitter like this, they did it wrong.
> The kernel should write a trigger *and* a timestamp to the hardware
> and the hardware should trigger at the specified time if the time is
> in the future and throw an error if it's in the past.  If you need to
> check that you made the deadline, check the actual desired condition
> (did you meat the deadline?) not a proxy (did the signal fire?).

Definitely a better hardware design, but as we all know, hardware
designers too rarely consult the software people who have to
right the actual code to properly drive the hardware :-)

My canonical example is high-performance userspace network
drivers, and though dropping is packet is less likely to kill a
patient, it's still a pretty bad thing if you're trying to design
a robust appliance.  In this case you really want to fix application
bugs that cause the code to enter the kernel when you think
you're in the internal loop running purely in userspace.  Things
like unexpected page faults, and third-party code that almost
never calls the kernel but in some dusty corner it occasionally
does, can screw up your userspace code pretty badly, and
mysteriously.  The "strict" mode support is not a hypothetical
insurance policy but a reaction to lots of Tilera customer support
over the years to folks failing to stay in userspace when they
thought they were doing the right thing.

> 2. This strict mode thing isn't exhaustive.  It's missing, at least,
> coverage for NMI, MCE, and SMI.  Sure, you can think that you've
> disabled all NMI sources, you can try to remember to set the
> appropriate boot flag that panics on MCE (and hope that you don't get
> screwed by broadcast MCE on Intel systems before it got fixed
> (Skylake?  Is the fix even available in a released chip?), and, for
> SMI, good luck...

You are confusing this strict mode support with the debug
support in patch 07/14.

Strict mode is for synchronous application errors.  You might
be right that there are cases that haven't been covered, but
certainly most of them are covered on the three platforms that
are supported in this initial series.  (You pointed me to one
that I would have missed on x86, namely the bounds check
exception from a bad bounds setup.)  I'm pretty confident I
have all of them for tile, since I know that hardware best,
and I think we're in good shape for arm64, though I'm still
coming up to speed on that architecture.

NMIs and machine checks are asynchronous interrupts that
don't have to do with what the application is doing, more or less.
Those should not be delivered to task-isolation cores at all,
so we just generate console spew when you set the
task_isolation_debug boot option.  I honestly don't know enough
about system management interrupts to comment on that,
though again, I would hope one can configure the system to
just not deliver them to nohz_full cores, and I think it would
be reasonable to generate some kernel spew if that happens.

> 3. You haven't dealt with IPIs.  The TLB flush code in particular
> seems like it will break all your assumptions.

Again, not a synchronous application error that we are trying
to catch with this signalling mechanism.

That said it could obviously be a more general application error
(e.g. a process with threads on both nohz_full and housekeeping
cores, where the housekeeping core unmaps some memory and
thus requires a TLB flush IPI).  But this is covered by the
task_isolation_debug patch for kernel/smp.c.

> Maybe it would make sense to whack more of the moles before adding a
> big assertion that there aren't any moles any more.

Maybe, but I've whacked the ones I know how to whack.
If there are ones I've missed I'm happy to add them in a
subsequent version of this series, or in follow-on patches.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
  2015-10-22 20:44                           ` Chris Metcalf
@ 2015-10-22 21:00                             ` Andy Lutomirski
  2015-10-27 19:37                                 ` Chris Metcalf
  0 siblings, 1 reply; 340+ messages in thread
From: Andy Lutomirski @ 2015-10-22 21:00 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On Thu, Oct 22, 2015 at 1:44 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> On 10/21/2015 02:53 PM, Andy Lutomirski wrote:
>>
>> On Tue, Oct 20, 2015 at 11:41 PM, Gilad Ben Yossef <giladb@ezchip.com>
>> wrote:
>>>
>>>
>>>> From: Andy Lutomirski [mailto:luto@amacapital.net]
>>>> Sent: Wednesday, October 21, 2015 4:43 AM
>>>> To: Chris Metcalf
>>>> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode
>>>> configurable signal
>>>>
>>>> On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com>
>>>> wrote:
>>>>>
>>>>> On 10/20/2015 8:56 PM, Steven Rostedt wrote:
>>>>>>
>>>>>> On Tue, 20 Oct 2015 16:36:04 -0400
>>>>>> Chris Metcalf <cmetcalf@ezchip.com> wrote:
>>>>>>
>>>>>>> Allow userspace to override the default SIGKILL delivered
>>>>>>> when a task_isolation process in STRICT mode does a syscall
>>>>>>> or otherwise synchronously enters the kernel.
>>>>>>>
>>> <snip>
>>>>>
>>>>> It doesn't map SIGKILL to some other signal unconditionally.  It just
>>>>> allows
>>>>> the "hey, you broke the STRICT contract and entered the kernel" signal
>>>>> to be something besides the default SIGKILL.
>>>>>
>>> <snip>
>>>>
>>>> I still dislike this thing.  It seems like a debugging feature being
>>>> implemented using signals instead of existing APIs.  I *still* don't
>>>> see why perf can't be used to accomplish your goal.
>>>>
>>> It is not (just) a debugging feature. There are workloads were not
>>> performing an action is much preferred to being late.
>>>
>>> Consider the following artificial but representative scenario: a task
>>> running in strict isolation is controlling a radiotherapy alpha emitter.
>>> The code runs in a tight event loop, reading an MMIO register with
>>> location data, making some calculation and in response writing an
>>> MMIO register that triggers the alpha emitter. As a safety measure, each
>>> trigger is for a specific very short time frame - the alpha emitter
>>> auto stops.
>>>
>>> The code has a strict assumption that no more than X cycles pass between
>>> reading the value and the response and the system is built in
>>> such a way that as long as the code has mastery of the CPU the assumption
>>> holds true. If something breaks this assumption (unplanned
>>> context switch to kernel), what you want to do is just stop place
>>> rather than fire the alpha emitter X nanoseconds too late.
>>>
>>> This feature lets you say: if the "contract" of isolation is broken,
>>> notify/kill me at once.
>>
>> That's a fair point.  It's risky, though, for quite a few reasons.
>>
>> 1. If someone builds an alpha emitter like this, they did it wrong.
>> The kernel should write a trigger *and* a timestamp to the hardware
>> and the hardware should trigger at the specified time if the time is
>> in the future and throw an error if it's in the past.  If you need to
>> check that you made the deadline, check the actual desired condition
>> (did you meat the deadline?) not a proxy (did the signal fire?).
>
>
> Definitely a better hardware design, but as we all know, hardware
> designers too rarely consult the software people who have to
> right the actual code to properly drive the hardware :-)
>
> My canonical example is high-performance userspace network
> drivers, and though dropping is packet is less likely to kill a
> patient, it's still a pretty bad thing if you're trying to design
> a robust appliance.  In this case you really want to fix application
> bugs that cause the code to enter the kernel when you think
> you're in the internal loop running purely in userspace.  Things
> like unexpected page faults, and third-party code that almost
> never calls the kernel but in some dusty corner it occasionally
> does, can screw up your userspace code pretty badly, and
> mysteriously.  The "strict" mode support is not a hypothetical
> insurance policy but a reaction to lots of Tilera customer support
> over the years to folks failing to stay in userspace when they
> thought they were doing the right thing.

But this is *exactly* the case where perf or other out-of-band
debugging could be a much better solution.  Perf could notify a
non-isolated thread that an interrupt happened, you'd still drop a
packet or two, but you wouldn't also drop the next ten thousand
packets while handling the signal.

>
>> 2. This strict mode thing isn't exhaustive.  It's missing, at least,
>> coverage for NMI, MCE, and SMI.  Sure, you can think that you've
>> disabled all NMI sources, you can try to remember to set the
>> appropriate boot flag that panics on MCE (and hope that you don't get
>> screwed by broadcast MCE on Intel systems before it got fixed
>> (Skylake?  Is the fix even available in a released chip?), and, for
>> SMI, good luck...
>
>
> You are confusing this strict mode support with the debug
> support in patch 07/14.

Nope.  I'm confusing this strict mode with what Gilad described: using
strict mode to cause outright shutdown instead of failure to meet a
deadline.

(FWIW, you could also use an ordinary hardware watchdog timer to
promote your failure to meet a deadline to a shutdown.  No new kernel
support needed.)

>
> Strict mode is for synchronous application errors.  You might
> be right that there are cases that haven't been covered, but
> certainly most of them are covered on the three platforms that
> are supported in this initial series.  (You pointed me to one
> that I would have missed on x86, namely the bounds check
> exception from a bad bounds setup.)  I'm pretty confident I
> have all of them for tile, since I know that hardware best,
> and I think we're in good shape for arm64, though I'm still
> coming up to speed on that architecture.

Again, for this definition of strict mode, I still don't see why it's
the right design.  If you want to debug your application to detect
application errors, use a debugging interface.

>
> NMIs and machine checks are asynchronous interrupts that
> don't have to do with what the application is doing, more or less.
> Those should not be delivered to task-isolation cores at all,
> so we just generate console spew when you set the
> task_isolation_debug boot option.  I honestly don't know enough
> about system management interrupts to comment on that,
> though again, I would hope one can configure the system to
> just not deliver them to nohz_full cores, and I think it would
> be reasonable to generate some kernel spew if that happens.

Hah hah yeah right.  On most existing Intel CPUs, you *cannot*
configure machine checks to do anything other than broadcast to all
cores or cause immediate shutdown.  And getting any sort of reasonable
control over SMI more or less requires special firmware.

>
>> 3. You haven't dealt with IPIs.  The TLB flush code in particular
>> seems like it will break all your assumptions.
>
>
> Again, not a synchronous application error that we are trying
> to catch with this signalling mechanism.
>
> That said it could obviously be a more general application error
> (e.g. a process with threads on both nohz_full and housekeeping
> cores, where the housekeeping core unmaps some memory and
> thus requires a TLB flush IPI).  But this is covered by the
> task_isolation_debug patch for kernel/smp.c.
>
>> Maybe it would make sense to whack more of the moles before adding a
>> big assertion that there aren't any moles any more.
>
>
> Maybe, but I've whacked the ones I know how to whack.
> If there are ones I've missed I'm happy to add them in a
> subsequent version of this series, or in follow-on patches.
>

I agree that you can, in principle, catch all the synchronous
application errors using this mechanism.  I'm saying that catching
them seems quite useful, but catching them using a prctl that causes a
signal and explicitly does *not* solve the deadline enforcement
problem seems to have dubious value in the upstream kernel.

You can't catch the asynchronous application errors with this
mechanism (or at least your ability to catch them depends on which
patch version IIRC), which include calling anything like munmap or
membarrier in another thread.

--Andy

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full
  2015-10-22 20:31                   ` Chris Metcalf
  (?)
@ 2015-10-23  2:33                   ` Frederic Weisbecker
  2015-10-23  8:49                     ` Peter Zijlstra
  -1 siblings, 1 reply; 340+ messages in thread
From: Frederic Weisbecker @ 2015-10-23  2:33 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Peter Zijlstra, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Thu, Oct 22, 2015 at 04:31:44PM -0400, Chris Metcalf wrote:
> On 10/21/2015 08:39 AM, Peter Zijlstra wrote:
> >Can you *please* start a new thread with each posting?
> >
> >This is absolutely unmanageable.
> 
> I've been explicitly threading the multiple patch series on purpose
> due to this text in "git help send-email":
> 
>        --in-reply-to=<identifier>
>               Make the first mail (or all the mails with --no-thread) appear
>               as a reply to the given Message-Id, which avoids breaking
>               threads to provide a new patch series. The second and subsequent
>               emails will be sent as replies according to the
>               --[no]-chain-reply-to setting.
> 
>               So for example when --thread and --no-chain-reply-to are
>               specified, the second and subsequent patches will be replies to
>               the first one like in the illustration below where [PATCH v2
>               0/3] is in reply to [PATCH 0/2]:
> 
>               [PATCH 0/2] Here is what I did...
>                 [PATCH 1/2] Clean up and tests
>                 [PATCH 2/2] Implementation
>                 [PATCH v2 0/3] Here is a reroll
>                   [PATCH v2 1/3] Clean up
>                   [PATCH v2 2/3] New tests
>                   [PATCH v2 3/3] Implementation
> 
> It sounds like this is exactly the behavior you are objecting
> to.  It's all one to me because I am not seeing these emails
> come up in some hugely nested fashion, but just viewing the
> responses that I haven't yet triaged away.

I personally (and I think this is the general LKML behaviour) use in-reply-to
when I post a single patch that is a fix for a bug, or a small enhancement,
discussed on some thread. It works well as it fits the conversation inline.

But for anything that requires significant changes, namely a patchset,
and that includes a new version of such patchset, it's usually better
to create a new thread. Otherwise the thread becomes an infinite mess and it
eventually expands further the mail client columns.

Thanks.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full
  2015-10-23  2:33                   ` Frederic Weisbecker
@ 2015-10-23  8:49                     ` Peter Zijlstra
  2015-10-23 13:29                       ` Frederic Weisbecker
  0 siblings, 1 reply; 340+ messages in thread
From: Peter Zijlstra @ 2015-10-23  8:49 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Fri, Oct 23, 2015 at 04:33:02AM +0200, Frederic Weisbecker wrote:
> On Thu, Oct 22, 2015 at 04:31:44PM -0400, Chris Metcalf wrote:
> > On 10/21/2015 08:39 AM, Peter Zijlstra wrote:
> > >Can you *please* start a new thread with each posting?
> > >
> > >This is absolutely unmanageable.
> > 
> > I've been explicitly threading the multiple patch series on purpose
> > due to this text in "git help send-email":
> > 
> >        --in-reply-to=<identifier>
> >               Make the first mail (or all the mails with --no-thread) appear
> >               as a reply to the given Message-Id, which avoids breaking
> >               threads to provide a new patch series. The second and subsequent
> >               emails will be sent as replies according to the
> >               --[no]-chain-reply-to setting.
> > 
> >               So for example when --thread and --no-chain-reply-to are
> >               specified, the second and subsequent patches will be replies to
> >               the first one like in the illustration below where [PATCH v2
> >               0/3] is in reply to [PATCH 0/2]:
> > 
> >               [PATCH 0/2] Here is what I did...
> >                 [PATCH 1/2] Clean up and tests
> >                 [PATCH 2/2] Implementation
> >                 [PATCH v2 0/3] Here is a reroll
> >                   [PATCH v2 1/3] Clean up
> >                   [PATCH v2 2/3] New tests
> >                   [PATCH v2 3/3] Implementation
> > 
> > It sounds like this is exactly the behavior you are objecting
> > to.  It's all one to me because I am not seeing these emails
> > come up in some hugely nested fashion, but just viewing the
> > responses that I haven't yet triaged away.

Yeah, the git people are not per definition following lkml standards,
even though git originated 'here'. They, for a long time, also defaulted
to --chain-reply-to, which is absolutely insane.

> I personally (and I think this is the general LKML behaviour) use in-reply-to
> when I post a single patch that is a fix for a bug, or a small enhancement,
> discussed on some thread. It works well as it fits the conversation inline.
> 
> But for anything that requires significant changes, namely a patchset,
> and that includes a new version of such patchset, it's usually better
> to create a new thread. Otherwise the thread becomes an infinite mess and it
> eventually expands further the mail client columns.

Agreed, although for single patches I use my regular mailer (mutt) and
can't be arsed with tools. Also I don't actually use git-send-email
ever, so I might be biased.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full
@ 2015-10-23  9:04                     ` Peter Zijlstra
  0 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2015-10-23  9:04 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Thu, Oct 22, 2015 at 04:31:44PM -0400, Chris Metcalf wrote:

> So is your recommendation to avoid the git send-email --in-reply-to
> option?  If so, would you recommend including an lkml.kernel.org
> link in the cover letter pointing to the previous version, or
> is there something else that would make your workflow better?

Mostly people don't bother with pointing to previous versions, and if
they have the same 0/x subject, they're typically trivial to find
anyway.

But if you really feel the need for explicit references to previous
versions, then yes, lkml.kernel.org/r/ links are preferred over pretty
much anything else I think.

> If you think this is actually the wrong thing, is it worth trying
> to fix the git docs to deprecate this option?

As said in the other email; git has different standards than lkml. By
now we're just one of many many users of git.

> Or is it more a question
> of scale, and the 80-odd patches that I've posted so far just pushed
> an otherwise good system into a more dysfunctional mode?  If so,
> perhaps some text in Documentation/SubmittingPatches would be
> helpful here.

Documentation/email-clients.txt maybe.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full
@ 2015-10-23  9:04                     ` Peter Zijlstra
  0 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2015-10-23  9:04 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 22, 2015 at 04:31:44PM -0400, Chris Metcalf wrote:

> So is your recommendation to avoid the git send-email --in-reply-to
> option?  If so, would you recommend including an lkml.kernel.org
> link in the cover letter pointing to the previous version, or
> is there something else that would make your workflow better?

Mostly people don't bother with pointing to previous versions, and if
they have the same 0/x subject, they're typically trivial to find
anyway.

But if you really feel the need for explicit references to previous
versions, then yes, lkml.kernel.org/r/ links are preferred over pretty
much anything else I think.

> If you think this is actually the wrong thing, is it worth trying
> to fix the git docs to deprecate this option?

As said in the other email; git has different standards than lkml. By
now we're just one of many many users of git.

> Or is it more a question
> of scale, and the 80-odd patches that I've posted so far just pushed
> an otherwise good system into a more dysfunctional mode?  If so,
> perhaps some text in Documentation/SubmittingPatches would be
> helpful here.

Documentation/email-clients.txt maybe.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full
  2015-10-23  9:04                     ` Peter Zijlstra
  (?)
@ 2015-10-23 11:52                     ` Theodore Ts'o
  -1 siblings, 0 replies; 340+ messages in thread
From: Theodore Ts'o @ 2015-10-23 11:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel

On Fri, Oct 23, 2015 at 11:04:59AM +0200, Peter Zijlstra wrote:
> > If you think this is actually the wrong thing, is it worth trying
> > to fix the git docs to deprecate this option?
> 
> As said in the other email; git has different standards than lkml. By
> now we're just one of many many users of git.

Even git developers will create a new thread for a large (more than
2-3 patches) patch set.  However, for a single patch, people have
chained the -v3 version of the draft --- not to the v2 version,
though, but to the review of the patch.  And I've seen that behavior
on some LKML lists, and I'm certainly fine with it on linux-ext4.

But if you have a huge patch series, and you keep chaining it unto the
8th, 10th, 22nd version, it certainly will get **very** annoying for
some MUA's.

The bottom line is that you should use common sense, and it can be
hard to document every last bit of what should be "common sense" into
a rule that is followed by robots or a perl script.  (Which is one of
the reasons why I'm not fond of the philosophy that every single last
checkpatch warning or error should result in a "cleanup" patch, but
that's another issue.)

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full
  2015-10-23  8:49                     ` Peter Zijlstra
@ 2015-10-23 13:29                       ` Frederic Weisbecker
  0 siblings, 0 replies; 340+ messages in thread
From: Frederic Weisbecker @ 2015-10-23 13:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Fri, Oct 23, 2015 at 10:49:51AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 23, 2015 at 04:33:02AM +0200, Frederic Weisbecker wrote:
> > I personally (and I think this is the general LKML behaviour) use in-reply-to
> > when I post a single patch that is a fix for a bug, or a small enhancement,
> > discussed on some thread. It works well as it fits the conversation inline.
> > 
> > But for anything that requires significant changes, namely a patchset,
> > and that includes a new version of such patchset, it's usually better
> > to create a new thread. Otherwise the thread becomes an infinite mess and it
> > eventually expands further the mail client columns.
> 
> Agreed, although for single patches I use my regular mailer (mutt) and
> can't be arsed with tools.

Yeah me too, otherwise I can't write a text before the patch changelog.

> Also I don't actually use git-send-email ever, so I might be biased.

Ah it's just too convenient so I wrote my scripts on top of it :-)
But surely many mail sender libraries can post patches just fine as well.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* RE: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
@ 2015-10-24  9:16                             ` Gilad Ben Yossef
  0 siblings, 0 replies; 340+ messages in thread
From: Gilad Ben Yossef @ 2015-10-24  9:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel


Hi Andy,

Thank for the feedback.


> From: Andy Lutomirski [mailto:luto@amacapital.net]
> Sent: Wednesday, October 21, 2015 9:53 PM
> To: Gilad Ben Yossef
> Cc: Chris Metcalf; Steven Rostedt; Ingo Molnar; Peter Zijlstra; Andrew
> Morton; Rik van Riel; Tejun Heo; Frederic Weisbecker; Thomas Gleixner; Paul
> E. McKenney; Christoph Lameter; Viresh Kumar; Catalin Marinas; Will Deacon;
> linux-doc@vger.kernel.org; Linux API; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode
> configurable signal
> 


> >> >> On Tue, 20 Oct 2015 16:36:04 -0400
> >> >> Chris Metcalf <cmetcalf@ezchip.com> wrote:
> >> >>
> >> >>> Allow userspace to override the default SIGKILL delivered
> >> >>> when a task_isolation process in STRICT mode does a syscall
> >> >>> or otherwise synchronously enters the kernel.
> >> >>>
> > <snip>
> >> >
> >> > It doesn't map SIGKILL to some other signal unconditionally.  It just allows
> >> > the "hey, you broke the STRICT contract and entered the kernel" signal
> >> > to be something besides the default SIGKILL.
> >> >
> >>
> >
> > <snip>
> >>
> >> I still dislike this thing.  It seems like a debugging feature being
> >> implemented using signals instead of existing APIs.  I *still* don't
> >> see why perf can't be used to accomplish your goal.
> >>
> >
> > It is not (just) a debugging feature. There are workloads were not
> performing an action is much preferred to being late.
> >
> > Consider the following artificial but representative scenario: a task running
> in strict isolation is controlling a radiotherapy alpha emitter.
> > The code runs in a tight event loop, reading an MMIO register with location
> data, making some calculation and in response writing an
> > MMIO register that triggers the alpha emitter. As a safety measure, each
> trigger is for a specific very short time frame - the alpha emitter
> > auto stops.
> >
> > The code has a strict assumption that no more than X cycles pass between
> reading the value and the response and the system is built in
> > such a way that as long as the code has mastery of the CPU the assumption
> holds true. If something breaks this assumption (unplanned
> > context switch to kernel), what you want to do is just stop place
> > rather than fire the alpha emitter X nanoseconds too late.
> >
> > This feature lets you say: if the "contract" of isolation is broken, notify/kill
> me at once.
> 
> That's a fair point.  It's risky, though, for quite a few reasons.
> 
> 1. If someone builds an alpha emitter like this, they did it wrong.
> The kernel should write a trigger *and* a timestamp to the hardware
> and the hardware should trigger at the specified time if the time is
> in the future and throw an error if it's in the past.  If you need to
> check that you made the deadline, check the actual desired condition
> (did you meat the deadline?) not a proxy (did the signal fire?).
> 

As I wrote above it is an *artificial* scenario. 

Yes, hardware and systems can be designed better, but they are not
always are and in these kind of systems, you really do want to have
double or triple checks.

Knowing such systems, even IF the hardware was designed as you 
specified (and I agree it should!) you would still add the software
protection.

> 2. This strict mode thing isn't exhaustive.  It's missing, at least,
> coverage for NMI, MCE, and SMI.  Sure, you can think that you've
> disabled all NMI sources, you can try to remember to set the
> appropriate boot flag that panics on MCE (and hope that you don't get
> screwed by broadcast MCE on Intel systems before it got fixed
> (Skylake?  Is the fix even available in a released chip?), and, for
> SMI, good luck...

You are right - it isn't exhaustive. It is one piece in a bigger puzzle.
Many of the other bits are platform specific and some of them have
been dealt with on the platform that care about these things.

Yes, we don't have dark magic to detect SMIs. Is that a reason to penalize
platforms where there is no such thing as SMI? 
 
 
> 3. You haven't dealt with IPIs.  The TLB flush code in particular
> seems like it will break all your assumptions.
>

But we have - in the general context. Consider this patch set from 2012 -
https://lwn.net/Articles/479510/

Not finished for sure. But what we have is now useful enough that it is used
in the real world for different workloads on different platforms, from packet
 processing, through HPC to high frequency trading.

> Maybe it would make sense to whack more of the moles before adding a
> big assertion that there aren't any moles any more.
> 

hm... maybe you are reading too much into this specific feature - its a 
"notify me, the application, if I asked you to do something that violates 
my previous request to be isolated", rather than "notify me whenever isolation is broken".

Does that make more sense?

Thanks,
Gilad

^ permalink raw reply	[flat|nested] 340+ messages in thread

* RE: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
@ 2015-10-24  9:16                             ` Gilad Ben Yossef
  0 siblings, 0 replies; 340+ messages in thread
From: Gilad Ben Yossef @ 2015-10-24  9:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


Hi Andy,

Thank for the feedback.


> From: Andy Lutomirski [mailto:luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org]
> Sent: Wednesday, October 21, 2015 9:53 PM
> To: Gilad Ben Yossef
> Cc: Chris Metcalf; Steven Rostedt; Ingo Molnar; Peter Zijlstra; Andrew
> Morton; Rik van Riel; Tejun Heo; Frederic Weisbecker; Thomas Gleixner; Paul
> E. McKenney; Christoph Lameter; Viresh Kumar; Catalin Marinas; Will Deacon;
> linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Linux API; linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode
> configurable signal
> 


> >> >> On Tue, 20 Oct 2015 16:36:04 -0400
> >> >> Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
> >> >>
> >> >>> Allow userspace to override the default SIGKILL delivered
> >> >>> when a task_isolation process in STRICT mode does a syscall
> >> >>> or otherwise synchronously enters the kernel.
> >> >>>
> > <snip>
> >> >
> >> > It doesn't map SIGKILL to some other signal unconditionally.  It just allows
> >> > the "hey, you broke the STRICT contract and entered the kernel" signal
> >> > to be something besides the default SIGKILL.
> >> >
> >>
> >
> > <snip>
> >>
> >> I still dislike this thing.  It seems like a debugging feature being
> >> implemented using signals instead of existing APIs.  I *still* don't
> >> see why perf can't be used to accomplish your goal.
> >>
> >
> > It is not (just) a debugging feature. There are workloads were not
> performing an action is much preferred to being late.
> >
> > Consider the following artificial but representative scenario: a task running
> in strict isolation is controlling a radiotherapy alpha emitter.
> > The code runs in a tight event loop, reading an MMIO register with location
> data, making some calculation and in response writing an
> > MMIO register that triggers the alpha emitter. As a safety measure, each
> trigger is for a specific very short time frame - the alpha emitter
> > auto stops.
> >
> > The code has a strict assumption that no more than X cycles pass between
> reading the value and the response and the system is built in
> > such a way that as long as the code has mastery of the CPU the assumption
> holds true. If something breaks this assumption (unplanned
> > context switch to kernel), what you want to do is just stop place
> > rather than fire the alpha emitter X nanoseconds too late.
> >
> > This feature lets you say: if the "contract" of isolation is broken, notify/kill
> me at once.
> 
> That's a fair point.  It's risky, though, for quite a few reasons.
> 
> 1. If someone builds an alpha emitter like this, they did it wrong.
> The kernel should write a trigger *and* a timestamp to the hardware
> and the hardware should trigger at the specified time if the time is
> in the future and throw an error if it's in the past.  If you need to
> check that you made the deadline, check the actual desired condition
> (did you meat the deadline?) not a proxy (did the signal fire?).
> 

As I wrote above it is an *artificial* scenario. 

Yes, hardware and systems can be designed better, but they are not
always are and in these kind of systems, you really do want to have
double or triple checks.

Knowing such systems, even IF the hardware was designed as you 
specified (and I agree it should!) you would still add the software
protection.

> 2. This strict mode thing isn't exhaustive.  It's missing, at least,
> coverage for NMI, MCE, and SMI.  Sure, you can think that you've
> disabled all NMI sources, you can try to remember to set the
> appropriate boot flag that panics on MCE (and hope that you don't get
> screwed by broadcast MCE on Intel systems before it got fixed
> (Skylake?  Is the fix even available in a released chip?), and, for
> SMI, good luck...

You are right - it isn't exhaustive. It is one piece in a bigger puzzle.
Many of the other bits are platform specific and some of them have
been dealt with on the platform that care about these things.

Yes, we don't have dark magic to detect SMIs. Is that a reason to penalize
platforms where there is no such thing as SMI? 
 
 
> 3. You haven't dealt with IPIs.  The TLB flush code in particular
> seems like it will break all your assumptions.
>

But we have - in the general context. Consider this patch set from 2012 -
https://lwn.net/Articles/479510/

Not finished for sure. But what we have is now useful enough that it is used
in the real world for different workloads on different platforms, from packet
 processing, through HPC to high frequency trading.

> Maybe it would make sense to whack more of the moles before adding a
> big assertion that there aren't any moles any more.
> 

hm... maybe you are reading too much into this specific feature - its a 
"notify me, the application, if I asked you to do something that violates 
my previous request to be isolated", rather than "notify me whenever isolation is broken".

Does that make more sense?

Thanks,
Gilad--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
  2015-10-21  0:29                         ` Steven Rostedt
  (?)
@ 2015-10-26 20:19                         ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-26 20:19 UTC (permalink / raw)
  To: Steven Rostedt, Andy Lutomirski
  Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel

Andy wrote:
> Your patches more or less implement "don't run me unless I'm
> isolated".  A scheduler class would be more like "isolate me (and
> maybe make me super high priority so it actually happens)".

Steven wrote:
> Since it only makes sense to run one isolated task per cpu (not more
> than one on the same CPU), I wonder if we should add a new interface
> for this, that would force everything else off the CPU that it
> requests. That is, you bind a task to a CPU, and then change it to
> SCHED_ISOLATED (or what not), and the kernel will force all other tasks
> off that CPU.

Frederic wrote:
> I think you'll have to make sure the task can not be concurrently
> reaffined to more CPUs. This may involve setting task_isolation_flags
> under the runqueue lock and thus move that tiny part to the scheduler
> code. And then we must forbid changing the affinity while the task has
> the isolation flag, or deactivate the flag.

These comments are all about the same high-level question, so I
want to address it in this reply.

The question is, should TASK_ISOLATION be "polite" or "aggressive"?
The original design was "polite": it worked as long as no other thing
on the system tried to mess with it.  The suggestions above are for an
"aggressive" design.

The "polite" design basically tags a task as being interested in
having the kernel help it out by staying away from it.  It relies on
running on a nohz_full cpu to keep scheduler ticks away from it.  It
relies on running on an isolcpus cpu to keep other processes from
getting dynamically load-balanced onto it and messing it up.  And, of
course, it relies on the other applications and users running on the
machine not to affinitize themselves onto its core and mess it up that
way.  But, as long as all those things are true, the kernel will try
to help it out by never interrupting it.  (And, it allows for the
kernel to report when those expectations are violated.)

The "aggressive" design would have an API that said "This is my core!".
The kernel would enforce keeping other processes off the core.  It
would require nohz_full semantics on that core.  It would lock the
task to that core in some way that would override attempts to reset
its sched_affinity.  It would do whatever else was necessary to make
that core unavailable to the rest of the system.

Advantages of the "polite" design:

- No special privileges required
- As a result, no security issues to sort through (capabilities, etc.)
- Therefore easy to use when running as an unprivileged user
- Won't screw up the occasional kernel task that needs to run

Advantages of the "aggressive" design:

- Clearer that the application will get the task isolation it wants
- More reasonable that it is enforcing kernel performance tweaks
   on the local core (e.g. flushing the per-cpu LRU cache)

The "aggressive" design is certainly tempting, but there may be other
negative consequences of this design: for example, if we need to run a
usermode helper process as a result of some system call, we do want to
ensure that it can run, and we need to allow it to be scheduled, even
if it's just a regular scheduler class thing.  The "polite" design
allows the usermode helper to run and just waits until it's safe for
the isolated task to return to userspace.  Possibly we could arrange
for a SCHED_ISOLATED class to allow that kind of behavior, though I'm
not familiar enough with the scheduler code to say for sure.

I think it's important that we're explicit about which of these two
approaches feels like the more appropriate one.  Possibly my Tilera
background is part of which pushes me towards the "polite" design; we
have a lot of cores, so they're a kind of trivial resource that we
don't need to aggressively defend, and it's a more conservative design
to enable task isolation only when all the relevant criteria have been
met, rather than enforcing those criteria up front.

I think if we adopt the "aggressive" model, it might likely make sense
to express it as a scheduling policy, since it would include core
scheduler changes such as denying other tasks the right to call
sched_setaffinity() with an affinity that includes cores currently in
use by SCHED_ISOLATED tasks.  This would be something pretty deeply
hooked into the scheduler and therefore might require some more
substantial changes.  In addition, of course, there's the cost of
documenting yet another scheduler policy.

In the "polite" model, we certainly could use a SCHED_ISOLATED
scheduling policy (with static priority zero) to indicate
task-isolation mode, rather than using prctl() to set a task_struct
bit.  I'm not sure how much it gains, though.  It could allow the
scheduler to detect that the only "runnable" task actually didn't want
to be run, and switch briefly to the idle task, but since this would
likely only be for a scheduler tick or two, the power advantages are
pretty minimal, for a pretty reasonable additional piece of complexity
both in the API (documenting a new scheduler class) and in the
implementation (putting new requirements into the scheduler
implementations).  So I'm somewhat dubious, although willing to be
pushed in that direction if that's the consensus.

On balance I think it still feels to me like the original proposed
direction (a "polite" task isolation mode with a prctl bit) feels
better than the scheduler-based alternatives that have been proposed.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
  2015-10-20 21:26                       ` Andy Lutomirski
  (?)
  (?)
@ 2015-10-26 20:32                       ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-26 20:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On 10/20/2015 05:26 PM, Andy Lutomirski wrote:
>>   Even more importantly, we rely on
>> rescheduling to take care of the fact that the scheduler tick may still
>> be running, and therefore loop back to the schedule() call that's run
>> when TIF_NEED_RESCHED gets set.
> This just seems like a mis-design.  We don't know why the scheduler
> tick is on, so we're just going to reschedule until the problem goes
> away?

See my previous email about polite vs aggressive design for more
thoughts on this, but, yes.  I'm not sure there's a way to do anything
else, other than my proposal there to dig deep into the scheduler
and allow it to switch to idle for a few tasks - but again, I'm just
not sure the complexity is worth the runtime power savings.

>>> BTW, should isolation just be a scheduler class (SCHED_ISOLATED)?
>>
>> So a scheduler class is an interesting idea certainly, although not
>> one I know immediately how to implement.  I'm not sure whether
>> it makes sense to require a user be root or have a suitable rtprio
>> rlimit, but perhaps so.  The nice thing about the current patch
>> series is that you can affinitize yourself to a nohz_full core and
>> declare that you want to run task-isolated, and none of that
>> requires root nor really is there a reason it should.
> Your patches more or less implement "don't run me unless I'm
> isolated".  A scheduler class would be more like "isolate me (and
> maybe make me super high priority so it actually happens)".
>
> I'm not a scheduler person, so I don't know.  But "don't run me unless
> I'm isolated" seems like a design that will, at best, only ever work
> by dumb luck.  You have to disable migration, avoid other runnable
> tasks, hope that the kernel keeps working the way it did when you
> wrote the patch, hope you continue to get lucky enough that you ever
> get to user mode in the first place, etc.

Could you explain the "dumb luck" characterization a bit more?

You're definitely right that I need to test for isolcpus separately
now that it's been decoupled from nohz_full again, so I will
add that to the next release of the series.

But the rest of it seems like things you just control for when you
are running the application, and if you do it right, the
application runs.  If you don't (e.g. you intentionally schedule
multiple processes on the same core), the app doesn't run,
and you fix it in development.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
  2015-10-21  0:29                         ` Steven Rostedt
  (?)
  (?)
@ 2015-10-26 21:13                         ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-26 21:13 UTC (permalink / raw)
  To: Steven Rostedt, Andy Lutomirski
  Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel

On 10/20/2015 08:29 PM, Steven Rostedt wrote:
> Also, doesn't RCU need to have a few ticks go by before it can safely
> disable itself from userspace? I recall something like that. Paul?

The current patch series supports that by testing tick_nohz_tick_stopped(),
which internally only becomes true after tick_nohz_stop_sched_tick()
manages to stop the tick, and it won't if rcu_needs_cpu() is true.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
  2015-10-21 16:12                   ` Frederic Weisbecker
@ 2015-10-27 16:40                     ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-27 16:40 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 10/21/2015 12:12 PM, Frederic Weisbecker wrote:
> On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote:
>> +/*
>> + * This routine controls whether we can enable task-isolation mode.
>> + * The task must be affinitized to a single nohz_full core or we will
>> + * return EINVAL.  Although the application could later re-affinitize
>> + * to a housekeeping core and lose task isolation semantics, this
>> + * initial test should catch 99% of bugs with task placement prior to
>> + * enabling task isolation.
>> + */
>> +int task_isolation_set(unsigned int flags)
>> +{
>> +	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
> I think you'll have to make sure the task can not be concurrently reaffined
> to more CPUs. This may involve setting task_isolation_flags under the runqueue
> lock and thus move that tiny part to the scheduler code. And then we must forbid
> changing the affinity while the task has the isolation flag, or deactivate the flag.
>
> In any case this needs some synchronization.

Well, as the comment says, this is not intended as a hard guarantee.
As written, it might race with a concurrent sched_setaffinity(), but
then again, it also is totally OK as written for sched_setaffinity() to
change it away after the prctl() is complete, so it's not necessary to
do any explicit synchronization.

This harks back again to the whole "polite vs aggressive" issue with
how we envision task isolation.

The "polite" model basically allows you to set up the conditions for
task isolation to be useful, and then if they are useful, great! What
you're suggesting here is a bit more of the "aggressive" model, where
we actually fail sched_setaffinity() either for any cpumask after
task isolation is set, or perhaps just for resetting it to housekeeping
cores.  (Note that we could in principle use PF_NO_SETAFFINITY to
just hard fail all attempts to call sched_setaffinity once we enable
task isolation, so we don't have to add more mechanism on that path.)

I'm a little reluctant to ever fail sched_setaffinity() based on the
task isolation status with the current "polite" model, since an
unprivileged application can set up for task isolation, and then
presumably no one can override it via sched_setaffinity() from another
task.  (I suppose you could do some kind of permissions-based thing
where root can always override it, or some suitable capability, etc.,
but I feel like that gets complicated quickly, for little benefit.)

The alternative you mention is that if the task is re-affinitized, it
loses its task-isolation status, and that also seems like an unfortunate
API, since if you are setting it with prctl(), it's really cleanest just to
only be able to unset it with prctl() as well.

I think given the current "polite" API, the only question is whether in
fact *no* initial test is the best thing, or if an initial test (as 
introduced
in the v8 version) is defensible just as a help for catching an obvious
mistake in setting up your task isolation.  I decided the advantage
of catching the mistake were more important than the "API purity"
of being 100% consistent in how we handled the interactions between
affinity and isolation, but I am certainly open to argument on that one.

Meanwhile I think it still feels like the v8 code is the best compromise.

>> +	/* If the tick is running, request rescheduling; we're not ready. */
>> +	if (!tick_nohz_tick_stopped()) {
> Note that this function tells whether the tick is in dynticks mode, which means
> the tick currently only run on-demand. But it's not necessarily completely stopped.

I think in fact this is the semantics we want (and that people requested),
e.g. if the user requests an alarm(), we may still be ticking even though
tick_nohz_tick_stopped() is true, but that test is still the right condition
to use to return to user space, since the user explicitly requested the 
alarm.

> I think we should rename that function and the field it refers to.

Sounds like a good idea.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
@ 2015-10-27 16:40                     ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-27 16:40 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 10/21/2015 12:12 PM, Frederic Weisbecker wrote:
> On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote:
>> +/*
>> + * This routine controls whether we can enable task-isolation mode.
>> + * The task must be affinitized to a single nohz_full core or we will
>> + * return EINVAL.  Although the application could later re-affinitize
>> + * to a housekeeping core and lose task isolation semantics, this
>> + * initial test should catch 99% of bugs with task placement prior to
>> + * enabling task isolation.
>> + */
>> +int task_isolation_set(unsigned int flags)
>> +{
>> +	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
> I think you'll have to make sure the task can not be concurrently reaffined
> to more CPUs. This may involve setting task_isolation_flags under the runqueue
> lock and thus move that tiny part to the scheduler code. And then we must forbid
> changing the affinity while the task has the isolation flag, or deactivate the flag.
>
> In any case this needs some synchronization.

Well, as the comment says, this is not intended as a hard guarantee.
As written, it might race with a concurrent sched_setaffinity(), but
then again, it also is totally OK as written for sched_setaffinity() to
change it away after the prctl() is complete, so it's not necessary to
do any explicit synchronization.

This harks back again to the whole "polite vs aggressive" issue with
how we envision task isolation.

The "polite" model basically allows you to set up the conditions for
task isolation to be useful, and then if they are useful, great! What
you're suggesting here is a bit more of the "aggressive" model, where
we actually fail sched_setaffinity() either for any cpumask after
task isolation is set, or perhaps just for resetting it to housekeeping
cores.  (Note that we could in principle use PF_NO_SETAFFINITY to
just hard fail all attempts to call sched_setaffinity once we enable
task isolation, so we don't have to add more mechanism on that path.)

I'm a little reluctant to ever fail sched_setaffinity() based on the
task isolation status with the current "polite" model, since an
unprivileged application can set up for task isolation, and then
presumably no one can override it via sched_setaffinity() from another
task.  (I suppose you could do some kind of permissions-based thing
where root can always override it, or some suitable capability, etc.,
but I feel like that gets complicated quickly, for little benefit.)

The alternative you mention is that if the task is re-affinitized, it
loses its task-isolation status, and that also seems like an unfortunate
API, since if you are setting it with prctl(), it's really cleanest just to
only be able to unset it with prctl() as well.

I think given the current "polite" API, the only question is whether in
fact *no* initial test is the best thing, or if an initial test (as 
introduced
in the v8 version) is defensible just as a help for catching an obvious
mistake in setting up your task isolation.  I decided the advantage
of catching the mistake were more important than the "API purity"
of being 100% consistent in how we handled the interactions between
affinity and isolation, but I am certainly open to argument on that one.

Meanwhile I think it still feels like the v8 code is the best compromise.

>> +	/* If the tick is running, request rescheduling; we're not ready. */
>> +	if (!tick_nohz_tick_stopped()) {
> Note that this function tells whether the tick is in dynticks mode, which means
> the tick currently only run on-demand. But it's not necessarily completely stopped.

I think in fact this is the semantics we want (and that people requested),
e.g. if the user requests an alarm(), we may still be ticking even though
tick_nohz_tick_stopped() is true, but that test is still the right condition
to use to return to user space, since the user explicitly requested the 
alarm.

> I think we should rename that function and the field it refers to.

Sounds like a good idea.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
@ 2015-10-27 19:37                                 ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-27 19:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API,
	linux-kernel

On 10/22/2015 05:00 PM, Andy Lutomirski wrote:
> On Thu, Oct 22, 2015 at 1:44 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
>> On 10/21/2015 02:53 PM, Andy Lutomirski wrote:
>>> On Tue, Oct 20, 2015 at 11:41 PM, Gilad Ben Yossef <giladb@ezchip.com>
>>> wrote:
>>>>
>>>>> From: Andy Lutomirski [mailto:luto@amacapital.net]
>>>>> Sent: Wednesday, October 21, 2015 4:43 AM
>>>>> To: Chris Metcalf
>>>>> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode
>>>>> configurable signal
>>>>>
>>>>> On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com>
>>>>> wrote:
>>>>>> On 10/20/2015 8:56 PM, Steven Rostedt wrote:
>>>>>>> On Tue, 20 Oct 2015 16:36:04 -0400
>>>>>>> Chris Metcalf <cmetcalf@ezchip.com> wrote:
>>>>>>>
>>>>>>>> Allow userspace to override the default SIGKILL delivered
>>>>>>>> when a task_isolation process in STRICT mode does a syscall
>>>>>>>> or otherwise synchronously enters the kernel.
>>>>>>>>
>>>> <snip>
>>>>>> It doesn't map SIGKILL to some other signal unconditionally.  It just
>>>>>> allows
>>>>>> the "hey, you broke the STRICT contract and entered the kernel" signal
>>>>>> to be something besides the default SIGKILL.
>>>>>>
>>>> <snip>
>>>>> I still dislike this thing.  It seems like a debugging feature being
>>>>> implemented using signals instead of existing APIs.  I *still* don't
>>>>> see why perf can't be used to accomplish your goal.
>>>>>
>>>> It is not (just) a debugging feature. There are workloads were not
>>>> performing an action is much preferred to being late.
>>>>
>>>> Consider the following artificial but representative scenario: a task
>>>> running in strict isolation is controlling a radiotherapy alpha emitter.
>>>> The code runs in a tight event loop, reading an MMIO register with
>>>> location data, making some calculation and in response writing an
>>>> MMIO register that triggers the alpha emitter. As a safety measure, each
>>>> trigger is for a specific very short time frame - the alpha emitter
>>>> auto stops.
>>>>
>>>> The code has a strict assumption that no more than X cycles pass between
>>>> reading the value and the response and the system is built in
>>>> such a way that as long as the code has mastery of the CPU the assumption
>>>> holds true. If something breaks this assumption (unplanned
>>>> context switch to kernel), what you want to do is just stop place
>>>> rather than fire the alpha emitter X nanoseconds too late.
>>>>
>>>> This feature lets you say: if the "contract" of isolation is broken,
>>>> notify/kill me at once.
>>> That's a fair point.  It's risky, though, for quite a few reasons.
>>>
>>> 1. If someone builds an alpha emitter like this, they did it wrong.
>>> The kernel should write a trigger *and* a timestamp to the hardware
>>> and the hardware should trigger at the specified time if the time is
>>> in the future and throw an error if it's in the past.  If you need to
>>> check that you made the deadline, check the actual desired condition
>>> (did you meat the deadline?) not a proxy (did the signal fire?).
>>
>> Definitely a better hardware design, but as we all know, hardware
>> designers too rarely consult the software people who have to
>> right the actual code to properly drive the hardware :-)
>>
>> My canonical example is high-performance userspace network
>> drivers, and though dropping is packet is less likely to kill a
>> patient, it's still a pretty bad thing if you're trying to design
>> a robust appliance.  In this case you really want to fix application
>> bugs that cause the code to enter the kernel when you think
>> you're in the internal loop running purely in userspace.  Things
>> like unexpected page faults, and third-party code that almost
>> never calls the kernel but in some dusty corner it occasionally
>> does, can screw up your userspace code pretty badly, and
>> mysteriously.  The "strict" mode support is not a hypothetical
>> insurance policy but a reaction to lots of Tilera customer support
>> over the years to folks failing to stay in userspace when they
>> thought they were doing the right thing.
> But this is *exactly* the case where perf or other out-of-band
> debugging could be a much better solution.  Perf could notify a
> non-isolated thread that an interrupt happened, you'd still drop a
> packet or two, but you wouldn't also drop the next ten thousand
> packets while handling the signal.

There's no reason the signal needs to be delivered to one of
the nohz_full cores.  If you're setting up to catch these signals
rather than have them just SIGKILL you, then you want to
run a separate thread on a housekeeping core that is doing
a sigwait() or equivalent.  I'm not sure why using perf to do
this is particularly better; I'm most interested in ensuring that
it is easy for applications to set this up if they want it, and
perf isn't always super-easy to use.

That said, maybe it's easier than I think to do that specific
thing, and worth considering doing it that way instead.  Is there
an easily-explained way to do what you suggest where perf
delivers a signal?  I assume you have in mind creating a
synthetic sampling perf event and using perf_event_open()
to get a file descriptor for it, and waiting with poll or SIGIO?
(Too bad perf_event_open isn't supported by glibc and we
have to use syscall() to even call it.)  Seems complex...

>>> 2. This strict mode thing isn't exhaustive.  It's missing, at least,
>>> coverage for NMI, MCE, and SMI.  Sure, you can think that you've
>>> disabled all NMI sources, you can try to remember to set the
>>> appropriate boot flag that panics on MCE (and hope that you don't get
>>> screwed by broadcast MCE on Intel systems before it got fixed
>>> (Skylake?  Is the fix even available in a released chip?), and, for
>>> SMI, good luck...
>>
>> You are confusing this strict mode support with the debug
>> support in patch 07/14.
> Nope.  I'm confusing this strict mode with what Gilad described: using
> strict mode to cause outright shutdown instead of failure to meet a
> deadline.

Yeah, fair point.  We certainly could wire up a mode to deliver
a signal or whatever for asynchronous interrupts (which I'm claiming
are primarily kernel bugs) instead of just synchronous interrupts
(which I'm claiming are application bugs).  That could be an
additional mode bit for prctl(), e.g. PR_TASK_ISOLATION_DEBUG
to align with the task_isolation_debug boot variable that enables
the kernel printk spew.

> (FWIW, you could also use an ordinary hardware watchdog timer to
> promote your failure to meet a deadline to a shutdown.  No new kernel
> support needed.)

But more hardware support is needed; there may not be a handy
hardware watchdog timer to use out of the box, and you don't
want to require the customer to buy new hardware to support
a feature like this if you don't have to.

>> Strict mode is for synchronous application errors.  You might
>> be right that there are cases that haven't been covered, but
>> certainly most of them are covered on the three platforms that
>> are supported in this initial series.  (You pointed me to one
>> that I would have missed on x86, namely the bounds check
>> exception from a bad bounds setup.)  I'm pretty confident I
>> have all of them for tile, since I know that hardware best,
>> and I think we're in good shape for arm64, though I'm still
>> coming up to speed on that architecture.
> Again, for this definition of strict mode, I still don't see why it's
> the right design.  If you want to debug your application to detect
> application errors, use a debugging interface.

Maybe.  But we basically want a single notification that the
app (and/or maybe kernel) screwed up.  Invoking all of perf
for that seems like overkill and a signal seems totally adequate,
whether for development fixing bugs, or production catching
bad things.  There are a reasonable number of precedents for
doing things this way: SIGPIPE and SIGFPE, to name two.


>> NMIs and machine checks are asynchronous interrupts that
>> don't have to do with what the application is doing, more or less.
>> Those should not be delivered to task-isolation cores at all,
>> so we just generate console spew when you set the
>> task_isolation_debug boot option.  I honestly don't know enough
>> about system management interrupts to comment on that,
>> though again, I would hope one can configure the system to
>> just not deliver them to nohz_full cores, and I think it would
>> be reasonable to generate some kernel spew if that happens.
> Hah hah yeah right.  On most existing Intel CPUs, you *cannot*
> configure machine checks to do anything other than broadcast to all
> cores or cause immediate shutdown.  And getting any sort of reasonable
> control over SMI more or less requires special firmware.

Yeah, as Gilad said, x86 may not be the best choice to run
a task-isolated application unless you can really set up those
kinds of things to stay off your core.

>>> 3. You haven't dealt with IPIs.  The TLB flush code in particular
>>> seems like it will break all your assumptions.
>>
>> Again, not a synchronous application error that we are trying
>> to catch with this signalling mechanism.
>>
>> That said it could obviously be a more general application error
>> (e.g. a process with threads on both nohz_full and housekeeping
>> cores, where the housekeeping core unmaps some memory and
>> thus requires a TLB flush IPI).  But this is covered by the
>> task_isolation_debug patch for kernel/smp.c.
>>
>>> Maybe it would make sense to whack more of the moles before adding a
>>> big assertion that there aren't any moles any more.
>>
>> Maybe, but I've whacked the ones I know how to whack.
>> If there are ones I've missed I'm happy to add them in a
>> subsequent version of this series, or in follow-on patches.
>>
> I agree that you can, in principle, catch all the synchronous
> application errors using this mechanism.  I'm saying that catching
> them seems quite useful, but catching them using a prctl that causes a
> signal and explicitly does *not* solve the deadline enforcement
> problem seems to have dubious value in the upstream kernel.

When you say "does not solve the deadline enforcement problem",
I'm not sure what point you're making.  The application presumably
can meet its own deadlines when it's not interrupted; the intent
here is to notice when the kernel gets in its way and notify it.

Granted you could add separate mechanisms to create deadlines
within the application, but that feels like a separate layer that
may or may not be desired for any given application.

> You can't catch the asynchronous application errors with this
> mechanism (or at least your ability to catch them depends on which
> patch version IIRC), which include calling anything like munmap or
> membarrier in another thread.

Yes, and munmap in another thread is certainly an application bug
at some level, so that's another reason to allow using the same
mechanism to notify the application of an asynchronous interrupt.
I'll add that for the next version of the patch series.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal
@ 2015-10-27 19:37                                 ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2015-10-27 19:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 10/22/2015 05:00 PM, Andy Lutomirski wrote:
> On Thu, Oct 22, 2015 at 1:44 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
>> On 10/21/2015 02:53 PM, Andy Lutomirski wrote:
>>> On Tue, Oct 20, 2015 at 11:41 PM, Gilad Ben Yossef <giladb-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
>>> wrote:
>>>>
>>>>> From: Andy Lutomirski [mailto:luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org]
>>>>> Sent: Wednesday, October 21, 2015 4:43 AM
>>>>> To: Chris Metcalf
>>>>> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode
>>>>> configurable signal
>>>>>
>>>>> On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org>
>>>>> wrote:
>>>>>> On 10/20/2015 8:56 PM, Steven Rostedt wrote:
>>>>>>> On Tue, 20 Oct 2015 16:36:04 -0400
>>>>>>> Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>>
>>>>>>>> Allow userspace to override the default SIGKILL delivered
>>>>>>>> when a task_isolation process in STRICT mode does a syscall
>>>>>>>> or otherwise synchronously enters the kernel.
>>>>>>>>
>>>> <snip>
>>>>>> It doesn't map SIGKILL to some other signal unconditionally.  It just
>>>>>> allows
>>>>>> the "hey, you broke the STRICT contract and entered the kernel" signal
>>>>>> to be something besides the default SIGKILL.
>>>>>>
>>>> <snip>
>>>>> I still dislike this thing.  It seems like a debugging feature being
>>>>> implemented using signals instead of existing APIs.  I *still* don't
>>>>> see why perf can't be used to accomplish your goal.
>>>>>
>>>> It is not (just) a debugging feature. There are workloads were not
>>>> performing an action is much preferred to being late.
>>>>
>>>> Consider the following artificial but representative scenario: a task
>>>> running in strict isolation is controlling a radiotherapy alpha emitter.
>>>> The code runs in a tight event loop, reading an MMIO register with
>>>> location data, making some calculation and in response writing an
>>>> MMIO register that triggers the alpha emitter. As a safety measure, each
>>>> trigger is for a specific very short time frame - the alpha emitter
>>>> auto stops.
>>>>
>>>> The code has a strict assumption that no more than X cycles pass between
>>>> reading the value and the response and the system is built in
>>>> such a way that as long as the code has mastery of the CPU the assumption
>>>> holds true. If something breaks this assumption (unplanned
>>>> context switch to kernel), what you want to do is just stop place
>>>> rather than fire the alpha emitter X nanoseconds too late.
>>>>
>>>> This feature lets you say: if the "contract" of isolation is broken,
>>>> notify/kill me at once.
>>> That's a fair point.  It's risky, though, for quite a few reasons.
>>>
>>> 1. If someone builds an alpha emitter like this, they did it wrong.
>>> The kernel should write a trigger *and* a timestamp to the hardware
>>> and the hardware should trigger at the specified time if the time is
>>> in the future and throw an error if it's in the past.  If you need to
>>> check that you made the deadline, check the actual desired condition
>>> (did you meat the deadline?) not a proxy (did the signal fire?).
>>
>> Definitely a better hardware design, but as we all know, hardware
>> designers too rarely consult the software people who have to
>> right the actual code to properly drive the hardware :-)
>>
>> My canonical example is high-performance userspace network
>> drivers, and though dropping is packet is less likely to kill a
>> patient, it's still a pretty bad thing if you're trying to design
>> a robust appliance.  In this case you really want to fix application
>> bugs that cause the code to enter the kernel when you think
>> you're in the internal loop running purely in userspace.  Things
>> like unexpected page faults, and third-party code that almost
>> never calls the kernel but in some dusty corner it occasionally
>> does, can screw up your userspace code pretty badly, and
>> mysteriously.  The "strict" mode support is not a hypothetical
>> insurance policy but a reaction to lots of Tilera customer support
>> over the years to folks failing to stay in userspace when they
>> thought they were doing the right thing.
> But this is *exactly* the case where perf or other out-of-band
> debugging could be a much better solution.  Perf could notify a
> non-isolated thread that an interrupt happened, you'd still drop a
> packet or two, but you wouldn't also drop the next ten thousand
> packets while handling the signal.

There's no reason the signal needs to be delivered to one of
the nohz_full cores.  If you're setting up to catch these signals
rather than have them just SIGKILL you, then you want to
run a separate thread on a housekeeping core that is doing
a sigwait() or equivalent.  I'm not sure why using perf to do
this is particularly better; I'm most interested in ensuring that
it is easy for applications to set this up if they want it, and
perf isn't always super-easy to use.

That said, maybe it's easier than I think to do that specific
thing, and worth considering doing it that way instead.  Is there
an easily-explained way to do what you suggest where perf
delivers a signal?  I assume you have in mind creating a
synthetic sampling perf event and using perf_event_open()
to get a file descriptor for it, and waiting with poll or SIGIO?
(Too bad perf_event_open isn't supported by glibc and we
have to use syscall() to even call it.)  Seems complex...

>>> 2. This strict mode thing isn't exhaustive.  It's missing, at least,
>>> coverage for NMI, MCE, and SMI.  Sure, you can think that you've
>>> disabled all NMI sources, you can try to remember to set the
>>> appropriate boot flag that panics on MCE (and hope that you don't get
>>> screwed by broadcast MCE on Intel systems before it got fixed
>>> (Skylake?  Is the fix even available in a released chip?), and, for
>>> SMI, good luck...
>>
>> You are confusing this strict mode support with the debug
>> support in patch 07/14.
> Nope.  I'm confusing this strict mode with what Gilad described: using
> strict mode to cause outright shutdown instead of failure to meet a
> deadline.

Yeah, fair point.  We certainly could wire up a mode to deliver
a signal or whatever for asynchronous interrupts (which I'm claiming
are primarily kernel bugs) instead of just synchronous interrupts
(which I'm claiming are application bugs).  That could be an
additional mode bit for prctl(), e.g. PR_TASK_ISOLATION_DEBUG
to align with the task_isolation_debug boot variable that enables
the kernel printk spew.

> (FWIW, you could also use an ordinary hardware watchdog timer to
> promote your failure to meet a deadline to a shutdown.  No new kernel
> support needed.)

But more hardware support is needed; there may not be a handy
hardware watchdog timer to use out of the box, and you don't
want to require the customer to buy new hardware to support
a feature like this if you don't have to.

>> Strict mode is for synchronous application errors.  You might
>> be right that there are cases that haven't been covered, but
>> certainly most of them are covered on the three platforms that
>> are supported in this initial series.  (You pointed me to one
>> that I would have missed on x86, namely the bounds check
>> exception from a bad bounds setup.)  I'm pretty confident I
>> have all of them for tile, since I know that hardware best,
>> and I think we're in good shape for arm64, though I'm still
>> coming up to speed on that architecture.
> Again, for this definition of strict mode, I still don't see why it's
> the right design.  If you want to debug your application to detect
> application errors, use a debugging interface.

Maybe.  But we basically want a single notification that the
app (and/or maybe kernel) screwed up.  Invoking all of perf
for that seems like overkill and a signal seems totally adequate,
whether for development fixing bugs, or production catching
bad things.  There are a reasonable number of precedents for
doing things this way: SIGPIPE and SIGFPE, to name two.


>> NMIs and machine checks are asynchronous interrupts that
>> don't have to do with what the application is doing, more or less.
>> Those should not be delivered to task-isolation cores at all,
>> so we just generate console spew when you set the
>> task_isolation_debug boot option.  I honestly don't know enough
>> about system management interrupts to comment on that,
>> though again, I would hope one can configure the system to
>> just not deliver them to nohz_full cores, and I think it would
>> be reasonable to generate some kernel spew if that happens.
> Hah hah yeah right.  On most existing Intel CPUs, you *cannot*
> configure machine checks to do anything other than broadcast to all
> cores or cause immediate shutdown.  And getting any sort of reasonable
> control over SMI more or less requires special firmware.

Yeah, as Gilad said, x86 may not be the best choice to run
a task-isolated application unless you can really set up those
kinds of things to stay off your core.

>>> 3. You haven't dealt with IPIs.  The TLB flush code in particular
>>> seems like it will break all your assumptions.
>>
>> Again, not a synchronous application error that we are trying
>> to catch with this signalling mechanism.
>>
>> That said it could obviously be a more general application error
>> (e.g. a process with threads on both nohz_full and housekeeping
>> cores, where the housekeeping core unmaps some memory and
>> thus requires a TLB flush IPI).  But this is covered by the
>> task_isolation_debug patch for kernel/smp.c.
>>
>>> Maybe it would make sense to whack more of the moles before adding a
>>> big assertion that there aren't any moles any more.
>>
>> Maybe, but I've whacked the ones I know how to whack.
>> If there are ones I've missed I'm happy to add them in a
>> subsequent version of this series, or in follow-on patches.
>>
> I agree that you can, in principle, catch all the synchronous
> application errors using this mechanism.  I'm saying that catching
> them seems quite useful, but catching them using a prctl that causes a
> signal and explicitly does *not* solve the deadline enforcement
> problem seems to have dubious value in the upstream kernel.

When you say "does not solve the deadline enforcement problem",
I'm not sure what point you're making.  The application presumably
can meet its own deadlines when it's not interrupted; the intent
here is to notice when the kernel gets in its way and notify it.

Granted you could add separate mechanisms to create deadlines
within the application, but that feels like a separate layer that
may or may not be desired for any given application.

> You can't catch the asynchronous application errors with this
> mechanism (or at least your ability to catch them depends on which
> patch version IIRC), which include calling anything like munmap or
> membarrier in another thread.

Yes, and munmap in another thread is certainly an application bug
at some level, so that's another reason to allow using the same
mechanism to notify the application of an asynchronous interrupt.
I'll add that for the next version of the patch series.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
@ 2016-01-28 16:38                       ` Frederic Weisbecker
  0 siblings, 0 replies; 340+ messages in thread
From: Frederic Weisbecker @ 2016-01-28 16:38 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Tue, Oct 27, 2015 at 12:40:29PM -0400, Chris Metcalf wrote:
> On 10/21/2015 12:12 PM, Frederic Weisbecker wrote:
> >On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote:
> >>+/*
> >>+ * This routine controls whether we can enable task-isolation mode.
> >>+ * The task must be affinitized to a single nohz_full core or we will
> >>+ * return EINVAL.  Although the application could later re-affinitize
> >>+ * to a housekeeping core and lose task isolation semantics, this
> >>+ * initial test should catch 99% of bugs with task placement prior to
> >>+ * enabling task isolation.
> >>+ */
> >>+int task_isolation_set(unsigned int flags)
> >>+{
> >>+	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
> >I think you'll have to make sure the task can not be concurrently reaffined
> >to more CPUs. This may involve setting task_isolation_flags under the runqueue
> >lock and thus move that tiny part to the scheduler code. And then we must forbid
> >changing the affinity while the task has the isolation flag, or deactivate the flag.
> >
> >In any case this needs some synchronization.
> 
> Well, as the comment says, this is not intended as a hard guarantee.
> As written, it might race with a concurrent sched_setaffinity(), but
> then again, it also is totally OK as written for sched_setaffinity() to
> change it away after the prctl() is complete, so it's not necessary to
> do any explicit synchronization.
> 
> This harks back again to the whole "polite vs aggressive" issue with
> how we envision task isolation.
> 
> The "polite" model basically allows you to set up the conditions for
> task isolation to be useful, and then if they are useful, great! What
> you're suggesting here is a bit more of the "aggressive" model, where
> we actually fail sched_setaffinity() either for any cpumask after
> task isolation is set, or perhaps just for resetting it to housekeeping
> cores.  (Note that we could in principle use PF_NO_SETAFFINITY to
> just hard fail all attempts to call sched_setaffinity once we enable
> task isolation, so we don't have to add more mechanism on that path.)
> 
> I'm a little reluctant to ever fail sched_setaffinity() based on the
> task isolation status with the current "polite" model, since an
> unprivileged application can set up for task isolation, and then
> presumably no one can override it via sched_setaffinity() from another
> task.  (I suppose you could do some kind of permissions-based thing
> where root can always override it, or some suitable capability, etc.,
> but I feel like that gets complicated quickly, for little benefit.)
> 
> The alternative you mention is that if the task is re-affinitized, it
> loses its task-isolation status, and that also seems like an unfortunate
> API, since if you are setting it with prctl(), it's really cleanest just to
> only be able to unset it with prctl() as well.
> 
> I think given the current "polite" API, the only question is whether in
> fact *no* initial test is the best thing, or if an initial test (as
> introduced
> in the v8 version) is defensible just as a help for catching an obvious
> mistake in setting up your task isolation.  I decided the advantage
> of catching the mistake were more important than the "API purity"
> of being 100% consistent in how we handled the interactions between
> affinity and isolation, but I am certainly open to argument on that one.
> 
> Meanwhile I think it still feels like the v8 code is the best compromise.

So what is the way to deal with a migration for example? When the task wakes
up on the non-isolated CPU, it gets warned or killed?

> 
> >>+	/* If the tick is running, request rescheduling; we're not ready. */
> >>+	if (!tick_nohz_tick_stopped()) {
> >Note that this function tells whether the tick is in dynticks mode, which means
> >the tick currently only run on-demand. But it's not necessarily completely stopped.
> 
> I think in fact this is the semantics we want (and that people requested),
> e.g. if the user requests an alarm(), we may still be ticking even though
> tick_nohz_tick_stopped() is true, but that test is still the right condition
> to use to return to user space, since the user explicitly requested the
> alarm.

It seems to break the initial purpose. If your task really doesn't want to be
disturbed, it simply can't arm a timer. tick_nohz_tick_stopped() is really no
other indication than the CPU trying to do its best to delay the next tick. But
that next tick could be re-armed every two msecs for example. Worse yet, if the
tick has been stopped and finally issues a timer that rearms itself every 1 msec,
tick_nohz_tick_stopped() will still be true.

Thanks.

> 
> >I think we should rename that function and the field it refers to.
> 
> Sounds like a good idea.
> 
> -- 
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
> 

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
@ 2016-01-28 16:38                       ` Frederic Weisbecker
  0 siblings, 0 replies; 340+ messages in thread
From: Frederic Weisbecker @ 2016-01-28 16:38 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Oct 27, 2015 at 12:40:29PM -0400, Chris Metcalf wrote:
> On 10/21/2015 12:12 PM, Frederic Weisbecker wrote:
> >On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote:
> >>+/*
> >>+ * This routine controls whether we can enable task-isolation mode.
> >>+ * The task must be affinitized to a single nohz_full core or we will
> >>+ * return EINVAL.  Although the application could later re-affinitize
> >>+ * to a housekeeping core and lose task isolation semantics, this
> >>+ * initial test should catch 99% of bugs with task placement prior to
> >>+ * enabling task isolation.
> >>+ */
> >>+int task_isolation_set(unsigned int flags)
> >>+{
> >>+	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
> >I think you'll have to make sure the task can not be concurrently reaffined
> >to more CPUs. This may involve setting task_isolation_flags under the runqueue
> >lock and thus move that tiny part to the scheduler code. And then we must forbid
> >changing the affinity while the task has the isolation flag, or deactivate the flag.
> >
> >In any case this needs some synchronization.
> 
> Well, as the comment says, this is not intended as a hard guarantee.
> As written, it might race with a concurrent sched_setaffinity(), but
> then again, it also is totally OK as written for sched_setaffinity() to
> change it away after the prctl() is complete, so it's not necessary to
> do any explicit synchronization.
> 
> This harks back again to the whole "polite vs aggressive" issue with
> how we envision task isolation.
> 
> The "polite" model basically allows you to set up the conditions for
> task isolation to be useful, and then if they are useful, great! What
> you're suggesting here is a bit more of the "aggressive" model, where
> we actually fail sched_setaffinity() either for any cpumask after
> task isolation is set, or perhaps just for resetting it to housekeeping
> cores.  (Note that we could in principle use PF_NO_SETAFFINITY to
> just hard fail all attempts to call sched_setaffinity once we enable
> task isolation, so we don't have to add more mechanism on that path.)
> 
> I'm a little reluctant to ever fail sched_setaffinity() based on the
> task isolation status with the current "polite" model, since an
> unprivileged application can set up for task isolation, and then
> presumably no one can override it via sched_setaffinity() from another
> task.  (I suppose you could do some kind of permissions-based thing
> where root can always override it, or some suitable capability, etc.,
> but I feel like that gets complicated quickly, for little benefit.)
> 
> The alternative you mention is that if the task is re-affinitized, it
> loses its task-isolation status, and that also seems like an unfortunate
> API, since if you are setting it with prctl(), it's really cleanest just to
> only be able to unset it with prctl() as well.
> 
> I think given the current "polite" API, the only question is whether in
> fact *no* initial test is the best thing, or if an initial test (as
> introduced
> in the v8 version) is defensible just as a help for catching an obvious
> mistake in setting up your task isolation.  I decided the advantage
> of catching the mistake were more important than the "API purity"
> of being 100% consistent in how we handled the interactions between
> affinity and isolation, but I am certainly open to argument on that one.
> 
> Meanwhile I think it still feels like the v8 code is the best compromise.

So what is the way to deal with a migration for example? When the task wakes
up on the non-isolated CPU, it gets warned or killed?

> 
> >>+	/* If the tick is running, request rescheduling; we're not ready. */
> >>+	if (!tick_nohz_tick_stopped()) {
> >Note that this function tells whether the tick is in dynticks mode, which means
> >the tick currently only run on-demand. But it's not necessarily completely stopped.
> 
> I think in fact this is the semantics we want (and that people requested),
> e.g. if the user requests an alarm(), we may still be ticking even though
> tick_nohz_tick_stopped() is true, but that test is still the right condition
> to use to return to user space, since the user explicitly requested the
> alarm.

It seems to break the initial purpose. If your task really doesn't want to be
disturbed, it simply can't arm a timer. tick_nohz_tick_stopped() is really no
other indication than the CPU trying to do its best to delay the next tick. But
that next tick could be re-armed every two msecs for example. Worse yet, if the
tick has been stopped and finally issues a timer that rearms itself every 1 msec,
tick_nohz_tick_stopped() will still be true.

Thanks.

> 
> >I think we should rename that function and the field it refers to.
> 
> Sounds like a good idea.
> 
> -- 
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
> 

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
  2016-01-28 16:38                       ` Frederic Weisbecker
@ 2016-02-11 19:58                         ` Chris Metcalf
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2016-02-11 19:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 01/28/2016 11:38 AM, Frederic Weisbecker wrote:
> On Tue, Oct 27, 2015 at 12:40:29PM -0400, Chris Metcalf wrote:
>> On 10/21/2015 12:12 PM, Frederic Weisbecker wrote:
>>> On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote:
>>>> +/*
>>>> + * This routine controls whether we can enable task-isolation mode.
>>>> + * The task must be affinitized to a single nohz_full core or we will
>>>> + * return EINVAL.  Although the application could later re-affinitize
>>>> + * to a housekeeping core and lose task isolation semantics, this
>>>> + * initial test should catch 99% of bugs with task placement prior to
>>>> + * enabling task isolation.
>>>> + */
>>>> +int task_isolation_set(unsigned int flags)
>>>> +{
>>>> +	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
>>> I think you'll have to make sure the task can not be concurrently reaffined
>>> to more CPUs. This may involve setting task_isolation_flags under the runqueue
>>> lock and thus move that tiny part to the scheduler code. And then we must forbid
>>> changing the affinity while the task has the isolation flag, or deactivate the flag.
>>>
>>> In any case this needs some synchronization.
>> Well, as the comment says, this is not intended as a hard guarantee.
>> As written, it might race with a concurrent sched_setaffinity(), but
>> then again, it also is totally OK as written for sched_setaffinity() to
>> change it away after the prctl() is complete, so it's not necessary to
>> do any explicit synchronization.
>>
>> This harks back again to the whole "polite vs aggressive" issue with
>> how we envision task isolation.
>>
>> The "polite" model basically allows you to set up the conditions for
>> task isolation to be useful, and then if they are useful, great! What
>> you're suggesting here is a bit more of the "aggressive" model, where
>> we actually fail sched_setaffinity() either for any cpumask after
>> task isolation is set, or perhaps just for resetting it to housekeeping
>> cores.  (Note that we could in principle use PF_NO_SETAFFINITY to
>> just hard fail all attempts to call sched_setaffinity once we enable
>> task isolation, so we don't have to add more mechanism on that path.)
>>
>> I'm a little reluctant to ever fail sched_setaffinity() based on the
>> task isolation status with the current "polite" model, since an
>> unprivileged application can set up for task isolation, and then
>> presumably no one can override it via sched_setaffinity() from another
>> task.  (I suppose you could do some kind of permissions-based thing
>> where root can always override it, or some suitable capability, etc.,
>> but I feel like that gets complicated quickly, for little benefit.)
>>
>> The alternative you mention is that if the task is re-affinitized, it
>> loses its task-isolation status, and that also seems like an unfortunate
>> API, since if you are setting it with prctl(), it's really cleanest just to
>> only be able to unset it with prctl() as well.
>>
>> I think given the current "polite" API, the only question is whether in
>> fact *no* initial test is the best thing, or if an initial test (as
>> introduced
>> in the v8 version) is defensible just as a help for catching an obvious
>> mistake in setting up your task isolation.  I decided the advantage
>> of catching the mistake were more important than the "API purity"
>> of being 100% consistent in how we handled the interactions between
>> affinity and isolation, but I am certainly open to argument on that one.
>>
>> Meanwhile I think it still feels like the v8 code is the best compromise.
> So what is the way to deal with a migration for example? When the task wakes
> up on the non-isolated CPU, it gets warned or killed?

Good question!  We can only enable task isolation on an isolcpus core,
so it must be a manual migration, either externally, or by the program
itself calling sched_setaffinity().  So at some level, it's just an
application bug.  In the current code, if you have enabled STRICT mode task
isolation, the process will get killed since it has to go through the kernel
to migrate.  If not in STRICT mode, then it will hang until it is manually
killed since full dynticks will never get turned on once it wakes up on a
non-isolated CPU - unless it is then manually migrated back to a proper
task-isolation cpu.  And, perhaps the intent was to do some cpu offlining
and rearrange the task isolation tasks, and therefore that makes sense?

So, maybe that semantics is good enough!?  I'm not completely sure, but
I think I'm willing to claim that for something this much of a corner
case, it's probably reasonable.

>>>> +	/* If the tick is running, request rescheduling; we're not ready. */
>>>> +	if (!tick_nohz_tick_stopped()) {
>>> Note that this function tells whether the tick is in dynticks mode, which means
>>> the tick currently only run on-demand. But it's not necessarily completely stopped.
>> I think in fact this is the semantics we want (and that people requested),
>> e.g. if the user requests an alarm(), we may still be ticking even though
>> tick_nohz_tick_stopped() is true, but that test is still the right condition
>> to use to return to user space, since the user explicitly requested the
>> alarm.
> It seems to break the initial purpose. If your task really doesn't want to be
> disturbed, it simply can't arm a timer. tick_nohz_tick_stopped() is really no
> other indication than the CPU trying to do its best to delay the next tick. But
> that next tick could be re-armed every two msecs for example. Worse yet, if the
> tick has been stopped and finally issues a timer that rearms itself every 1 msec,
> tick_nohz_tick_stopped() will still be true.

This is definitely another grey area.  Certainly if there's a kernel timer that
rearms itself every 1 ms, we're in trouble.  (And the existing mechanisms of STRICT
mode and task_isolation_debug would help.)  But as far as just regular userspace
arming a timer via syscall, then if your hardware had a 64-bit down counter
for timer interrupts, for example, you might well be able to do something like
say "every night at midnight, I can stop driving packets and do system maintenance,
so I'd like the kernel to interrupt me".  In this case some kind of alarm() would
not be incompatible with task isolation.  I admit this is kind of an extreme
case; and certainly in STRICT mode, as currently written, you'd get a signal if
you tried to do this, so you'd have to run with STRICT mode off.

However, the reason I specifically decided to do this is community feedback.  In
http://lkml.kernel.org/r/CALCETrVdZxkEeQd3=V6p_yLYL7T83Y3WfnhfVGi3GwTxF+vPQg@mail.gmail.com,
on 9/28/2015, Andy Lutomirski wrote:

> Why are we treating alarms as something that should defer entry to
> userspace?  I think it would be entirely reasonable to set an alarm
> for ten minutes, ask for isolation, and then think hard for ten
> minutes.
>
> [...]
>
> ISTM something's suboptimal with the inner workings of all this if
> task_isolation_enter needs to sleep to wait for an event that isn't
> scheduled for the immediate future (e.g. already queued up as an
> interrupt).

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH v8 04/14] task_isolation: add initial support
@ 2016-02-11 19:58                         ` Chris Metcalf
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Metcalf @ 2016-02-11 19:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 01/28/2016 11:38 AM, Frederic Weisbecker wrote:
> On Tue, Oct 27, 2015 at 12:40:29PM -0400, Chris Metcalf wrote:
>> On 10/21/2015 12:12 PM, Frederic Weisbecker wrote:
>>> On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote:
>>>> +/*
>>>> + * This routine controls whether we can enable task-isolation mode.
>>>> + * The task must be affinitized to a single nohz_full core or we will
>>>> + * return EINVAL.  Although the application could later re-affinitize
>>>> + * to a housekeeping core and lose task isolation semantics, this
>>>> + * initial test should catch 99% of bugs with task placement prior to
>>>> + * enabling task isolation.
>>>> + */
>>>> +int task_isolation_set(unsigned int flags)
>>>> +{
>>>> +	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
>>> I think you'll have to make sure the task can not be concurrently reaffined
>>> to more CPUs. This may involve setting task_isolation_flags under the runqueue
>>> lock and thus move that tiny part to the scheduler code. And then we must forbid
>>> changing the affinity while the task has the isolation flag, or deactivate the flag.
>>>
>>> In any case this needs some synchronization.
>> Well, as the comment says, this is not intended as a hard guarantee.
>> As written, it might race with a concurrent sched_setaffinity(), but
>> then again, it also is totally OK as written for sched_setaffinity() to
>> change it away after the prctl() is complete, so it's not necessary to
>> do any explicit synchronization.
>>
>> This harks back again to the whole "polite vs aggressive" issue with
>> how we envision task isolation.
>>
>> The "polite" model basically allows you to set up the conditions for
>> task isolation to be useful, and then if they are useful, great! What
>> you're suggesting here is a bit more of the "aggressive" model, where
>> we actually fail sched_setaffinity() either for any cpumask after
>> task isolation is set, or perhaps just for resetting it to housekeeping
>> cores.  (Note that we could in principle use PF_NO_SETAFFINITY to
>> just hard fail all attempts to call sched_setaffinity once we enable
>> task isolation, so we don't have to add more mechanism on that path.)
>>
>> I'm a little reluctant to ever fail sched_setaffinity() based on the
>> task isolation status with the current "polite" model, since an
>> unprivileged application can set up for task isolation, and then
>> presumably no one can override it via sched_setaffinity() from another
>> task.  (I suppose you could do some kind of permissions-based thing
>> where root can always override it, or some suitable capability, etc.,
>> but I feel like that gets complicated quickly, for little benefit.)
>>
>> The alternative you mention is that if the task is re-affinitized, it
>> loses its task-isolation status, and that also seems like an unfortunate
>> API, since if you are setting it with prctl(), it's really cleanest just to
>> only be able to unset it with prctl() as well.
>>
>> I think given the current "polite" API, the only question is whether in
>> fact *no* initial test is the best thing, or if an initial test (as
>> introduced
>> in the v8 version) is defensible just as a help for catching an obvious
>> mistake in setting up your task isolation.  I decided the advantage
>> of catching the mistake were more important than the "API purity"
>> of being 100% consistent in how we handled the interactions between
>> affinity and isolation, but I am certainly open to argument on that one.
>>
>> Meanwhile I think it still feels like the v8 code is the best compromise.
> So what is the way to deal with a migration for example? When the task wakes
> up on the non-isolated CPU, it gets warned or killed?

Good question!  We can only enable task isolation on an isolcpus core,
so it must be a manual migration, either externally, or by the program
itself calling sched_setaffinity().  So at some level, it's just an
application bug.  In the current code, if you have enabled STRICT mode task
isolation, the process will get killed since it has to go through the kernel
to migrate.  If not in STRICT mode, then it will hang until it is manually
killed since full dynticks will never get turned on once it wakes up on a
non-isolated CPU - unless it is then manually migrated back to a proper
task-isolation cpu.  And, perhaps the intent was to do some cpu offlining
and rearrange the task isolation tasks, and therefore that makes sense?

So, maybe that semantics is good enough!?  I'm not completely sure, but
I think I'm willing to claim that for something this much of a corner
case, it's probably reasonable.

>>>> +	/* If the tick is running, request rescheduling; we're not ready. */
>>>> +	if (!tick_nohz_tick_stopped()) {
>>> Note that this function tells whether the tick is in dynticks mode, which means
>>> the tick currently only run on-demand. But it's not necessarily completely stopped.
>> I think in fact this is the semantics we want (and that people requested),
>> e.g. if the user requests an alarm(), we may still be ticking even though
>> tick_nohz_tick_stopped() is true, but that test is still the right condition
>> to use to return to user space, since the user explicitly requested the
>> alarm.
> It seems to break the initial purpose. If your task really doesn't want to be
> disturbed, it simply can't arm a timer. tick_nohz_tick_stopped() is really no
> other indication than the CPU trying to do its best to delay the next tick. But
> that next tick could be re-armed every two msecs for example. Worse yet, if the
> tick has been stopped and finally issues a timer that rearms itself every 1 msec,
> tick_nohz_tick_stopped() will still be true.

This is definitely another grey area.  Certainly if there's a kernel timer that
rearms itself every 1 ms, we're in trouble.  (And the existing mechanisms of STRICT
mode and task_isolation_debug would help.)  But as far as just regular userspace
arming a timer via syscall, then if your hardware had a 64-bit down counter
for timer interrupts, for example, you might well be able to do something like
say "every night at midnight, I can stop driving packets and do system maintenance,
so I'd like the kernel to interrupt me".  In this case some kind of alarm() would
not be incompatible with task isolation.  I admit this is kind of an extreme
case; and certainly in STRICT mode, as currently written, you'd get a signal if
you tried to do this, so you'd have to run with STRICT mode off.

However, the reason I specifically decided to do this is community feedback.  In
http://lkml.kernel.org/r/CALCETrVdZxkEeQd3=V6p_yLYL7T83Y3WfnhfVGi3GwTxF+vPQg@mail.gmail.com,
on 9/28/2015, Andy Lutomirski wrote:

> Why are we treating alarms as something that should defer entry to
> userspace?  I think it would be entirely reasonable to set an alarm
> for ten minutes, ask for isolation, and then think hard for ten
> minutes.
>
> [...]
>
> ISTM something's suboptimal with the inner workings of all this if
> task_isolation_enter needs to sleep to wait for an event that isn't
> scheduled for the immediate future (e.g. already queued up as an
> interrupt).

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 340+ messages in thread

end of thread, other threads:[~2016-02-11 20:13 UTC | newest]

Thread overview: 340+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-08 17:58 [PATCH 0/6] support "dataplane" mode for nohz_full Chris Metcalf
2015-05-08 17:58 ` Chris Metcalf
2015-05-08 17:58 ` [PATCH 1/6] nohz_full: add support for "dataplane" mode Chris Metcalf
2015-05-08 17:58   ` Chris Metcalf
2015-05-08 17:58 ` [PATCH 2/6] nohz: dataplane: allow tick to be fully disabled for dataplane Chris Metcalf
2015-05-12  9:26   ` Peter Zijlstra
2015-05-12 13:12     ` Paul E. McKenney
2015-05-14 20:55       ` Chris Metcalf
2015-05-08 17:58 ` [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry Chris Metcalf
2015-05-09  7:04   ` Mike Galbraith
2015-05-11 20:13     ` Chris Metcalf
2015-05-12  2:21       ` Mike Galbraith
2015-05-12  9:28       ` Peter Zijlstra
2015-05-12  9:32       ` Peter Zijlstra
2015-05-12 13:08         ` Paul E. McKenney
2015-05-08 17:58 ` [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE Chris Metcalf
2015-05-08 17:58   ` Chris Metcalf
2015-05-12  9:33   ` Peter Zijlstra
2015-05-12  9:33     ` Peter Zijlstra
2015-05-12  9:50     ` Ingo Molnar
2015-05-12  9:50       ` Ingo Molnar
2015-05-12 10:38       ` Peter Zijlstra
2015-05-12 10:38         ` Peter Zijlstra
2015-05-12 12:52         ` Ingo Molnar
2015-05-13  4:35           ` Andy Lutomirski
2015-05-13 17:51             ` Paul E. McKenney
2015-05-14 20:55               ` Chris Metcalf
2015-05-14 20:55                 ` Chris Metcalf
2015-05-14 20:54     ` Chris Metcalf
2015-05-14 20:54       ` Chris Metcalf
2015-05-08 17:58 ` [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode Chris Metcalf
2015-05-08 17:58   ` Chris Metcalf
2015-05-09  7:28   ` Andy Lutomirski
2015-05-09 10:37     ` Gilad Ben Yossef
2015-05-09 10:37       ` Gilad Ben Yossef
2015-05-11 19:13     ` Chris Metcalf
2015-05-11 19:13       ` Chris Metcalf
2015-05-11 22:28       ` Andy Lutomirski
2015-05-12 21:06         ` Chris Metcalf
2015-05-12 22:23           ` Andy Lutomirski
2015-05-15 21:25             ` Chris Metcalf
2015-05-12  9:38   ` Peter Zijlstra
2015-05-12 13:20     ` Paul E. McKenney
2015-05-12 13:20       ` Paul E. McKenney
2015-05-08 17:58 ` [PATCH 6/6] nohz: add dataplane_debug boot flag Chris Metcalf
2015-05-08 21:18 ` [PATCH 0/6] support "dataplane" mode for nohz_full Andrew Morton
2015-05-08 21:18   ` Andrew Morton
2015-05-08 21:22   ` Steven Rostedt
2015-05-08 21:22     ` Steven Rostedt
2015-05-08 23:11     ` Chris Metcalf
2015-05-08 23:11       ` Chris Metcalf
2015-05-08 23:19       ` Andrew Morton
2015-05-08 23:19         ` Andrew Morton
2015-05-09  7:05         ` Ingo Molnar
2015-05-09  7:19           ` Andy Lutomirski
2015-05-09  7:19             ` Andy Lutomirski
2015-05-11 19:54             ` Chris Metcalf
2015-05-11 19:54               ` Chris Metcalf
2015-05-11 22:15               ` Andy Lutomirski
2015-05-11 22:15                 ` Andy Lutomirski
     [not found]             ` <55510885.9070101@ezchip.com>
2015-05-12 13:18               ` Paul E. McKenney
2015-05-09  7:19           ` Mike Galbraith
2015-05-09  7:19             ` Mike Galbraith
2015-05-09 10:18             ` Gilad Ben Yossef
2015-05-09 10:18               ` Gilad Ben Yossef
2015-05-11 12:57           ` Steven Rostedt
2015-05-11 12:57             ` Steven Rostedt
2015-05-11 15:36             ` Frederic Weisbecker
2015-05-11 19:19               ` Mike Galbraith
2015-05-11 19:25                 ` Chris Metcalf
2015-05-11 19:25                   ` Chris Metcalf
2015-05-12  1:47                   ` Mike Galbraith
2015-05-12  4:35                     ` Mike Galbraith
2015-05-15 15:05                     ` Chris Metcalf
2015-05-15 18:44                       ` Mike Galbraith
2015-05-26 19:51                         ` Chris Metcalf
2015-05-27  3:28                           ` Mike Galbraith
2015-05-11 17:19             ` Paul E. McKenney
2015-05-11 17:27               ` Andrew Morton
2015-05-11 17:33                 ` Frederic Weisbecker
2015-05-11 18:00                   ` Steven Rostedt
2015-05-11 18:09                     ` Chris Metcalf
2015-05-11 18:09                       ` Chris Metcalf
2015-05-11 18:36                       ` Steven Rostedt
2015-05-11 18:36                         ` Steven Rostedt
2015-05-12  9:10                       ` CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) Ingo Molnar
2015-05-12 11:48                         ` Peter Zijlstra
2015-05-12 11:48                           ` Peter Zijlstra
2015-05-12 12:34                           ` Ingo Molnar
2015-05-12 12:39                             ` Peter Zijlstra
2015-05-12 12:39                               ` Peter Zijlstra
2015-05-12 12:43                               ` Ingo Molnar
2015-05-12 12:43                                 ` Ingo Molnar
2015-05-12 15:36                             ` Frederic Weisbecker
2015-05-12 21:05                         ` CONFIG_ISOLATION=y Chris Metcalf
2015-05-12 21:05                           ` CONFIG_ISOLATION=y Chris Metcalf
2015-05-12 10:46             ` [PATCH 0/6] support "dataplane" mode for nohz_full Peter Zijlstra
2015-05-12 10:46               ` Peter Zijlstra
2015-05-15 15:10               ` Chris Metcalf
2015-05-15 15:10                 ` Chris Metcalf
2015-05-15 21:26 ` [PATCH v2 0/5] support "cpu_isolated" " Chris Metcalf
2015-05-15 21:26   ` Chris Metcalf
2015-05-15 21:27   ` [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode Chris Metcalf
2015-05-15 21:27     ` Chris Metcalf
2015-05-15 21:27     ` [PATCH v2 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf
2015-05-15 21:27       ` Chris Metcalf
2015-05-15 21:27     ` [PATCH v2 3/5] nohz: cpu_isolated strict mode configurable signal Chris Metcalf
2015-05-15 21:27       ` Chris Metcalf
2015-05-15 21:27     ` [PATCH v2 4/5] nohz: add cpu_isolated_debug boot flag Chris Metcalf
2015-05-15 21:27     ` [PATCH v2 5/5] nohz: cpu_isolated: allow tick to be fully disabled Chris Metcalf
2015-05-15 22:17     ` [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode Thomas Gleixner
2015-05-15 22:17       ` Thomas Gleixner
2015-05-28 20:38       ` Chris Metcalf
2015-05-28 20:38         ` Chris Metcalf
2015-06-03 15:29   ` [PATCH v3 0/5] support "cpu_isolated" mode for nohz_full Chris Metcalf
2015-06-03 15:29     ` Chris Metcalf
2015-06-03 15:29     ` [PATCH v3 1/5] nohz_full: add support for "cpu_isolated" mode Chris Metcalf
2015-06-03 15:29       ` Chris Metcalf
2015-06-03 15:29     ` [PATCH v3 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf
2015-06-03 15:29       ` Chris Metcalf
2015-06-03 15:29     ` [PATCH v3 3/5] nohz: cpu_isolated strict mode configurable signal Chris Metcalf
2015-06-03 15:29       ` Chris Metcalf
2015-06-03 15:29     ` [PATCH v3 4/5] nohz: add cpu_isolated_debug boot flag Chris Metcalf
2015-06-03 15:29     ` [PATCH v3 5/5] nohz: cpu_isolated: allow tick to be fully disabled Chris Metcalf
2015-07-13 19:57     ` [PATCH v4 0/5] support "cpu_isolated" mode for nohz_full Chris Metcalf
2015-07-13 19:57       ` Chris Metcalf
2015-07-13 19:57       ` [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode Chris Metcalf
2015-07-13 19:57         ` Chris Metcalf
2015-07-13 20:40         ` Andy Lutomirski
2015-07-13 21:01           ` Chris Metcalf
2015-07-13 21:45             ` Andy Lutomirski
2015-07-21 19:10               ` Chris Metcalf
2015-07-21 19:26                 ` Andy Lutomirski
2015-07-21 20:36                   ` Paul E. McKenney
2015-07-21 20:36                     ` Paul E. McKenney
2015-07-22 13:57                     ` Christoph Lameter
2015-07-22 19:28                       ` Paul E. McKenney
2015-07-22 19:28                         ` Paul E. McKenney
2015-07-22 20:02                         ` Christoph Lameter
2015-07-24 20:21                           ` Chris Metcalf
2015-07-24 20:22                   ` Chris Metcalf
2015-07-24 14:03                 ` Frederic Weisbecker
2015-07-24 20:19                   ` Chris Metcalf
2015-07-24 13:27         ` Frederic Weisbecker
2015-07-24 20:21           ` Chris Metcalf
2015-07-24 20:21             ` Chris Metcalf
2015-07-13 19:57       ` [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf
2015-07-13 19:57         ` Chris Metcalf
2015-07-13 21:47         ` Andy Lutomirski
2015-07-13 21:47           ` Andy Lutomirski
2015-07-21 19:34           ` Chris Metcalf
2015-07-21 19:34             ` Chris Metcalf
2015-07-21 19:42             ` Andy Lutomirski
2015-07-21 19:42               ` Andy Lutomirski
2015-07-24 20:29               ` Chris Metcalf
2015-07-13 19:57       ` [PATCH v4 3/5] nohz: cpu_isolated strict mode configurable signal Chris Metcalf
2015-07-13 19:57         ` Chris Metcalf
2015-07-13 19:58       ` [PATCH v4 4/5] nohz: add cpu_isolated_debug boot flag Chris Metcalf
2015-07-13 19:58       ` [PATCH v4 5/5] nohz: cpu_isolated: allow tick to be fully disabled Chris Metcalf
2015-07-28 19:49       ` [PATCH v5 0/6] support "cpu_isolated" mode for nohz_full Chris Metcalf
2015-07-28 19:49         ` Chris Metcalf
2015-07-28 19:49         ` [PATCH v5 1/6] vmstat: provide a function to quiet down the diff processing Chris Metcalf
2015-07-28 19:49         ` [PATCH v5 2/6] cpu_isolated: add initial support Chris Metcalf
2015-07-28 19:49           ` Chris Metcalf
2015-08-12 16:00           ` Frederic Weisbecker
2015-08-12 16:00             ` Frederic Weisbecker
2015-08-12 18:22             ` Chris Metcalf
2015-08-12 18:22               ` Chris Metcalf
2015-08-26 15:26               ` Frederic Weisbecker
2015-08-26 15:26                 ` Frederic Weisbecker
2015-08-26 15:55                 ` Chris Metcalf
2015-08-26 15:55                   ` Chris Metcalf
2015-07-28 19:49         ` [PATCH v5 3/6] cpu_isolated: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf
2015-07-28 19:49           ` Chris Metcalf
2015-07-28 19:49         ` [PATCH v5 4/6] cpu_isolated: provide strict mode configurable signal Chris Metcalf
2015-07-28 19:49           ` Chris Metcalf
2015-07-28 19:49         ` [PATCH v5 5/6] cpu_isolated: add debug boot flag Chris Metcalf
2015-07-28 19:49         ` [PATCH v5 6/6] nohz: cpu_isolated: allow tick to be fully disabled Chris Metcalf
2015-08-25 19:55         ` [PATCH v6 0/6] support "task_isolated" mode for nohz_full Chris Metcalf
2015-08-25 19:55           ` Chris Metcalf
2015-08-25 19:55           ` [PATCH v6 1/6] vmstat: provide a function to quiet down the diff processing Chris Metcalf
2015-08-25 19:55           ` [PATCH v6 2/6] task_isolation: add initial support Chris Metcalf
2015-08-25 19:55             ` Chris Metcalf
2015-08-25 19:55           ` [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf
2015-08-25 19:55             ` Chris Metcalf
2015-08-26 10:36             ` Will Deacon
2015-08-26 15:10               ` Chris Metcalf
2015-09-02 10:13                 ` Will Deacon
2015-09-02 10:13                   ` Will Deacon
2015-08-28 15:31               ` [PATCH v6.1 " Chris Metcalf
2015-08-28 15:31                 ` Chris Metcalf
2015-08-25 19:55           ` [PATCH v6 4/6] task_isolation: provide strict mode configurable signal Chris Metcalf
2015-08-25 19:55             ` Chris Metcalf
2015-08-28 19:22             ` Andy Lutomirski
2015-09-02 18:38               ` [PATCH v6.2 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf
2015-09-02 18:38                 ` Chris Metcalf
2015-08-25 19:55           ` [PATCH v6 5/6] task_isolation: add debug boot flag Chris Metcalf
2015-08-25 19:55           ` [PATCH v6 6/6] nohz: task_isolation: allow tick to be fully disabled Chris Metcalf
2015-09-28 15:17           ` [PATCH v7 00/11] support "task_isolated" mode for nohz_full Chris Metcalf
2015-09-28 15:17             ` Chris Metcalf
2015-09-28 15:17             ` [PATCH v7 01/11] vmstat: provide a function to quiet down the diff processing Chris Metcalf
2015-09-28 15:17             ` [PATCH v7 02/11] task_isolation: add initial support Chris Metcalf
2015-09-28 15:17               ` Chris Metcalf
2015-10-01 12:14               ` Frederic Weisbecker
2015-10-01 12:18                 ` Thomas Gleixner
2015-10-01 12:23                   ` Frederic Weisbecker
2015-10-01 12:31                     ` Thomas Gleixner
2015-10-01 17:02                   ` Chris Metcalf
2015-10-01 17:02                     ` Chris Metcalf
2015-10-01 21:20                     ` Thomas Gleixner
2015-10-01 21:20                       ` Thomas Gleixner
2015-10-02 17:15                       ` Chris Metcalf
2015-10-02 17:15                         ` Chris Metcalf
2015-10-02 19:02                         ` Thomas Gleixner
2015-10-02 19:02                           ` Thomas Gleixner
2015-10-01 19:25                 ` Chris Metcalf
2015-10-01 19:25                   ` Chris Metcalf
2015-09-28 15:17             ` [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf
2015-09-28 15:17               ` Chris Metcalf
2015-09-28 20:51               ` Andy Lutomirski
2015-09-28 20:51                 ` Andy Lutomirski
2015-09-28 21:54                 ` Chris Metcalf
2015-09-28 22:38                   ` Andy Lutomirski
2015-09-29 17:35                     ` Chris Metcalf
2015-09-29 17:46                       ` Andy Lutomirski
2015-09-29 17:46                         ` Andy Lutomirski
2015-09-29 17:57                         ` Chris Metcalf
2015-09-29 17:57                           ` Chris Metcalf
2015-09-29 18:00                           ` Andy Lutomirski
2015-10-01 19:25                             ` Chris Metcalf
2015-10-01 19:25                               ` Chris Metcalf
2015-09-28 15:17             ` [PATCH v7 04/11] task_isolation: provide strict mode configurable signal Chris Metcalf
2015-09-28 15:17               ` Chris Metcalf
2015-09-28 20:54               ` Andy Lutomirski
2015-09-28 21:54                 ` Chris Metcalf
2015-09-28 21:54                   ` Chris Metcalf
2015-09-28 15:17             ` [PATCH v7 05/11] task_isolation: add debug boot flag Chris Metcalf
2015-09-28 20:59               ` Andy Lutomirski
2015-09-28 21:55                 ` Chris Metcalf
2015-09-28 22:40                   ` Andy Lutomirski
2015-09-29 17:35                     ` Chris Metcalf
2015-10-05 17:07               ` Luiz Capitulino
2015-10-08  0:33                 ` Chris Metcalf
2015-10-08 20:28                   ` Luiz Capitulino
2015-09-28 15:17             ` [PATCH v7 06/11] nohz: task_isolation: allow tick to be fully disabled Chris Metcalf
2015-09-28 20:40               ` Andy Lutomirski
2015-10-01 13:07                 ` Frederic Weisbecker
2015-10-01 14:13                   ` Thomas Gleixner
2015-09-28 15:17             ` [PATCH v7 07/11] arch/x86: enable task isolation functionality Chris Metcalf
2015-09-28 20:59               ` Andy Lutomirski
2015-09-28 21:57                 ` Chris Metcalf
2015-09-28 22:43                   ` Andy Lutomirski
2015-09-29 17:42                     ` Chris Metcalf
2015-09-29 17:57                       ` Andy Lutomirski
2015-09-30 20:25                         ` Thomas Gleixner
2015-09-30 20:58                           ` Chris Metcalf
2015-09-30 22:02                             ` Thomas Gleixner
2015-09-30 22:11                               ` Andy Lutomirski
2015-10-01  8:12                                 ` Thomas Gleixner
2015-10-01  9:08                                   ` Christoph Lameter
2015-10-01 10:10                                     ` Thomas Gleixner
2015-10-01 19:25                                   ` Chris Metcalf
2015-09-28 15:17             ` [PATCH v7 08/11] arch/arm64: adopt prepare_exit_to_usermode() model from x86 Chris Metcalf
2015-09-28 15:17               ` Chris Metcalf
2015-09-28 15:17             ` [PATCH v7 09/11] arch/arm64: enable task isolation functionality Chris Metcalf
2015-09-28 15:17               ` Chris Metcalf
2015-09-28 15:17             ` [PATCH v7 10/11] arch/tile: adopt prepare_exit_to_usermode() model from x86 Chris Metcalf
2015-09-28 15:17             ` [PATCH v7 11/11] arch/tile: enable task isolation functionality Chris Metcalf
2015-10-20 20:35             ` [PATCH v8 00/14] support "task_isolation" mode for nohz_full Chris Metcalf
2015-10-20 20:35               ` Chris Metcalf
2015-10-20 20:35               ` [PATCH v8 01/14] vmstat: provide a function to quiet down the diff processing Chris Metcalf
2015-10-20 20:36               ` [PATCH v8 02/14] vmstat: add vmstat_idle function Chris Metcalf
2015-10-20 20:45                 ` Christoph Lameter
2015-10-20 20:36               ` [PATCH v8 03/14] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
2015-10-20 20:36                 ` Chris Metcalf
2015-10-20 20:36               ` [PATCH v8 04/14] task_isolation: add initial support Chris Metcalf
2015-10-20 20:36                 ` Chris Metcalf
2015-10-20 20:56                 ` Andy Lutomirski
2015-10-20 21:20                   ` Chris Metcalf
2015-10-20 21:20                     ` Chris Metcalf
2015-10-20 21:26                     ` Andy Lutomirski
2015-10-20 21:26                       ` Andy Lutomirski
2015-10-21  0:29                       ` Steven Rostedt
2015-10-21  0:29                         ` Steven Rostedt
2015-10-26 20:19                         ` Chris Metcalf
2015-10-26 21:13                         ` Chris Metcalf
2015-10-26 20:32                       ` Chris Metcalf
2015-10-21 16:12                 ` Frederic Weisbecker
2015-10-21 16:12                   ` Frederic Weisbecker
2015-10-27 16:40                   ` Chris Metcalf
2015-10-27 16:40                     ` Chris Metcalf
2016-01-28 16:38                     ` Frederic Weisbecker
2016-01-28 16:38                       ` Frederic Weisbecker
2016-02-11 19:58                       ` Chris Metcalf
2016-02-11 19:58                         ` Chris Metcalf
2015-10-20 20:36               ` [PATCH v8 05/14] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf
2015-10-20 20:36                 ` Chris Metcalf
2015-10-20 20:36               ` [PATCH v8 06/14] task_isolation: provide strict mode configurable signal Chris Metcalf
2015-10-20 20:36                 ` Chris Metcalf
2015-10-21  0:56                 ` Steven Rostedt
2015-10-21  0:56                   ` Steven Rostedt
2015-10-21  1:30                   ` Chris Metcalf
2015-10-21  1:30                     ` Chris Metcalf
2015-10-21  1:41                     ` Steven Rostedt
2015-10-21  1:41                       ` Steven Rostedt
2015-10-21  1:42                     ` Andy Lutomirski
2015-10-21  6:41                       ` Gilad Ben Yossef
2015-10-21  6:41                         ` Gilad Ben Yossef
2015-10-21 18:53                         ` Andy Lutomirski
2015-10-22 20:44                           ` Chris Metcalf
2015-10-22 21:00                             ` Andy Lutomirski
2015-10-27 19:37                               ` Chris Metcalf
2015-10-27 19:37                                 ` Chris Metcalf
2015-10-24  9:16                           ` Gilad Ben Yossef
2015-10-24  9:16                             ` Gilad Ben Yossef
2015-10-20 20:36               ` [PATCH v8 07/14] task_isolation: add debug boot flag Chris Metcalf
2015-10-20 20:36               ` [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot Chris Metcalf
2015-10-20 21:03                 ` Frederic Weisbecker
2015-10-20 21:18                   ` Chris Metcalf
2015-10-21  0:59                     ` Steven Rostedt
2015-10-21  6:56                   ` Gilad Ben Yossef
2015-10-21 14:28                   ` Christoph Lameter
2015-10-21 15:35                     ` Frederic Weisbecker
2015-10-20 20:36               ` [PATCH v8 09/14] arch/x86: enable task isolation functionality Chris Metcalf
2015-10-20 20:36               ` [PATCH v8 10/14] arch/arm64: adopt prepare_exit_to_usermode() model from x86 Chris Metcalf
2015-10-20 20:36                 ` Chris Metcalf
2015-10-20 20:36               ` [PATCH v8 11/14] arch/arm64: enable task isolation functionality Chris Metcalf
2015-10-20 20:36                 ` Chris Metcalf
2015-10-20 20:36               ` [PATCH v8 12/14] arch/tile: adopt prepare_exit_to_usermode() model from x86 Chris Metcalf
2015-10-20 20:36               ` [PATCH v8 13/14] arch/tile: turn off timer tick for oneshot_stopped state Chris Metcalf
2015-10-20 20:36               ` [PATCH v8 14/14] arch/tile: enable task isolation functionality Chris Metcalf
2015-10-21 12:39               ` [PATCH v8 00/14] support "task_isolation" mode for nohz_full Peter Zijlstra
2015-10-22 20:31                 ` Chris Metcalf
2015-10-22 20:31                   ` Chris Metcalf
2015-10-23  2:33                   ` Frederic Weisbecker
2015-10-23  8:49                     ` Peter Zijlstra
2015-10-23 13:29                       ` Frederic Weisbecker
2015-10-23  9:04                   ` Peter Zijlstra
2015-10-23  9:04                     ` Peter Zijlstra
2015-10-23 11:52                     ` Theodore Ts'o

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.