All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 00/13] support "task_isolation" mode for nohz_full
@ 2016-01-04 19:34 ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

It has been a couple of months since the v8 version of this patch,
since various other priorities came up at work.  Since it's been
a while I will try to summarize where I think we got to on the 
various issues that were raised with v8.

1. Andy Lutomirski raised the issue of whether it really made sense to
   only attempt to set up the conditions for task isolation, ask the kernel
   nicely for it, and then wait until it happened.  He wondered if a
   SCHED_ISOLATED class might be a helpful abstraction.  Steven Rostedt
   also suggested having an interface that would force everything else
   off a core to enable SCHED_ISOLATED to succeed.  Frederick added 
   some concerns about enforcing the test that the process was in a
   good state to enter task isolation.

   I tried to address the different design philosphies for what I called
   the original "polite" mode and the reviewers' suggestions for an
   "aggressive" mode in this email:

   https://lkml.org/lkml/2015/10/26/625

   As I said there, on balance I think the "polite" option is still
   better.  Obviously folks are welcome to disagree and I'm happy to
   continue that conversation (or perhaps I convinced everyone).

2. Andy didn't like the idea of having a "STRICT" mode which
   delivered a signal to a process for violating the contract that it
   will promise to stay out of the kernel.  Gilad Ben Yossef argued that
   it made sense to have a way for the kernel to enforce the requested
   correctness guarantee of never being interrupted.  Andy pointed out
   that we should then really deliver such a signal when the kernel
   delivers an asynchronous interrupt to the core as well.  In particular
   this is a concern for the application-error case of a process that
   calls unmap() on one core while a thread on another core is running
   STRICT, and thus gets an unexpected TLB flush.

   This patch series addresses that concern by including support for
   IRQs, IPIs, and similar asynchronous interrupts to also send the
   STRICT signal to the process.  We don't try to send the signal if
   we are in an NMI, and instead just force a console backtrace like
   you would get in task_isolation_debug mode.

3. Frederick nack'ed my patch for a boot flag to disable the 1Hz
   periodic scheduler tick.

   I'm still hoping he's open to changing his mind about that, but in
   this patch series I have removed that boot flag.

Various other changes have been introduced since v8:

https://lkml.kernel.org/r/1445373372-6567-1-git-send-email-cmetcalf@ezchip.com

- Rebased to Linux 4.4-rc5.

- Since nohz_full and isolnodes have been separated back out again in
  4.4, I introduced a new task_isolation=MASK boot argument that sets
  both of them.  The task isolation support now requires that this
  boot flag have been used; it intentionally doesn't work if you've
  just enabled nohz_full and isolcpus separately.  I could be
  convinced that doing it the other way around makes sense, though.

- I folded the two STRICT mode patches together since there didn't
  seem to be much value in having the second patch that just enabled
  having a settable signal.  I also refactored the various routines
  that report on interrupts/exceptions/etc to make it easier to hook
  in from the case where we are interrupted asynchronously.

- For the debug support, I moved most of the functionality into
  kernel/isolation.c and out of kernel/sched/core.c, leaving only a
  small hook to handle mapping a remote cpu to a task struct safely.
  In addition to implementing Andy's suggestion of signalling a task
  when it is interrupted asynchronously, I also added a ratelimit
  hook so we won't spam the console if (for example) a timer interrupt
  runs amok - particularly since when this happens without ratelimit,
  it can end up self-perpetuating the timer interrupt.

- I added a task_isolation_debug_cpumask() helper function to check
  all the cpus in a mask to see if they are being interrupted
  inappropriately.

- I made the check for irq_enter() robust to architectures that
  have already entered user mode context_tracking before calling
  irq_enter() by testing user_mode(get_irq_regs()) instead of
  context_tracking_in_user(), and split out the code to a separate
  inlined function so I could comment it better.

- For arm64, I added a task_isolation_debug_cpumask() hook for
  smp_cross_call(), which I had missed in the earlier versions.

- I generalized the fix for tile to set up a clockevents hook for
  set_state_oneshot_stopped() to also apply to the arm_arch_timer,
  which I realized was showing the same problem.  For both cases,
  this seems to be what Viresh had in mind with commit 8fff52fd509345
  ("clockevents: Introduce CLOCK_EVT_STATE_ONESHOT_STOPPED state").

- For tile, I adopted the arm model of doing user_exit() calls in the
  early assembly code (a new patch in this series).  I also added a
  missing task_isolation_debug hook for tile's IPI and remote cache
  flush code.

Chris Metcalf (12):
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: support PR_TASK_ISOLATION_STRICT mode
  task_isolation: add debug boot flag
  arch/x86: enable task isolation functionality
  arch/arm64: adopt prepare_exit_to_usermode() model from x86
  arch/arm64: enable task isolation functionality
  arch/tile: adopt prepare_exit_to_usermode() model from x86
  arch/tile: move user_exit() to early kernel entry sequence
  arch/tile: enable task isolation functionality
  arm, tile: turn off timer tick for oneshot_stopped state

Christoph Lameter (1):
  vmstat: provide a function to quiet down the diff processing

 Documentation/kernel-parameters.txt  |  16 +++
 arch/arm64/include/asm/thread_info.h |  18 ++-
 arch/arm64/kernel/entry.S            |   6 +-
 arch/arm64/kernel/ptrace.c           |  12 +-
 arch/arm64/kernel/signal.c           |  35 ++++--
 arch/arm64/kernel/smp.c              |   2 +
 arch/arm64/mm/fault.c                |   4 +
 arch/tile/include/asm/processor.h    |   2 +-
 arch/tile/include/asm/thread_info.h  |   8 +-
 arch/tile/kernel/intvec_32.S         |  51 +++-----
 arch/tile/kernel/intvec_64.S         |  54 +++------
 arch/tile/kernel/process.c           |  83 +++++++------
 arch/tile/kernel/ptrace.c            |  19 +--
 arch/tile/kernel/single_step.c       |   8 +-
 arch/tile/kernel/smp.c               |  26 ++--
 arch/tile/kernel/time.c              |   1 +
 arch/tile/kernel/traps.c             |  13 +-
 arch/tile/kernel/unaligned.c         |  16 ++-
 arch/tile/mm/fault.c                 |   6 +-
 arch/tile/mm/homecache.c             |   2 +
 arch/x86/entry/common.c              |  10 +-
 arch/x86/kernel/traps.c              |   2 +
 arch/x86/mm/fault.c                  |   2 +
 drivers/clocksource/arm_arch_timer.c |   2 +
 include/linux/isolation.h            |  80 +++++++++++++
 include/linux/sched.h                |   3 +
 include/linux/swap.h                 |   1 +
 include/linux/vmstat.h               |   4 +
 include/uapi/linux/prctl.h           |   8 ++
 init/Kconfig                         |  20 ++++
 kernel/Makefile                      |   1 +
 kernel/irq_work.c                    |   5 +-
 kernel/isolation.c                   | 225 +++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                  |  18 +++
 kernel/signal.c                      |   5 +
 kernel/smp.c                         |   6 +-
 kernel/softirq.c                     |  33 +++++
 kernel/sys.c                         |   9 ++
 mm/swap.c                            |  13 +-
 mm/vmstat.c                          |  24 ++++
 40 files changed, 665 insertions(+), 188 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

-- 
2.1.2


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v9 00/13] support "task_isolation" mode for nohz_full
@ 2016-01-04 19:34 ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

It has been a couple of months since the v8 version of this patch,
since various other priorities came up at work.  Since it's been
a while I will try to summarize where I think we got to on the 
various issues that were raised with v8.

1. Andy Lutomirski raised the issue of whether it really made sense to
   only attempt to set up the conditions for task isolation, ask the kernel
   nicely for it, and then wait until it happened.  He wondered if a
   SCHED_ISOLATED class might be a helpful abstraction.  Steven Rostedt
   also suggested having an interface that would force everything else
   off a core to enable SCHED_ISOLATED to succeed.  Frederick added 
   some concerns about enforcing the test that the process was in a
   good state to enter task isolation.

   I tried to address the different design philosphies for what I called
   the original "polite" mode and the reviewers' suggestions for an
   "aggressive" mode in this email:

   https://lkml.org/lkml/2015/10/26/625

   As I said there, on balance I think the "polite" option is still
   better.  Obviously folks are welcome to disagree and I'm happy to
   continue that conversation (or perhaps I convinced everyone).

2. Andy didn't like the idea of having a "STRICT" mode which
   delivered a signal to a process for violating the contract that it
   will promise to stay out of the kernel.  Gilad Ben Yossef argued that
   it made sense to have a way for the kernel to enforce the requested
   correctness guarantee of never being interrupted.  Andy pointed out
   that we should then really deliver such a signal when the kernel
   delivers an asynchronous interrupt to the core as well.  In particular
   this is a concern for the application-error case of a process that
   calls unmap() on one core while a thread on another core is running
   STRICT, and thus gets an unexpected TLB flush.

   This patch series addresses that concern by including support for
   IRQs, IPIs, and similar asynchronous interrupts to also send the
   STRICT signal to the process.  We don't try to send the signal if
   we are in an NMI, and instead just force a console backtrace like
   you would get in task_isolation_debug mode.

3. Frederick nack'ed my patch for a boot flag to disable the 1Hz
   periodic scheduler tick.

   I'm still hoping he's open to changing his mind about that, but in
   this patch series I have removed that boot flag.

Various other changes have been introduced since v8:

https://lkml.kernel.org/r/1445373372-6567-1-git-send-email-cmetcalf@ezchip.com

- Rebased to Linux 4.4-rc5.

- Since nohz_full and isolnodes have been separated back out again in
  4.4, I introduced a new task_isolation=MASK boot argument that sets
  both of them.  The task isolation support now requires that this
  boot flag have been used; it intentionally doesn't work if you've
  just enabled nohz_full and isolcpus separately.  I could be
  convinced that doing it the other way around makes sense, though.

- I folded the two STRICT mode patches together since there didn't
  seem to be much value in having the second patch that just enabled
  having a settable signal.  I also refactored the various routines
  that report on interrupts/exceptions/etc to make it easier to hook
  in from the case where we are interrupted asynchronously.

- For the debug support, I moved most of the functionality into
  kernel/isolation.c and out of kernel/sched/core.c, leaving only a
  small hook to handle mapping a remote cpu to a task struct safely.
  In addition to implementing Andy's suggestion of signalling a task
  when it is interrupted asynchronously, I also added a ratelimit
  hook so we won't spam the console if (for example) a timer interrupt
  runs amok - particularly since when this happens without ratelimit,
  it can end up self-perpetuating the timer interrupt.

- I added a task_isolation_debug_cpumask() helper function to check
  all the cpus in a mask to see if they are being interrupted
  inappropriately.

- I made the check for irq_enter() robust to architectures that
  have already entered user mode context_tracking before calling
  irq_enter() by testing user_mode(get_irq_regs()) instead of
  context_tracking_in_user(), and split out the code to a separate
  inlined function so I could comment it better.

- For arm64, I added a task_isolation_debug_cpumask() hook for
  smp_cross_call(), which I had missed in the earlier versions.

- I generalized the fix for tile to set up a clockevents hook for
  set_state_oneshot_stopped() to also apply to the arm_arch_timer,
  which I realized was showing the same problem.  For both cases,
  this seems to be what Viresh had in mind with commit 8fff52fd509345
  ("clockevents: Introduce CLOCK_EVT_STATE_ONESHOT_STOPPED state").

- For tile, I adopted the arm model of doing user_exit() calls in the
  early assembly code (a new patch in this series).  I also added a
  missing task_isolation_debug hook for tile's IPI and remote cache
  flush code.

Chris Metcalf (12):
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: support PR_TASK_ISOLATION_STRICT mode
  task_isolation: add debug boot flag
  arch/x86: enable task isolation functionality
  arch/arm64: adopt prepare_exit_to_usermode() model from x86
  arch/arm64: enable task isolation functionality
  arch/tile: adopt prepare_exit_to_usermode() model from x86
  arch/tile: move user_exit() to early kernel entry sequence
  arch/tile: enable task isolation functionality
  arm, tile: turn off timer tick for oneshot_stopped state

Christoph Lameter (1):
  vmstat: provide a function to quiet down the diff processing

 Documentation/kernel-parameters.txt  |  16 +++
 arch/arm64/include/asm/thread_info.h |  18 ++-
 arch/arm64/kernel/entry.S            |   6 +-
 arch/arm64/kernel/ptrace.c           |  12 +-
 arch/arm64/kernel/signal.c           |  35 ++++--
 arch/arm64/kernel/smp.c              |   2 +
 arch/arm64/mm/fault.c                |   4 +
 arch/tile/include/asm/processor.h    |   2 +-
 arch/tile/include/asm/thread_info.h  |   8 +-
 arch/tile/kernel/intvec_32.S         |  51 +++-----
 arch/tile/kernel/intvec_64.S         |  54 +++------
 arch/tile/kernel/process.c           |  83 +++++++------
 arch/tile/kernel/ptrace.c            |  19 +--
 arch/tile/kernel/single_step.c       |   8 +-
 arch/tile/kernel/smp.c               |  26 ++--
 arch/tile/kernel/time.c              |   1 +
 arch/tile/kernel/traps.c             |  13 +-
 arch/tile/kernel/unaligned.c         |  16 ++-
 arch/tile/mm/fault.c                 |   6 +-
 arch/tile/mm/homecache.c             |   2 +
 arch/x86/entry/common.c              |  10 +-
 arch/x86/kernel/traps.c              |   2 +
 arch/x86/mm/fault.c                  |   2 +
 drivers/clocksource/arm_arch_timer.c |   2 +
 include/linux/isolation.h            |  80 +++++++++++++
 include/linux/sched.h                |   3 +
 include/linux/swap.h                 |   1 +
 include/linux/vmstat.h               |   4 +
 include/uapi/linux/prctl.h           |   8 ++
 init/Kconfig                         |  20 ++++
 kernel/Makefile                      |   1 +
 kernel/irq_work.c                    |   5 +-
 kernel/isolation.c                   | 225 +++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                  |  18 +++
 kernel/signal.c                      |   5 +
 kernel/smp.c                         |   6 +-
 kernel/softirq.c                     |  33 +++++
 kernel/sys.c                         |   9 ++
 mm/swap.c                            |  13 +-
 mm/vmstat.c                          |  24 ++++
 40 files changed, 665 insertions(+), 188 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

-- 
2.1.2

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v9 01/13] vmstat: provide a function to quiet down the diff processing
  2016-01-04 19:34 ` Chris Metcalf
  (?)
@ 2016-01-04 19:34 ` Chris Metcalf
  -1 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel
  Cc: Chris Metcalf

From: Christoph Lameter <cl@linux.com>

quiet_vmstat() can be called in anticipation of a OS "quiet" period
where no tick processing should be triggered. quiet_vmstat() will fold
all pending differentials into the global counters and disable the
vmstat_worker processing.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 14 ++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 5dbc8b0ee567..6f5a21993ff3 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -189,6 +189,7 @@ extern void __inc_zone_state(struct zone *, enum zone_stat_item);
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
+void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -249,6 +250,7 @@ static inline void __dec_zone_page_state(struct page *page,
 
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0d5712b0206c..0510d2ec31a6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1418,6 +1418,20 @@ static void vmstat_update(struct work_struct *w)
 }
 
 /*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+	do {
+		if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+			cancel_delayed_work(this_cpu_ptr(&vmstat_work));
+
+	} while (refresh_cpu_vm_stats());
+}
+
+/*
  * Check if the diffs for a certain cpu indicate that
  * an update is needed.
  */
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v9 02/13] vmstat: add vmstat_idle function
  2016-01-04 19:34 ` Chris Metcalf
  (?)
  (?)
@ 2016-01-04 19:34 ` Chris Metcalf
  -1 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel
  Cc: Chris Metcalf

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 6f5a21993ff3..3dc82bf5bce6 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -190,6 +190,7 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
 void quiet_vmstat(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -251,6 +252,7 @@ static inline void __dec_zone_page_state(struct page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0510d2ec31a6..ccc390197464 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1454,6 +1454,16 @@ static bool need_update(int cpu)
 	return false;
 }
 
+/*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+	int cpu = smp_processor_id();
+	return cpumask_test_cpu(cpu, cpu_stat_off) && !need_update(cpu);
+}
+
 
 /*
  * Shepherd worker thread that checks the
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v9 03/13] lru_add_drain_all: factor out lru_add_drain_needed
  2016-01-04 19:34 ` Chris Metcalf
@ 2016-01-04 19:34   ` Chris Metcalf
  -1 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-mm, linux-kernel
  Cc: Chris Metcalf

This per-cpu check was being done in the loop in lru_add_drain_all(),
but having it be callable for a particular cpu is helpful for the
task-isolation patches.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/swap.h |  1 +
 mm/swap.c            | 13 +++++++++----
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7ba7dccaf0e7..66719610c9f5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -305,6 +305,7 @@ extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
+extern bool lru_add_drain_needed(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index 39395fb549c0..ce1eb052a293 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -854,6 +854,14 @@ void deactivate_file_page(struct page *page)
 	}
 }
 
+bool lru_add_drain_needed(int cpu)
+{
+	return (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
+		pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
+		pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		need_activate_page_drain(cpu));
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
@@ -880,10 +888,7 @@ void lru_add_drain_all(void)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
-		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-		    need_activate_page_drain(cpu)) {
+		if (lru_add_drain_needed(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			schedule_work_on(cpu, work);
 			cpumask_set_cpu(cpu, &has_work);
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v9 03/13] lru_add_drain_all: factor out lru_add_drain_needed
@ 2016-01-04 19:34   ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-mm, linux-kernel
  Cc: Chris Metcalf

This per-cpu check was being done in the loop in lru_add_drain_all(),
but having it be callable for a particular cpu is helpful for the
task-isolation patches.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/swap.h |  1 +
 mm/swap.c            | 13 +++++++++----
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7ba7dccaf0e7..66719610c9f5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -305,6 +305,7 @@ extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
+extern bool lru_add_drain_needed(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index 39395fb549c0..ce1eb052a293 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -854,6 +854,14 @@ void deactivate_file_page(struct page *page)
 	}
 }
 
+bool lru_add_drain_needed(int cpu)
+{
+	return (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
+		pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
+		pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		need_activate_page_drain(cpu));
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
@@ -880,10 +888,7 @@ void lru_add_drain_all(void)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
-		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-		    need_activate_page_drain(cpu)) {
+		if (lru_add_drain_needed(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			schedule_work_on(cpu, work);
 			cpumask_set_cpu(cpu, &has_work);
-- 
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v9 04/13] task_isolation: add initial support
  2016-01-04 19:34 ` Chris Metcalf
@ 2016-01-04 19:34   ` Chris Metcalf
  -1 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well.  The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl().  When the _ENABLE bit is set for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future. 

The task_isolation_ready() call plays an equivalent role to the
TIF_xxx flags when returning to userspace, and should be checked
in the loop check of the prepare_exit_to_usermode() routine or its
architecture equivalent.  It is called with interrupts disabled and
inspects the kernel state to determine if it is safe to return into
an isolated state.  In particular, if it sees that the scheduler
tick is still enabled, it sets the TIF_NEED_RESCHED bit to notify
the scheduler to attempt to schedule a different task.

Each time through the loop of TIF work to do, we call the new
task_isolation_enter() routine, which takes any actions that might
avoid a future interrupt to the core, such as a worker thread
being scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

As a result of these tests on the "return to userspace" path, sys
calls (and page faults, etc.) can be inordinately slow.  However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

Separate patches that follow provide these changes for x86, arm64,
and tile.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 Documentation/kernel-parameters.txt |   8 +++
 include/linux/isolation.h           |  50 +++++++++++++++++
 include/linux/sched.h               |   3 ++
 include/uapi/linux/prctl.h          |   5 ++
 init/Kconfig                        |  20 +++++++
 kernel/Makefile                     |   1 +
 kernel/isolation.c                  | 105 ++++++++++++++++++++++++++++++++++++
 kernel/sys.c                        |   9 ++++
 8 files changed, 201 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 742f69d18fc8..e035679e646e 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3665,6 +3665,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation=	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION=y, set
+			the specified list of CPUs where cpus will be able
+			to use prctl(PR_SET_TASK_ISOLATION) to set up task
+			isolation mode.  Setting this boot flag implicitly
+			also sets up nohz_full and isolcpus mode for the
+			listed set of cpus.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..ed1bfc793c5a
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,50 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/* cpus that are configured to support task isolation */
+extern cpumask_var_t task_isolation_map;
+
+static inline bool task_isolation_possible(int cpu)
+{
+	return tick_nohz_full_enabled() &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+extern int task_isolation_set(unsigned int flags);
+
+static inline bool task_isolation_enabled(void)
+{
+	return task_isolation_possible(smp_processor_id()) &&
+		(current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE);
+}
+
+extern bool _task_isolation_ready(void);
+extern void _task_isolation_enter(void);
+
+static inline bool task_isolation_ready(void)
+{
+	return !task_isolation_enabled() || _task_isolation_ready();
+}
+
+static inline void task_isolation_enter(void)
+{
+	if (task_isolation_enabled())
+		_task_isolation_enter();
+}
+
+#else
+static inline bool task_isolation_possible(int cpu) { return false; }
+static inline bool task_isolation_enabled(void) { return false; }
+static inline bool task_isolation_ready(void) { return true; }
+static inline void task_isolation_enter(void) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index edad7a43edea..d439ee4f2ce2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1812,6 +1812,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned int	task_isolation_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..67224df4b559 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,9 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */
+#define PR_SET_TASK_ISOLATION		48
+#define PR_GET_TASK_ISOLATION		49
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 235c7a2c0d20..fb0c707e527f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -787,6 +787,26 @@ config RCU_EXPEDITE_BOOT
 
 endmenu # "RCU Subsystem"
 
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL
+	help
+	 Allow userspace processes to place themselves on task_isolation
+	 cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+	 themselves from the kernel.  On return to userspace,
+	 isolated tasks will first arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  This "hard" isolation from the kernel is
+	 required for userspace tasks that are running hard real-time
+	 tasks in userspace, such as a 10 Gbit network driver in userspace.
+
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf008ecb3..693a2ba35679 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..68a9f7457bc0
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,105 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include "time/tick-sched.h"
+
+cpumask_var_t task_isolation_map;
+
+/*
+ * Isolation requires both nohz and isolcpus support from the scheduler.
+ * We provide a boot flag that enables both for now, and which we can
+ * add other functionality to over time if needed.  Note that just
+ * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
+ */
+static int __init task_isolation_setup(char *str)
+{
+	alloc_bootmem_cpumask_var(&task_isolation_map);
+	if (cpulist_parse(str, task_isolation_map) < 0) {
+		pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
+		return 1;
+	}
+
+	alloc_bootmem_cpumask_var(&cpu_isolated_map);
+	cpumask_copy(cpu_isolated_map, task_isolation_map);
+
+	alloc_bootmem_cpumask_var(&tick_nohz_full_mask);
+	cpumask_copy(tick_nohz_full_mask, task_isolation_map);
+	tick_nohz_full_running = true;
+
+	return 1;
+}
+__setup("task_isolation=", task_isolation_setup);
+
+/*
+ * This routine controls whether we can enable task-isolation mode.
+ * The task must be affinitized to a single task_isolation core or we will
+ * return EINVAL.  Although the application could later re-affinitize
+ * to a housekeeping core and lose task isolation semantics, this
+ * initial test should catch 99% of bugs with task placement prior to
+ * enabling task isolation.
+ */
+int task_isolation_set(unsigned int flags)
+{
+	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
+	    !task_isolation_possible(smp_processor_id()))
+		return -EINVAL;
+
+	current->task_isolation_flags = flags;
+	return 0;
+}
+
+/*
+ * In task isolation mode we try to return to userspace only after
+ * attempting to make sure we won't be interrupted again.  To handle
+ * the periodic scheduler tick, we test to make sure that the tick is
+ * stopped, and if it isn't yet, we request a reschedule so that if
+ * another task needs to run to completion first, it can do so.
+ * Similarly, if any other subsystems require quiescing, we will need
+ * to do that before we return to userspace.
+ */
+bool _task_isolation_ready(void)
+{
+	WARN_ON_ONCE(!irqs_disabled());
+
+	/* If we need to drain the LRU cache, we're not ready. */
+	if (lru_add_drain_needed(smp_processor_id()))
+		return false;
+
+	/* If vmstats need updating, we're not ready. */
+	if (!vmstat_idle())
+		return false;
+
+	/* Request rescheduling unless we are in full dynticks mode. */
+	if (!tick_nohz_tick_stopped()) {
+		set_tsk_need_resched(current);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Each time we try to prepare for return to userspace in a process
+ * with task isolation enabled, we run this code to quiesce whatever
+ * subsystems we can readily quiesce to avoid later interrupts.
+ */
+void _task_isolation_enter(void)
+{
+	WARN_ON_ONCE(irqs_disabled());
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	quiet_vmstat();
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 6af9212ab5aa..7c97227dfb39 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2266,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_TASK_ISOLATION
+	case PR_SET_TASK_ISOLATION:
+		error = task_isolation_set(arg2);
+		break;
+	case PR_GET_TASK_ISOLATION:
+		error = me->task_isolation_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v9 04/13] task_isolation: add initial support
@ 2016-01-04 19:34   ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well.  The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl().  When the _ENABLE bit is set for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future. 

The task_isolation_ready() call plays an equivalent role to the
TIF_xxx flags when returning to userspace, and should be checked
in the loop check of the prepare_exit_to_usermode() routine or its
architecture equivalent.  It is called with interrupts disabled and
inspects the kernel state to determine if it is safe to return into
an isolated state.  In particular, if it sees that the scheduler
tick is still enabled, it sets the TIF_NEED_RESCHED bit to notify
the scheduler to attempt to schedule a different task.

Each time through the loop of TIF work to do, we call the new
task_isolation_enter() routine, which takes any actions that might
avoid a future interrupt to the core, such as a worker thread
being scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

As a result of these tests on the "return to userspace" path, sys
calls (and page faults, etc.) can be inordinately slow.  However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

Separate patches that follow provide these changes for x86, arm64,
and tile.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 Documentation/kernel-parameters.txt |   8 +++
 include/linux/isolation.h           |  50 +++++++++++++++++
 include/linux/sched.h               |   3 ++
 include/uapi/linux/prctl.h          |   5 ++
 init/Kconfig                        |  20 +++++++
 kernel/Makefile                     |   1 +
 kernel/isolation.c                  | 105 ++++++++++++++++++++++++++++++++++++
 kernel/sys.c                        |   9 ++++
 8 files changed, 201 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 742f69d18fc8..e035679e646e 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3665,6 +3665,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation=	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION=y, set
+			the specified list of CPUs where cpus will be able
+			to use prctl(PR_SET_TASK_ISOLATION) to set up task
+			isolation mode.  Setting this boot flag implicitly
+			also sets up nohz_full and isolcpus mode for the
+			listed set of cpus.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..ed1bfc793c5a
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,50 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/* cpus that are configured to support task isolation */
+extern cpumask_var_t task_isolation_map;
+
+static inline bool task_isolation_possible(int cpu)
+{
+	return tick_nohz_full_enabled() &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+extern int task_isolation_set(unsigned int flags);
+
+static inline bool task_isolation_enabled(void)
+{
+	return task_isolation_possible(smp_processor_id()) &&
+		(current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE);
+}
+
+extern bool _task_isolation_ready(void);
+extern void _task_isolation_enter(void);
+
+static inline bool task_isolation_ready(void)
+{
+	return !task_isolation_enabled() || _task_isolation_ready();
+}
+
+static inline void task_isolation_enter(void)
+{
+	if (task_isolation_enabled())
+		_task_isolation_enter();
+}
+
+#else
+static inline bool task_isolation_possible(int cpu) { return false; }
+static inline bool task_isolation_enabled(void) { return false; }
+static inline bool task_isolation_ready(void) { return true; }
+static inline void task_isolation_enter(void) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index edad7a43edea..d439ee4f2ce2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1812,6 +1812,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned int	task_isolation_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..67224df4b559 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,9 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */
+#define PR_SET_TASK_ISOLATION		48
+#define PR_GET_TASK_ISOLATION		49
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 235c7a2c0d20..fb0c707e527f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -787,6 +787,26 @@ config RCU_EXPEDITE_BOOT
 
 endmenu # "RCU Subsystem"
 
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL
+	help
+	 Allow userspace processes to place themselves on task_isolation
+	 cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+	 themselves from the kernel.  On return to userspace,
+	 isolated tasks will first arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  This "hard" isolation from the kernel is
+	 required for userspace tasks that are running hard real-time
+	 tasks in userspace, such as a 10 Gbit network driver in userspace.
+
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf008ecb3..693a2ba35679 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..68a9f7457bc0
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,105 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include "time/tick-sched.h"
+
+cpumask_var_t task_isolation_map;
+
+/*
+ * Isolation requires both nohz and isolcpus support from the scheduler.
+ * We provide a boot flag that enables both for now, and which we can
+ * add other functionality to over time if needed.  Note that just
+ * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
+ */
+static int __init task_isolation_setup(char *str)
+{
+	alloc_bootmem_cpumask_var(&task_isolation_map);
+	if (cpulist_parse(str, task_isolation_map) < 0) {
+		pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
+		return 1;
+	}
+
+	alloc_bootmem_cpumask_var(&cpu_isolated_map);
+	cpumask_copy(cpu_isolated_map, task_isolation_map);
+
+	alloc_bootmem_cpumask_var(&tick_nohz_full_mask);
+	cpumask_copy(tick_nohz_full_mask, task_isolation_map);
+	tick_nohz_full_running = true;
+
+	return 1;
+}
+__setup("task_isolation=", task_isolation_setup);
+
+/*
+ * This routine controls whether we can enable task-isolation mode.
+ * The task must be affinitized to a single task_isolation core or we will
+ * return EINVAL.  Although the application could later re-affinitize
+ * to a housekeeping core and lose task isolation semantics, this
+ * initial test should catch 99% of bugs with task placement prior to
+ * enabling task isolation.
+ */
+int task_isolation_set(unsigned int flags)
+{
+	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
+	    !task_isolation_possible(smp_processor_id()))
+		return -EINVAL;
+
+	current->task_isolation_flags = flags;
+	return 0;
+}
+
+/*
+ * In task isolation mode we try to return to userspace only after
+ * attempting to make sure we won't be interrupted again.  To handle
+ * the periodic scheduler tick, we test to make sure that the tick is
+ * stopped, and if it isn't yet, we request a reschedule so that if
+ * another task needs to run to completion first, it can do so.
+ * Similarly, if any other subsystems require quiescing, we will need
+ * to do that before we return to userspace.
+ */
+bool _task_isolation_ready(void)
+{
+	WARN_ON_ONCE(!irqs_disabled());
+
+	/* If we need to drain the LRU cache, we're not ready. */
+	if (lru_add_drain_needed(smp_processor_id()))
+		return false;
+
+	/* If vmstats need updating, we're not ready. */
+	if (!vmstat_idle())
+		return false;
+
+	/* Request rescheduling unless we are in full dynticks mode. */
+	if (!tick_nohz_tick_stopped()) {
+		set_tsk_need_resched(current);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Each time we try to prepare for return to userspace in a process
+ * with task isolation enabled, we run this code to quiesce whatever
+ * subsystems we can readily quiesce to avoid later interrupts.
+ */
+void _task_isolation_enter(void)
+{
+	WARN_ON_ONCE(irqs_disabled());
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	quiet_vmstat();
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 6af9212ab5aa..7c97227dfb39 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2266,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_TASK_ISOLATION
+	case PR_SET_TASK_ISOLATION:
+		error = task_isolation_set(arg2);
+		break;
+	case PR_GET_TASK_ISOLATION:
+		error = me->task_isolation_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v9 05/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode
  2016-01-04 19:34 ` Chris Metcalf
@ 2016-01-04 19:34   ` Chris Metcalf
  -1 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry is fatal; this is defined as
happening immediately before the SECCOMP test.

By default, the task is signalled with SIGKILL, but we add prctl()
bits to support requesting a specific signal instead.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/isolation.h  | 25 +++++++++++++++++++
 include/uapi/linux/prctl.h |  3 +++
 kernel/isolation.c         | 60 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 88 insertions(+)

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index ed1bfc793c5a..69a3e4c59ab3 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -40,11 +40,36 @@ static inline void task_isolation_enter(void)
 		_task_isolation_enter();
 }
 
+extern bool task_isolation_syscall(int nr);
+extern void task_isolation_exception(const char *fmt, ...);
+extern void task_isolation_interrupt(struct task_struct *, const char *buf);
+
+static inline bool task_isolation_strict(void)
+{
+	return (task_isolation_possible(smp_processor_id()) &&
+		(current->task_isolation_flags &
+		 (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+		(PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT));
+}
+
+static inline bool task_isolation_check_syscall(int nr)
+{
+	return task_isolation_strict() && task_isolation_syscall(nr);
+}
+
+#define task_isolation_check_exception(fmt, ...)			\
+	do {								\
+		if (task_isolation_strict())				\
+			task_isolation_exception(fmt, ## __VA_ARGS__);	\
+	} while (0)
+
 #else
 static inline bool task_isolation_possible(int cpu) { return false; }
 static inline bool task_isolation_enabled(void) { return false; }
 static inline bool task_isolation_ready(void) { return true; }
 static inline void task_isolation_enter(void) { }
+static inline bool task_isolation_check_syscall(int nr) { return false; }
+static inline void task_isolation_check_exception(const char *fmt, ...) { }
 #endif
 
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 67224df4b559..a5582ace987f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,8 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		48
 #define PR_GET_TASK_ISOLATION		49
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_STRICT	(1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 68a9f7457bc0..29ffb21ada0b 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,7 @@
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
 #include <linux/syscalls.h>
+#include <asm/unistd.h>
 #include "time/tick-sched.h"
 
 cpumask_var_t task_isolation_map;
@@ -103,3 +104,62 @@ void _task_isolation_enter(void)
 	/* Quieten the vmstat worker so it won't interrupt us. */
 	quiet_vmstat();
 }
+
+void task_isolation_interrupt(struct task_struct *task, const char *buf)
+{
+	siginfo_t info = {};
+	int sig;
+
+	pr_warn("%s/%d: task_isolation strict mode violated by %s\n",
+		task->comm, task->pid, buf);
+
+	/*
+	 * Turn off task isolation mode entirely to avoid spamming
+	 * the process with signals.  It can re-enable task isolation
+	 * mode in the signal handler if it wants to.
+	 */
+	task->task_isolation_flags = 0;
+
+	sig = PR_TASK_ISOLATION_GET_SIG(task->task_isolation_flags);
+	if (sig == 0)
+		sig = SIGKILL;
+	info.si_signo = sig;
+	send_sig_info(sig, &info, task);
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(const char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	task_isolation_interrupt(current, buf);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+bool task_isolation_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return false;
+	}
+
+	task_isolation_exception("syscall %d", syscall);
+	return true;
+}
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v9 05/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode
@ 2016-01-04 19:34   ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry is fatal; this is defined as
happening immediately before the SECCOMP test.

By default, the task is signalled with SIGKILL, but we add prctl()
bits to support requesting a specific signal instead.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 include/linux/isolation.h  | 25 +++++++++++++++++++
 include/uapi/linux/prctl.h |  3 +++
 kernel/isolation.c         | 60 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 88 insertions(+)

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index ed1bfc793c5a..69a3e4c59ab3 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -40,11 +40,36 @@ static inline void task_isolation_enter(void)
 		_task_isolation_enter();
 }
 
+extern bool task_isolation_syscall(int nr);
+extern void task_isolation_exception(const char *fmt, ...);
+extern void task_isolation_interrupt(struct task_struct *, const char *buf);
+
+static inline bool task_isolation_strict(void)
+{
+	return (task_isolation_possible(smp_processor_id()) &&
+		(current->task_isolation_flags &
+		 (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+		(PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT));
+}
+
+static inline bool task_isolation_check_syscall(int nr)
+{
+	return task_isolation_strict() && task_isolation_syscall(nr);
+}
+
+#define task_isolation_check_exception(fmt, ...)			\
+	do {								\
+		if (task_isolation_strict())				\
+			task_isolation_exception(fmt, ## __VA_ARGS__);	\
+	} while (0)
+
 #else
 static inline bool task_isolation_possible(int cpu) { return false; }
 static inline bool task_isolation_enabled(void) { return false; }
 static inline bool task_isolation_ready(void) { return true; }
 static inline void task_isolation_enter(void) { }
+static inline bool task_isolation_check_syscall(int nr) { return false; }
+static inline void task_isolation_check_exception(const char *fmt, ...) { }
 #endif
 
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 67224df4b559..a5582ace987f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,8 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		48
 #define PR_GET_TASK_ISOLATION		49
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_STRICT	(1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 68a9f7457bc0..29ffb21ada0b 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,7 @@
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
 #include <linux/syscalls.h>
+#include <asm/unistd.h>
 #include "time/tick-sched.h"
 
 cpumask_var_t task_isolation_map;
@@ -103,3 +104,62 @@ void _task_isolation_enter(void)
 	/* Quieten the vmstat worker so it won't interrupt us. */
 	quiet_vmstat();
 }
+
+void task_isolation_interrupt(struct task_struct *task, const char *buf)
+{
+	siginfo_t info = {};
+	int sig;
+
+	pr_warn("%s/%d: task_isolation strict mode violated by %s\n",
+		task->comm, task->pid, buf);
+
+	/*
+	 * Turn off task isolation mode entirely to avoid spamming
+	 * the process with signals.  It can re-enable task isolation
+	 * mode in the signal handler if it wants to.
+	 */
+	task->task_isolation_flags = 0;
+
+	sig = PR_TASK_ISOLATION_GET_SIG(task->task_isolation_flags);
+	if (sig == 0)
+		sig = SIGKILL;
+	info.si_signo = sig;
+	send_sig_info(sig, &info, task);
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(const char *fmt, ...)
+{
+	va_list args;
+	char buf[100];
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	task_isolation_interrupt(current, buf);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in) if the _STRICT flag is set.
+ */
+bool task_isolation_syscall(int syscall)
+{
+	/* Ignore prctl() syscalls or any task exit. */
+	switch (syscall) {
+	case __NR_prctl:
+	case __NR_exit:
+	case __NR_exit_group:
+		return false;
+	}
+
+	task_isolation_exception("syscall %d", syscall);
+	return true;
+}
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v9 06/13] task_isolation: add debug boot flag
  2016-01-04 19:34 ` Chris Metcalf
                   ` (5 preceding siblings ...)
  (?)
@ 2016-01-04 19:34 ` Chris Metcalf
  2016-01-04 22:52   ` Steven Rostedt
  -1 siblings, 1 reply; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-kernel
  Cc: Chris Metcalf

The new "task_isolation_debug" flag simplifies debugging
of TASK_ISOLATION kernels when processes are running in
PR_TASK_ISOLATION_ENABLE mode.  Such processes should get no
interrupts from the kernel, and if they do, when this boot flag is
specified a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a task_isolation
core has unexpectedly entered the kernel.  But what this boot flag
does is allow the kernel to provide better diagnostics, e.g. by
reporting in the IPI-generating code what remote core and context
is preparing to deliver an interrupt to a task_isolation core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 Documentation/kernel-parameters.txt |  8 +++++
 include/linux/isolation.h           |  5 ++++
 kernel/irq_work.c                   |  5 +++-
 kernel/isolation.c                  | 60 +++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                 | 18 +++++++++++
 kernel/signal.c                     |  5 ++++
 kernel/smp.c                        |  6 +++-
 kernel/softirq.c                    | 33 ++++++++++++++++++++
 8 files changed, 138 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index e035679e646e..112fba1727f4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3673,6 +3673,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			also sets up nohz_full and isolcpus mode for the
 			listed set of cpus.
 
+	task_isolation_debug	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION
+			and booted in task_isolation= mode, this
+			setting will generate console backtraces when
+			the kernel is about to interrupt a task that
+			has requested PR_TASK_ISOLATION_ENABLE and is
+			running on a task_isolation core.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index 69a3e4c59ab3..3e15e75d078f 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -43,6 +43,9 @@ static inline void task_isolation_enter(void)
 extern bool task_isolation_syscall(int nr);
 extern void task_isolation_exception(const char *fmt, ...);
 extern void task_isolation_interrupt(struct task_struct *, const char *buf);
+extern void task_isolation_debug(int cpu);
+extern void task_isolation_debug_cpumask(const struct cpumask *);
+extern void task_isolation_debug_task(int cpu, struct task_struct *p);
 
 static inline bool task_isolation_strict(void)
 {
@@ -70,6 +73,8 @@ static inline bool task_isolation_ready(void) { return true; }
 static inline void task_isolation_enter(void) { }
 static inline bool task_isolation_check_syscall(int nr) { return false; }
 static inline void task_isolation_check_exception(const char *fmt, ...) { }
+static inline void task_isolation_debug(int cpu) { }
+#define task_isolation_debug_cpumask(mask) do {} while (0)
 #endif
 
 #endif
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index bcf107ce0854..a9b95ce00667 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/isolation.h>
 #include <asm/processor.h>
 
 
@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (!irq_work_claim(work))
 		return false;
 
-	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+		task_isolation_debug(cpu);
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return true;
 }
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 29ffb21ada0b..9f31c0b458ed 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,7 @@
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
 #include <linux/syscalls.h>
+#include <linux/ratelimit.h>
 #include <asm/unistd.h>
 #include "time/tick-sched.h"
 
@@ -163,3 +164,62 @@ bool task_isolation_syscall(int syscall)
 	task_isolation_exception("syscall %d", syscall);
 	return true;
 }
+
+/* Enable debugging of any interrupts of task_isolation cores. */
+static int task_isolation_debug_flag;
+static int __init task_isolation_debug_func(char *str)
+{
+	task_isolation_debug_flag = true;
+	return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+void task_isolation_debug_task(int cpu, struct task_struct *p)
+{
+	static DEFINE_RATELIMIT_STATE(console_output, HZ, 1);
+	bool force_debug = false;
+
+	/*
+	 * Our caller made sure the task was running on a task isolation
+	 * core, but make sure the task has enabled isolation.
+	 */
+	if (!(p->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
+		return;
+
+	/*
+	 * If the task was in strict mode, deliver a signal to it.
+	 * We disable task isolation mode when we deliver a signal
+	 * so we won't end up recursing back here again.
+	 * If we are in an NMI, we don't try delivering the signal
+	 * and instead just treat it as if "debug" mode was enabled,
+	 * since that's pretty much all we can do.
+	 */
+	if (p->task_isolation_flags & PR_TASK_ISOLATION_STRICT) {
+		if (in_nmi())
+			force_debug = true;
+		else
+			task_isolation_interrupt(p, "interrupt");
+	}
+
+	/*
+	 * If (for example) the timer interrupt starts ticking
+	 * unexpectedly, we will get an unmanageable flow of output,
+	 * so limit to one backtrace per second.
+	 */
+	if (force_debug ||
+	    (task_isolation_debug_flag && __ratelimit(&console_output))) {
+		pr_err("Interrupt detected for task_isolation cpu %d, %s/%d\n",
+		       cpu, p->comm, p->pid);
+		dump_stack();
+	}
+}
+
+void task_isolation_debug_cpumask(const struct cpumask *mask)
+{
+	int cpu, thiscpu = smp_processor_id();
+
+	/* No need to report on this cpu since we're already in the kernel. */
+	for_each_cpu(cpu, mask)
+		if (cpu != thiscpu)
+			task_isolation_debug(cpu);
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 732e993b564b..700120221f6b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
 #include <linux/binfmts.h>
 #include <linux/context_tracking.h>
 #include <linux/compiler.h>
+#include <linux/isolation.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -746,6 +747,23 @@ bool sched_can_stop_tick(void)
 }
 #endif /* CONFIG_NO_HZ_FULL */
 
+#ifdef CONFIG_TASK_ISOLATION
+void task_isolation_debug(int cpu)
+{
+	struct task_struct *p;
+
+	if (!task_isolation_possible(cpu))
+		return;
+
+	rcu_read_lock();
+	p = cpu_curr(cpu);
+	get_task_struct(p);
+	rcu_read_unlock();
+	task_isolation_debug_task(cpu, p);
+	put_task_struct(p);
+}
+#endif
+
 void sched_avg_update(struct rq *rq)
 {
 	s64 period = sched_avg_period();
diff --git a/kernel/signal.c b/kernel/signal.c
index f3f1f7a972fd..c45ef71f329c 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -638,6 +638,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+#ifdef CONFIG_TASK_ISOLATION
+	/* If the task is being killed, don't complain about task_isolation. */
+	if (state & TASK_WAKEKILL)
+		t->task_isolation_flags = 0;
+#endif
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index d903c02223af..a61894409645 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
 #include <linux/smp.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include <linux/isolation.h>
 
 #include "smpboot.h"
 
@@ -178,8 +179,10 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
-	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) {
+		task_isolation_debug(cpu);
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return 0;
 }
@@ -457,6 +460,7 @@ void smp_call_function_many(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	task_isolation_debug_cpumask(cfd->cpumask);
 	arch_send_call_function_ipi_mask(cfd->cpumask);
 
 	if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 479e4436f787..f249b71cddf4 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -26,6 +26,7 @@
 #include <linux/smpboot.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -319,6 +320,37 @@ asmlinkage __visible void do_softirq(void)
 	local_irq_restore(flags);
 }
 
+/* Determine whether this IRQ is something task isolation cares about. */
+static void task_isolation_irq(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+	struct pt_regs *regs;
+
+	if (!context_tracking_cpu_is_enabled())
+		return;
+
+	/*
+	 * We have not yet called __irq_enter() and so we haven't
+	 * adjusted the hardirq count.  This test will allow us to
+	 * avoid false positives for nested IRQs.
+	 */
+	if (in_interrupt())
+		return;
+
+	/*
+	 * If we were already in the kernel, not from an irq but from
+	 * a syscall or synchronous exception/fault, this test should
+	 * avoid a false positive as well.  Note that this requires
+	 * architecture support for calling set_irq_regs() prior to
+	 * calling irq_enter(), and if it's not done consistently, we
+	 * will not consistently avoid false positives here.
+	 */
+	regs = get_irq_regs();
+	if (regs && user_mode(regs))
+		task_isolation_debug(smp_processor_id());
+#endif
+}
+
 /*
  * Enter an interrupt context.
  */
@@ -335,6 +367,7 @@ void irq_enter(void)
 		_local_bh_enable();
 	}
 
+	task_isolation_irq();
 	__irq_enter();
 }
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v9 07/13] arch/x86: enable task isolation functionality
  2016-01-04 19:34 ` Chris Metcalf
                   ` (6 preceding siblings ...)
  (?)
@ 2016-01-04 19:34 ` Chris Metcalf
  2016-01-04 21:02   ` [PATCH v9bis " Chris Metcalf
  -1 siblings, 1 reply; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	H. Peter Anvin, x86, linux-kernel
  Cc: Chris Metcalf

In prepare_exit_to_usermode(), call task_isolation_ready()
when we are checking the thread-info flags, and after we've handled
the other work, call task_isolation_enter() unconditionally.

In syscall_trace_enter_phase1(), we add the necessary support for
strict-mode detection of syscalls.

We add strict reporting for the kernel exception types that do
not result in signals, namely non-signalling page faults and
non-signalling MPX fixups.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/x86/entry/common.c | 10 +++++++++-
 arch/x86/kernel/traps.c |  2 ++
 arch/x86/mm/fault.c     |  2 ++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index a89fdbc1f0be..75958a6b5112 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -21,6 +21,7 @@
 #include <linux/context_tracking.h>
 #include <linux/user-return-notifier.h>
 #include <linux/uprobes.h>
+#include <linux/isolation.h>
 
 #include <asm/desc.h>
 #include <asm/traps.h>
@@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	 */
 	if (work & _TIF_NOHZ) {
 		enter_from_user_mode();
+		if (task_isolation_check_syscall(regs->orig_ax)) {
+			regs->orig_ax = -1;
+			return 0;
+		}
 		work &= ~_TIF_NOHZ;
 	}
 #endif
@@ -254,12 +259,15 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
 			fire_user_return_notifiers();
 
+		task_isolation_enter();
+
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
 		cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
 
-		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
+		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS) &&
+		    task_isolation_ready())
 			break;
 
 	}
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index ade185a46b1d..82bf53ec1e98 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -36,6 +36,7 @@
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/io.h>
+#include <linux/isolation.h>
 
 #ifdef CONFIG_EISA
 #include <linux/ioport.h>
@@ -398,6 +399,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 	case 2:	/* Bound directory has invalid entry. */
 		if (mpx_handle_bd_fault())
 			goto exit_trap;
+		task_isolation_check_exception("bounds check");
 		break; /* Success, it was handled */
 	case 1: /* Bound violation. */
 		info = mpx_generate_siginfo(regs);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index eef44d9a3f77..7b23487a3bd7 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -14,6 +14,7 @@
 #include <linux/prefetch.h>		/* prefetchw			*/
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
+#include <linux/isolation.h>		/* task_isolation_check_exception */
 
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
@@ -1148,6 +1149,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		local_irq_enable();
 		error_code |= PF_USER;
 		flags |= FAULT_FLAG_USER;
+		task_isolation_check_exception("page fault at %#lx", address);
 	} else {
 		if (regs->flags & X86_EFLAGS_IF)
 			local_irq_enable();
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86
  2016-01-04 19:34 ` Chris Metcalf
@ 2016-01-04 19:34   ` Chris Metcalf
  -1 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-arm-kernel, linux-kernel
  Cc: Chris Metcalf

This change is a prerequisite change for TASK_ISOLATION but also
stands on its own for readability and maintainability.  The existing
arm64 do_notify_resume() is called in a loop from assembly on
the slow path; this change moves the loop into C code as well.
For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add new,
comprehensible entry and exit handlers written in C").

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/arm64/kernel/entry.S  |  6 +++---
 arch/arm64/kernel/signal.c | 32 ++++++++++++++++++++++----------
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 7ed3d75f6304..04eff4c4ac6e 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -630,9 +630,8 @@ work_pending:
 	mov	x0, sp				// 'regs'
 	tst	x2, #PSR_MODE_MASK		// user mode regs?
 	b.ne	no_work_pending			// returning to kernel
-	enable_irq				// enable interrupts for do_notify_resume()
-	bl	do_notify_resume
-	b	ret_to_user
+	bl	prepare_exit_to_usermode
+	b	no_user_work_pending
 work_resched:
 	bl	schedule
 
@@ -644,6 +643,7 @@ ret_to_user:
 	ldr	x1, [tsk, #TI_FLAGS]
 	and	x2, x1, #_TIF_WORK_MASK
 	cbnz	x2, work_pending
+no_user_work_pending:
 	enable_step_tsk x1, x2
 no_work_pending:
 	kernel_exit 0
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index e18c48cb6db1..fde59c1139a9 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -399,18 +399,30 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
-asmlinkage void do_notify_resume(struct pt_regs *regs,
-				 unsigned int thread_flags)
+asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
+					 unsigned int thread_flags)
 {
-	if (thread_flags & _TIF_SIGPENDING)
-		do_signal(regs);
+	do {
+		local_irq_enable();
 
-	if (thread_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-	}
+		if (thread_flags & _TIF_NEED_RESCHED)
+			schedule();
+
+		if (thread_flags & _TIF_SIGPENDING)
+			do_signal(regs);
+
+		if (thread_flags & _TIF_NOTIFY_RESUME) {
+			clear_thread_flag(TIF_NOTIFY_RESUME);
+			tracehook_notify_resume(regs);
+		}
+
+		if (thread_flags & _TIF_FOREIGN_FPSTATE)
+			fpsimd_restore_current_state();
+
+		local_irq_disable();
 
-	if (thread_flags & _TIF_FOREIGN_FPSTATE)
-		fpsimd_restore_current_state();
+		thread_flags = READ_ONCE(current_thread_info()->flags) &
+			_TIF_WORK_MASK;
 
+	} while (thread_flags);
 }
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86
@ 2016-01-04 19:34   ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: linux-arm-kernel

This change is a prerequisite change for TASK_ISOLATION but also
stands on its own for readability and maintainability.  The existing
arm64 do_notify_resume() is called in a loop from assembly on
the slow path; this change moves the loop into C code as well.
For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add new,
comprehensible entry and exit handlers written in C").

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/arm64/kernel/entry.S  |  6 +++---
 arch/arm64/kernel/signal.c | 32 ++++++++++++++++++++++----------
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 7ed3d75f6304..04eff4c4ac6e 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -630,9 +630,8 @@ work_pending:
 	mov	x0, sp				// 'regs'
 	tst	x2, #PSR_MODE_MASK		// user mode regs?
 	b.ne	no_work_pending			// returning to kernel
-	enable_irq				// enable interrupts for do_notify_resume()
-	bl	do_notify_resume
-	b	ret_to_user
+	bl	prepare_exit_to_usermode
+	b	no_user_work_pending
 work_resched:
 	bl	schedule
 
@@ -644,6 +643,7 @@ ret_to_user:
 	ldr	x1, [tsk, #TI_FLAGS]
 	and	x2, x1, #_TIF_WORK_MASK
 	cbnz	x2, work_pending
+no_user_work_pending:
 	enable_step_tsk x1, x2
 no_work_pending:
 	kernel_exit 0
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index e18c48cb6db1..fde59c1139a9 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -399,18 +399,30 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
-asmlinkage void do_notify_resume(struct pt_regs *regs,
-				 unsigned int thread_flags)
+asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
+					 unsigned int thread_flags)
 {
-	if (thread_flags & _TIF_SIGPENDING)
-		do_signal(regs);
+	do {
+		local_irq_enable();
 
-	if (thread_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-	}
+		if (thread_flags & _TIF_NEED_RESCHED)
+			schedule();
+
+		if (thread_flags & _TIF_SIGPENDING)
+			do_signal(regs);
+
+		if (thread_flags & _TIF_NOTIFY_RESUME) {
+			clear_thread_flag(TIF_NOTIFY_RESUME);
+			tracehook_notify_resume(regs);
+		}
+
+		if (thread_flags & _TIF_FOREIGN_FPSTATE)
+			fpsimd_restore_current_state();
+
+		local_irq_disable();
 
-	if (thread_flags & _TIF_FOREIGN_FPSTATE)
-		fpsimd_restore_current_state();
+		thread_flags = READ_ONCE(current_thread_info()->flags) &
+			_TIF_WORK_MASK;
 
+	} while (thread_flags);
 }
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v9 09/13] arch/arm64: enable task isolation functionality
  2016-01-04 19:34 ` Chris Metcalf
@ 2016-01-04 19:34   ` Chris Metcalf
  -1 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-arm-kernel, linux-kernel
  Cc: Chris Metcalf

We need to call task_isolation_enter() from prepare_exit_to_usermode(),
so that we can both ensure we do it last before returning to
userspace, and we also are able to re-run signal handling, etc.,
if something occurs while task_isolation_enter() has interrupts
enabled.  To do this we add _TIF_NOHZ to the _TIF_WORK_MASK if
we have CONFIG_TASK_ISOLATION enabled, which brings us into
prepare_exit_to_usermode() on all return to userspace.  But we
don't put _TIF_NOHZ in the flags that we use to loop back and
recheck, since we don't need to loop back only because the flag
is set.  Instead we unconditionally call task_isolation_enter()
at the end of the loop if any other work is done.

To make the assembly code continue to be as optimized as before,
we renumber the _TIF flags so that both _TIF_WORK_MASK and
_TIF_SYSCALL_WORK still have contiguous runs of bits in the
immediate operand for the "and" instruction, as required by the
ARM64 ISA.  Since TIF_NOHZ is in both masks, it must be the
middle bit in the contiguous run that starts with the
_TIF_WORK_MASK bits and ends with the _TIF_SYSCALL_WORK bits.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one.  This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

We instrument the smp_cross_call() routine so that it checks for
isolated tasks and generates a suitable warning if we are about
to disturb one of them in strict or debug mode.

Finally, add an explicit check for STRICT mode in do_mem_abort()
to handle the case of page faults.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/arm64/include/asm/thread_info.h | 18 ++++++++++++------
 arch/arm64/kernel/ptrace.c           | 12 +++++++++---
 arch/arm64/kernel/signal.c           |  7 +++++--
 arch/arm64/kernel/smp.c              |  2 ++
 arch/arm64/mm/fault.c                |  4 ++++
 5 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 90c7ff233735..94a98e9e29ef 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -103,11 +103,11 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	1
 #define TIF_NOTIFY_RESUME	2	/* callback before returning to user */
 #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
-#define TIF_NOHZ		7
-#define TIF_SYSCALL_TRACE	8
-#define TIF_SYSCALL_AUDIT	9
-#define TIF_SYSCALL_TRACEPOINT	10
-#define TIF_SECCOMP		11
+#define TIF_NOHZ		4
+#define TIF_SYSCALL_TRACE	5
+#define TIF_SYSCALL_AUDIT	6
+#define TIF_SYSCALL_TRACEPOINT	7
+#define TIF_SECCOMP		8
 #define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_FREEZE		19
 #define TIF_RESTORE_SIGMASK	20
@@ -125,9 +125,15 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_32BIT		(1 << TIF_32BIT)
 
-#define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
+#define _TIF_WORK_LOOP_MASK	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE)
 
+#ifdef CONFIG_TASK_ISOLATION
+# define _TIF_WORK_MASK		(_TIF_WORK_LOOP_MASK | _TIF_NOHZ)
+#else
+# define _TIF_WORK_MASK		_TIF_WORK_LOOP_MASK
+#endif
+
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
 				 _TIF_NOHZ)
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 1971f491bb90..69ed3ba81650 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1240,14 +1241,19 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
-	/* Do the secure computing check first; failures should be fast. */
+	unsigned long work = ACCESS_ONCE(current_thread_info()->flags);
+
+	if ((work & _TIF_NOHZ) && task_isolation_check_syscall(regs->syscallno))
+		return -1;
+
+	/* Do the secure computing check early; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
 
-	if (test_thread_flag(TIF_SYSCALL_TRACE))
+	if (work & _TIF_SYSCALL_TRACE)
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
-	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
+	if (work & _TIF_SYSCALL_TRACEPOINT)
 		trace_sys_enter(regs, regs->syscallno);
 
 	audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index fde59c1139a9..641c828653c7 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -25,6 +25,7 @@
 #include <linux/uaccess.h>
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
+#include <linux/isolation.h>
 
 #include <asm/debug-monitors.h>
 #include <asm/elf.h>
@@ -419,10 +420,12 @@ asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
 		if (thread_flags & _TIF_FOREIGN_FPSTATE)
 			fpsimd_restore_current_state();
 
+		task_isolation_enter();
+
 		local_irq_disable();
 
 		thread_flags = READ_ONCE(current_thread_info()->flags) &
-			_TIF_WORK_MASK;
+			_TIF_WORK_LOOP_MASK;
 
-	} while (thread_flags);
+	} while (thread_flags || !task_isolation_ready());
 }
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index b1adc51b2c2e..dcb3282d04a2 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -37,6 +37,7 @@
 #include <linux/completion.h>
 #include <linux/of.h>
 #include <linux/irq_work.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -632,6 +633,7 @@ static const char *ipi_types[NR_IPI] __tracepoint_string = {
 static void smp_cross_call(const struct cpumask *target, unsigned int ipinr)
 {
 	trace_ipi_raise(target, ipi_types[ipinr]);
+	task_isolation_debug_cpumask(target);
 	__smp_cross_call(target, ipinr);
 }
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 92ddac1e8ca2..fbc78035b2af 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/isolation.h>
 
 #include <asm/cpufeature.h>
 #include <asm/exception.h>
@@ -466,6 +467,9 @@ asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
 	const struct fault_info *inf = fault_info + (esr & 63);
 	struct siginfo info;
 
+	if (user_mode(regs))
+		task_isolation_check_exception("%s at %#lx", inf->name, addr);
+
 	if (!inf->fn(addr, esr, regs))
 		return;
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v9 09/13] arch/arm64: enable task isolation functionality
@ 2016-01-04 19:34   ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: linux-arm-kernel

We need to call task_isolation_enter() from prepare_exit_to_usermode(),
so that we can both ensure we do it last before returning to
userspace, and we also are able to re-run signal handling, etc.,
if something occurs while task_isolation_enter() has interrupts
enabled.  To do this we add _TIF_NOHZ to the _TIF_WORK_MASK if
we have CONFIG_TASK_ISOLATION enabled, which brings us into
prepare_exit_to_usermode() on all return to userspace.  But we
don't put _TIF_NOHZ in the flags that we use to loop back and
recheck, since we don't need to loop back only because the flag
is set.  Instead we unconditionally call task_isolation_enter()
at the end of the loop if any other work is done.

To make the assembly code continue to be as optimized as before,
we renumber the _TIF flags so that both _TIF_WORK_MASK and
_TIF_SYSCALL_WORK still have contiguous runs of bits in the
immediate operand for the "and" instruction, as required by the
ARM64 ISA.  Since TIF_NOHZ is in both masks, it must be the
middle bit in the contiguous run that starts with the
_TIF_WORK_MASK bits and ends with the _TIF_SYSCALL_WORK bits.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one.  This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

We instrument the smp_cross_call() routine so that it checks for
isolated tasks and generates a suitable warning if we are about
to disturb one of them in strict or debug mode.

Finally, add an explicit check for STRICT mode in do_mem_abort()
to handle the case of page faults.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/arm64/include/asm/thread_info.h | 18 ++++++++++++------
 arch/arm64/kernel/ptrace.c           | 12 +++++++++---
 arch/arm64/kernel/signal.c           |  7 +++++--
 arch/arm64/kernel/smp.c              |  2 ++
 arch/arm64/mm/fault.c                |  4 ++++
 5 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 90c7ff233735..94a98e9e29ef 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -103,11 +103,11 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	1
 #define TIF_NOTIFY_RESUME	2	/* callback before returning to user */
 #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
-#define TIF_NOHZ		7
-#define TIF_SYSCALL_TRACE	8
-#define TIF_SYSCALL_AUDIT	9
-#define TIF_SYSCALL_TRACEPOINT	10
-#define TIF_SECCOMP		11
+#define TIF_NOHZ		4
+#define TIF_SYSCALL_TRACE	5
+#define TIF_SYSCALL_AUDIT	6
+#define TIF_SYSCALL_TRACEPOINT	7
+#define TIF_SECCOMP		8
 #define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_FREEZE		19
 #define TIF_RESTORE_SIGMASK	20
@@ -125,9 +125,15 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_32BIT		(1 << TIF_32BIT)
 
-#define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
+#define _TIF_WORK_LOOP_MASK	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE)
 
+#ifdef CONFIG_TASK_ISOLATION
+# define _TIF_WORK_MASK		(_TIF_WORK_LOOP_MASK | _TIF_NOHZ)
+#else
+# define _TIF_WORK_MASK		_TIF_WORK_LOOP_MASK
+#endif
+
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
 				 _TIF_NOHZ)
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 1971f491bb90..69ed3ba81650 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1240,14 +1241,19 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
-	/* Do the secure computing check first; failures should be fast. */
+	unsigned long work = ACCESS_ONCE(current_thread_info()->flags);
+
+	if ((work & _TIF_NOHZ) && task_isolation_check_syscall(regs->syscallno))
+		return -1;
+
+	/* Do the secure computing check early; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
 
-	if (test_thread_flag(TIF_SYSCALL_TRACE))
+	if (work & _TIF_SYSCALL_TRACE)
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
-	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
+	if (work & _TIF_SYSCALL_TRACEPOINT)
 		trace_sys_enter(regs, regs->syscallno);
 
 	audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index fde59c1139a9..641c828653c7 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -25,6 +25,7 @@
 #include <linux/uaccess.h>
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
+#include <linux/isolation.h>
 
 #include <asm/debug-monitors.h>
 #include <asm/elf.h>
@@ -419,10 +420,12 @@ asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
 		if (thread_flags & _TIF_FOREIGN_FPSTATE)
 			fpsimd_restore_current_state();
 
+		task_isolation_enter();
+
 		local_irq_disable();
 
 		thread_flags = READ_ONCE(current_thread_info()->flags) &
-			_TIF_WORK_MASK;
+			_TIF_WORK_LOOP_MASK;
 
-	} while (thread_flags);
+	} while (thread_flags || !task_isolation_ready());
 }
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index b1adc51b2c2e..dcb3282d04a2 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -37,6 +37,7 @@
 #include <linux/completion.h>
 #include <linux/of.h>
 #include <linux/irq_work.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -632,6 +633,7 @@ static const char *ipi_types[NR_IPI] __tracepoint_string = {
 static void smp_cross_call(const struct cpumask *target, unsigned int ipinr)
 {
 	trace_ipi_raise(target, ipi_types[ipinr]);
+	task_isolation_debug_cpumask(target);
 	__smp_cross_call(target, ipinr);
 }
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 92ddac1e8ca2..fbc78035b2af 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/isolation.h>
 
 #include <asm/cpufeature.h>
 #include <asm/exception.h>
@@ -466,6 +467,9 @@ asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
 	const struct fault_info *inf = fault_info + (esr & 63);
 	struct siginfo info;
 
+	if (user_mode(regs))
+		task_isolation_check_exception("%s at %#lx", inf->name, addr);
+
 	if (!inf->fn(addr, esr, regs))
 		return;
 
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v9 10/13] arch/tile: adopt prepare_exit_to_usermode() model from x86
  2016-01-04 19:34 ` Chris Metcalf
                   ` (9 preceding siblings ...)
  (?)
@ 2016-01-04 19:34 ` Chris Metcalf
  -1 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel
  Cc: Chris Metcalf

This change is a prerequisite change for TASK_ISOLATION but also
stands on its own for readability and maintainability.  The existing
tile do_work_pending() was called in a loop from assembly on
the slow path; this change moves the loop into C code as well.
For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add new,
comprehensible entry and exit handlers written in C").

This change exposes a pre-existing bug on the older tilepro platform;
the singlestep processing is done last, but on tilepro (unlike tilegx)
we enable interrupts while doing that processing, so we could in
theory miss a signal or other asynchronous event.  A future change
could fix this by breaking the singlestep work into a "prepare"
step done in the main loop, and a "trigger" step done after exiting
the loop.  Since this change is intended as purely a restructuring
change, we call out the bug explicitly now, but don't yet fix it.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/include/asm/processor.h   |  2 +-
 arch/tile/include/asm/thread_info.h |  8 +++-
 arch/tile/kernel/intvec_32.S        | 46 +++++++--------------
 arch/tile/kernel/intvec_64.S        | 49 +++++++----------------
 arch/tile/kernel/process.c          | 79 +++++++++++++++++++------------------
 5 files changed, 77 insertions(+), 107 deletions(-)

diff --git a/arch/tile/include/asm/processor.h b/arch/tile/include/asm/processor.h
index 139dfdee0134..0684e88aacd8 100644
--- a/arch/tile/include/asm/processor.h
+++ b/arch/tile/include/asm/processor.h
@@ -212,7 +212,7 @@ static inline void release_thread(struct task_struct *dead_task)
 	/* Nothing for now */
 }
 
-extern int do_work_pending(struct pt_regs *regs, u32 flags);
+extern void prepare_exit_to_usermode(struct pt_regs *regs, u32 flags);
 
 
 /*
diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index dc1fb28d9636..4b7cef9e94e0 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -140,10 +140,14 @@ extern void _cpu_idle(void);
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
 #define _TIF_NOHZ		(1<<TIF_NOHZ)
 
+/* Work to do as we loop to exit to user space. */
+#define _TIF_WORK_MASK \
+	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
+	 _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME)
+
 /* Work to do on any return to user space. */
 #define _TIF_ALLWORK_MASK \
-	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | _TIF_SINGLESTEP | \
-	 _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME | _TIF_NOHZ)
+	(_TIF_WORK_MASK | _TIF_SINGLESTEP | _TIF_NOHZ)
 
 /* Work to do at syscall entry. */
 #define _TIF_SYSCALL_ENTRY_WORK \
diff --git a/arch/tile/kernel/intvec_32.S b/arch/tile/kernel/intvec_32.S
index fbbe2ea882ea..33d48812872a 100644
--- a/arch/tile/kernel/intvec_32.S
+++ b/arch/tile/kernel/intvec_32.S
@@ -846,18 +846,6 @@ STD_ENTRY(interrupt_return)
 	FEEDBACK_REENTER(interrupt_return)
 
 	/*
-	 * Use r33 to hold whether we have already loaded the callee-saves
-	 * into ptregs.  We don't want to do it twice in this loop, since
-	 * then we'd clobber whatever changes are made by ptrace, etc.
-	 * Get base of stack in r32.
-	 */
-	{
-	 GET_THREAD_INFO(r32)
-	 movei  r33, 0
-	}
-
-.Lretry_work_pending:
-	/*
 	 * Disable interrupts so as to make sure we don't
 	 * miss an interrupt that sets any of the thread flags (like
 	 * need_resched or sigpending) between sampling and the iret.
@@ -867,33 +855,27 @@ STD_ENTRY(interrupt_return)
 	IRQ_DISABLE(r20, r21)
 	TRACE_IRQS_OFF  /* Note: clobbers registers r0-r29 */
 
-
-	/* Check to see if there is any work to do before returning to user. */
+	/*
+	 * See if there are any work items (including single-shot items)
+	 * to do.  If so, save the callee-save registers to pt_regs
+	 * and then dispatch to C code.
+	 */
+	GET_THREAD_INFO(r21)
 	{
-	 addi   r29, r32, THREAD_INFO_FLAGS_OFFSET
-	 moveli r1, lo16(_TIF_ALLWORK_MASK)
+	 addi   r22, r21, THREAD_INFO_FLAGS_OFFSET
+	 moveli r20, lo16(_TIF_ALLWORK_MASK)
 	}
 	{
-	 lw     r29, r29
-	 auli   r1, r1, ha16(_TIF_ALLWORK_MASK)
+	 lw     r22, r22
+	 auli   r20, r20, ha16(_TIF_ALLWORK_MASK)
 	}
-	and     r1, r29, r1
-	bzt     r1, .Lrestore_all
-
-	/*
-	 * Make sure we have all the registers saved for signal
-	 * handling, notify-resume, or single-step.  Call out to C
-	 * code to figure out exactly what we need to do for each flag bit,
-	 * then if necessary, reload the flags and recheck.
-	 */
+	and     r1, r22, r20
 	{
 	 PTREGS_PTR(r0, PTREGS_OFFSET_BASE)
-	 bnz    r33, 1f
+	 bzt    r1, .Lrestore_all
 	}
 	push_extra_callee_saves r0
-	movei   r33, 1
-1:	jal     do_work_pending
-	bnz     r0, .Lretry_work_pending
+	jal     prepare_exit_to_usermode
 
 	/*
 	 * In the NMI case we
@@ -1327,7 +1309,7 @@ STD_ENTRY(ret_from_kernel_thread)
 	FEEDBACK_REENTER(ret_from_kernel_thread)
 	{
 	 movei  r30, 0               /* not an NMI */
-	 j      .Lresume_userspace   /* jump into middle of interrupt_return */
+	 j      interrupt_return
 	}
 	STD_ENDPROC(ret_from_kernel_thread)
 
diff --git a/arch/tile/kernel/intvec_64.S b/arch/tile/kernel/intvec_64.S
index 58964d209d4d..a41c994ce237 100644
--- a/arch/tile/kernel/intvec_64.S
+++ b/arch/tile/kernel/intvec_64.S
@@ -879,20 +879,6 @@ STD_ENTRY(interrupt_return)
 	FEEDBACK_REENTER(interrupt_return)
 
 	/*
-	 * Use r33 to hold whether we have already loaded the callee-saves
-	 * into ptregs.  We don't want to do it twice in this loop, since
-	 * then we'd clobber whatever changes are made by ptrace, etc.
-	 */
-	{
-	 movei  r33, 0
-	 move   r32, sp
-	}
-
-	/* Get base of stack in r32. */
-	EXTRACT_THREAD_INFO(r32)
-
-.Lretry_work_pending:
-	/*
 	 * Disable interrupts so as to make sure we don't
 	 * miss an interrupt that sets any of the thread flags (like
 	 * need_resched or sigpending) between sampling and the iret.
@@ -902,33 +888,28 @@ STD_ENTRY(interrupt_return)
 	IRQ_DISABLE(r20, r21)
 	TRACE_IRQS_OFF  /* Note: clobbers registers r0-r29 */
 
-
-	/* Check to see if there is any work to do before returning to user. */
+	/*
+	 * See if there are any work items (including single-shot items)
+	 * to do.  If so, save the callee-save registers to pt_regs
+	 * and then dispatch to C code.
+	 */
+	move    r21, sp
+	EXTRACT_THREAD_INFO(r21)
 	{
-	 addi   r29, r32, THREAD_INFO_FLAGS_OFFSET
-	 moveli r1, hw1_last(_TIF_ALLWORK_MASK)
+	 addi   r22, r21, THREAD_INFO_FLAGS_OFFSET
+	 moveli r20, hw1_last(_TIF_ALLWORK_MASK)
 	}
 	{
-	 ld     r29, r29
-	 shl16insli r1, r1, hw0(_TIF_ALLWORK_MASK)
+	 ld     r22, r22
+	 shl16insli r20, r20, hw0(_TIF_ALLWORK_MASK)
 	}
-	and     r1, r29, r1
-	beqzt   r1, .Lrestore_all
-
-	/*
-	 * Make sure we have all the registers saved for signal
-	 * handling or notify-resume.  Call out to C code to figure out
-	 * exactly what we need to do for each flag bit, then if
-	 * necessary, reload the flags and recheck.
-	 */
+	and     r1, r22, r20
 	{
 	 PTREGS_PTR(r0, PTREGS_OFFSET_BASE)
-	 bnez   r33, 1f
+	 beqzt  r1, .Lrestore_all
 	}
 	push_extra_callee_saves r0
-	movei   r33, 1
-1:	jal     do_work_pending
-	bnez    r0, .Lretry_work_pending
+	jal     prepare_exit_to_usermode
 
 	/*
 	 * In the NMI case we
@@ -1411,7 +1392,7 @@ STD_ENTRY(ret_from_kernel_thread)
 	FEEDBACK_REENTER(ret_from_kernel_thread)
 	{
 	 movei  r30, 0               /* not an NMI */
-	 j      .Lresume_userspace   /* jump into middle of interrupt_return */
+	 j      interrupt_return
 	}
 	STD_ENDPROC(ret_from_kernel_thread)
 
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index 7d5769310bef..b5f30d376ce1 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -462,54 +462,57 @@ struct task_struct *__sched _switch_to(struct task_struct *prev,
 
 /*
  * This routine is called on return from interrupt if any of the
- * TIF_WORK_MASK flags are set in thread_info->flags.  It is
- * entered with interrupts disabled so we don't miss an event
- * that modified the thread_info flags.  If any flag is set, we
- * handle it and return, and the calling assembly code will
- * re-disable interrupts, reload the thread flags, and call back
- * if more flags need to be handled.
- *
- * We return whether we need to check the thread_info flags again
- * or not.  Note that we don't clear TIF_SINGLESTEP here, so it's
- * important that it be tested last, and then claim that we don't
- * need to recheck the flags.
+ * TIF_ALLWORK_MASK flags are set in thread_info->flags.  It is
+ * entered with interrupts disabled so we don't miss an event that
+ * modified the thread_info flags.  We loop until all the tested flags
+ * are clear.  Note that the function is called on certain conditions
+ * that are not listed in the loop condition here (e.g. SINGLESTEP)
+ * which guarantees we will do those things once, and redo them if any
+ * of the other work items is re-done, but won't continue looping if
+ * all the other work is done.
  */
-int do_work_pending(struct pt_regs *regs, u32 thread_info_flags)
+void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags)
 {
-	/* If we enter in kernel mode, do nothing and exit the caller loop. */
-	if (!user_mode(regs))
-		return 0;
+	if (WARN_ON(!user_mode(regs)))
+		return;
 
-	user_exit();
+	do {
+		local_irq_enable();
 
-	/* Enable interrupts; they are disabled again on return to caller. */
-	local_irq_enable();
+		if (thread_info_flags & _TIF_NEED_RESCHED)
+			schedule();
 
-	if (thread_info_flags & _TIF_NEED_RESCHED) {
-		schedule();
-		return 1;
-	}
 #if CHIP_HAS_TILE_DMA()
-	if (thread_info_flags & _TIF_ASYNC_TLB) {
-		do_async_page_fault(regs);
-		return 1;
-	}
+		if (thread_info_flags & _TIF_ASYNC_TLB)
+			do_async_page_fault(regs);
 #endif
-	if (thread_info_flags & _TIF_SIGPENDING) {
-		do_signal(regs);
-		return 1;
-	}
-	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-		return 1;
-	}
-	if (thread_info_flags & _TIF_SINGLESTEP)
+
+		if (thread_info_flags & _TIF_SIGPENDING)
+			do_signal(regs);
+
+		if (thread_info_flags & _TIF_NOTIFY_RESUME) {
+			clear_thread_flag(TIF_NOTIFY_RESUME);
+			tracehook_notify_resume(regs);
+		}
+
+		local_irq_disable();
+		thread_info_flags = READ_ONCE(current_thread_info()->flags);
+
+	} while (thread_info_flags & _TIF_WORK_MASK);
+
+	if (thread_info_flags & _TIF_SINGLESTEP) {
 		single_step_once(regs);
+#ifndef __tilegx__
+		/*
+		 * FIXME: on tilepro, since we enable interrupts in
+		 * this routine, it's possible that we miss a signal
+		 * or other asynchronous event.
+		 */
+		local_irq_disable();
+#endif
+	}
 
 	user_enter();
-
-	return 0;
 }
 
 unsigned long get_wchan(struct task_struct *p)
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v9 11/13] arch/tile: move user_exit() to early kernel entry sequence
  2016-01-04 19:34 ` Chris Metcalf
                   ` (10 preceding siblings ...)
  (?)
@ 2016-01-04 19:34 ` Chris Metcalf
  -1 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel
  Cc: Chris Metcalf

This ensures that we always notify context tracking that we
have exited from user space no matter how we enter the kernel.
It is similar to how arm64 handles context tracking, for example.

This allows the removal of all the exception_enter() calls that
were added in commit 49e4e15619cd ("tile: support CONTEXT_TRACKING and
thus NOHZ_FULL").

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/kernel/intvec_32.S   |  5 ++++-
 arch/tile/kernel/intvec_64.S   |  5 ++++-
 arch/tile/kernel/ptrace.c      | 15 ---------------
 arch/tile/kernel/single_step.c |  3 ---
 arch/tile/kernel/traps.c       | 13 ++++---------
 arch/tile/kernel/unaligned.c   | 13 ++++---------
 arch/tile/mm/fault.c           |  3 ---
 7 files changed, 16 insertions(+), 41 deletions(-)

diff --git a/arch/tile/kernel/intvec_32.S b/arch/tile/kernel/intvec_32.S
index 33d48812872a..9ff75e3a318a 100644
--- a/arch/tile/kernel/intvec_32.S
+++ b/arch/tile/kernel/intvec_32.S
@@ -572,7 +572,7 @@ intvec_\vecname:
 	}
 	wh64    r52
 
-#ifdef CONFIG_TRACE_IRQFLAGS
+#if defined(CONFIG_TRACE_IRQFLAGS) || defined(CONFIG_CONTEXT_TRACKING)
 	.ifnc \function,handle_nmi
 	/*
 	 * We finally have enough state set up to notify the irq
@@ -588,6 +588,9 @@ intvec_\vecname:
 	{ move r32, r2; move r33, r3 }
 	.endif
 	TRACE_IRQS_OFF
+#ifdef CONFIG_CONTEXT_TRACKING
+	jal     context_tracking_user_exit
+#endif
 	.ifnc \function,handle_syscall
 	{ move r0, r30; move r1, r31 }
 	{ move r2, r32; move r3, r33 }
diff --git a/arch/tile/kernel/intvec_64.S b/arch/tile/kernel/intvec_64.S
index a41c994ce237..f080a6c3d82b 100644
--- a/arch/tile/kernel/intvec_64.S
+++ b/arch/tile/kernel/intvec_64.S
@@ -753,7 +753,7 @@ intvec_\vecname:
 	}
 	wh64    r52
 
-#ifdef CONFIG_TRACE_IRQFLAGS
+#if defined(CONFIG_TRACE_IRQFLAGS) || defined(CONFIG_CONTEXT_TRACKING)
 	.ifnc \function,handle_nmi
 	/*
 	 * We finally have enough state set up to notify the irq
@@ -769,6 +769,9 @@ intvec_\vecname:
 	{ move r32, r2; move r33, r3 }
 	.endif
 	TRACE_IRQS_OFF
+#ifdef CONFIG_CONTEXT_TRACKING
+	jal     context_tracking_user_exit
+#endif
 	.ifnc \function,handle_syscall
 	{ move r0, r30; move r1, r31 }
 	{ move r2, r32; move r3, r33 }
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index bdc126faf741..54e7b723db99 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -255,13 +255,6 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 {
 	u32 work = ACCESS_ONCE(current_thread_info()->flags);
 
-	/*
-	 * If TIF_NOHZ is set, we are required to call user_exit() before
-	 * doing anything that could touch RCU.
-	 */
-	if (work & _TIF_NOHZ)
-		user_exit();
-
 	if (secure_computing() == -1)
 		return -1;
 
@@ -281,12 +274,6 @@ void do_syscall_trace_exit(struct pt_regs *regs)
 	long errno;
 
 	/*
-	 * We may come here right after calling schedule_user()
-	 * in which case we can be in RCU user mode.
-	 */
-	user_exit();
-
-	/*
 	 * The standard tile calling convention returns the value (or negative
 	 * errno) in r0, and zero (or positive errno) in r1.
 	 * It saves a couple of cycles on the hot path to do this work in
@@ -322,7 +309,5 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs)
 /* Handle synthetic interrupt delivered only by the simulator. */
 void __kprobes do_breakpoint(struct pt_regs* regs, int fault_num)
 {
-	enum ctx_state prev_state = exception_enter();
 	send_sigtrap(current, regs);
-	exception_exit(prev_state);
 }
diff --git a/arch/tile/kernel/single_step.c b/arch/tile/kernel/single_step.c
index 53f7b9def07b..862973074bf9 100644
--- a/arch/tile/kernel/single_step.c
+++ b/arch/tile/kernel/single_step.c
@@ -23,7 +23,6 @@
 #include <linux/types.h>
 #include <linux/err.h>
 #include <linux/prctl.h>
-#include <linux/context_tracking.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <asm/uaccess.h>
@@ -739,7 +738,6 @@ static DEFINE_PER_CPU(unsigned long, ss_saved_pc);
 
 void gx_singlestep_handle(struct pt_regs *regs, int fault_num)
 {
-	enum ctx_state prev_state = exception_enter();
 	unsigned long *ss_pc = this_cpu_ptr(&ss_saved_pc);
 	struct thread_info *info = (void *)current_thread_info();
 	int is_single_step = test_ti_thread_flag(info, TIF_SINGLESTEP);
@@ -756,7 +754,6 @@ void gx_singlestep_handle(struct pt_regs *regs, int fault_num)
 		__insn_mtspr(SPR_SINGLE_STEP_CONTROL_K, control);
 		send_sigtrap(current, regs);
 	}
-	exception_exit(prev_state);
 }
 
 
diff --git a/arch/tile/kernel/traps.c b/arch/tile/kernel/traps.c
index 0011a9ff0525..4d9651c5b1ad 100644
--- a/arch/tile/kernel/traps.c
+++ b/arch/tile/kernel/traps.c
@@ -20,7 +20,6 @@
 #include <linux/reboot.h>
 #include <linux/uaccess.h>
 #include <linux/ptrace.h>
-#include <linux/context_tracking.h>
 #include <asm/stack.h>
 #include <asm/traps.h>
 #include <asm/setup.h>
@@ -254,7 +253,6 @@ static int do_bpt(struct pt_regs *regs)
 void __kprobes do_trap(struct pt_regs *regs, int fault_num,
 		       unsigned long reason)
 {
-	enum ctx_state prev_state = exception_enter();
 	siginfo_t info = { 0 };
 	int signo, code;
 	unsigned long address = 0;
@@ -263,7 +261,7 @@ void __kprobes do_trap(struct pt_regs *regs, int fault_num,
 
 	/* Handle breakpoints, etc. */
 	if (is_kernel && fault_num == INT_ILL && do_bpt(regs))
-		goto done;
+		return;
 
 	/* Re-enable interrupts, if they were previously enabled. */
 	if (!(regs->flags & PT_FLAGS_DISABLE_IRQ))
@@ -277,7 +275,7 @@ void __kprobes do_trap(struct pt_regs *regs, int fault_num,
 		const char *name;
 		char buf[100];
 		if (fixup_exception(regs))  /* ILL_TRANS or UNALIGN_DATA */
-			goto done;
+			return;
 		if (fault_num >= 0 &&
 		    fault_num < ARRAY_SIZE(int_name) &&
 		    int_name[fault_num] != NULL)
@@ -319,7 +317,7 @@ void __kprobes do_trap(struct pt_regs *regs, int fault_num,
 	case INT_GPV:
 #if CHIP_HAS_TILE_DMA()
 		if (retry_gpv(reason))
-			goto done;
+			return;
 #endif
 		/*FALLTHROUGH*/
 	case INT_UDN_ACCESS:
@@ -346,7 +344,7 @@ void __kprobes do_trap(struct pt_regs *regs, int fault_num,
 			if (!state ||
 			    (void __user *)(regs->pc) != state->buffer) {
 				single_step_once(regs);
-				goto done;
+				return;
 			}
 		}
 #endif
@@ -390,9 +388,6 @@ void __kprobes do_trap(struct pt_regs *regs, int fault_num,
 	if (signo != SIGTRAP)
 		trace_unhandled_signal("trap", regs, address, signo);
 	force_sig_info(signo, &info, current);
-
-done:
-	exception_exit(prev_state);
 }
 
 void do_nmi(struct pt_regs *regs, int fault_num, unsigned long reason)
diff --git a/arch/tile/kernel/unaligned.c b/arch/tile/kernel/unaligned.c
index d075f92ccee0..0db5f7c9d9e5 100644
--- a/arch/tile/kernel/unaligned.c
+++ b/arch/tile/kernel/unaligned.c
@@ -25,7 +25,6 @@
 #include <linux/module.h>
 #include <linux/compat.h>
 #include <linux/prctl.h>
-#include <linux/context_tracking.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <asm/uaccess.h>
@@ -1449,7 +1448,6 @@ void jit_bundle_gen(struct pt_regs *regs, tilegx_bundle_bits bundle,
 
 void do_unaligned(struct pt_regs *regs, int vecnum)
 {
-	enum ctx_state prev_state = exception_enter();
 	tilegx_bundle_bits __user  *pc;
 	tilegx_bundle_bits bundle;
 	struct thread_info *info = current_thread_info();
@@ -1503,7 +1501,7 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
 				*((tilegx_bundle_bits *)(regs->pc)));
 			jit_bundle_gen(regs, bundle, align_ctl);
 		}
-		goto done;
+		return;
 	}
 
 	/*
@@ -1527,7 +1525,7 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
 
 		trace_unhandled_signal("unaligned fixup trap", regs, 0, SIGBUS);
 		force_sig_info(info.si_signo, &info, current);
-		goto done;
+		return;
 	}
 
 
@@ -1544,7 +1542,7 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
 		trace_unhandled_signal("segfault in unalign fixup", regs,
 				       (unsigned long)info.si_addr, SIGSEGV);
 		force_sig_info(info.si_signo, &info, current);
-		goto done;
+		return;
 	}
 
 	if (!info->unalign_jit_base) {
@@ -1579,7 +1577,7 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
 
 		if (IS_ERR((void __force *)user_page)) {
 			pr_err("Out of kernel pages trying do_mmap\n");
-			goto done;
+			return;
 		}
 
 		/* Save the address in the thread_info struct */
@@ -1592,9 +1590,6 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
 
 	/* Generate unalign JIT */
 	jit_bundle_gen(regs, GX_INSN_BSWAP(bundle), align_ctl);
-
-done:
-	exception_exit(prev_state);
 }
 
 #endif /* __tilegx__ */
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 13eac59bf16a..26734214818c 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -35,7 +35,6 @@
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 #include <linux/kdebug.h>
-#include <linux/context_tracking.h>
 
 #include <asm/pgalloc.h>
 #include <asm/sections.h>
@@ -845,9 +844,7 @@ static inline void __do_page_fault(struct pt_regs *regs, int fault_num,
 void do_page_fault(struct pt_regs *regs, int fault_num,
 		   unsigned long address, unsigned long write)
 {
-	enum ctx_state prev_state = exception_enter();
 	__do_page_fault(regs, fault_num, address, write);
-	exception_exit(prev_state);
 }
 
 #if CHIP_HAS_TILE_DMA()
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v9 12/13] arch/tile: enable task isolation functionality
  2016-01-04 19:34 ` Chris Metcalf
                   ` (11 preceding siblings ...)
  (?)
@ 2016-01-04 19:34 ` Chris Metcalf
  -1 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel
  Cc: Chris Metcalf

We add the necessary call to task_isolation_enter() in the
prepare_exit_to_usermode() routine.  We already unconditionally
call into this routine if TIF_NOHZ is set, since that's where
we do the user_enter() call.

We add calls to task_isolation_check_exception() in places
where exceptions may not generate signals to the application.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/kernel/process.c     |  6 +++++-
 arch/tile/kernel/ptrace.c      |  6 ++++++
 arch/tile/kernel/single_step.c |  5 +++++
 arch/tile/kernel/smp.c         | 26 ++++++++++++++------------
 arch/tile/kernel/unaligned.c   |  3 +++
 arch/tile/mm/fault.c           |  3 +++
 arch/tile/mm/homecache.c       |  2 ++
 7 files changed, 38 insertions(+), 13 deletions(-)

diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index b5f30d376ce1..832febfd65df 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -29,6 +29,7 @@
 #include <linux/signal.h>
 #include <linux/delay.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 #include <asm/stack.h>
 #include <asm/switch_to.h>
 #include <asm/homecache.h>
@@ -495,10 +496,13 @@ void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags)
 			tracehook_notify_resume(regs);
 		}
 
+		task_isolation_enter();
+
 		local_irq_disable();
 		thread_info_flags = READ_ONCE(current_thread_info()->flags);
 
-	} while (thread_info_flags & _TIF_WORK_MASK);
+	} while ((thread_info_flags & _TIF_WORK_MASK) ||
+		 !task_isolation_ready());
 
 	if (thread_info_flags & _TIF_SINGLESTEP) {
 		single_step_once(regs);
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index 54e7b723db99..f76f2d8b8923 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -23,6 +23,7 @@
 #include <linux/elf.h>
 #include <linux/tracehook.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 #include <asm/traps.h>
 #include <arch/chip.h>
 
@@ -255,6 +256,11 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 {
 	u32 work = ACCESS_ONCE(current_thread_info()->flags);
 
+	if (work & _TIF_NOHZ) {
+		if (task_isolation_check_syscall(regs->regs[TREG_SYSCALL_NR]))
+			return -1;
+	}
+
 	if (secure_computing() == -1)
 		return -1;
 
diff --git a/arch/tile/kernel/single_step.c b/arch/tile/kernel/single_step.c
index 862973074bf9..ba01eacde7a3 100644
--- a/arch/tile/kernel/single_step.c
+++ b/arch/tile/kernel/single_step.c
@@ -23,6 +23,7 @@
 #include <linux/types.h>
 #include <linux/err.h>
 #include <linux/prctl.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <asm/uaccess.h>
@@ -320,6 +321,8 @@ void single_step_once(struct pt_regs *regs)
 	int size = 0, sign_ext = 0;  /* happy compiler */
 	int align_ctl;
 
+	task_isolation_check_exception("single step at %#lx", regs->pc);
+
 	align_ctl = unaligned_fixup;
 	switch (task_thread_info(current)->align_ctl) {
 	case PR_UNALIGN_NOPRINT:
@@ -767,6 +770,8 @@ void single_step_once(struct pt_regs *regs)
 	unsigned long *ss_pc = this_cpu_ptr(&ss_saved_pc);
 	unsigned long control = __insn_mfspr(SPR_SINGLE_STEP_CONTROL_K);
 
+	task_isolation_check_exception("single step at %#lx", regs->pc);
+
 	*ss_pc = regs->pc;
 	control |= SPR_SINGLE_STEP_CONTROL_1__CANCELED_MASK;
 	control |= SPR_SINGLE_STEP_CONTROL_1__INHIBIT_MASK;
diff --git a/arch/tile/kernel/smp.c b/arch/tile/kernel/smp.c
index 07e3ff5cc740..7298d68d4584 100644
--- a/arch/tile/kernel/smp.c
+++ b/arch/tile/kernel/smp.c
@@ -20,6 +20,7 @@
 #include <linux/irq.h>
 #include <linux/irq_work.h>
 #include <linux/module.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/homecache.h>
 
@@ -181,10 +182,11 @@ void flush_icache_range(unsigned long start, unsigned long end)
 	struct ipi_flush flush = { start, end };
 
 	/* If invoked with irqs disabled, we can not issue IPIs. */
-	if (irqs_disabled())
+	if (irqs_disabled()) {
+		task_isolation_debug_cpumask(&task_isolation_map);
 		flush_remote(0, HV_FLUSH_EVICT_L1I, NULL, 0, 0, 0,
 			NULL, NULL, 0);
-	else {
+	} else {
 		preempt_disable();
 		on_each_cpu(ipi_flush_icache_range, &flush, 1);
 		preempt_enable();
@@ -258,10 +260,8 @@ void __init ipi_init(void)
 
 #if CHIP_HAS_IPI()
 
-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
 {
-	WARN_ON(cpu_is_offline(cpu));
-
 	/*
 	 * We just want to do an MMIO store.  The traditional writeq()
 	 * functions aren't really correct here, since they're always
@@ -273,15 +273,17 @@ void smp_send_reschedule(int cpu)
 
 #else
 
-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
 {
-	HV_Coord coord;
-
-	WARN_ON(cpu_is_offline(cpu));
-
-	coord.y = cpu_y(cpu);
-	coord.x = cpu_x(cpu);
+	HV_Coord coord = { .y = cpu_y(cpu), .x = cpu_x(cpu) };
 	hv_trigger_ipi(coord, IRQ_RESCHEDULE);
 }
 
 #endif /* CHIP_HAS_IPI() */
+
+void smp_send_reschedule(int cpu)
+{
+	WARN_ON(cpu_is_offline(cpu));
+	task_isolation_debug(cpu);
+	__smp_send_reschedule(cpu);
+}
diff --git a/arch/tile/kernel/unaligned.c b/arch/tile/kernel/unaligned.c
index 0db5f7c9d9e5..b1e229a1ff62 100644
--- a/arch/tile/kernel/unaligned.c
+++ b/arch/tile/kernel/unaligned.c
@@ -25,6 +25,7 @@
 #include <linux/module.h>
 #include <linux/compat.h>
 #include <linux/prctl.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <asm/uaccess.h>
@@ -1545,6 +1546,8 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
 		return;
 	}
 
+	task_isolation_check_exception("unaligned JIT at %#lx", regs->pc);
+
 	if (!info->unalign_jit_base) {
 		void __user *user_page;
 
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 26734214818c..1dee18d3ffbd 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -35,6 +35,7 @@
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 #include <linux/kdebug.h>
+#include <linux/isolation.h>
 
 #include <asm/pgalloc.h>
 #include <asm/sections.h>
@@ -844,6 +845,8 @@ static inline void __do_page_fault(struct pt_regs *regs, int fault_num,
 void do_page_fault(struct pt_regs *regs, int fault_num,
 		   unsigned long address, unsigned long write)
 {
+	task_isolation_check_exception("page fault interrupt %d at %#lx (%#lx)",
+				       fault_num, regs->pc, address);
 	__do_page_fault(regs, fault_num, address, write);
 }
 
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..e044e8dd8372 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
 #include <linux/smp.h>
 #include <linux/module.h>
 #include <linux/hugetlb.h>
+#include <linux/isolation.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -83,6 +84,7 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
 	 * Don't bother to update atomically; losing a count
 	 * here is not that critical.
 	 */
+	task_isolation_debug_cpumask(&mask);
 	for_each_cpu(cpu, &mask)
 		++per_cpu(irq_stat, cpu).irq_hv_flush_count;
 }
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH v9 13/13] arm, tile: turn off timer tick for oneshot_stopped state
  2016-01-04 19:34 ` Chris Metcalf
                   ` (12 preceding siblings ...)
  (?)
@ 2016-01-04 19:34 ` Chris Metcalf
  -1 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-kernel
  Cc: Chris Metcalf

When the schedule tick is disabled in tick_nohz_stop_sched_tick(),
we call hrtimer_cancel(), which eventually calls down into
__remove_hrtimer() and thus into hrtimer_force_reprogram().
That function's call to tick_program_event() detects that
we are trying to set the expiration to KTIME_MAX and calls
clockevents_switch_state() to set the state to ONESHOT_STOPPED,
and returns.  See commit 8fff52fd5093 ("clockevents: Introduce
CLOCK_EVT_STATE_ONESHOT_STOPPED state") for more background.

However, by default the internal __clockevents_switch_state() code
doesn't have a "set_state_oneshot_stopped" function pointer for
the arm_arch_timer or tile clock_event_device structures, so that
code returns -ENOSYS, and we end up not setting the state, and more
importantly, we don't actually turn off the hardware timer.
As a result, the timer tick we were waiting for before is still
queued, and fires shortly afterwards, only to discover there was
nothing for it to do, at which point it quiesces.

The fix is to provide that function pointer field, and like the
other function pointers, have it just turn off the timer interrupt.
Any call to set a new timer interval will properly re-enable it.

This fix avoids a small performance hiccup for regular applications,
but for TASK_ISOLATION code, it fixes a potentially serious
kernel timer interruption to the time-sensitive application.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
 arch/tile/kernel/time.c              | 1 +
 drivers/clocksource/arm_arch_timer.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index 178989e6d3e3..fbedf380d9d4 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -159,6 +159,7 @@ static DEFINE_PER_CPU(struct clock_event_device, tile_timer) = {
 	.set_next_event = tile_timer_set_next_event,
 	.set_state_shutdown = tile_timer_shutdown,
 	.set_state_oneshot = tile_timer_shutdown,
+	.set_state_oneshot_stopped = tile_timer_shutdown,
 	.tick_resume = tile_timer_shutdown,
 };
 
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index c64d543d64bf..727a669afb1f 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -288,6 +288,8 @@ static void __arch_timer_setup(unsigned type,
 		}
 	}
 
+	clk->set_state_oneshot_stopped = clk->set_state_shutdown;
+
 	clk->set_state_shutdown(clk);
 
 	clockevents_config_and_register(clk, arch_timer_rate, 0xf, 0x7fffffff);
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86
  2016-01-04 19:34   ` Chris Metcalf
@ 2016-01-04 20:33     ` Mark Rutland
  -1 siblings, 0 replies; 92+ messages in thread
From: Mark Rutland @ 2016-01-04 20:33 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-arm-kernel, linux-kernel

Hi,

On Mon, Jan 04, 2016 at 02:34:46PM -0500, Chris Metcalf wrote:
> This change is a prerequisite change for TASK_ISOLATION but also
> stands on its own for readability and maintainability. 

I have also been looking into converting the userspace return path from
assembly to C [1], for the latter two reasons. Based on that, I have a
couple of comments.

> The existing arm64 do_notify_resume() is called in a loop from
> assembly on the slow path; this change moves the loop into C code as
> well.  For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add
> new, comprehensible entry and exit handlers written in C").
>
> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
> ---
>  arch/arm64/kernel/entry.S  |  6 +++---
>  arch/arm64/kernel/signal.c | 32 ++++++++++++++++++++++----------
>  2 files changed, 25 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 7ed3d75f6304..04eff4c4ac6e 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -630,9 +630,8 @@ work_pending:
>  	mov	x0, sp				// 'regs'
>  	tst	x2, #PSR_MODE_MASK		// user mode regs?
>  	b.ne	no_work_pending			// returning to kernel
> -	enable_irq				// enable interrupts for do_notify_resume()
> -	bl	do_notify_resume
> -	b	ret_to_user
> +	bl	prepare_exit_to_usermode
> +	b	no_user_work_pending
>  work_resched:
>  	bl	schedule
>  
> @@ -644,6 +643,7 @@ ret_to_user:
>  	ldr	x1, [tsk, #TI_FLAGS]
>  	and	x2, x1, #_TIF_WORK_MASK
>  	cbnz	x2, work_pending
> +no_user_work_pending:
>  	enable_step_tsk x1, x2
>  no_work_pending:
>  	kernel_exit 0

It seems unfortunate to leave behind portions of the entry.S
_TIF_WORK_MASK state machine (i.e. a small portion of ret_fast_syscall,
and the majority of work_pending and ret_to_user).

I think it would be nicer if we could handle all of that in one place
(or at least all in C).

> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index e18c48cb6db1..fde59c1139a9 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -399,18 +399,30 @@ static void do_signal(struct pt_regs *regs)
>  	restore_saved_sigmask();
>  }
>  
> -asmlinkage void do_notify_resume(struct pt_regs *regs,
> -				 unsigned int thread_flags)
> +asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
> +					 unsigned int thread_flags)
>  {
> -	if (thread_flags & _TIF_SIGPENDING)
> -		do_signal(regs);
> +	do {
> +		local_irq_enable();
>  
> -	if (thread_flags & _TIF_NOTIFY_RESUME) {
> -		clear_thread_flag(TIF_NOTIFY_RESUME);
> -		tracehook_notify_resume(regs);
> -	}
> +		if (thread_flags & _TIF_NEED_RESCHED)
> +			schedule();

Previously, had we called schedule(), we'd reload the thread info flags
and start that state machine again, whereas now we'll handle all the
cached flags before reloading.

Are we sure nothing is relying on the prior behaviour?

> +
> +		if (thread_flags & _TIF_SIGPENDING)
> +			do_signal(regs);
> +
> +		if (thread_flags & _TIF_NOTIFY_RESUME) {
> +			clear_thread_flag(TIF_NOTIFY_RESUME);
> +			tracehook_notify_resume(regs);
> +		}
> +
> +		if (thread_flags & _TIF_FOREIGN_FPSTATE)
> +			fpsimd_restore_current_state();
> +
> +		local_irq_disable();
>  
> -	if (thread_flags & _TIF_FOREIGN_FPSTATE)
> -		fpsimd_restore_current_state();
> +		thread_flags = READ_ONCE(current_thread_info()->flags) &
> +			_TIF_WORK_MASK;
>  
> +	} while (thread_flags);
>  }

Other than that, this looks good to me.

Thanks,
Mark.

[1] https://git.kernel.org/cgit/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86
@ 2016-01-04 20:33     ` Mark Rutland
  0 siblings, 0 replies; 92+ messages in thread
From: Mark Rutland @ 2016-01-04 20:33 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Mon, Jan 04, 2016 at 02:34:46PM -0500, Chris Metcalf wrote:
> This change is a prerequisite change for TASK_ISOLATION but also
> stands on its own for readability and maintainability. 

I have also been looking into converting the userspace return path from
assembly to C [1], for the latter two reasons. Based on that, I have a
couple of comments.

> The existing arm64 do_notify_resume() is called in a loop from
> assembly on the slow path; this change moves the loop into C code as
> well.  For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add
> new, comprehensible entry and exit handlers written in C").
>
> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
> ---
>  arch/arm64/kernel/entry.S  |  6 +++---
>  arch/arm64/kernel/signal.c | 32 ++++++++++++++++++++++----------
>  2 files changed, 25 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 7ed3d75f6304..04eff4c4ac6e 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -630,9 +630,8 @@ work_pending:
>  	mov	x0, sp				// 'regs'
>  	tst	x2, #PSR_MODE_MASK		// user mode regs?
>  	b.ne	no_work_pending			// returning to kernel
> -	enable_irq				// enable interrupts for do_notify_resume()
> -	bl	do_notify_resume
> -	b	ret_to_user
> +	bl	prepare_exit_to_usermode
> +	b	no_user_work_pending
>  work_resched:
>  	bl	schedule
>  
> @@ -644,6 +643,7 @@ ret_to_user:
>  	ldr	x1, [tsk, #TI_FLAGS]
>  	and	x2, x1, #_TIF_WORK_MASK
>  	cbnz	x2, work_pending
> +no_user_work_pending:
>  	enable_step_tsk x1, x2
>  no_work_pending:
>  	kernel_exit 0

It seems unfortunate to leave behind portions of the entry.S
_TIF_WORK_MASK state machine (i.e. a small portion of ret_fast_syscall,
and the majority of work_pending and ret_to_user).

I think it would be nicer if we could handle all of that in one place
(or at least all in C).

> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index e18c48cb6db1..fde59c1139a9 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -399,18 +399,30 @@ static void do_signal(struct pt_regs *regs)
>  	restore_saved_sigmask();
>  }
>  
> -asmlinkage void do_notify_resume(struct pt_regs *regs,
> -				 unsigned int thread_flags)
> +asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
> +					 unsigned int thread_flags)
>  {
> -	if (thread_flags & _TIF_SIGPENDING)
> -		do_signal(regs);
> +	do {
> +		local_irq_enable();
>  
> -	if (thread_flags & _TIF_NOTIFY_RESUME) {
> -		clear_thread_flag(TIF_NOTIFY_RESUME);
> -		tracehook_notify_resume(regs);
> -	}
> +		if (thread_flags & _TIF_NEED_RESCHED)
> +			schedule();

Previously, had we called schedule(), we'd reload the thread info flags
and start that state machine again, whereas now we'll handle all the
cached flags before reloading.

Are we sure nothing is relying on the prior behaviour?

> +
> +		if (thread_flags & _TIF_SIGPENDING)
> +			do_signal(regs);
> +
> +		if (thread_flags & _TIF_NOTIFY_RESUME) {
> +			clear_thread_flag(TIF_NOTIFY_RESUME);
> +			tracehook_notify_resume(regs);
> +		}
> +
> +		if (thread_flags & _TIF_FOREIGN_FPSTATE)
> +			fpsimd_restore_current_state();
> +
> +		local_irq_disable();
>  
> -	if (thread_flags & _TIF_FOREIGN_FPSTATE)
> -		fpsimd_restore_current_state();
> +		thread_flags = READ_ONCE(current_thread_info()->flags) &
> +			_TIF_WORK_MASK;
>  
> +	} while (thread_flags);
>  }

Other than that, this looks good to me.

Thanks,
Mark.

[1] https://git.kernel.org/cgit/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86
  2016-01-04 20:33     ` Mark Rutland
@ 2016-01-04 21:01       ` Chris Metcalf
  -1 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 21:01 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-arm-kernel, linux-kernel

On 01/04/2016 03:33 PM, Mark Rutland wrote:
> Hi,
>
> On Mon, Jan 04, 2016 at 02:34:46PM -0500, Chris Metcalf wrote:
>> This change is a prerequisite change for TASK_ISOLATION but also
>> stands on its own for readability and maintainability.
> I have also been looking into converting the userspace return path from
> assembly to C [1], for the latter two reasons. Based on that, I have a
> couple of comments.

Thanks!

> It seems unfortunate to leave behind portions of the entry.S
> _TIF_WORK_MASK state machine (i.e. a small portion of ret_fast_syscall,
> and the majority of work_pending and ret_to_user).
>
> I think it would be nicer if we could handle all of that in one place
> (or at least all in C).

Yes, in principle I agree with this, and I think your deasm tree looks
like an excellent idea.

For this patch series I wanted to focus more on what was necessary
for the various platforms to implement task isolation, and less on
additional cleanups of the platforms in question.  I think my changes
don't make the TIF state machine any less clear, nor do they make
it harder for an eventual further migration to C code along the lines
of what you've done, so it seems plausible to me to commit them
upstream independently of your work.

>> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
>> index e18c48cb6db1..fde59c1139a9 100644
>> --- a/arch/arm64/kernel/signal.c
>> +++ b/arch/arm64/kernel/signal.c
>> @@ -399,18 +399,30 @@ static void do_signal(struct pt_regs *regs)
>>   	restore_saved_sigmask();
>>   }
>>   
>> -asmlinkage void do_notify_resume(struct pt_regs *regs,
>> -				 unsigned int thread_flags)
>> +asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
>> +					 unsigned int thread_flags)
>>   {
>> -	if (thread_flags & _TIF_SIGPENDING)
>> -		do_signal(regs);
>> +	do {
>> +		local_irq_enable();
>>   
>> -	if (thread_flags & _TIF_NOTIFY_RESUME) {
>> -		clear_thread_flag(TIF_NOTIFY_RESUME);
>> -		tracehook_notify_resume(regs);
>> -	}
>> +		if (thread_flags & _TIF_NEED_RESCHED)
>> +			schedule();
> Previously, had we called schedule(), we'd reload the thread info flags
> and start that state machine again, whereas now we'll handle all the
> cached flags before reloading.
>
> Are we sure nothing is relying on the prior behaviour?

Good eye, and I probably should have called that out in the commit
message.  My best guess is that there should be nothing that depends
on the old semantics.  Other platforms (certainly x86 and tile, anyway)
already have the semantics that you run out the old state machine on
return from schedule(), so regardless, it's probably appropriate for
arm to follow that same convention.

>> +
>> +		if (thread_flags & _TIF_SIGPENDING)
>> +			do_signal(regs);
>> +
>> +		if (thread_flags & _TIF_NOTIFY_RESUME) {
>> +			clear_thread_flag(TIF_NOTIFY_RESUME);
>> +			tracehook_notify_resume(regs);
>> +		}
>> +
>> +		if (thread_flags & _TIF_FOREIGN_FPSTATE)
>> +			fpsimd_restore_current_state();
>> +
>> +		local_irq_disable();
>>   
>> -	if (thread_flags & _TIF_FOREIGN_FPSTATE)
>> -		fpsimd_restore_current_state();
>> +		thread_flags = READ_ONCE(current_thread_info()->flags) &
>> +			_TIF_WORK_MASK;
>>   
>> +	} while (thread_flags);
>>   }
> Other than that, this looks good to me.
>
> Thanks,
> Mark.
>
> [1] https://git.kernel.org/cgit/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm

Thanks again for the review - shall I add your Reviewed-by (or Acked-by?)
to this patch?

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86
@ 2016-01-04 21:01       ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 21:01 UTC (permalink / raw)
  To: linux-arm-kernel

On 01/04/2016 03:33 PM, Mark Rutland wrote:
> Hi,
>
> On Mon, Jan 04, 2016 at 02:34:46PM -0500, Chris Metcalf wrote:
>> This change is a prerequisite change for TASK_ISOLATION but also
>> stands on its own for readability and maintainability.
> I have also been looking into converting the userspace return path from
> assembly to C [1], for the latter two reasons. Based on that, I have a
> couple of comments.

Thanks!

> It seems unfortunate to leave behind portions of the entry.S
> _TIF_WORK_MASK state machine (i.e. a small portion of ret_fast_syscall,
> and the majority of work_pending and ret_to_user).
>
> I think it would be nicer if we could handle all of that in one place
> (or at least all in C).

Yes, in principle I agree with this, and I think your deasm tree looks
like an excellent idea.

For this patch series I wanted to focus more on what was necessary
for the various platforms to implement task isolation, and less on
additional cleanups of the platforms in question.  I think my changes
don't make the TIF state machine any less clear, nor do they make
it harder for an eventual further migration to C code along the lines
of what you've done, so it seems plausible to me to commit them
upstream independently of your work.

>> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
>> index e18c48cb6db1..fde59c1139a9 100644
>> --- a/arch/arm64/kernel/signal.c
>> +++ b/arch/arm64/kernel/signal.c
>> @@ -399,18 +399,30 @@ static void do_signal(struct pt_regs *regs)
>>   	restore_saved_sigmask();
>>   }
>>   
>> -asmlinkage void do_notify_resume(struct pt_regs *regs,
>> -				 unsigned int thread_flags)
>> +asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs,
>> +					 unsigned int thread_flags)
>>   {
>> -	if (thread_flags & _TIF_SIGPENDING)
>> -		do_signal(regs);
>> +	do {
>> +		local_irq_enable();
>>   
>> -	if (thread_flags & _TIF_NOTIFY_RESUME) {
>> -		clear_thread_flag(TIF_NOTIFY_RESUME);
>> -		tracehook_notify_resume(regs);
>> -	}
>> +		if (thread_flags & _TIF_NEED_RESCHED)
>> +			schedule();
> Previously, had we called schedule(), we'd reload the thread info flags
> and start that state machine again, whereas now we'll handle all the
> cached flags before reloading.
>
> Are we sure nothing is relying on the prior behaviour?

Good eye, and I probably should have called that out in the commit
message.  My best guess is that there should be nothing that depends
on the old semantics.  Other platforms (certainly x86 and tile, anyway)
already have the semantics that you run out the old state machine on
return from schedule(), so regardless, it's probably appropriate for
arm to follow that same convention.

>> +
>> +		if (thread_flags & _TIF_SIGPENDING)
>> +			do_signal(regs);
>> +
>> +		if (thread_flags & _TIF_NOTIFY_RESUME) {
>> +			clear_thread_flag(TIF_NOTIFY_RESUME);
>> +			tracehook_notify_resume(regs);
>> +		}
>> +
>> +		if (thread_flags & _TIF_FOREIGN_FPSTATE)
>> +			fpsimd_restore_current_state();
>> +
>> +		local_irq_disable();
>>   
>> -	if (thread_flags & _TIF_FOREIGN_FPSTATE)
>> -		fpsimd_restore_current_state();
>> +		thread_flags = READ_ONCE(current_thread_info()->flags) &
>> +			_TIF_WORK_MASK;
>>   
>> +	} while (thread_flags);
>>   }
> Other than that, this looks good to me.
>
> Thanks,
> Mark.
>
> [1] https://git.kernel.org/cgit/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm

Thanks again for the review - shall I add your Reviewed-by (or Acked-by?)
to this patch?

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v9bis 07/13] arch/x86: enable task isolation functionality
  2016-01-04 19:34 ` [PATCH v9 07/13] arch/x86: enable task isolation functionality Chris Metcalf
@ 2016-01-04 21:02   ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 21:02 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	H. Peter Anvin, x86, linux-kernel
  Cc: Chris Metcalf

In prepare_exit_to_usermode(), call task_isolation_ready()
when we are checking the thread-info flags, and after we've handled
the other work, call task_isolation_enter() unconditionally.

In syscall_trace_enter_phase1(), we add the necessary support for
strict-mode detection of syscalls.

We add strict reporting for the kernel exception types that do
not result in signals, namely non-signalling page faults and
non-signalling MPX fixups.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
Oops! In v9 I sent a version of this patch that didn't have the
semantic merge to 4.4 from Andy's commit 39b48e575e92 ("x86/entry:
Split and inline prepare_exit_to_usermode()").  This "v9bis" version
adds the necessary extra check to get into exit_to_usermode_loop()
in the first place when running in task-isolation mode.

 arch/x86/entry/common.c | 18 ++++++++++++++++--
 arch/x86/kernel/traps.c |  2 ++
 arch/x86/mm/fault.c     |  2 ++
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index a89fdbc1f0be..477d8cafaaf2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -21,6 +21,7 @@
 #include <linux/context_tracking.h>
 #include <linux/user-return-notifier.h>
 #include <linux/uprobes.h>
+#include <linux/isolation.h>
 
 #include <asm/desc.h>
 #include <asm/traps.h>
@@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	 */
 	if (work & _TIF_NOHZ) {
 		enter_from_user_mode();
+		if (task_isolation_check_syscall(regs->orig_ax)) {
+			regs->orig_ax = -1;
+			return 0;
+		}
 		work &= ~_TIF_NOHZ;
 	}
 #endif
@@ -254,17 +259,26 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
 			fire_user_return_notifiers();
 
+		task_isolation_enter();
+
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
 		cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
 
-		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
+		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS) &&
+		    task_isolation_ready())
 			break;
 
 	}
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+# define EXIT_TO_USERMODE_FLAGS (EXIT_TO_USERMODE_LOOP_FLAGS | _TIF_NOHZ)
+#else
+# define EXIT_TO_USERMODE_FLAGS EXIT_TO_USERMODE_LOOP_FLAGS
+#endif
+
 /* Called with IRQs disabled. */
 __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 {
@@ -278,7 +292,7 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	cached_flags =
 		READ_ONCE(pt_regs_to_thread_info(regs)->flags);
 
-	if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
+	if (unlikely(cached_flags & EXIT_TO_USERMODE_FLAGS))
 		exit_to_usermode_loop(regs, cached_flags);
 
 	user_enter();
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index ade185a46b1d..82bf53ec1e98 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -36,6 +36,7 @@
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/io.h>
+#include <linux/isolation.h>
 
 #ifdef CONFIG_EISA
 #include <linux/ioport.h>
@@ -398,6 +399,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 	case 2:	/* Bound directory has invalid entry. */
 		if (mpx_handle_bd_fault())
 			goto exit_trap;
+		task_isolation_check_exception("bounds check");
 		break; /* Success, it was handled */
 	case 1: /* Bound violation. */
 		info = mpx_generate_siginfo(regs);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index eef44d9a3f77..7b23487a3bd7 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -14,6 +14,7 @@
 #include <linux/prefetch.h>		/* prefetchw			*/
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
+#include <linux/isolation.h>		/* task_isolation_check_exception */
 
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
@@ -1148,6 +1149,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		local_irq_enable();
 		error_code |= PF_USER;
 		flags |= FAULT_FLAG_USER;
+		task_isolation_check_exception("page fault at %#lx", address);
 	} else {
 		if (regs->flags & X86_EFLAGS_IF)
 			local_irq_enable();
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86
  2016-01-04 20:33     ` Mark Rutland
@ 2016-01-04 22:31       ` Andy Lutomirski
  -1 siblings, 0 replies; 92+ messages in thread
From: Andy Lutomirski @ 2016-01-04 22:31 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-arm-kernel, linux-kernel

On Mon, Jan 4, 2016 at 12:33 PM, Mark Rutland <mark.rutland@arm.com> wrote:
> Hi,
>
> On Mon, Jan 04, 2016 at 02:34:46PM -0500, Chris Metcalf wrote:
>> This change is a prerequisite change for TASK_ISOLATION but also
>> stands on its own for readability and maintainability.
>
> I have also been looking into converting the userspace return path from
> assembly to C [1], for the latter two reasons. Based on that, I have a
> couple of comments.
>

>
> [1] https://git.kernel.org/cgit/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm

Neat!

In case you want to compare notes, I have a branch with the entire
syscall path on x86 in C except for cleanly separated asm fast path
optimizations:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/entry_compat

Even in Linus' tree, the x86 32-bit syscalls are in C.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86
@ 2016-01-04 22:31       ` Andy Lutomirski
  0 siblings, 0 replies; 92+ messages in thread
From: Andy Lutomirski @ 2016-01-04 22:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 4, 2016 at 12:33 PM, Mark Rutland <mark.rutland@arm.com> wrote:
> Hi,
>
> On Mon, Jan 04, 2016 at 02:34:46PM -0500, Chris Metcalf wrote:
>> This change is a prerequisite change for TASK_ISOLATION but also
>> stands on its own for readability and maintainability.
>
> I have also been looking into converting the userspace return path from
> assembly to C [1], for the latter two reasons. Based on that, I have a
> couple of comments.
>

>
> [1] https://git.kernel.org/cgit/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm

Neat!

In case you want to compare notes, I have a branch with the entire
syscall path on x86 in C except for cleanly separated asm fast path
optimizations:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/entry_compat

Even in Linus' tree, the x86 32-bit syscalls are in C.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 06/13] task_isolation: add debug boot flag
  2016-01-04 19:34 ` [PATCH v9 06/13] task_isolation: add debug boot flag Chris Metcalf
@ 2016-01-04 22:52   ` Steven Rostedt
  2016-01-04 23:42     ` Chris Metcalf
  0 siblings, 1 reply; 92+ messages in thread
From: Steven Rostedt @ 2016-01-04 22:52 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-kernel

On Mon, 4 Jan 2016 14:34:44 -0500
Chris Metcalf <cmetcalf@ezchip.com> wrote:


> +#ifdef CONFIG_TASK_ISOLATION
> +void task_isolation_debug(int cpu)
> +{
> +	struct task_struct *p;
> +
> +	if (!task_isolation_possible(cpu))
> +		return;
> +
> +	rcu_read_lock();

What's the rcu_read_lock() for? I don't see what is being protected by
rcu here?

-- Steve

> +	p = cpu_curr(cpu);
> +	get_task_struct(p);
> +	rcu_read_unlock();
> +	task_isolation_debug_task(cpu, p);
> +	put_task_struct(p);
> +}
> +#endif
> +

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 06/13] task_isolation: add debug boot flag
  2016-01-04 22:52   ` Steven Rostedt
@ 2016-01-04 23:42     ` Chris Metcalf
  2016-01-05 13:42       ` Steven Rostedt
  0 siblings, 1 reply; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 23:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-kernel

On 1/4/2016 5:52 PM, Steven Rostedt wrote:
> On Mon, 4 Jan 2016 14:34:44 -0500
> Chris Metcalf<cmetcalf@ezchip.com>  wrote:
>
>
>> >+#ifdef CONFIG_TASK_ISOLATION
>> >+void task_isolation_debug(int cpu)
>> >+{
>> >+	struct task_struct *p;
>> >+
>> >+	if (!task_isolation_possible(cpu))
>> >+		return;
>> >+
>> >+	rcu_read_lock();
> What's the rcu_read_lock() for? I don't see what is being protected by
> rcu here?

I'm not completely clear either, but this is the same idiom as is used throughout
kernel/sched/core.c when mapping from a pid or a cpu to a task_struct, since
obviously you could end up racing with the task_struct being removed after the
task dies.  My best understanding is that the rcu_read_lock() holds up the final
free of the structure so that we have time here to get another reference to it.

See for example sched_setaffinity() for a similar use of the idiom.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 06/13] task_isolation: add debug boot flag
  2016-01-04 23:42     ` Chris Metcalf
@ 2016-01-05 13:42       ` Steven Rostedt
  0 siblings, 0 replies; 92+ messages in thread
From: Steven Rostedt @ 2016-01-05 13:42 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-kernel

On Mon, 4 Jan 2016 18:42:00 -0500
Chris Metcalf <cmetcalf@ezchip.com> wrote:

> On 1/4/2016 5:52 PM, Steven Rostedt wrote:
> > On Mon, 4 Jan 2016 14:34:44 -0500
> > Chris Metcalf<cmetcalf@ezchip.com>  wrote:
> >
> >  
> >> >+#ifdef CONFIG_TASK_ISOLATION
> >> >+void task_isolation_debug(int cpu)
> >> >+{
> >> >+	struct task_struct *p;
> >> >+
> >> >+	if (!task_isolation_possible(cpu))
> >> >+		return;
> >> >+
> >> >+	rcu_read_lock();  
> > What's the rcu_read_lock() for? I don't see what is being protected by
> > rcu here?  
> 
> I'm not completely clear either, but this is the same idiom as is used throughout
> kernel/sched/core.c when mapping from a pid or a cpu to a task_struct, since
> obviously you could end up racing with the task_struct being removed after the
> task dies.  My best understanding is that the rcu_read_lock() holds up the final
> free of the structure so that we have time here to get another reference to it.
> 
> See for example sched_setaffinity() for a similar use of the idiom.
> 

Ah you're right. I'm still trying to get back up to speed from the
holidays. Yeah, we need to grab the lock to prevent the task from going
away from the time we get cpu_curr() to the time we up it's ref count.

-- Steve

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86
  2016-01-04 21:01       ` Chris Metcalf
@ 2016-01-05 17:21         ` Mark Rutland
  -1 siblings, 0 replies; 92+ messages in thread
From: Mark Rutland @ 2016-01-05 17:21 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Rik van Riel, Catalin Marinas, Peter Zijlstra,
	Frederic Weisbecker, Will Deacon, linux-kernel, Steven Rostedt,
	Andy Lutomirski, Thomas Gleixner, linux-arm-kernel, Viresh Kumar,
	Tejun Heo, Andrew Morton, Paul E. McKenney, Christoph Lameter,
	Ingo Molnar, Gilad Ben Yossef

On Mon, Jan 04, 2016 at 04:01:05PM -0500, Chris Metcalf wrote:
> On 01/04/2016 03:33 PM, Mark Rutland wrote:
> >Hi,
> >
> >On Mon, Jan 04, 2016 at 02:34:46PM -0500, Chris Metcalf wrote:
> >>This change is a prerequisite change for TASK_ISOLATION but also
> >>stands on its own for readability and maintainability.
> >I have also been looking into converting the userspace return path from
> >assembly to C [1], for the latter two reasons. Based on that, I have a
> >couple of comments.
> 
> Thanks!
> 
> >It seems unfortunate to leave behind portions of the entry.S
> >_TIF_WORK_MASK state machine (i.e. a small portion of ret_fast_syscall,
> >and the majority of work_pending and ret_to_user).
> >
> >I think it would be nicer if we could handle all of that in one place
> >(or at least all in C).
> 
> Yes, in principle I agree with this, and I think your deasm tree looks
> like an excellent idea.
> 
> For this patch series I wanted to focus more on what was necessary
> for the various platforms to implement task isolation, and less on
> additional cleanups of the platforms in question.  I think my changes
> don't make the TIF state machine any less clear, nor do they make
> it harder for an eventual further migration to C code along the lines
> of what you've done, so it seems plausible to me to commit them
> upstream independently of your work.

I appreciate that you don't want to rewrite all the code.

However, I think it's easier to factor out a small amount of additional
code now and evlove that as a whole than it will be to evolve part of it
and try to put it back together later.

I have a patch which I will reply with momentarily.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86
@ 2016-01-05 17:21         ` Mark Rutland
  0 siblings, 0 replies; 92+ messages in thread
From: Mark Rutland @ 2016-01-05 17:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 04, 2016 at 04:01:05PM -0500, Chris Metcalf wrote:
> On 01/04/2016 03:33 PM, Mark Rutland wrote:
> >Hi,
> >
> >On Mon, Jan 04, 2016 at 02:34:46PM -0500, Chris Metcalf wrote:
> >>This change is a prerequisite change for TASK_ISOLATION but also
> >>stands on its own for readability and maintainability.
> >I have also been looking into converting the userspace return path from
> >assembly to C [1], for the latter two reasons. Based on that, I have a
> >couple of comments.
> 
> Thanks!
> 
> >It seems unfortunate to leave behind portions of the entry.S
> >_TIF_WORK_MASK state machine (i.e. a small portion of ret_fast_syscall,
> >and the majority of work_pending and ret_to_user).
> >
> >I think it would be nicer if we could handle all of that in one place
> >(or at least all in C).
> 
> Yes, in principle I agree with this, and I think your deasm tree looks
> like an excellent idea.
> 
> For this patch series I wanted to focus more on what was necessary
> for the various platforms to implement task isolation, and less on
> additional cleanups of the platforms in question.  I think my changes
> don't make the TIF state machine any less clear, nor do they make
> it harder for an eventual further migration to C code along the lines
> of what you've done, so it seems plausible to me to commit them
> upstream independently of your work.

I appreciate that you don't want to rewrite all the code.

However, I think it's easier to factor out a small amount of additional
code now and evlove that as a whole than it will be to evolve part of it
and try to put it back together later.

I have a patch which I will reply with momentarily.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 1/2] arm64: entry: remove pointless SPSR mode check
  2016-01-05 17:21         ` Mark Rutland
@ 2016-01-05 17:33           ` Mark Rutland
  -1 siblings, 0 replies; 92+ messages in thread
From: Mark Rutland @ 2016-01-05 17:33 UTC (permalink / raw)
  To: cmetcalf
  Cc: mark.rutland, will.deacon, catalin.marinas, luto, linux-kernel,
	linux-arm-kernel

In work_pending  we may skip work if the stacked SPSR value represents
anything other than an EL0 context. We then immediately invoke the
kernel_exit 0 macro as part of ret_to_user, assuming a return to EL0.
This is somewhat confusing.

We use work_pending as part of the ret_to_user/ret_fast_syscall state
machine. We only use ret_fast_syscall in the return from an SVC issued
from EL0. We use ret_to_user for return from EL0 exception handlers and
also for return from ret_from_fork in the case the task was not a kernel
thread (i.e. it is a user task).

Thus in all cases the stacked SPSR value must represent an EL0 context,
and the check is redundant. This patch removes it, along with the now
unused no_work_pending label.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/kernel/entry.S | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 7ed3d75..6b30ab1 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -626,10 +626,7 @@ ret_fast_syscall_trace:
 work_pending:
 	tbnz	x1, #TIF_NEED_RESCHED, work_resched
 	/* TIF_SIGPENDING, TIF_NOTIFY_RESUME or TIF_FOREIGN_FPSTATE case */
-	ldr	x2, [sp, #S_PSTATE]
 	mov	x0, sp				// 'regs'
-	tst	x2, #PSR_MODE_MASK		// user mode regs?
-	b.ne	no_work_pending			// returning to kernel
 	enable_irq				// enable interrupts for do_notify_resume()
 	bl	do_notify_resume
 	b	ret_to_user
@@ -645,7 +642,6 @@ ret_to_user:
 	and	x2, x1, #_TIF_WORK_MASK
 	cbnz	x2, work_pending
 	enable_step_tsk x1, x2
-no_work_pending:
 	kernel_exit 0
 ENDPROC(ret_to_user)
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 1/2] arm64: entry: remove pointless SPSR mode check
@ 2016-01-05 17:33           ` Mark Rutland
  0 siblings, 0 replies; 92+ messages in thread
From: Mark Rutland @ 2016-01-05 17:33 UTC (permalink / raw)
  To: linux-arm-kernel

In work_pending  we may skip work if the stacked SPSR value represents
anything other than an EL0 context. We then immediately invoke the
kernel_exit 0 macro as part of ret_to_user, assuming a return to EL0.
This is somewhat confusing.

We use work_pending as part of the ret_to_user/ret_fast_syscall state
machine. We only use ret_fast_syscall in the return from an SVC issued
from EL0. We use ret_to_user for return from EL0 exception handlers and
also for return from ret_from_fork in the case the task was not a kernel
thread (i.e. it is a user task).

Thus in all cases the stacked SPSR value must represent an EL0 context,
and the check is redundant. This patch removes it, along with the now
unused no_work_pending label.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/kernel/entry.S | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 7ed3d75..6b30ab1 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -626,10 +626,7 @@ ret_fast_syscall_trace:
 work_pending:
 	tbnz	x1, #TIF_NEED_RESCHED, work_resched
 	/* TIF_SIGPENDING, TIF_NOTIFY_RESUME or TIF_FOREIGN_FPSTATE case */
-	ldr	x2, [sp, #S_PSTATE]
 	mov	x0, sp				// 'regs'
-	tst	x2, #PSR_MODE_MASK		// user mode regs?
-	b.ne	no_work_pending			// returning to kernel
 	enable_irq				// enable interrupts for do_notify_resume()
 	bl	do_notify_resume
 	b	ret_to_user
@@ -645,7 +642,6 @@ ret_to_user:
 	and	x2, x1, #_TIF_WORK_MASK
 	cbnz	x2, work_pending
 	enable_step_tsk x1, x2
-no_work_pending:
 	kernel_exit 0
 ENDPROC(ret_to_user)
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 2/2] arm64: factor work_pending state machine to C
  2016-01-05 17:21         ` Mark Rutland
@ 2016-01-05 17:33           ` Mark Rutland
  -1 siblings, 0 replies; 92+ messages in thread
From: Mark Rutland @ 2016-01-05 17:33 UTC (permalink / raw)
  To: cmetcalf
  Cc: mark.rutland, will.deacon, catalin.marinas, luto, linux-kernel,
	linux-arm-kernel

Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
state machine that can be difficult to reason about due to duplicated
code and a large number of branch targets.

This patch factors the common logic out into the existing
do_notify_resume function, converting the code to C in the process,
making the code more legible.

This patch tries to mirror the existing behaviour as closely as possible
while using the usual C control flow primitives. There should be no
functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/kernel/entry.S  | 24 +++---------------------
 arch/arm64/kernel/signal.c | 36 ++++++++++++++++++++++++++----------
 2 files changed, 29 insertions(+), 31 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 6b30ab1..41f5dfc 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -612,35 +612,17 @@ ret_fast_syscall:
 	ldr	x1, [tsk, #TI_FLAGS]		// re-check for syscall tracing
 	and	x2, x1, #_TIF_SYSCALL_WORK
 	cbnz	x2, ret_fast_syscall_trace
-	and	x2, x1, #_TIF_WORK_MASK
-	cbnz	x2, work_pending
-	enable_step_tsk x1, x2
-	kernel_exit 0
+	b	ret_to_user
 ret_fast_syscall_trace:
 	enable_irq				// enable interrupts
 	b	__sys_trace_return_skipped	// we already saved x0
 
 /*
- * Ok, we need to do extra processing, enter the slow path.
- */
-work_pending:
-	tbnz	x1, #TIF_NEED_RESCHED, work_resched
-	/* TIF_SIGPENDING, TIF_NOTIFY_RESUME or TIF_FOREIGN_FPSTATE case */
-	mov	x0, sp				// 'regs'
-	enable_irq				// enable interrupts for do_notify_resume()
-	bl	do_notify_resume
-	b	ret_to_user
-work_resched:
-	bl	schedule
-
-/*
  * "slow" syscall return path.
  */
 ret_to_user:
-	disable_irq				// disable interrupts
-	ldr	x1, [tsk, #TI_FLAGS]
-	and	x2, x1, #_TIF_WORK_MASK
-	cbnz	x2, work_pending
+	bl	do_notify_resume
+	ldr	x1, [tsk, #TI_FLAGS]		// re-check for single-step
 	enable_step_tsk x1, x2
 	kernel_exit 0
 ENDPROC(ret_to_user)
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index e18c48c..3a6c60b 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -399,18 +399,34 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
-asmlinkage void do_notify_resume(struct pt_regs *regs,
-				 unsigned int thread_flags)
+asmlinkage void do_notify_resume(void)
 {
-	if (thread_flags & _TIF_SIGPENDING)
-		do_signal(regs);
+	struct pt_regs *regs = task_pt_regs(current);
+	unsigned long thread_flags;
 
-	if (thread_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-	}
+	for (;;) {
+		local_irq_disable();
+
+		thread_flags = READ_ONCE(current_thread_info()->flags);
+		if (!(thread_flags & _TIF_WORK_MASK))
+			break;
+
+		if (thread_flags & _TIF_NEED_RESCHED) {
+			schedule();
+			continue;
+		}
 
-	if (thread_flags & _TIF_FOREIGN_FPSTATE)
-		fpsimd_restore_current_state();
+		local_irq_enable();
 
+		if (thread_flags & _TIF_SIGPENDING)
+			do_signal(regs);
+
+		if (thread_flags & _TIF_NOTIFY_RESUME) {
+			clear_thread_flag(TIF_NOTIFY_RESUME);
+			tracehook_notify_resume(regs);
+		}
+
+		if (thread_flags & _TIF_FOREIGN_FPSTATE)
+			fpsimd_restore_current_state();
+	}
 }
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 2/2] arm64: factor work_pending state machine to C
@ 2016-01-05 17:33           ` Mark Rutland
  0 siblings, 0 replies; 92+ messages in thread
From: Mark Rutland @ 2016-01-05 17:33 UTC (permalink / raw)
  To: linux-arm-kernel

Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
state machine that can be difficult to reason about due to duplicated
code and a large number of branch targets.

This patch factors the common logic out into the existing
do_notify_resume function, converting the code to C in the process,
making the code more legible.

This patch tries to mirror the existing behaviour as closely as possible
while using the usual C control flow primitives. There should be no
functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/kernel/entry.S  | 24 +++---------------------
 arch/arm64/kernel/signal.c | 36 ++++++++++++++++++++++++++----------
 2 files changed, 29 insertions(+), 31 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 6b30ab1..41f5dfc 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -612,35 +612,17 @@ ret_fast_syscall:
 	ldr	x1, [tsk, #TI_FLAGS]		// re-check for syscall tracing
 	and	x2, x1, #_TIF_SYSCALL_WORK
 	cbnz	x2, ret_fast_syscall_trace
-	and	x2, x1, #_TIF_WORK_MASK
-	cbnz	x2, work_pending
-	enable_step_tsk x1, x2
-	kernel_exit 0
+	b	ret_to_user
 ret_fast_syscall_trace:
 	enable_irq				// enable interrupts
 	b	__sys_trace_return_skipped	// we already saved x0
 
 /*
- * Ok, we need to do extra processing, enter the slow path.
- */
-work_pending:
-	tbnz	x1, #TIF_NEED_RESCHED, work_resched
-	/* TIF_SIGPENDING, TIF_NOTIFY_RESUME or TIF_FOREIGN_FPSTATE case */
-	mov	x0, sp				// 'regs'
-	enable_irq				// enable interrupts for do_notify_resume()
-	bl	do_notify_resume
-	b	ret_to_user
-work_resched:
-	bl	schedule
-
-/*
  * "slow" syscall return path.
  */
 ret_to_user:
-	disable_irq				// disable interrupts
-	ldr	x1, [tsk, #TI_FLAGS]
-	and	x2, x1, #_TIF_WORK_MASK
-	cbnz	x2, work_pending
+	bl	do_notify_resume
+	ldr	x1, [tsk, #TI_FLAGS]		// re-check for single-step
 	enable_step_tsk x1, x2
 	kernel_exit 0
 ENDPROC(ret_to_user)
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index e18c48c..3a6c60b 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -399,18 +399,34 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
-asmlinkage void do_notify_resume(struct pt_regs *regs,
-				 unsigned int thread_flags)
+asmlinkage void do_notify_resume(void)
 {
-	if (thread_flags & _TIF_SIGPENDING)
-		do_signal(regs);
+	struct pt_regs *regs = task_pt_regs(current);
+	unsigned long thread_flags;
 
-	if (thread_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-	}
+	for (;;) {
+		local_irq_disable();
+
+		thread_flags = READ_ONCE(current_thread_info()->flags);
+		if (!(thread_flags & _TIF_WORK_MASK))
+			break;
+
+		if (thread_flags & _TIF_NEED_RESCHED) {
+			schedule();
+			continue;
+		}
 
-	if (thread_flags & _TIF_FOREIGN_FPSTATE)
-		fpsimd_restore_current_state();
+		local_irq_enable();
 
+		if (thread_flags & _TIF_SIGPENDING)
+			do_signal(regs);
+
+		if (thread_flags & _TIF_NOTIFY_RESUME) {
+			clear_thread_flag(TIF_NOTIFY_RESUME);
+			tracehook_notify_resume(regs);
+		}
+
+		if (thread_flags & _TIF_FOREIGN_FPSTATE)
+			fpsimd_restore_current_state();
+	}
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86
  2016-01-04 22:31       ` Andy Lutomirski
@ 2016-01-05 18:01         ` Mark Rutland
  -1 siblings, 0 replies; 92+ messages in thread
From: Mark Rutland @ 2016-01-05 18:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-arm-kernel, linux-kernel

On Mon, Jan 04, 2016 at 02:31:42PM -0800, Andy Lutomirski wrote:
> On Mon, Jan 4, 2016 at 12:33 PM, Mark Rutland <mark.rutland@arm.com> wrote:
> > Hi,
> >
> > On Mon, Jan 04, 2016 at 02:34:46PM -0500, Chris Metcalf wrote:
> >> This change is a prerequisite change for TASK_ISOLATION but also
> >> stands on its own for readability and maintainability.
> >
> > I have also been looking into converting the userspace return path from
> > assembly to C [1], for the latter two reasons. Based on that, I have a
> > couple of comments.
> >
> 
> >
> > [1] https://git.kernel.org/cgit/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm
> 
> Neat!
> 
> In case you want to compare notes, I have a branch with the entire
> syscall path on x86 in C except for cleanly separated asm fast path
> optimizations:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/entry_compat

It was in fact your x86 effort that inspired me to look at this!

Thanks for the pointer, I'm almost certainly going to steal an idea or
two.

Currently it looks like arm64's conversion will be less painful than
that for x86 as the entry assembly is smaller and relatively uniform.
It looks like all but the register save/restore is possible in C.

That said, I have yet to stress/validate everything with tracing, irq
debugging, and so on, so my confidence may be misplaced.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86
@ 2016-01-05 18:01         ` Mark Rutland
  0 siblings, 0 replies; 92+ messages in thread
From: Mark Rutland @ 2016-01-05 18:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 04, 2016 at 02:31:42PM -0800, Andy Lutomirski wrote:
> On Mon, Jan 4, 2016 at 12:33 PM, Mark Rutland <mark.rutland@arm.com> wrote:
> > Hi,
> >
> > On Mon, Jan 04, 2016 at 02:34:46PM -0500, Chris Metcalf wrote:
> >> This change is a prerequisite change for TASK_ISOLATION but also
> >> stands on its own for readability and maintainability.
> >
> > I have also been looking into converting the userspace return path from
> > assembly to C [1], for the latter two reasons. Based on that, I have a
> > couple of comments.
> >
> 
> >
> > [1] https://git.kernel.org/cgit/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm
> 
> Neat!
> 
> In case you want to compare notes, I have a branch with the entire
> syscall path on x86 in C except for cleanly separated asm fast path
> optimizations:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/entry_compat

It was in fact your x86 effort that inspired me to look at this!

Thanks for the pointer, I'm almost certainly going to steal an idea or
two.

Currently it looks like arm64's conversion will be less painful than
that for x86 as the entry assembly is smaller and relatively uniform.
It looks like all but the register save/restore is possible in C.

That said, I have yet to stress/validate everything with tracing, irq
debugging, and so on, so my confidence may be misplaced.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/2] arm64: factor work_pending state machine to C
  2016-01-05 17:33           ` Mark Rutland
@ 2016-01-05 18:53             ` Chris Metcalf
  -1 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-05 18:53 UTC (permalink / raw)
  To: Mark Rutland
  Cc: will.deacon, catalin.marinas, luto, linux-kernel, linux-arm-kernel

On 01/05/2016 12:33 PM, Mark Rutland wrote:
> Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
> state machine that can be difficult to reason about due to duplicated
> code and a large number of branch targets.
>
> This patch factors the common logic out into the existing
> do_notify_resume function, converting the code to C in the process,
> making the code more legible.
>
> This patch tries to mirror the existing behaviour as closely as possible
> while using the usual C control flow primitives. There should be no
> functional change as a result of this patch.
>
> Signed-off-by: Mark Rutland<mark.rutland@arm.com>
> Cc: Catalin Marinas<catalin.marinas@arm.com>
> Cc: Chris Metcalf<cmetcalf@ezchip.com>
> Cc: Will Deacon<will.deacon@arm.com>
> ---
>   arch/arm64/kernel/entry.S  | 24 +++---------------------
>   arch/arm64/kernel/signal.c | 36 ++++++++++++++++++++++++++----------
>   2 files changed, 29 insertions(+), 31 deletions(-)

This looks good, and also makes the task isolation change drop in
very cleanly (relatively speaking).  Since do_notify_resume() is
called unconditionally now, we don't have to worry about fussing
with the bit numbering for the TIF_xxx flags in asm/threadinfo.h, so
that whole part of the patch can be dropped, and the actual
change to do_notify_resume() becomes:

diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 3a6c60beadca..00d0ec3a8e60 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -25,6 +25,7 @@
  #include <linux/uaccess.h>
  #include <linux/tracehook.h>
  #include <linux/ratelimit.h>
+#include <linux/isolation.h>
  
  #include <asm/debug-monitors.h>
  #include <asm/elf.h>
@@ -408,7 +409,8 @@ asmlinkage void do_notify_resume(void)
                 local_irq_disable();
  
                 thread_flags = READ_ONCE(current_thread_info()->flags);
-               if (!(thread_flags & _TIF_WORK_MASK))
+               if (!(thread_flags & _TIF_WORK_MASK) &&
+                   task_isolation_ready())
                         break;
  
                 if (thread_flags & _TIF_NEED_RESCHED) {
@@ -428,5 +430,7 @@ asmlinkage void do_notify_resume(void)
  
                 if (thread_flags & _TIF_FOREIGN_FPSTATE)
                         fpsimd_restore_current_state();
+
+               task_isolation_enter();
         }
  }


For the moment I just added your two commits into my task-isolation
tree and pushed it up, but if your changes make it into 4.5 and the
task-isolation series doesn't, I will remove them and rebase on 4.5-rc1
once that's released.  I've similarly staged the arch/tile enablement
changes to go into 4.5 so I can drop them from the task-isolation tree
as well at that point.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 2/2] arm64: factor work_pending state machine to C
@ 2016-01-05 18:53             ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-05 18:53 UTC (permalink / raw)
  To: linux-arm-kernel

On 01/05/2016 12:33 PM, Mark Rutland wrote:
> Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
> state machine that can be difficult to reason about due to duplicated
> code and a large number of branch targets.
>
> This patch factors the common logic out into the existing
> do_notify_resume function, converting the code to C in the process,
> making the code more legible.
>
> This patch tries to mirror the existing behaviour as closely as possible
> while using the usual C control flow primitives. There should be no
> functional change as a result of this patch.
>
> Signed-off-by: Mark Rutland<mark.rutland@arm.com>
> Cc: Catalin Marinas<catalin.marinas@arm.com>
> Cc: Chris Metcalf<cmetcalf@ezchip.com>
> Cc: Will Deacon<will.deacon@arm.com>
> ---
>   arch/arm64/kernel/entry.S  | 24 +++---------------------
>   arch/arm64/kernel/signal.c | 36 ++++++++++++++++++++++++++----------
>   2 files changed, 29 insertions(+), 31 deletions(-)

This looks good, and also makes the task isolation change drop in
very cleanly (relatively speaking).  Since do_notify_resume() is
called unconditionally now, we don't have to worry about fussing
with the bit numbering for the TIF_xxx flags in asm/threadinfo.h, so
that whole part of the patch can be dropped, and the actual
change to do_notify_resume() becomes:

diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 3a6c60beadca..00d0ec3a8e60 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -25,6 +25,7 @@
  #include <linux/uaccess.h>
  #include <linux/tracehook.h>
  #include <linux/ratelimit.h>
+#include <linux/isolation.h>
  
  #include <asm/debug-monitors.h>
  #include <asm/elf.h>
@@ -408,7 +409,8 @@ asmlinkage void do_notify_resume(void)
                 local_irq_disable();
  
                 thread_flags = READ_ONCE(current_thread_info()->flags);
-               if (!(thread_flags & _TIF_WORK_MASK))
+               if (!(thread_flags & _TIF_WORK_MASK) &&
+                   task_isolation_ready())
                         break;
  
                 if (thread_flags & _TIF_NEED_RESCHED) {
@@ -428,5 +430,7 @@ asmlinkage void do_notify_resume(void)
  
                 if (thread_flags & _TIF_FOREIGN_FPSTATE)
                         fpsimd_restore_current_state();
+
+               task_isolation_enter();
         }
  }


For the moment I just added your two commits into my task-isolation
tree and pushed it up, but if your changes make it into 4.5 and the
task-isolation series doesn't, I will remove them and rebase on 4.5-rc1
once that's released.  I've similarly staged the arch/tile enablement
changes to go into 4.5 so I can drop them from the task-isolation tree
as well at that point.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/2] arm64: entry: remove pointless SPSR mode check
  2016-01-05 17:33           ` Mark Rutland
@ 2016-01-06 12:15             ` Catalin Marinas
  -1 siblings, 0 replies; 92+ messages in thread
From: Catalin Marinas @ 2016-01-06 12:15 UTC (permalink / raw)
  To: Mark Rutland; +Cc: cmetcalf, will.deacon, linux-kernel, luto, linux-arm-kernel

On Tue, Jan 05, 2016 at 05:33:34PM +0000, Mark Rutland wrote:
> In work_pending  we may skip work if the stacked SPSR value represents
> anything other than an EL0 context. We then immediately invoke the
> kernel_exit 0 macro as part of ret_to_user, assuming a return to EL0.
> This is somewhat confusing.
> 
> We use work_pending as part of the ret_to_user/ret_fast_syscall state
> machine. We only use ret_fast_syscall in the return from an SVC issued
> from EL0. We use ret_to_user for return from EL0 exception handlers and
> also for return from ret_from_fork in the case the task was not a kernel
> thread (i.e. it is a user task).
> 
> Thus in all cases the stacked SPSR value must represent an EL0 context,
> and the check is redundant. This patch removes it, along with the now
> unused no_work_pending label.
> 
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Chris Metcalf <cmetcalf@ezchip.com>
> Cc: Will Deacon <will.deacon@arm.com>

Acked-by: Catalin Marinas <catalin.marinas@arm.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 1/2] arm64: entry: remove pointless SPSR mode check
@ 2016-01-06 12:15             ` Catalin Marinas
  0 siblings, 0 replies; 92+ messages in thread
From: Catalin Marinas @ 2016-01-06 12:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jan 05, 2016 at 05:33:34PM +0000, Mark Rutland wrote:
> In work_pending  we may skip work if the stacked SPSR value represents
> anything other than an EL0 context. We then immediately invoke the
> kernel_exit 0 macro as part of ret_to_user, assuming a return to EL0.
> This is somewhat confusing.
> 
> We use work_pending as part of the ret_to_user/ret_fast_syscall state
> machine. We only use ret_fast_syscall in the return from an SVC issued
> from EL0. We use ret_to_user for return from EL0 exception handlers and
> also for return from ret_from_fork in the case the task was not a kernel
> thread (i.e. it is a user task).
> 
> Thus in all cases the stacked SPSR value must represent an EL0 context,
> and the check is redundant. This patch removes it, along with the now
> unused no_work_pending label.
> 
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Chris Metcalf <cmetcalf@ezchip.com>
> Cc: Will Deacon <will.deacon@arm.com>

Acked-by: Catalin Marinas <catalin.marinas@arm.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/2] arm64: factor work_pending state machine to C
  2016-01-05 17:33           ` Mark Rutland
@ 2016-01-06 12:30             ` Catalin Marinas
  -1 siblings, 0 replies; 92+ messages in thread
From: Catalin Marinas @ 2016-01-06 12:30 UTC (permalink / raw)
  To: Mark Rutland; +Cc: cmetcalf, will.deacon, linux-kernel, luto, linux-arm-kernel

On Tue, Jan 05, 2016 at 05:33:35PM +0000, Mark Rutland wrote:
> Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
> state machine that can be difficult to reason about due to duplicated
> code and a large number of branch targets.
> 
> This patch factors the common logic out into the existing
> do_notify_resume function, converting the code to C in the process,
> making the code more legible.
> 
> This patch tries to mirror the existing behaviour as closely as possible
> while using the usual C control flow primitives. There should be no
> functional change as a result of this patch.
> 
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Chris Metcalf <cmetcalf@ezchip.com>
> Cc: Will Deacon <will.deacon@arm.com>

This is definitely cleaner. The only downside is slightly more expensive
ret_fast_syscall. I guess it's not noticeable (though we could do some
quick benchmark like getpid in a loop). Anyway, I'm fine with the patch:

Acked-by: Catalin Marinas <catalin.marinas@arm.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 2/2] arm64: factor work_pending state machine to C
@ 2016-01-06 12:30             ` Catalin Marinas
  0 siblings, 0 replies; 92+ messages in thread
From: Catalin Marinas @ 2016-01-06 12:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jan 05, 2016 at 05:33:35PM +0000, Mark Rutland wrote:
> Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
> state machine that can be difficult to reason about due to duplicated
> code and a large number of branch targets.
> 
> This patch factors the common logic out into the existing
> do_notify_resume function, converting the code to C in the process,
> making the code more legible.
> 
> This patch tries to mirror the existing behaviour as closely as possible
> while using the usual C control flow primitives. There should be no
> functional change as a result of this patch.
> 
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Chris Metcalf <cmetcalf@ezchip.com>
> Cc: Will Deacon <will.deacon@arm.com>

This is definitely cleaner. The only downside is slightly more expensive
ret_fast_syscall. I guess it's not noticeable (though we could do some
quick benchmark like getpid in a loop). Anyway, I'm fine with the patch:

Acked-by: Catalin Marinas <catalin.marinas@arm.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/2] arm64: factor work_pending state machine to C
  2016-01-06 12:30             ` Catalin Marinas
@ 2016-01-06 12:47               ` Mark Rutland
  -1 siblings, 0 replies; 92+ messages in thread
From: Mark Rutland @ 2016-01-06 12:47 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: cmetcalf, will.deacon, linux-kernel, luto, linux-arm-kernel

On Wed, Jan 06, 2016 at 12:30:11PM +0000, Catalin Marinas wrote:
> On Tue, Jan 05, 2016 at 05:33:35PM +0000, Mark Rutland wrote:
> > Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
> > state machine that can be difficult to reason about due to duplicated
> > code and a large number of branch targets.
> > 
> > This patch factors the common logic out into the existing
> > do_notify_resume function, converting the code to C in the process,
> > making the code more legible.
> > 
> > This patch tries to mirror the existing behaviour as closely as possible
> > while using the usual C control flow primitives. There should be no
> > functional change as a result of this patch.
> > 
> > Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Chris Metcalf <cmetcalf@ezchip.com>
> > Cc: Will Deacon <will.deacon@arm.com>
> 
> This is definitely cleaner. The only downside is slightly more expensive
> ret_fast_syscall. I guess it's not noticeable (though we could do some
> quick benchmark like getpid in a loop). Anyway, I'm fine with the patch:
> 
> Acked-by: Catalin Marinas <catalin.marinas@arm.com>

Cheers!

While any additional overhead hasn't been noticeable, I'll try to get
some numbers out as part of the larger deasm testing/benchmarking.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 2/2] arm64: factor work_pending state machine to C
@ 2016-01-06 12:47               ` Mark Rutland
  0 siblings, 0 replies; 92+ messages in thread
From: Mark Rutland @ 2016-01-06 12:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jan 06, 2016 at 12:30:11PM +0000, Catalin Marinas wrote:
> On Tue, Jan 05, 2016 at 05:33:35PM +0000, Mark Rutland wrote:
> > Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
> > state machine that can be difficult to reason about due to duplicated
> > code and a large number of branch targets.
> > 
> > This patch factors the common logic out into the existing
> > do_notify_resume function, converting the code to C in the process,
> > making the code more legible.
> > 
> > This patch tries to mirror the existing behaviour as closely as possible
> > while using the usual C control flow primitives. There should be no
> > functional change as a result of this patch.
> > 
> > Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Chris Metcalf <cmetcalf@ezchip.com>
> > Cc: Will Deacon <will.deacon@arm.com>
> 
> This is definitely cleaner. The only downside is slightly more expensive
> ret_fast_syscall. I guess it's not noticeable (though we could do some
> quick benchmark like getpid in a loop). Anyway, I'm fine with the patch:
> 
> Acked-by: Catalin Marinas <catalin.marinas@arm.com>

Cheers!

While any additional overhead hasn't been noticeable, I'll try to get
some numbers out as part of the larger deasm testing/benchmarking.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/2] arm64: factor work_pending state machine to C
  2016-01-05 17:33           ` Mark Rutland
@ 2016-01-06 13:43             ` Mark Rutland
  -1 siblings, 0 replies; 92+ messages in thread
From: Mark Rutland @ 2016-01-06 13:43 UTC (permalink / raw)
  To: cmetcalf, catalin.marinas
  Cc: will.deacon, luto, linux-kernel, linux-arm-kernel

On Tue, Jan 05, 2016 at 05:33:35PM +0000, Mark Rutland wrote:
> Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
> state machine that can be difficult to reason about due to duplicated
> code and a large number of branch targets.
> 
> This patch factors the common logic out into the existing
> do_notify_resume function, converting the code to C in the process,
> making the code more legible.
> 
> This patch tries to mirror the existing behaviour as closely as possible
> while using the usual C control flow primitives. There should be no
> functional change as a result of this patch.

I realised there is a problem with this for kernel built with
TRACE_IRQFLAGS, as local_irq_{enable,disable}() will verify that the IRQ
state is as expected.

In ret_fast_syscall we disable irqs behind the back of the tracer, so
when we get into do_notify_resume we'll get a splat.

In the non-syscall cases we do not disable interrupts first, so we can't
balance things in do_notify_resume.

We can either add a trace_hardirqs_off call to ret_fast_syscall, or we
can use raw_local_irq_{disable,enable}. The latter would match the
current behaviour (and is a nicer diff). Once the syscall path is moved
to C it would be possible to use the non-raw variants all-over.

Catalin, are you happy with using the raw accessors in do_notify_resume,
or would you prefer using trace_hardirqs_off?

Thanks,
Mark.

> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Chris Metcalf <cmetcalf@ezchip.com>
> Cc: Will Deacon <will.deacon@arm.com>
> ---
>  arch/arm64/kernel/entry.S  | 24 +++---------------------
>  arch/arm64/kernel/signal.c | 36 ++++++++++++++++++++++++++----------
>  2 files changed, 29 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 6b30ab1..41f5dfc 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -612,35 +612,17 @@ ret_fast_syscall:
>  	ldr	x1, [tsk, #TI_FLAGS]		// re-check for syscall tracing
>  	and	x2, x1, #_TIF_SYSCALL_WORK
>  	cbnz	x2, ret_fast_syscall_trace
> -	and	x2, x1, #_TIF_WORK_MASK
> -	cbnz	x2, work_pending
> -	enable_step_tsk x1, x2
> -	kernel_exit 0
> +	b	ret_to_user
>  ret_fast_syscall_trace:
>  	enable_irq				// enable interrupts
>  	b	__sys_trace_return_skipped	// we already saved x0
>  
>  /*
> - * Ok, we need to do extra processing, enter the slow path.
> - */
> -work_pending:
> -	tbnz	x1, #TIF_NEED_RESCHED, work_resched
> -	/* TIF_SIGPENDING, TIF_NOTIFY_RESUME or TIF_FOREIGN_FPSTATE case */
> -	mov	x0, sp				// 'regs'
> -	enable_irq				// enable interrupts for do_notify_resume()
> -	bl	do_notify_resume
> -	b	ret_to_user
> -work_resched:
> -	bl	schedule
> -
> -/*
>   * "slow" syscall return path.
>   */
>  ret_to_user:
> -	disable_irq				// disable interrupts
> -	ldr	x1, [tsk, #TI_FLAGS]
> -	and	x2, x1, #_TIF_WORK_MASK
> -	cbnz	x2, work_pending
> +	bl	do_notify_resume
> +	ldr	x1, [tsk, #TI_FLAGS]		// re-check for single-step
>  	enable_step_tsk x1, x2
>  	kernel_exit 0
>  ENDPROC(ret_to_user)
> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index e18c48c..3a6c60b 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -399,18 +399,34 @@ static void do_signal(struct pt_regs *regs)
>  	restore_saved_sigmask();
>  }
>  
> -asmlinkage void do_notify_resume(struct pt_regs *regs,
> -				 unsigned int thread_flags)
> +asmlinkage void do_notify_resume(void)
>  {
> -	if (thread_flags & _TIF_SIGPENDING)
> -		do_signal(regs);
> +	struct pt_regs *regs = task_pt_regs(current);
> +	unsigned long thread_flags;
>  
> -	if (thread_flags & _TIF_NOTIFY_RESUME) {
> -		clear_thread_flag(TIF_NOTIFY_RESUME);
> -		tracehook_notify_resume(regs);
> -	}
> +	for (;;) {
> +		local_irq_disable();

This should be raw_local_irq_disable()...

> +
> +		thread_flags = READ_ONCE(current_thread_info()->flags);
> +		if (!(thread_flags & _TIF_WORK_MASK))
> +			break;
> +
> +		if (thread_flags & _TIF_NEED_RESCHED) {
> +			schedule();
> +			continue;
> +		}
>  
> -	if (thread_flags & _TIF_FOREIGN_FPSTATE)
> -		fpsimd_restore_current_state();
> +		local_irq_enable();

... likewise, raw_local_irq_enable() here.

>  
> +		if (thread_flags & _TIF_SIGPENDING)
> +			do_signal(regs);
> +
> +		if (thread_flags & _TIF_NOTIFY_RESUME) {
> +			clear_thread_flag(TIF_NOTIFY_RESUME);
> +			tracehook_notify_resume(regs);
> +		}
> +
> +		if (thread_flags & _TIF_FOREIGN_FPSTATE)
> +			fpsimd_restore_current_state();
> +	}
>  }
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 2/2] arm64: factor work_pending state machine to C
@ 2016-01-06 13:43             ` Mark Rutland
  0 siblings, 0 replies; 92+ messages in thread
From: Mark Rutland @ 2016-01-06 13:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jan 05, 2016 at 05:33:35PM +0000, Mark Rutland wrote:
> Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
> state machine that can be difficult to reason about due to duplicated
> code and a large number of branch targets.
> 
> This patch factors the common logic out into the existing
> do_notify_resume function, converting the code to C in the process,
> making the code more legible.
> 
> This patch tries to mirror the existing behaviour as closely as possible
> while using the usual C control flow primitives. There should be no
> functional change as a result of this patch.

I realised there is a problem with this for kernel built with
TRACE_IRQFLAGS, as local_irq_{enable,disable}() will verify that the IRQ
state is as expected.

In ret_fast_syscall we disable irqs behind the back of the tracer, so
when we get into do_notify_resume we'll get a splat.

In the non-syscall cases we do not disable interrupts first, so we can't
balance things in do_notify_resume.

We can either add a trace_hardirqs_off call to ret_fast_syscall, or we
can use raw_local_irq_{disable,enable}. The latter would match the
current behaviour (and is a nicer diff). Once the syscall path is moved
to C it would be possible to use the non-raw variants all-over.

Catalin, are you happy with using the raw accessors in do_notify_resume,
or would you prefer using trace_hardirqs_off?

Thanks,
Mark.

> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Chris Metcalf <cmetcalf@ezchip.com>
> Cc: Will Deacon <will.deacon@arm.com>
> ---
>  arch/arm64/kernel/entry.S  | 24 +++---------------------
>  arch/arm64/kernel/signal.c | 36 ++++++++++++++++++++++++++----------
>  2 files changed, 29 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 6b30ab1..41f5dfc 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -612,35 +612,17 @@ ret_fast_syscall:
>  	ldr	x1, [tsk, #TI_FLAGS]		// re-check for syscall tracing
>  	and	x2, x1, #_TIF_SYSCALL_WORK
>  	cbnz	x2, ret_fast_syscall_trace
> -	and	x2, x1, #_TIF_WORK_MASK
> -	cbnz	x2, work_pending
> -	enable_step_tsk x1, x2
> -	kernel_exit 0
> +	b	ret_to_user
>  ret_fast_syscall_trace:
>  	enable_irq				// enable interrupts
>  	b	__sys_trace_return_skipped	// we already saved x0
>  
>  /*
> - * Ok, we need to do extra processing, enter the slow path.
> - */
> -work_pending:
> -	tbnz	x1, #TIF_NEED_RESCHED, work_resched
> -	/* TIF_SIGPENDING, TIF_NOTIFY_RESUME or TIF_FOREIGN_FPSTATE case */
> -	mov	x0, sp				// 'regs'
> -	enable_irq				// enable interrupts for do_notify_resume()
> -	bl	do_notify_resume
> -	b	ret_to_user
> -work_resched:
> -	bl	schedule
> -
> -/*
>   * "slow" syscall return path.
>   */
>  ret_to_user:
> -	disable_irq				// disable interrupts
> -	ldr	x1, [tsk, #TI_FLAGS]
> -	and	x2, x1, #_TIF_WORK_MASK
> -	cbnz	x2, work_pending
> +	bl	do_notify_resume
> +	ldr	x1, [tsk, #TI_FLAGS]		// re-check for single-step
>  	enable_step_tsk x1, x2
>  	kernel_exit 0
>  ENDPROC(ret_to_user)
> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index e18c48c..3a6c60b 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -399,18 +399,34 @@ static void do_signal(struct pt_regs *regs)
>  	restore_saved_sigmask();
>  }
>  
> -asmlinkage void do_notify_resume(struct pt_regs *regs,
> -				 unsigned int thread_flags)
> +asmlinkage void do_notify_resume(void)
>  {
> -	if (thread_flags & _TIF_SIGPENDING)
> -		do_signal(regs);
> +	struct pt_regs *regs = task_pt_regs(current);
> +	unsigned long thread_flags;
>  
> -	if (thread_flags & _TIF_NOTIFY_RESUME) {
> -		clear_thread_flag(TIF_NOTIFY_RESUME);
> -		tracehook_notify_resume(regs);
> -	}
> +	for (;;) {
> +		local_irq_disable();

This should be raw_local_irq_disable()...

> +
> +		thread_flags = READ_ONCE(current_thread_info()->flags);
> +		if (!(thread_flags & _TIF_WORK_MASK))
> +			break;
> +
> +		if (thread_flags & _TIF_NEED_RESCHED) {
> +			schedule();
> +			continue;
> +		}
>  
> -	if (thread_flags & _TIF_FOREIGN_FPSTATE)
> -		fpsimd_restore_current_state();
> +		local_irq_enable();

... likewise, raw_local_irq_enable() here.

>  
> +		if (thread_flags & _TIF_SIGPENDING)
> +			do_signal(regs);
> +
> +		if (thread_flags & _TIF_NOTIFY_RESUME) {
> +			clear_thread_flag(TIF_NOTIFY_RESUME);
> +			tracehook_notify_resume(regs);
> +		}
> +
> +		if (thread_flags & _TIF_FOREIGN_FPSTATE)
> +			fpsimd_restore_current_state();
> +	}
>  }
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/2] arm64: factor work_pending state machine to C
  2016-01-06 13:43             ` Mark Rutland
@ 2016-01-06 14:17               ` Catalin Marinas
  -1 siblings, 0 replies; 92+ messages in thread
From: Catalin Marinas @ 2016-01-06 14:17 UTC (permalink / raw)
  To: Mark Rutland; +Cc: cmetcalf, will.deacon, linux-kernel, linux-arm-kernel, luto

On Wed, Jan 06, 2016 at 01:43:14PM +0000, Mark Rutland wrote:
> On Tue, Jan 05, 2016 at 05:33:35PM +0000, Mark Rutland wrote:
> > Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
> > state machine that can be difficult to reason about due to duplicated
> > code and a large number of branch targets.
> > 
> > This patch factors the common logic out into the existing
> > do_notify_resume function, converting the code to C in the process,
> > making the code more legible.
> > 
> > This patch tries to mirror the existing behaviour as closely as possible
> > while using the usual C control flow primitives. There should be no
> > functional change as a result of this patch.
> 
> I realised there is a problem with this for kernel built with
> TRACE_IRQFLAGS, as local_irq_{enable,disable}() will verify that the IRQ
> state is as expected.
> 
> In ret_fast_syscall we disable irqs behind the back of the tracer, so
> when we get into do_notify_resume we'll get a splat.
> 
> In the non-syscall cases we do not disable interrupts first, so we can't
> balance things in do_notify_resume.
> 
> We can either add a trace_hardirqs_off call to ret_fast_syscall, or we
> can use raw_local_irq_{disable,enable}. The latter would match the
> current behaviour (and is a nicer diff). Once the syscall path is moved
> to C it would be possible to use the non-raw variants all-over.
> 
> Catalin, are you happy with using the raw accessors in do_notify_resume,
> or would you prefer using trace_hardirqs_off?

I would prefer the explicit trace_hardirqs_off annotation, even though
it is a few more lines.

-- 
Catalin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 2/2] arm64: factor work_pending state machine to C
@ 2016-01-06 14:17               ` Catalin Marinas
  0 siblings, 0 replies; 92+ messages in thread
From: Catalin Marinas @ 2016-01-06 14:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jan 06, 2016 at 01:43:14PM +0000, Mark Rutland wrote:
> On Tue, Jan 05, 2016 at 05:33:35PM +0000, Mark Rutland wrote:
> > Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
> > state machine that can be difficult to reason about due to duplicated
> > code and a large number of branch targets.
> > 
> > This patch factors the common logic out into the existing
> > do_notify_resume function, converting the code to C in the process,
> > making the code more legible.
> > 
> > This patch tries to mirror the existing behaviour as closely as possible
> > while using the usual C control flow primitives. There should be no
> > functional change as a result of this patch.
> 
> I realised there is a problem with this for kernel built with
> TRACE_IRQFLAGS, as local_irq_{enable,disable}() will verify that the IRQ
> state is as expected.
> 
> In ret_fast_syscall we disable irqs behind the back of the tracer, so
> when we get into do_notify_resume we'll get a splat.
> 
> In the non-syscall cases we do not disable interrupts first, so we can't
> balance things in do_notify_resume.
> 
> We can either add a trace_hardirqs_off call to ret_fast_syscall, or we
> can use raw_local_irq_{disable,enable}. The latter would match the
> current behaviour (and is a nicer diff). Once the syscall path is moved
> to C it would be possible to use the non-raw variants all-over.
> 
> Catalin, are you happy with using the raw accessors in do_notify_resume,
> or would you prefer using trace_hardirqs_off?

I would prefer the explicit trace_hardirqs_off annotation, even though
it is a few more lines.

-- 
Catalin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
@ 2016-01-11 21:15   ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-11 21:15 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel

Ping!  There has been no substantive feedback to this version of
the patch in the week since I posted it, which optimistically suggests
to me that people may be satisfied with it.  If that's true, Frederic,
I assume this would be pulled into your tree?

I have slightly updated the v9 patch series since this posting:

- Incorporated a fix to initialize cpu_isolation_mask early if no
   cpu_isolation= boot argument was given, to avoid crashing on
   CPUMASK_OFFSTACK platforms.

- Incorporated Mark Rutland's changes to convert arm64
   assembly to C code instead of using my own version.

The updated patch series is available in the branch at

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git 
dataplane

I will post a v10 with those couple of small changes if I don't hear
any other feedback, or of course feel free to pull from the git repo.

On 01/04/2016 02:34 PM, Chris Metcalf wrote:
> It has been a couple of months since the v8 version of this patch,
> since various other priorities came up at work.  Since it's been
> a while I will try to summarize where I think we got to on the
> various issues that were raised with v8.
>
> 1. Andy Lutomirski raised the issue of whether it really made sense to
>     only attempt to set up the conditions for task isolation, ask the kernel
>     nicely for it, and then wait until it happened.  He wondered if a
>     SCHED_ISOLATED class might be a helpful abstraction.  Steven Rostedt
>     also suggested having an interface that would force everything else
>     off a core to enable SCHED_ISOLATED to succeed.  Frederick added
>     some concerns about enforcing the test that the process was in a
>     good state to enter task isolation.
>
>     I tried to address the different design philosphies for what I called
>     the original "polite" mode and the reviewers' suggestions for an
>     "aggressive" mode in this email:
>
>     https://lkml.org/lkml/2015/10/26/625
>
>     As I said there, on balance I think the "polite" option is still
>     better.  Obviously folks are welcome to disagree and I'm happy to
>     continue that conversation (or perhaps I convinced everyone).
>
> 2. Andy didn't like the idea of having a "STRICT" mode which
>     delivered a signal to a process for violating the contract that it
>     will promise to stay out of the kernel.  Gilad Ben Yossef argued that
>     it made sense to have a way for the kernel to enforce the requested
>     correctness guarantee of never being interrupted.  Andy pointed out
>     that we should then really deliver such a signal when the kernel
>     delivers an asynchronous interrupt to the core as well.  In particular
>     this is a concern for the application-error case of a process that
>     calls unmap() on one core while a thread on another core is running
>     STRICT, and thus gets an unexpected TLB flush.
>
>     This patch series addresses that concern by including support for
>     IRQs, IPIs, and similar asynchronous interrupts to also send the
>     STRICT signal to the process.  We don't try to send the signal if
>     we are in an NMI, and instead just force a console backtrace like
>     you would get in task_isolation_debug mode.
>
> 3. Frederick nack'ed my patch for a boot flag to disable the 1Hz
>     periodic scheduler tick.
>
>     I'm still hoping he's open to changing his mind about that, but in
>     this patch series I have removed that boot flag.
>
> Various other changes have been introduced since v8:
>
> https://lkml.kernel.org/r/1445373372-6567-1-git-send-email-cmetcalf@ezchip.com
>
> - Rebased to Linux 4.4-rc5.
>
> - Since nohz_full and isolnodes have been separated back out again in
>    4.4, I introduced a new task_isolation=MASK boot argument that sets
>    both of them.  The task isolation support now requires that this
>    boot flag have been used; it intentionally doesn't work if you've
>    just enabled nohz_full and isolcpus separately.  I could be
>    convinced that doing it the other way around makes sense, though.
>
> - I folded the two STRICT mode patches together since there didn't
>    seem to be much value in having the second patch that just enabled
>    having a settable signal.  I also refactored the various routines
>    that report on interrupts/exceptions/etc to make it easier to hook
>    in from the case where we are interrupted asynchronously.
>
> - For the debug support, I moved most of the functionality into
>    kernel/isolation.c and out of kernel/sched/core.c, leaving only a
>    small hook to handle mapping a remote cpu to a task struct safely.
>    In addition to implementing Andy's suggestion of signalling a task
>    when it is interrupted asynchronously, I also added a ratelimit
>    hook so we won't spam the console if (for example) a timer interrupt
>    runs amok - particularly since when this happens without ratelimit,
>    it can end up self-perpetuating the timer interrupt.
>
> - I added a task_isolation_debug_cpumask() helper function to check
>    all the cpus in a mask to see if they are being interrupted
>    inappropriately.
>
> - I made the check for irq_enter() robust to architectures that
>    have already entered user mode context_tracking before calling
>    irq_enter() by testing user_mode(get_irq_regs()) instead of
>    context_tracking_in_user(), and split out the code to a separate
>    inlined function so I could comment it better.
>
> - For arm64, I added a task_isolation_debug_cpumask() hook for
>    smp_cross_call(), which I had missed in the earlier versions.
>
> - I generalized the fix for tile to set up a clockevents hook for
>    set_state_oneshot_stopped() to also apply to the arm_arch_timer,
>    which I realized was showing the same problem.  For both cases,
>    this seems to be what Viresh had in mind with commit 8fff52fd509345
>    ("clockevents: Introduce CLOCK_EVT_STATE_ONESHOT_STOPPED state").
>
> - For tile, I adopted the arm model of doing user_exit() calls in the
>    early assembly code (a new patch in this series).  I also added a
>    missing task_isolation_debug hook for tile's IPI and remote cache
>    flush code.
>
> Chris Metcalf (12):
>    vmstat: add vmstat_idle function
>    lru_add_drain_all: factor out lru_add_drain_needed
>    task_isolation: add initial support
>    task_isolation: support PR_TASK_ISOLATION_STRICT mode
>    task_isolation: add debug boot flag
>    arch/x86: enable task isolation functionality
>    arch/arm64: adopt prepare_exit_to_usermode() model from x86
>    arch/arm64: enable task isolation functionality
>    arch/tile: adopt prepare_exit_to_usermode() model from x86
>    arch/tile: move user_exit() to early kernel entry sequence
>    arch/tile: enable task isolation functionality
>    arm, tile: turn off timer tick for oneshot_stopped state
>
> Christoph Lameter (1):
>    vmstat: provide a function to quiet down the diff processing
>
>   Documentation/kernel-parameters.txt  |  16 +++
>   arch/arm64/include/asm/thread_info.h |  18 ++-
>   arch/arm64/kernel/entry.S            |   6 +-
>   arch/arm64/kernel/ptrace.c           |  12 +-
>   arch/arm64/kernel/signal.c           |  35 ++++--
>   arch/arm64/kernel/smp.c              |   2 +
>   arch/arm64/mm/fault.c                |   4 +
>   arch/tile/include/asm/processor.h    |   2 +-
>   arch/tile/include/asm/thread_info.h  |   8 +-
>   arch/tile/kernel/intvec_32.S         |  51 +++-----
>   arch/tile/kernel/intvec_64.S         |  54 +++------
>   arch/tile/kernel/process.c           |  83 +++++++------
>   arch/tile/kernel/ptrace.c            |  19 +--
>   arch/tile/kernel/single_step.c       |   8 +-
>   arch/tile/kernel/smp.c               |  26 ++--
>   arch/tile/kernel/time.c              |   1 +
>   arch/tile/kernel/traps.c             |  13 +-
>   arch/tile/kernel/unaligned.c         |  16 ++-
>   arch/tile/mm/fault.c                 |   6 +-
>   arch/tile/mm/homecache.c             |   2 +
>   arch/x86/entry/common.c              |  10 +-
>   arch/x86/kernel/traps.c              |   2 +
>   arch/x86/mm/fault.c                  |   2 +
>   drivers/clocksource/arm_arch_timer.c |   2 +
>   include/linux/isolation.h            |  80 +++++++++++++
>   include/linux/sched.h                |   3 +
>   include/linux/swap.h                 |   1 +
>   include/linux/vmstat.h               |   4 +
>   include/uapi/linux/prctl.h           |   8 ++
>   init/Kconfig                         |  20 ++++
>   kernel/Makefile                      |   1 +
>   kernel/irq_work.c                    |   5 +-
>   kernel/isolation.c                   | 225 +++++++++++++++++++++++++++++++++++
>   kernel/sched/core.c                  |  18 +++
>   kernel/signal.c                      |   5 +
>   kernel/smp.c                         |   6 +-
>   kernel/softirq.c                     |  33 +++++
>   kernel/sys.c                         |   9 ++
>   mm/swap.c                            |  13 +-
>   mm/vmstat.c                          |  24 ++++
>   40 files changed, 665 insertions(+), 188 deletions(-)
>   create mode 100644 include/linux/isolation.h
>   create mode 100644 kernel/isolation.c
>

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
@ 2016-01-11 21:15   ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-11 21:15 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Ping!  There has been no substantive feedback to this version of
the patch in the week since I posted it, which optimistically suggests
to me that people may be satisfied with it.  If that's true, Frederic,
I assume this would be pulled into your tree?

I have slightly updated the v9 patch series since this posting:

- Incorporated a fix to initialize cpu_isolation_mask early if no
   cpu_isolation= boot argument was given, to avoid crashing on
   CPUMASK_OFFSTACK platforms.

- Incorporated Mark Rutland's changes to convert arm64
   assembly to C code instead of using my own version.

The updated patch series is available in the branch at

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git 
dataplane

I will post a v10 with those couple of small changes if I don't hear
any other feedback, or of course feel free to pull from the git repo.

On 01/04/2016 02:34 PM, Chris Metcalf wrote:
> It has been a couple of months since the v8 version of this patch,
> since various other priorities came up at work.  Since it's been
> a while I will try to summarize where I think we got to on the
> various issues that were raised with v8.
>
> 1. Andy Lutomirski raised the issue of whether it really made sense to
>     only attempt to set up the conditions for task isolation, ask the kernel
>     nicely for it, and then wait until it happened.  He wondered if a
>     SCHED_ISOLATED class might be a helpful abstraction.  Steven Rostedt
>     also suggested having an interface that would force everything else
>     off a core to enable SCHED_ISOLATED to succeed.  Frederick added
>     some concerns about enforcing the test that the process was in a
>     good state to enter task isolation.
>
>     I tried to address the different design philosphies for what I called
>     the original "polite" mode and the reviewers' suggestions for an
>     "aggressive" mode in this email:
>
>     https://lkml.org/lkml/2015/10/26/625
>
>     As I said there, on balance I think the "polite" option is still
>     better.  Obviously folks are welcome to disagree and I'm happy to
>     continue that conversation (or perhaps I convinced everyone).
>
> 2. Andy didn't like the idea of having a "STRICT" mode which
>     delivered a signal to a process for violating the contract that it
>     will promise to stay out of the kernel.  Gilad Ben Yossef argued that
>     it made sense to have a way for the kernel to enforce the requested
>     correctness guarantee of never being interrupted.  Andy pointed out
>     that we should then really deliver such a signal when the kernel
>     delivers an asynchronous interrupt to the core as well.  In particular
>     this is a concern for the application-error case of a process that
>     calls unmap() on one core while a thread on another core is running
>     STRICT, and thus gets an unexpected TLB flush.
>
>     This patch series addresses that concern by including support for
>     IRQs, IPIs, and similar asynchronous interrupts to also send the
>     STRICT signal to the process.  We don't try to send the signal if
>     we are in an NMI, and instead just force a console backtrace like
>     you would get in task_isolation_debug mode.
>
> 3. Frederick nack'ed my patch for a boot flag to disable the 1Hz
>     periodic scheduler tick.
>
>     I'm still hoping he's open to changing his mind about that, but in
>     this patch series I have removed that boot flag.
>
> Various other changes have been introduced since v8:
>
> https://lkml.kernel.org/r/1445373372-6567-1-git-send-email-cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org
>
> - Rebased to Linux 4.4-rc5.
>
> - Since nohz_full and isolnodes have been separated back out again in
>    4.4, I introduced a new task_isolation=MASK boot argument that sets
>    both of them.  The task isolation support now requires that this
>    boot flag have been used; it intentionally doesn't work if you've
>    just enabled nohz_full and isolcpus separately.  I could be
>    convinced that doing it the other way around makes sense, though.
>
> - I folded the two STRICT mode patches together since there didn't
>    seem to be much value in having the second patch that just enabled
>    having a settable signal.  I also refactored the various routines
>    that report on interrupts/exceptions/etc to make it easier to hook
>    in from the case where we are interrupted asynchronously.
>
> - For the debug support, I moved most of the functionality into
>    kernel/isolation.c and out of kernel/sched/core.c, leaving only a
>    small hook to handle mapping a remote cpu to a task struct safely.
>    In addition to implementing Andy's suggestion of signalling a task
>    when it is interrupted asynchronously, I also added a ratelimit
>    hook so we won't spam the console if (for example) a timer interrupt
>    runs amok - particularly since when this happens without ratelimit,
>    it can end up self-perpetuating the timer interrupt.
>
> - I added a task_isolation_debug_cpumask() helper function to check
>    all the cpus in a mask to see if they are being interrupted
>    inappropriately.
>
> - I made the check for irq_enter() robust to architectures that
>    have already entered user mode context_tracking before calling
>    irq_enter() by testing user_mode(get_irq_regs()) instead of
>    context_tracking_in_user(), and split out the code to a separate
>    inlined function so I could comment it better.
>
> - For arm64, I added a task_isolation_debug_cpumask() hook for
>    smp_cross_call(), which I had missed in the earlier versions.
>
> - I generalized the fix for tile to set up a clockevents hook for
>    set_state_oneshot_stopped() to also apply to the arm_arch_timer,
>    which I realized was showing the same problem.  For both cases,
>    this seems to be what Viresh had in mind with commit 8fff52fd509345
>    ("clockevents: Introduce CLOCK_EVT_STATE_ONESHOT_STOPPED state").
>
> - For tile, I adopted the arm model of doing user_exit() calls in the
>    early assembly code (a new patch in this series).  I also added a
>    missing task_isolation_debug hook for tile's IPI and remote cache
>    flush code.
>
> Chris Metcalf (12):
>    vmstat: add vmstat_idle function
>    lru_add_drain_all: factor out lru_add_drain_needed
>    task_isolation: add initial support
>    task_isolation: support PR_TASK_ISOLATION_STRICT mode
>    task_isolation: add debug boot flag
>    arch/x86: enable task isolation functionality
>    arch/arm64: adopt prepare_exit_to_usermode() model from x86
>    arch/arm64: enable task isolation functionality
>    arch/tile: adopt prepare_exit_to_usermode() model from x86
>    arch/tile: move user_exit() to early kernel entry sequence
>    arch/tile: enable task isolation functionality
>    arm, tile: turn off timer tick for oneshot_stopped state
>
> Christoph Lameter (1):
>    vmstat: provide a function to quiet down the diff processing
>
>   Documentation/kernel-parameters.txt  |  16 +++
>   arch/arm64/include/asm/thread_info.h |  18 ++-
>   arch/arm64/kernel/entry.S            |   6 +-
>   arch/arm64/kernel/ptrace.c           |  12 +-
>   arch/arm64/kernel/signal.c           |  35 ++++--
>   arch/arm64/kernel/smp.c              |   2 +
>   arch/arm64/mm/fault.c                |   4 +
>   arch/tile/include/asm/processor.h    |   2 +-
>   arch/tile/include/asm/thread_info.h  |   8 +-
>   arch/tile/kernel/intvec_32.S         |  51 +++-----
>   arch/tile/kernel/intvec_64.S         |  54 +++------
>   arch/tile/kernel/process.c           |  83 +++++++------
>   arch/tile/kernel/ptrace.c            |  19 +--
>   arch/tile/kernel/single_step.c       |   8 +-
>   arch/tile/kernel/smp.c               |  26 ++--
>   arch/tile/kernel/time.c              |   1 +
>   arch/tile/kernel/traps.c             |  13 +-
>   arch/tile/kernel/unaligned.c         |  16 ++-
>   arch/tile/mm/fault.c                 |   6 +-
>   arch/tile/mm/homecache.c             |   2 +
>   arch/x86/entry/common.c              |  10 +-
>   arch/x86/kernel/traps.c              |   2 +
>   arch/x86/mm/fault.c                  |   2 +
>   drivers/clocksource/arm_arch_timer.c |   2 +
>   include/linux/isolation.h            |  80 +++++++++++++
>   include/linux/sched.h                |   3 +
>   include/linux/swap.h                 |   1 +
>   include/linux/vmstat.h               |   4 +
>   include/uapi/linux/prctl.h           |   8 ++
>   init/Kconfig                         |  20 ++++
>   kernel/Makefile                      |   1 +
>   kernel/irq_work.c                    |   5 +-
>   kernel/isolation.c                   | 225 +++++++++++++++++++++++++++++++++++
>   kernel/sched/core.c                  |  18 +++
>   kernel/signal.c                      |   5 +
>   kernel/smp.c                         |   6 +-
>   kernel/softirq.c                     |  33 +++++
>   kernel/sys.c                         |   9 ++
>   mm/swap.c                            |  13 +-
>   mm/vmstat.c                          |  24 ++++
>   40 files changed, 665 insertions(+), 188 deletions(-)
>   create mode 100644 include/linux/isolation.h
>   create mode 100644 kernel/isolation.c
>

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
  2016-01-11 21:15   ` Chris Metcalf
  (?)
@ 2016-01-12 10:07   ` Will Deacon
  2016-01-12 17:49       ` Chris Metcalf
  -1 siblings, 1 reply; 92+ messages in thread
From: Will Deacon @ 2016-01-12 10:07 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Mon, Jan 11, 2016 at 04:15:50PM -0500, Chris Metcalf wrote:
> Ping!  There has been no substantive feedback to this version of
> the patch in the week since I posted it, which optimistically suggests
> to me that people may be satisfied with it.  If that's true, Frederic,
> I assume this would be pulled into your tree?
> 
> I have slightly updated the v9 patch series since this posting:
> 
> - Incorporated a fix to initialize cpu_isolation_mask early if no
>   cpu_isolation= boot argument was given, to avoid crashing on
>   CPUMASK_OFFSTACK platforms.
> 
> - Incorporated Mark Rutland's changes to convert arm64
>   assembly to C code instead of using my own version.

Please avoid queuing these patches -- the first is already in the arm64
queue for 4.5 and the second was found to introduce a substantial
performance regression on the syscall entry/exit path. I think Mark had
an updated version to address that, so it would be easier not to have
an old version sitting in some other queue!

Cheers,

Will

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
@ 2016-01-12 10:53     ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2016-01-12 10:53 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel


* Chris Metcalf <cmetcalf@ezchip.com> wrote:

> Ping!  There has been no substantive feedback to this version of
> the patch in the week since I posted it, which optimistically suggests
> to me that people may be satisfied with it.  If that's true, Frederic,
> I assume this would be pulled into your tree?

We are right before (and into) the merge window, don't expect substantial feedback 
in those timeframes, as most kernel maintainers are very busy.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
@ 2016-01-12 10:53     ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2016-01-12 10:53 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


* Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:

> Ping!  There has been no substantive feedback to this version of
> the patch in the week since I posted it, which optimistically suggests
> to me that people may be satisfied with it.  If that's true, Frederic,
> I assume this would be pulled into your tree?

We are right before (and into) the merge window, don't expect substantial feedback 
in those timeframes, as most kernel maintainers are very busy.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
@ 2016-01-12 17:49       ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-12 17:49 UTC (permalink / raw)
  To: Will Deacon
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel, Mark Rutland

(Adding Mark to cc's)

On 01/12/2016 05:07 AM, Will Deacon wrote:
> On Mon, Jan 11, 2016 at 04:15:50PM -0500, Chris Metcalf wrote:
>> Ping!  There has been no substantive feedback to this version of
>> the patch in the week since I posted it, which optimistically suggests
>> to me that people may be satisfied with it.  If that's true, Frederic,
>> I assume this would be pulled into your tree?
>>
>> I have slightly updated the v9 patch series since this posting:
>>
>> [...]
>>
>> - Incorporated Mark Rutland's changes to convert arm64
>>    assembly to C code instead of using my own version.
> Please avoid queuing these patches -- the first is already in the arm64
> queue for 4.5 and the second was found to introduce a substantial
> performance regression on the syscall entry/exit path. I think Mark had
> an updated version to address that, so it would be easier not to have
> an old version sitting in some other queue!

I am not formally queueing them anywhere (like linux-next), though
now that you mention it, that's a pretty good idea - I'll talk to Steven
about that, assuming this merge window closes without the task
isolation stuff going in.

In the arch/tile code, we load the thread_info_flags and test them
against a bitmask before we call into C code, to avoid the various
overheads involved in the C path.  Perhaps that same strategy is all
that's needed for the arm64 code?  Hopefully you can get that
code merged up during the 4.5 window so I can use it as the new
baseline for the task isolation stuff.

Thanks!

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
@ 2016-01-12 17:49       ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-12 17:49 UTC (permalink / raw)
  To: Will Deacon
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mark Rutland

(Adding Mark to cc's)

On 01/12/2016 05:07 AM, Will Deacon wrote:
> On Mon, Jan 11, 2016 at 04:15:50PM -0500, Chris Metcalf wrote:
>> Ping!  There has been no substantive feedback to this version of
>> the patch in the week since I posted it, which optimistically suggests
>> to me that people may be satisfied with it.  If that's true, Frederic,
>> I assume this would be pulled into your tree?
>>
>> I have slightly updated the v9 patch series since this posting:
>>
>> [...]
>>
>> - Incorporated Mark Rutland's changes to convert arm64
>>    assembly to C code instead of using my own version.
> Please avoid queuing these patches -- the first is already in the arm64
> queue for 4.5 and the second was found to introduce a substantial
> performance regression on the syscall entry/exit path. I think Mark had
> an updated version to address that, so it would be easier not to have
> an old version sitting in some other queue!

I am not formally queueing them anywhere (like linux-next), though
now that you mention it, that's a pretty good idea - I'll talk to Steven
about that, assuming this merge window closes without the task
isolation stuff going in.

In the arch/tile code, we load the thread_info_flags and test them
against a bitmask before we call into C code, to avoid the various
overheads involved in the C path.  Perhaps that same strategy is all
that's needed for the arm64 code?  Hopefully you can get that
code merged up during the 4.5 window so I can use it as the new
baseline for the task isolation stuff.

Thanks!

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
@ 2016-01-13 10:44         ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2016-01-13 10:44 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Will Deacon, Gilad Ben Yossef, Steven Rostedt, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel, Mark Rutland


* Chris Metcalf <cmetcalf@ezchip.com> wrote:

> (Adding Mark to cc's)
> 
> On 01/12/2016 05:07 AM, Will Deacon wrote:
> >On Mon, Jan 11, 2016 at 04:15:50PM -0500, Chris Metcalf wrote:
> >>Ping!  There has been no substantive feedback to this version of
> >>the patch in the week since I posted it, which optimistically suggests
> >>to me that people may be satisfied with it.  If that's true, Frederic,
> >>I assume this would be pulled into your tree?
> >>
> >>I have slightly updated the v9 patch series since this posting:
> >>
> >>[...]
> >>
> >>- Incorporated Mark Rutland's changes to convert arm64
> >>   assembly to C code instead of using my own version.
> >Please avoid queuing these patches -- the first is already in the arm64
> >queue for 4.5 and the second was found to introduce a substantial
> >performance regression on the syscall entry/exit path. I think Mark had
> >an updated version to address that, so it would be easier not to have
> >an old version sitting in some other queue!
> 
> I am not formally queueing them anywhere (like linux-next), though
> now that you mention it, that's a pretty good idea - I'll talk to Steven
> about that, assuming this merge window closes without the task
> isolation stuff going in.

NAK. Given the controversy, no way should this stuff go outside the primary trees 
it affects: the scheduler, timer, irq, etc. trees.

We can merge this up in -tip once everyone is happy... but as I said, don't expect 
many replies before and during the merge window.

Thanks,

	Ingo> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
@ 2016-01-13 10:44         ` Ingo Molnar
  0 siblings, 0 replies; 92+ messages in thread
From: Ingo Molnar @ 2016-01-13 10:44 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Will Deacon, Gilad Ben Yossef, Steven Rostedt, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mark Rutland


* Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote:

> (Adding Mark to cc's)
> 
> On 01/12/2016 05:07 AM, Will Deacon wrote:
> >On Mon, Jan 11, 2016 at 04:15:50PM -0500, Chris Metcalf wrote:
> >>Ping!  There has been no substantive feedback to this version of
> >>the patch in the week since I posted it, which optimistically suggests
> >>to me that people may be satisfied with it.  If that's true, Frederic,
> >>I assume this would be pulled into your tree?
> >>
> >>I have slightly updated the v9 patch series since this posting:
> >>
> >>[...]
> >>
> >>- Incorporated Mark Rutland's changes to convert arm64
> >>   assembly to C code instead of using my own version.
> >Please avoid queuing these patches -- the first is already in the arm64
> >queue for 4.5 and the second was found to introduce a substantial
> >performance regression on the syscall entry/exit path. I think Mark had
> >an updated version to address that, so it would be easier not to have
> >an old version sitting in some other queue!
> 
> I am not formally queueing them anywhere (like linux-next), though
> now that you mention it, that's a pretty good idea - I'll talk to Steven
> about that, assuming this merge window closes without the task
> isolation stuff going in.

NAK. Given the controversy, no way should this stuff go outside the primary trees 
it affects: the scheduler, timer, irq, etc. trees.

We can merge this up in -tip once everyone is happy... but as I said, don't expect 
many replies before and during the merge window.

Thanks,

	Ingo> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
  2016-01-13 10:44         ` Ingo Molnar
@ 2016-01-13 21:19           ` Chris Metcalf
  -1 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-13 21:19 UTC (permalink / raw)
  To: Ingo Molnar, Mark Rutland
  Cc: Will Deacon, Gilad Ben Yossef, Steven Rostedt, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On 01/13/2016 05:44 AM, Ingo Molnar wrote:
> * Chris Metcalf <cmetcalf@ezchip.com> wrote:
>
>> (Adding Mark to cc's)
>>
>> On 01/12/2016 05:07 AM, Will Deacon wrote:
>>> On Mon, Jan 11, 2016 at 04:15:50PM -0500, Chris Metcalf wrote:
>>>> Ping!  There has been no substantive feedback to this version of
>>>> the patch in the week since I posted it, which optimistically suggests
>>>> to me that people may be satisfied with it.  If that's true, Frederic,
>>>> I assume this would be pulled into your tree?
>>>>
>>>> I have slightly updated the v9 patch series since this posting:
>>>>
>>>> [...]
>>>>
>>>> - Incorporated Mark Rutland's changes to convert arm64
>>>>    assembly to C code instead of using my own version.
>>> Please avoid queuing these patches -- the first is already in the arm64
>>> queue for 4.5 and the second was found to introduce a substantial
>>> performance regression on the syscall entry/exit path. I think Mark had
>>> an updated version to address that, so it would be easier not to have
>>> an old version sitting in some other queue!
>> I am not formally queueing them anywhere (like linux-next), though
>> now that you mention it, that's a pretty good idea - I'll talk to Steven
>> about that, assuming this merge window closes without the task
>> isolation stuff going in.
> NAK. Given the controversy, no way should this stuff go outside the primary trees
> it affects: the scheduler, timer, irq, etc. trees.

Fair enough.  I'll plan to do v10 once the merge window closes.

Mark, let me know when/if you get a new version of the de-asm stuff
for do_notify_resume() - thanks.  Or, would it be helpful if I worked up
the option I suggested, where we check the thread_info flags in the
assembly code before calling out to the new loop in do_notify_resume()?

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
@ 2016-01-13 21:19           ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-13 21:19 UTC (permalink / raw)
  To: Ingo Molnar, Mark Rutland
  Cc: Will Deacon, Gilad Ben Yossef, Steven Rostedt, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On 01/13/2016 05:44 AM, Ingo Molnar wrote:
> * Chris Metcalf <cmetcalf@ezchip.com> wrote:
>
>> (Adding Mark to cc's)
>>
>> On 01/12/2016 05:07 AM, Will Deacon wrote:
>>> On Mon, Jan 11, 2016 at 04:15:50PM -0500, Chris Metcalf wrote:
>>>> Ping!  There has been no substantive feedback to this version of
>>>> the patch in the week since I posted it, which optimistically suggests
>>>> to me that people may be satisfied with it.  If that's true, Frederic,
>>>> I assume this would be pulled into your tree?
>>>>
>>>> I have slightly updated the v9 patch series since this posting:
>>>>
>>>> [...]
>>>>
>>>> - Incorporated Mark Rutland's changes to convert arm64
>>>>    assembly to C code instead of using my own version.
>>> Please avoid queuing these patches -- the first is already in the arm64
>>> queue for 4.5 and the second was found to introduce a substantial
>>> performance regression on the syscall entry/exit path. I think Mark had
>>> an updated version to address that, so it would be easier not to have
>>> an old version sitting in some other queue!
>> I am not formally queueing them anywhere (like linux-next), though
>> now that you mention it, that's a pretty good idea - I'll talk to Steven
>> about that, assuming this merge window closes without the task
>> isolation stuff going in.
> NAK. Given the controversy, no way should this stuff go outside the primary trees
> it affects: the scheduler, timer, irq, etc. trees.

Fair enough.  I'll plan to do v10 once the merge window closes.

Mark, let me know when/if you get a new version of the de-asm stuff
for do_notify_resume() - thanks.  Or, would it be helpful if I worked up
the option I suggested, where we check the thread_info flags in the
assembly code before calling out to the new loop in do_notify_resume()?

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-01-04 19:34   ` Chris Metcalf
  (?)
@ 2016-01-19 15:42   ` Frederic Weisbecker
  2016-01-19 20:45       ` Chris Metcalf
  -1 siblings, 1 reply; 92+ messages in thread
From: Frederic Weisbecker @ 2016-01-19 15:42 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Mon, Jan 04, 2016 at 02:34:42PM -0500, Chris Metcalf wrote:
> diff --git a/kernel/isolation.c b/kernel/isolation.c
> new file mode 100644
> index 000000000000..68a9f7457bc0
> --- /dev/null
> +++ b/kernel/isolation.c
> @@ -0,0 +1,105 @@
> +/*
> + *  linux/kernel/isolation.c
> + *
> + *  Implementation for task isolation.
> + *
> + *  Distributed under GPLv2.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/swap.h>
> +#include <linux/vmstat.h>
> +#include <linux/isolation.h>
> +#include <linux/syscalls.h>
> +#include "time/tick-sched.h"
> +
> +cpumask_var_t task_isolation_map;
> +
> +/*
> + * Isolation requires both nohz and isolcpus support from the scheduler.
> + * We provide a boot flag that enables both for now, and which we can
> + * add other functionality to over time if needed.  Note that just
> + * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
> + */
> +static int __init task_isolation_setup(char *str)
> +{
> +	alloc_bootmem_cpumask_var(&task_isolation_map);
> +	if (cpulist_parse(str, task_isolation_map) < 0) {
> +		pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
> +		return 1;
> +	}
> +
> +	alloc_bootmem_cpumask_var(&cpu_isolated_map);
> +	cpumask_copy(cpu_isolated_map, task_isolation_map);
> +
> +	alloc_bootmem_cpumask_var(&tick_nohz_full_mask);
> +	cpumask_copy(tick_nohz_full_mask, task_isolation_map);
> +	tick_nohz_full_running = true;

How about calling tick_nohz_full_setup() instead? I'd rather prefer
that nohz full implementation details stay in tick-sched.c

Also what happens if nohz_full= is given as well as task_isolation= ?
Don't we risk a memory leak and maybe breaking the fact that
(nohz_full & task_isolation != task_isolation) which is really a requirement?

> +
> +	return 1;
> +}
> +__setup("task_isolation=", task_isolation_setup);
> +
> +/*
> + * This routine controls whether we can enable task-isolation mode.
> + * The task must be affinitized to a single task_isolation core or we will
> + * return EINVAL.  Although the application could later re-affinitize
> + * to a housekeeping core and lose task isolation semantics, this
> + * initial test should catch 99% of bugs with task placement prior to
> + * enabling task isolation.
> + */
> +int task_isolation_set(unsigned int flags)
> +{
> +	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
> +	    !task_isolation_possible(smp_processor_id()))
> +		return -EINVAL;
> +
> +	current->task_isolation_flags = flags;
> +	return 0;
> +}

What if we concurrently change the task's affinity? Also it seems that preemption
isn't disabled, so we can also migrate concurrently. I'm surprised you haven't
seen warnings with smp_processor_id().

Also we should protect against task's affinity change when task_isolation_flags
is set.

> +
> +/*
> + * In task isolation mode we try to return to userspace only after
> + * attempting to make sure we won't be interrupted again.  To handle
> + * the periodic scheduler tick, we test to make sure that the tick is
> + * stopped, and if it isn't yet, we request a reschedule so that if
> + * another task needs to run to completion first, it can do so.
> + * Similarly, if any other subsystems require quiescing, we will need
> + * to do that before we return to userspace.
> + */
> +bool _task_isolation_ready(void)
> +{
> +	WARN_ON_ONCE(!irqs_disabled());
> +
> +	/* If we need to drain the LRU cache, we're not ready. */
> +	if (lru_add_drain_needed(smp_processor_id()))
> +		return false;
> +
> +	/* If vmstats need updating, we're not ready. */
> +	if (!vmstat_idle())
> +		return false;
> +
> +	/* Request rescheduling unless we are in full dynticks mode. */
> +	if (!tick_nohz_tick_stopped()) {
> +		set_tsk_need_resched(current);

I'm not sure doing this will help getting the tick to get stopped.

> +		return false;
> +	}
> +
> +	return true;
> +}

Thanks!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-01-19 15:42   ` Frederic Weisbecker
@ 2016-01-19 20:45       ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-19 20:45 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 01/19/2016 10:42 AM, Frederic Weisbecker wrote:
> On Mon, Jan 04, 2016 at 02:34:42PM -0500, Chris Metcalf wrote:
>> diff --git a/kernel/isolation.c b/kernel/isolation.c
>> new file mode 100644
>> index 000000000000..68a9f7457bc0
>> --- /dev/null
>> +++ b/kernel/isolation.c
>> @@ -0,0 +1,105 @@
>> +/*
>> + *  linux/kernel/isolation.c
>> + *
>> + *  Implementation for task isolation.
>> + *
>> + *  Distributed under GPLv2.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/swap.h>
>> +#include <linux/vmstat.h>
>> +#include <linux/isolation.h>
>> +#include <linux/syscalls.h>
>> +#include "time/tick-sched.h"
>> +
>> +cpumask_var_t task_isolation_map;
>> +
>> +/*
>> + * Isolation requires both nohz and isolcpus support from the scheduler.
>> + * We provide a boot flag that enables both for now, and which we can
>> + * add other functionality to over time if needed.  Note that just
>> + * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
>> + */
>> +static int __init task_isolation_setup(char *str)
>> +{
>> +	alloc_bootmem_cpumask_var(&task_isolation_map);
>> +	if (cpulist_parse(str, task_isolation_map) < 0) {
>> +		pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
>> +		return 1;
>> +	}
>> +
>> +	alloc_bootmem_cpumask_var(&cpu_isolated_map);
>> +	cpumask_copy(cpu_isolated_map, task_isolation_map);
>> +
>> +	alloc_bootmem_cpumask_var(&tick_nohz_full_mask);
>> +	cpumask_copy(tick_nohz_full_mask, task_isolation_map);
>> +	tick_nohz_full_running = true;
> How about calling tick_nohz_full_setup() instead? I'd rather prefer
> that nohz full implementation details stay in tick-sched.c
>
> Also what happens if nohz_full= is given as well as task_isolation= ?
> Don't we risk a memory leak and maybe breaking the fact that
> (nohz_full & task_isolation != task_isolation) which is really a requirement?

Yeah, this is a good point.  I'm not sure what the best way is to make
this happen.  It's already true that we will leak memory if you
specify "nohz_full=" more than once on the command line, but it's
awkward to fix (assuming we want the last value to win) so maybe
we can just ignore this problem - it's a pretty small amount of memory
after all.  If so, then making tick_nohz_full_setup() and 
isolated_cpu_setup()
both non-static and calling them from task_isolation_setup() might
be the cleanest approach.  What do you think?

You asked what happens if nohz_full= is given as well, which is a very
good question.  Perhaps the right answer is to have an early_initcall
that suppresses task isolation on any cores that lost their nohz_full
or isolcpus status due to later boot command line arguments (and
generate a console warning, obviously).

>> +
>> +	return 1;
>> +}
>> +__setup("task_isolation=", task_isolation_setup);
>> +
>> +/*
>> + * This routine controls whether we can enable task-isolation mode.
>> + * The task must be affinitized to a single task_isolation core or we will
>> + * return EINVAL.  Although the application could later re-affinitize
>> + * to a housekeeping core and lose task isolation semantics, this
>> + * initial test should catch 99% of bugs with task placement prior to
>> + * enabling task isolation.
>> + */
>> +int task_isolation_set(unsigned int flags)
>> +{
>> +	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
>> +	    !task_isolation_possible(smp_processor_id()))
>> +		return -EINVAL;
>> +
>> +	current->task_isolation_flags = flags;
>> +	return 0;
>> +}
> What if we concurrently change the task's affinity? Also it seems that preemption
> isn't disabled, so we can also migrate concurrently. I'm surprised you haven't
> seen warnings with smp_processor_id().
>
> Also we should protect against task's affinity change when task_isolation_flags
> is set.

I talked about this a bit when you raised it for the v8 patch series:

   http://lkml.kernel.org/r/562FA8FD.8080502@ezchip.com

I'd be curious to hear your take on the arguments I made there.

You're absolutely right about the preemption warnings, which I only fixed
a few days ago.  In this case I use raw_smp_processor_id() since with a
fixed single-core cpu affinity, we're not going anywhere, so the warning
from smp_processor_id() would be bogus.  And although technically it is
still correct (racing with another task resetting the task affinity on this
one), it is in any case equivalent to having that other task reset the 
affinity
on return from the prctl(), which I've already claimed isn't an interesting
use case to try to handle.  But let me know what you think!

>> +
>> +/*
>> + * In task isolation mode we try to return to userspace only after
>> + * attempting to make sure we won't be interrupted again.  To handle
>> + * the periodic scheduler tick, we test to make sure that the tick is
>> + * stopped, and if it isn't yet, we request a reschedule so that if
>> + * another task needs to run to completion first, it can do so.
>> + * Similarly, if any other subsystems require quiescing, we will need
>> + * to do that before we return to userspace.
>> + */
>> +bool _task_isolation_ready(void)
>> +{
>> +	WARN_ON_ONCE(!irqs_disabled());
>> +
>> +	/* If we need to drain the LRU cache, we're not ready. */
>> +	if (lru_add_drain_needed(smp_processor_id()))
>> +		return false;
>> +
>> +	/* If vmstats need updating, we're not ready. */
>> +	if (!vmstat_idle())
>> +		return false;
>> +
>> +	/* Request rescheduling unless we are in full dynticks mode. */
>> +	if (!tick_nohz_tick_stopped()) {
>> +		set_tsk_need_resched(current);
> I'm not sure doing this will help getting the tick to get stopped.

Well, I don't know that there is anything else we CAN do, right?  If there's
another task that can run, great - it may be that that's why full dynticks
isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
there's nothing else we can do, in which case we basically spend our time
going around through the scheduler code and back out to the
task_isolation_ready() test, but again, there's really nothing else more
useful we can be doing at this point.  Once the RCU tick fires (or whatever
it was that was preventing full dynticks from engaging), we will pass this
test and return to user space.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
@ 2016-01-19 20:45       ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-19 20:45 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 01/19/2016 10:42 AM, Frederic Weisbecker wrote:
> On Mon, Jan 04, 2016 at 02:34:42PM -0500, Chris Metcalf wrote:
>> diff --git a/kernel/isolation.c b/kernel/isolation.c
>> new file mode 100644
>> index 000000000000..68a9f7457bc0
>> --- /dev/null
>> +++ b/kernel/isolation.c
>> @@ -0,0 +1,105 @@
>> +/*
>> + *  linux/kernel/isolation.c
>> + *
>> + *  Implementation for task isolation.
>> + *
>> + *  Distributed under GPLv2.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/swap.h>
>> +#include <linux/vmstat.h>
>> +#include <linux/isolation.h>
>> +#include <linux/syscalls.h>
>> +#include "time/tick-sched.h"
>> +
>> +cpumask_var_t task_isolation_map;
>> +
>> +/*
>> + * Isolation requires both nohz and isolcpus support from the scheduler.
>> + * We provide a boot flag that enables both for now, and which we can
>> + * add other functionality to over time if needed.  Note that just
>> + * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
>> + */
>> +static int __init task_isolation_setup(char *str)
>> +{
>> +	alloc_bootmem_cpumask_var(&task_isolation_map);
>> +	if (cpulist_parse(str, task_isolation_map) < 0) {
>> +		pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
>> +		return 1;
>> +	}
>> +
>> +	alloc_bootmem_cpumask_var(&cpu_isolated_map);
>> +	cpumask_copy(cpu_isolated_map, task_isolation_map);
>> +
>> +	alloc_bootmem_cpumask_var(&tick_nohz_full_mask);
>> +	cpumask_copy(tick_nohz_full_mask, task_isolation_map);
>> +	tick_nohz_full_running = true;
> How about calling tick_nohz_full_setup() instead? I'd rather prefer
> that nohz full implementation details stay in tick-sched.c
>
> Also what happens if nohz_full= is given as well as task_isolation= ?
> Don't we risk a memory leak and maybe breaking the fact that
> (nohz_full & task_isolation != task_isolation) which is really a requirement?

Yeah, this is a good point.  I'm not sure what the best way is to make
this happen.  It's already true that we will leak memory if you
specify "nohz_full=" more than once on the command line, but it's
awkward to fix (assuming we want the last value to win) so maybe
we can just ignore this problem - it's a pretty small amount of memory
after all.  If so, then making tick_nohz_full_setup() and 
isolated_cpu_setup()
both non-static and calling them from task_isolation_setup() might
be the cleanest approach.  What do you think?

You asked what happens if nohz_full= is given as well, which is a very
good question.  Perhaps the right answer is to have an early_initcall
that suppresses task isolation on any cores that lost their nohz_full
or isolcpus status due to later boot command line arguments (and
generate a console warning, obviously).

>> +
>> +	return 1;
>> +}
>> +__setup("task_isolation=", task_isolation_setup);
>> +
>> +/*
>> + * This routine controls whether we can enable task-isolation mode.
>> + * The task must be affinitized to a single task_isolation core or we will
>> + * return EINVAL.  Although the application could later re-affinitize
>> + * to a housekeeping core and lose task isolation semantics, this
>> + * initial test should catch 99% of bugs with task placement prior to
>> + * enabling task isolation.
>> + */
>> +int task_isolation_set(unsigned int flags)
>> +{
>> +	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
>> +	    !task_isolation_possible(smp_processor_id()))
>> +		return -EINVAL;
>> +
>> +	current->task_isolation_flags = flags;
>> +	return 0;
>> +}
> What if we concurrently change the task's affinity? Also it seems that preemption
> isn't disabled, so we can also migrate concurrently. I'm surprised you haven't
> seen warnings with smp_processor_id().
>
> Also we should protect against task's affinity change when task_isolation_flags
> is set.

I talked about this a bit when you raised it for the v8 patch series:

   http://lkml.kernel.org/r/562FA8FD.8080502@ezchip.com

I'd be curious to hear your take on the arguments I made there.

You're absolutely right about the preemption warnings, which I only fixed
a few days ago.  In this case I use raw_smp_processor_id() since with a
fixed single-core cpu affinity, we're not going anywhere, so the warning
from smp_processor_id() would be bogus.  And although technically it is
still correct (racing with another task resetting the task affinity on this
one), it is in any case equivalent to having that other task reset the 
affinity
on return from the prctl(), which I've already claimed isn't an interesting
use case to try to handle.  But let me know what you think!

>> +
>> +/*
>> + * In task isolation mode we try to return to userspace only after
>> + * attempting to make sure we won't be interrupted again.  To handle
>> + * the periodic scheduler tick, we test to make sure that the tick is
>> + * stopped, and if it isn't yet, we request a reschedule so that if
>> + * another task needs to run to completion first, it can do so.
>> + * Similarly, if any other subsystems require quiescing, we will need
>> + * to do that before we return to userspace.
>> + */
>> +bool _task_isolation_ready(void)
>> +{
>> +	WARN_ON_ONCE(!irqs_disabled());
>> +
>> +	/* If we need to drain the LRU cache, we're not ready. */
>> +	if (lru_add_drain_needed(smp_processor_id()))
>> +		return false;
>> +
>> +	/* If vmstats need updating, we're not ready. */
>> +	if (!vmstat_idle())
>> +		return false;
>> +
>> +	/* Request rescheduling unless we are in full dynticks mode. */
>> +	if (!tick_nohz_tick_stopped()) {
>> +		set_tsk_need_resched(current);
> I'm not sure doing this will help getting the tick to get stopped.

Well, I don't know that there is anything else we CAN do, right?  If there's
another task that can run, great - it may be that that's why full dynticks
isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
there's nothing else we can do, in which case we basically spend our time
going around through the scheduler code and back out to the
task_isolation_ready() test, but again, there's really nothing else more
useful we can be doing at this point.  Once the RCU tick fires (or whatever
it was that was preventing full dynticks from engaging), we will pass this
test and return to user space.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 00/13] support "task_isolation" mode for nohz_full
  2016-01-13 21:19           ` Chris Metcalf
  (?)
@ 2016-01-20 13:27           ` Mark Rutland
  -1 siblings, 0 replies; 92+ messages in thread
From: Mark Rutland @ 2016-01-20 13:27 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Ingo Molnar, Will Deacon, Gilad Ben Yossef, Steven Rostedt,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas,
	Andy Lutomirski, Daniel Lezcano, linux-doc, linux-api,
	linux-kernel

Hi Chris,

Sorry for the delay. I had intended to take a look at this and so held
off replying, but my time has been taken up elsewhere.

On Wed, Jan 13, 2016 at 04:19:56PM -0500, Chris Metcalf wrote:
> On 01/13/2016 05:44 AM, Ingo Molnar wrote:
> >* Chris Metcalf <cmetcalf@ezchip.com> wrote:
> >
> >>(Adding Mark to cc's)
> >>
> >>On 01/12/2016 05:07 AM, Will Deacon wrote:
> >>>On Mon, Jan 11, 2016 at 04:15:50PM -0500, Chris Metcalf wrote:
> >>>>Ping!  There has been no substantive feedback to this version of
> >>>>the patch in the week since I posted it, which optimistically suggests
> >>>>to me that people may be satisfied with it.  If that's true, Frederic,
> >>>>I assume this would be pulled into your tree?
> >>>>
> >>>>I have slightly updated the v9 patch series since this posting:
> >>>>
> >>>>[...]
> >>>>
> >>>>- Incorporated Mark Rutland's changes to convert arm64
> >>>>   assembly to C code instead of using my own version.
> >>>Please avoid queuing these patches -- the first is already in the arm64
> >>>queue for 4.5 and the second was found to introduce a substantial
> >>>performance regression on the syscall entry/exit path. I think Mark had
> >>>an updated version to address that, so it would be easier not to have
> >>>an old version sitting in some other queue!
> >>I am not formally queueing them anywhere (like linux-next), though
> >>now that you mention it, that's a pretty good idea - I'll talk to Steven
> >>about that, assuming this merge window closes without the task
> >>isolation stuff going in.
> >NAK. Given the controversy, no way should this stuff go outside the primary trees
> >it affects: the scheduler, timer, irq, etc. trees.
> 
> Fair enough.  I'll plan to do v10 once the merge window closes.
> 
> Mark, let me know when/if you get a new version of the de-asm stuff
> for do_notify_resume() - thanks.

If I get the chance soon, I will do, though I suspect I won't have the
chance to give that the time it deserves over the next week or two. 

> Or, would it be helpful if I worked up the option I suggested, where
> we check the thread_info flags in the assembly code before calling out
> to the new loop in do_notify_resume()?

That would probably be for the best.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-01-19 20:45       ` Chris Metcalf
  (?)
@ 2016-01-28  0:28       ` Frederic Weisbecker
  2016-01-29 18:18           ` Chris Metcalf
  -1 siblings, 1 reply; 92+ messages in thread
From: Frederic Weisbecker @ 2016-01-28  0:28 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Tue, Jan 19, 2016 at 03:45:04PM -0500, Chris Metcalf wrote:
> On 01/19/2016 10:42 AM, Frederic Weisbecker wrote:
> >>+/*
> >>+ * Isolation requires both nohz and isolcpus support from the scheduler.
> >>+ * We provide a boot flag that enables both for now, and which we can
> >>+ * add other functionality to over time if needed.  Note that just
> >>+ * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
> >>+ */
> >>+static int __init task_isolation_setup(char *str)
> >>+{
> >>+	alloc_bootmem_cpumask_var(&task_isolation_map);
> >>+	if (cpulist_parse(str, task_isolation_map) < 0) {
> >>+		pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
> >>+		return 1;
> >>+	}
> >>+
> >>+	alloc_bootmem_cpumask_var(&cpu_isolated_map);
> >>+	cpumask_copy(cpu_isolated_map, task_isolation_map);
> >>+
> >>+	alloc_bootmem_cpumask_var(&tick_nohz_full_mask);
> >>+	cpumask_copy(tick_nohz_full_mask, task_isolation_map);
> >>+	tick_nohz_full_running = true;
> >How about calling tick_nohz_full_setup() instead? I'd rather prefer
> >that nohz full implementation details stay in tick-sched.c
> >
> >Also what happens if nohz_full= is given as well as task_isolation= ?
> >Don't we risk a memory leak and maybe breaking the fact that
> >(nohz_full & task_isolation != task_isolation) which is really a requirement?
> 
> Yeah, this is a good point.  I'm not sure what the best way is to make
> this happen.  It's already true that we will leak memory if you
> specify "nohz_full=" more than once on the command line, but it's
> awkward to fix (assuming we want the last value to win) so maybe
> we can just ignore this problem - it's a pretty small amount of memory
> after all.  If so, then making tick_nohz_full_setup() and
> isolated_cpu_setup()
> both non-static and calling them from task_isolation_setup() might
> be the cleanest approach.  What do you think?

I think we can reuse tick_nohz_full_setup() indeed, or some of its internals
and encapsulate that in a function so that isolation.c can initialize nohz full
without fiddling with internal variables.

> 
> You asked what happens if nohz_full= is given as well, which is a very
> good question.  Perhaps the right answer is to have an early_initcall
> that suppresses task isolation on any cores that lost their nohz_full
> or isolcpus status due to later boot command line arguments (and
> generate a console warning, obviously).

I'd rather imagine that the final nohz full cpumask is "nohz_full=" | "task_isolation="
That's the easiest way to deal with and both nohz and task isolation can call
a common initializer that takes care of the allocation and add the cpus to the mask.

> >>+int task_isolation_set(unsigned int flags)
> >>+{
> >>+	if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
> >>+	    !task_isolation_possible(smp_processor_id()))
> >>+		return -EINVAL;
> >>+
> >>+	current->task_isolation_flags = flags;
> >>+	return 0;
> >>+}
> >What if we concurrently change the task's affinity? Also it seems that preemption
> >isn't disabled, so we can also migrate concurrently. I'm surprised you haven't
> >seen warnings with smp_processor_id().
> >
> >Also we should protect against task's affinity change when task_isolation_flags
> >is set.
> 
> I talked about this a bit when you raised it for the v8 patch series:
> 
>   http://lkml.kernel.org/r/562FA8FD.8080502@ezchip.com
> 
> I'd be curious to hear your take on the arguments I made there.

Oh ok, I'm going to reply there then :)

> 
> You're absolutely right about the preemption warnings, which I only fixed
> a few days ago.  In this case I use raw_smp_processor_id() since with a
> fixed single-core cpu affinity, we're not going anywhere, so the warning
> from smp_processor_id() would be bogus.  And although technically it is
> still correct (racing with another task resetting the task affinity on this
> one), it is in any case equivalent to having that other task reset the
> affinity
> on return from the prctl(), which I've already claimed isn't an interesting
> use case to try to handle.  But let me know what you think!

Ok it's very much tied to the affinity issue. If we deal with affinity changes
properly I think we can use the raw_ version.

> 
> >>+
> >>+/*
> >>+ * In task isolation mode we try to return to userspace only after
> >>+ * attempting to make sure we won't be interrupted again.  To handle
> >>+ * the periodic scheduler tick, we test to make sure that the tick is
> >>+ * stopped, and if it isn't yet, we request a reschedule so that if
> >>+ * another task needs to run to completion first, it can do so.
> >>+ * Similarly, if any other subsystems require quiescing, we will need
> >>+ * to do that before we return to userspace.
> >>+ */
> >>+bool _task_isolation_ready(void)
> >>+{
> >>+	WARN_ON_ONCE(!irqs_disabled());
> >>+
> >>+	/* If we need to drain the LRU cache, we're not ready. */
> >>+	if (lru_add_drain_needed(smp_processor_id()))
> >>+		return false;
> >>+
> >>+	/* If vmstats need updating, we're not ready. */
> >>+	if (!vmstat_idle())
> >>+		return false;
> >>+
> >>+	/* Request rescheduling unless we are in full dynticks mode. */
> >>+	if (!tick_nohz_tick_stopped()) {
> >>+		set_tsk_need_resched(current);
> >I'm not sure doing this will help getting the tick to get stopped.
> 
> Well, I don't know that there is anything else we CAN do, right?  If there's
> another task that can run, great - it may be that that's why full dynticks
> isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
> there's nothing else we can do, in which case we basically spend our time
> going around through the scheduler code and back out to the
> task_isolation_ready() test, but again, there's really nothing else more
> useful we can be doing at this point.  Once the RCU tick fires (or whatever
> it was that was preventing full dynticks from engaging), we will pass this
> test and return to user space.

There is nothing at all you can do and setting TIF_RESCHED won't help either.
If there is another task that can run, the scheduler takes care of resched
by itself :-)

Thanks.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-01-28  0:28       ` Frederic Weisbecker
@ 2016-01-29 18:18           ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-29 18:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 01/27/2016 07:28 PM, Frederic Weisbecker wrote:
> On Tue, Jan 19, 2016 at 03:45:04PM -0500, Chris Metcalf wrote:
>> You asked what happens if nohz_full= is given as well, which is a very
>> good question.  Perhaps the right answer is to have an early_initcall
>> that suppresses task isolation on any cores that lost their nohz_full
>> or isolcpus status due to later boot command line arguments (and
>> generate a console warning, obviously).
> I'd rather imagine that the final nohz full cpumask is "nohz_full=" | "task_isolation="
> That's the easiest way to deal with and both nohz and task isolation can call
> a common initializer that takes care of the allocation and add the cpus to the mask.

I like it!

And by the same token, the final isolcpus cpumask is "isolcpus=" | 
"task_isolation="?
That seems like we'd want to do it to keep things parallel.

>>>> +bool _task_isolation_ready(void)
>>>> +{
>>>> +	WARN_ON_ONCE(!irqs_disabled());
>>>> +
>>>> +	/* If we need to drain the LRU cache, we're not ready. */
>>>> +	if (lru_add_drain_needed(smp_processor_id()))
>>>> +		return false;
>>>> +
>>>> +	/* If vmstats need updating, we're not ready. */
>>>> +	if (!vmstat_idle())
>>>> +		return false;
>>>> +
>>>> +	/* Request rescheduling unless we are in full dynticks mode. */
>>>> +	if (!tick_nohz_tick_stopped()) {
>>>> +		set_tsk_need_resched(current);
>>> I'm not sure doing this will help getting the tick to get stopped.
>> Well, I don't know that there is anything else we CAN do, right?  If there's
>> another task that can run, great - it may be that that's why full dynticks
>> isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
>> there's nothing else we can do, in which case we basically spend our time
>> going around through the scheduler code and back out to the
>> task_isolation_ready() test, but again, there's really nothing else more
>> useful we can be doing at this point.  Once the RCU tick fires (or whatever
>> it was that was preventing full dynticks from engaging), we will pass this
>> test and return to user space.
> There is nothing at all you can do and setting TIF_RESCHED won't help either.
> If there is another task that can run, the scheduler takes care of resched
> by itself :-)

The problem is that the scheduler will only take care of resched at a
later time, typically when we get a timer interrupt later.  By invoking the
scheduler here, we allow any tasks that are ready to run to run
immediately, rather than waiting for an interrupt to wake the scheduler.
Plenty of places in the kernel just call schedule() directly when they are
waiting.  Since we're waiting here regardless, we might as well
immediately get any other runnable tasks dealt with.

We could also just return "false" in _task_isolation_ready(), and then
check tick_nohz_tick_stopped() in _task_isolation_enter() and if false,
call schedule() explicitly there, but that seems a little more roundabout.
Admittedly it's more usual to see kernel code call schedule() directly
to yield the processor, but in this case I'm not convinced it's cleaner
given we're already in a loop where the caller is checking TIF_RESCHED
and then calling schedule() when it's set.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
@ 2016-01-29 18:18           ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-29 18:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 01/27/2016 07:28 PM, Frederic Weisbecker wrote:
> On Tue, Jan 19, 2016 at 03:45:04PM -0500, Chris Metcalf wrote:
>> You asked what happens if nohz_full= is given as well, which is a very
>> good question.  Perhaps the right answer is to have an early_initcall
>> that suppresses task isolation on any cores that lost their nohz_full
>> or isolcpus status due to later boot command line arguments (and
>> generate a console warning, obviously).
> I'd rather imagine that the final nohz full cpumask is "nohz_full=" | "task_isolation="
> That's the easiest way to deal with and both nohz and task isolation can call
> a common initializer that takes care of the allocation and add the cpus to the mask.

I like it!

And by the same token, the final isolcpus cpumask is "isolcpus=" | 
"task_isolation="?
That seems like we'd want to do it to keep things parallel.

>>>> +bool _task_isolation_ready(void)
>>>> +{
>>>> +	WARN_ON_ONCE(!irqs_disabled());
>>>> +
>>>> +	/* If we need to drain the LRU cache, we're not ready. */
>>>> +	if (lru_add_drain_needed(smp_processor_id()))
>>>> +		return false;
>>>> +
>>>> +	/* If vmstats need updating, we're not ready. */
>>>> +	if (!vmstat_idle())
>>>> +		return false;
>>>> +
>>>> +	/* Request rescheduling unless we are in full dynticks mode. */
>>>> +	if (!tick_nohz_tick_stopped()) {
>>>> +		set_tsk_need_resched(current);
>>> I'm not sure doing this will help getting the tick to get stopped.
>> Well, I don't know that there is anything else we CAN do, right?  If there's
>> another task that can run, great - it may be that that's why full dynticks
>> isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
>> there's nothing else we can do, in which case we basically spend our time
>> going around through the scheduler code and back out to the
>> task_isolation_ready() test, but again, there's really nothing else more
>> useful we can be doing at this point.  Once the RCU tick fires (or whatever
>> it was that was preventing full dynticks from engaging), we will pass this
>> test and return to user space.
> There is nothing at all you can do and setting TIF_RESCHED won't help either.
> If there is another task that can run, the scheduler takes care of resched
> by itself :-)

The problem is that the scheduler will only take care of resched at a
later time, typically when we get a timer interrupt later.  By invoking the
scheduler here, we allow any tasks that are ready to run to run
immediately, rather than waiting for an interrupt to wake the scheduler.
Plenty of places in the kernel just call schedule() directly when they are
waiting.  Since we're waiting here regardless, we might as well
immediately get any other runnable tasks dealt with.

We could also just return "false" in _task_isolation_ready(), and then
check tick_nohz_tick_stopped() in _task_isolation_enter() and if false,
call schedule() explicitly there, but that seems a little more roundabout.
Admittedly it's more usual to see kernel code call schedule() directly
to yield the processor, but in this case I'm not convinced it's cleaner
given we're already in a loop where the caller is checking TIF_RESCHED
and then calling schedule() when it's set.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
@ 2016-01-30 21:11             ` Frederic Weisbecker
  0 siblings, 0 replies; 92+ messages in thread
From: Frederic Weisbecker @ 2016-01-30 21:11 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Fri, Jan 29, 2016 at 01:18:05PM -0500, Chris Metcalf wrote:
> On 01/27/2016 07:28 PM, Frederic Weisbecker wrote:
> >On Tue, Jan 19, 2016 at 03:45:04PM -0500, Chris Metcalf wrote:
> >>You asked what happens if nohz_full= is given as well, which is a very
> >>good question.  Perhaps the right answer is to have an early_initcall
> >>that suppresses task isolation on any cores that lost their nohz_full
> >>or isolcpus status due to later boot command line arguments (and
> >>generate a console warning, obviously).
> >I'd rather imagine that the final nohz full cpumask is "nohz_full=" | "task_isolation="
> >That's the easiest way to deal with and both nohz and task isolation can call
> >a common initializer that takes care of the allocation and add the cpus to the mask.
> 
> I like it!
> 
> And by the same token, the final isolcpus cpumask is "isolcpus=" |
> "task_isolation="?
> That seems like we'd want to do it to keep things parallel.

We have reverted the patch that made isolcpus |= nohz_full. Too
many people complained about unusable machines with NO_HZ_FULL_ALL

But the user can still set that parameter manually.

> 
> >>>>+bool _task_isolation_ready(void)
> >>>>+{
> >>>>+	WARN_ON_ONCE(!irqs_disabled());
> >>>>+
> >>>>+	/* If we need to drain the LRU cache, we're not ready. */
> >>>>+	if (lru_add_drain_needed(smp_processor_id()))
> >>>>+		return false;
> >>>>+
> >>>>+	/* If vmstats need updating, we're not ready. */
> >>>>+	if (!vmstat_idle())
> >>>>+		return false;
> >>>>+
> >>>>+	/* Request rescheduling unless we are in full dynticks mode. */
> >>>>+	if (!tick_nohz_tick_stopped()) {
> >>>>+		set_tsk_need_resched(current);
> >>>I'm not sure doing this will help getting the tick to get stopped.
> >>Well, I don't know that there is anything else we CAN do, right?  If there's
> >>another task that can run, great - it may be that that's why full dynticks
> >>isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
> >>there's nothing else we can do, in which case we basically spend our time
> >>going around through the scheduler code and back out to the
> >>task_isolation_ready() test, but again, there's really nothing else more
> >>useful we can be doing at this point.  Once the RCU tick fires (or whatever
> >>it was that was preventing full dynticks from engaging), we will pass this
> >>test and return to user space.
> >There is nothing at all you can do and setting TIF_RESCHED won't help either.
> >If there is another task that can run, the scheduler takes care of resched
> >by itself :-)
> 
> The problem is that the scheduler will only take care of resched at a
> later time, typically when we get a timer interrupt later.

When a task is enqueued, the scheduler sets TIF_RESCHED on the target. If the
target is remote it sends an IPI, if it's local then we wait the next reschedule
point (preemption points, voluntary reschedule, interrupts). There is just nothing
you can do to accelerate that.


> By invoking the scheduler here, we allow any tasks that are ready to run to run
> immediately, rather than waiting for an interrupt to wake the scheduler.

Well, in this case here we are interested in the current CPU. And if a task
got awoken and waits for the current CPU, it will have an opportunity to get
schedule on syscall exit.

> Plenty of places in the kernel just call schedule() directly when they are
> waiting.  Since we're waiting here regardless, we might as well
> immediately get any other runnable tasks dealt with.
> 
> We could also just return "false" in _task_isolation_ready(), and then
> check tick_nohz_tick_stopped() in _task_isolation_enter() and if false,
> call schedule() explicitly there, but that seems a little more roundabout.
> Admittedly it's more usual to see kernel code call schedule() directly
> to yield the processor, but in this case I'm not convinced it's cleaner
> given we're already in a loop where the caller is checking TIF_RESCHED
> and then calling schedule() when it's set.

You could call cond_resched(), but really syscall exit is enough for what
you want. And the problem here if a task prevents the CPU from stopping the
tick is that task itself, not the fact it doesn't get scheduled. If we have
other tasks than the current isolated one on the CPU, it means that the
environment is not ready for hard isolation.

And in general: we shouldn't loop at all there: if something depends on the tick,
the CPU is not ready for isolation and something needs to be done: setting
some task affinity, etc... So we should just fail the prctl and let the user
deal with it.

> 
> -- 
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
@ 2016-01-30 21:11             ` Frederic Weisbecker
  0 siblings, 0 replies; 92+ messages in thread
From: Frederic Weisbecker @ 2016-01-30 21:11 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 29, 2016 at 01:18:05PM -0500, Chris Metcalf wrote:
> On 01/27/2016 07:28 PM, Frederic Weisbecker wrote:
> >On Tue, Jan 19, 2016 at 03:45:04PM -0500, Chris Metcalf wrote:
> >>You asked what happens if nohz_full= is given as well, which is a very
> >>good question.  Perhaps the right answer is to have an early_initcall
> >>that suppresses task isolation on any cores that lost their nohz_full
> >>or isolcpus status due to later boot command line arguments (and
> >>generate a console warning, obviously).
> >I'd rather imagine that the final nohz full cpumask is "nohz_full=" | "task_isolation="
> >That's the easiest way to deal with and both nohz and task isolation can call
> >a common initializer that takes care of the allocation and add the cpus to the mask.
> 
> I like it!
> 
> And by the same token, the final isolcpus cpumask is "isolcpus=" |
> "task_isolation="?
> That seems like we'd want to do it to keep things parallel.

We have reverted the patch that made isolcpus |= nohz_full. Too
many people complained about unusable machines with NO_HZ_FULL_ALL

But the user can still set that parameter manually.

> 
> >>>>+bool _task_isolation_ready(void)
> >>>>+{
> >>>>+	WARN_ON_ONCE(!irqs_disabled());
> >>>>+
> >>>>+	/* If we need to drain the LRU cache, we're not ready. */
> >>>>+	if (lru_add_drain_needed(smp_processor_id()))
> >>>>+		return false;
> >>>>+
> >>>>+	/* If vmstats need updating, we're not ready. */
> >>>>+	if (!vmstat_idle())
> >>>>+		return false;
> >>>>+
> >>>>+	/* Request rescheduling unless we are in full dynticks mode. */
> >>>>+	if (!tick_nohz_tick_stopped()) {
> >>>>+		set_tsk_need_resched(current);
> >>>I'm not sure doing this will help getting the tick to get stopped.
> >>Well, I don't know that there is anything else we CAN do, right?  If there's
> >>another task that can run, great - it may be that that's why full dynticks
> >>isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
> >>there's nothing else we can do, in which case we basically spend our time
> >>going around through the scheduler code and back out to the
> >>task_isolation_ready() test, but again, there's really nothing else more
> >>useful we can be doing at this point.  Once the RCU tick fires (or whatever
> >>it was that was preventing full dynticks from engaging), we will pass this
> >>test and return to user space.
> >There is nothing at all you can do and setting TIF_RESCHED won't help either.
> >If there is another task that can run, the scheduler takes care of resched
> >by itself :-)
> 
> The problem is that the scheduler will only take care of resched at a
> later time, typically when we get a timer interrupt later.

When a task is enqueued, the scheduler sets TIF_RESCHED on the target. If the
target is remote it sends an IPI, if it's local then we wait the next reschedule
point (preemption points, voluntary reschedule, interrupts). There is just nothing
you can do to accelerate that.


> By invoking the scheduler here, we allow any tasks that are ready to run to run
> immediately, rather than waiting for an interrupt to wake the scheduler.

Well, in this case here we are interested in the current CPU. And if a task
got awoken and waits for the current CPU, it will have an opportunity to get
schedule on syscall exit.

> Plenty of places in the kernel just call schedule() directly when they are
> waiting.  Since we're waiting here regardless, we might as well
> immediately get any other runnable tasks dealt with.
> 
> We could also just return "false" in _task_isolation_ready(), and then
> check tick_nohz_tick_stopped() in _task_isolation_enter() and if false,
> call schedule() explicitly there, but that seems a little more roundabout.
> Admittedly it's more usual to see kernel code call schedule() directly
> to yield the processor, but in this case I'm not convinced it's cleaner
> given we're already in a loop where the caller is checking TIF_RESCHED
> and then calling schedule() when it's set.

You could call cond_resched(), but really syscall exit is enough for what
you want. And the problem here if a task prevents the CPU from stopping the
tick is that task itself, not the fact it doesn't get scheduled. If we have
other tasks than the current isolated one on the CPU, it means that the
environment is not ready for hard isolation.

And in general: we shouldn't loop at all there: if something depends on the tick,
the CPU is not ready for isolation and something needs to be done: setting
some task affinity, etc... So we should just fail the prctl and let the user
deal with it.

> 
> -- 
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-01-30 21:11             ` Frederic Weisbecker
@ 2016-02-11 19:24               ` Chris Metcalf
  -1 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-02-11 19:24 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 01/30/2016 04:11 PM, Frederic Weisbecker wrote:
> On Fri, Jan 29, 2016 at 01:18:05PM -0500, Chris Metcalf wrote:
>> On 01/27/2016 07:28 PM, Frederic Weisbecker wrote:
>>> On Tue, Jan 19, 2016 at 03:45:04PM -0500, Chris Metcalf wrote:
>>>> You asked what happens if nohz_full= is given as well, which is a very
>>>> good question.  Perhaps the right answer is to have an early_initcall
>>>> that suppresses task isolation on any cores that lost their nohz_full
>>>> or isolcpus status due to later boot command line arguments (and
>>>> generate a console warning, obviously).
>>> I'd rather imagine that the final nohz full cpumask is "nohz_full=" | "task_isolation="
>>> That's the easiest way to deal with and both nohz and task isolation can call
>>> a common initializer that takes care of the allocation and add the cpus to the mask.
>> I like it!
>>
>> And by the same token, the final isolcpus cpumask is "isolcpus=" |
>> "task_isolation="?
>> That seems like we'd want to do it to keep things parallel.
> We have reverted the patch that made isolcpus |= nohz_full. Too
> many people complained about unusable machines with NO_HZ_FULL_ALL
>
> But the user can still set that parameter manually.

Yes.  What I was suggesting is that if the user specifies task_isolation=X-Y
we should add cpus X-Y to both the nohz_full set and the isolcpus set.
I've changed it to work that way for the v10 patch series.


>>>>>> +bool _task_isolation_ready(void)
>>>>>> +{
>>>>>> +	WARN_ON_ONCE(!irqs_disabled());
>>>>>> +
>>>>>> +	/* If we need to drain the LRU cache, we're not ready. */
>>>>>> +	if (lru_add_drain_needed(smp_processor_id()))
>>>>>> +		return false;
>>>>>> +
>>>>>> +	/* If vmstats need updating, we're not ready. */
>>>>>> +	if (!vmstat_idle())
>>>>>> +		return false;
>>>>>> +
>>>>>> +	/* Request rescheduling unless we are in full dynticks mode. */
>>>>>> +	if (!tick_nohz_tick_stopped()) {
>>>>>> +		set_tsk_need_resched(current);
>>>>> I'm not sure doing this will help getting the tick to get stopped.
>>>> Well, I don't know that there is anything else we CAN do, right?  If there's
>>>> another task that can run, great - it may be that that's why full dynticks
>>>> isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
>>>> there's nothing else we can do, in which case we basically spend our time
>>>> going around through the scheduler code and back out to the
>>>> task_isolation_ready() test, but again, there's really nothing else more
>>>> useful we can be doing at this point.  Once the RCU tick fires (or whatever
>>>> it was that was preventing full dynticks from engaging), we will pass this
>>>> test and return to user space.
>>> There is nothing at all you can do and setting TIF_RESCHED won't help either.
>>> If there is another task that can run, the scheduler takes care of resched
>>> by itself :-)
>> The problem is that the scheduler will only take care of resched at a
>> later time, typically when we get a timer interrupt later.
> When a task is enqueued, the scheduler sets TIF_RESCHED on the target. If the
> target is remote it sends an IPI, if it's local then we wait the next reschedule
> point (preemption points, voluntary reschedule, interrupts). There is just nothing
> you can do to accelerate that.

But that's exactly what I'm saying.  If we're sitting in a loop here waiting
for some short-lived process (maybe kernel thread) to run and get out of
the way, we don't want to just spin sitting in prepare_exit_to_usermode().
We want to call schedule(), get the short-lived process to run, then when
it calls schedule() again, we're back in prepare_exit_to_usermode but now
we can return to userspace.

We don't want to wait for preemption points or interrupts, and there are
no other voluntary reschedules in the prepare_exit_to_usermode() loop.

If the other task had been woken up for some completion, then yes we would
already have had TIF_RESCHED set, but if the other runnable task was (for
example) pre-empted on a timer tick, we wouldn't have TIF_RESCHED set at
this point, and thus we might need to call schedule() explicitly.

Note that the prepare_exit_to_usermode() loop is exactly the point at
which we normally call schedule() if we are in syscall exit, so we are
just encouraging that schedule() to happen if otherwise it might not.

>> By invoking the scheduler here, we allow any tasks that are ready to run to run
>> immediately, rather than waiting for an interrupt to wake the scheduler.
> Well, in this case here we are interested in the current CPU. And if a task
> got awoken and waits for the current CPU, it will have an opportunity to get
> schedule on syscall exit.

That's true if TIF_RESCHED was set because a completion occurred that
the other task was waiting for.  But there might not be any such completion
and the task just got preempted earlier and is still ready to run.

My point is that setting TIF_RESCHED is never harmful, and there are
cases like involuntary preemption where it might help.


>> Plenty of places in the kernel just call schedule() directly when they are
>> waiting.  Since we're waiting here regardless, we might as well
>> immediately get any other runnable tasks dealt with.
>>
>> We could also just return "false" in _task_isolation_ready(), and then
>> check tick_nohz_tick_stopped() in _task_isolation_enter() and if false,
>> call schedule() explicitly there, but that seems a little more roundabout.
>> Admittedly it's more usual to see kernel code call schedule() directly
>> to yield the processor, but in this case I'm not convinced it's cleaner
>> given we're already in a loop where the caller is checking TIF_RESCHED
>> and then calling schedule() when it's set.
> You could call cond_resched(), but really syscall exit is enough for what
> you want. And the problem here if a task prevents the CPU from stopping the
> tick is that task itself, not the fact it doesn't get scheduled.

True, although in that case we just need to wait (e.g. for an RCU tick
to occur to quiesce); we could spin, but spinning through the scheduler
seems no better or worse in that case then just spinning with
interrupts enabled in a loop.  And (as I said above) it could help.

> If we have
> other tasks than the current isolated one on the CPU, it means that the
> environment is not ready for hard isolation.

Right.  But the model is that in that case, the task that wants hard
isolation is just going to have to wait to return to userspace.


> And in general: we shouldn't loop at all there: if something depends on the tick,
> the CPU is not ready for isolation and something needs to be done: setting
> some task affinity, etc... So we should just fail the prctl and let the user
> deal with it.

So there are potentially two cases here:

(1) When we initially do the prctl(), should we check to see if there are
other schedulable tasks, etc., and fail the prctl() if so?  You could make a
case for this, but I think in practice userspace would just end up looping
back to retry the prctl if we created that semantic in the kernel.

(2) What about times when we are leaving the kernel after already
doing the prctl()?  For example a core doing packet forwarding might
want to report some error condition up to the kernel, and remove itself
from the set of cores handling packets, then do some syscall(s) to generate
logging data, and then go back and continue handling packets.  Or, the
process might have created some large anonymous mapping where
every now and then it needs to cross a page boundary for some structure
and touch a new page, and it knows to expect a page fault in that case.
In those cases we are returning from the kernel, not at prctl() time, and
we still want to enforce the semantics that no further interrupts will
occur to disturb the task.  These kinds of use cases are why we have
as general-purpose a mechanism as we do for task isolation.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
@ 2016-02-11 19:24               ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-02-11 19:24 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 01/30/2016 04:11 PM, Frederic Weisbecker wrote:
> On Fri, Jan 29, 2016 at 01:18:05PM -0500, Chris Metcalf wrote:
>> On 01/27/2016 07:28 PM, Frederic Weisbecker wrote:
>>> On Tue, Jan 19, 2016 at 03:45:04PM -0500, Chris Metcalf wrote:
>>>> You asked what happens if nohz_full= is given as well, which is a very
>>>> good question.  Perhaps the right answer is to have an early_initcall
>>>> that suppresses task isolation on any cores that lost their nohz_full
>>>> or isolcpus status due to later boot command line arguments (and
>>>> generate a console warning, obviously).
>>> I'd rather imagine that the final nohz full cpumask is "nohz_full=" | "task_isolation="
>>> That's the easiest way to deal with and both nohz and task isolation can call
>>> a common initializer that takes care of the allocation and add the cpus to the mask.
>> I like it!
>>
>> And by the same token, the final isolcpus cpumask is "isolcpus=" |
>> "task_isolation="?
>> That seems like we'd want to do it to keep things parallel.
> We have reverted the patch that made isolcpus |= nohz_full. Too
> many people complained about unusable machines with NO_HZ_FULL_ALL
>
> But the user can still set that parameter manually.

Yes.  What I was suggesting is that if the user specifies task_isolation=X-Y
we should add cpus X-Y to both the nohz_full set and the isolcpus set.
I've changed it to work that way for the v10 patch series.


>>>>>> +bool _task_isolation_ready(void)
>>>>>> +{
>>>>>> +	WARN_ON_ONCE(!irqs_disabled());
>>>>>> +
>>>>>> +	/* If we need to drain the LRU cache, we're not ready. */
>>>>>> +	if (lru_add_drain_needed(smp_processor_id()))
>>>>>> +		return false;
>>>>>> +
>>>>>> +	/* If vmstats need updating, we're not ready. */
>>>>>> +	if (!vmstat_idle())
>>>>>> +		return false;
>>>>>> +
>>>>>> +	/* Request rescheduling unless we are in full dynticks mode. */
>>>>>> +	if (!tick_nohz_tick_stopped()) {
>>>>>> +		set_tsk_need_resched(current);
>>>>> I'm not sure doing this will help getting the tick to get stopped.
>>>> Well, I don't know that there is anything else we CAN do, right?  If there's
>>>> another task that can run, great - it may be that that's why full dynticks
>>>> isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
>>>> there's nothing else we can do, in which case we basically spend our time
>>>> going around through the scheduler code and back out to the
>>>> task_isolation_ready() test, but again, there's really nothing else more
>>>> useful we can be doing at this point.  Once the RCU tick fires (or whatever
>>>> it was that was preventing full dynticks from engaging), we will pass this
>>>> test and return to user space.
>>> There is nothing at all you can do and setting TIF_RESCHED won't help either.
>>> If there is another task that can run, the scheduler takes care of resched
>>> by itself :-)
>> The problem is that the scheduler will only take care of resched at a
>> later time, typically when we get a timer interrupt later.
> When a task is enqueued, the scheduler sets TIF_RESCHED on the target. If the
> target is remote it sends an IPI, if it's local then we wait the next reschedule
> point (preemption points, voluntary reschedule, interrupts). There is just nothing
> you can do to accelerate that.

But that's exactly what I'm saying.  If we're sitting in a loop here waiting
for some short-lived process (maybe kernel thread) to run and get out of
the way, we don't want to just spin sitting in prepare_exit_to_usermode().
We want to call schedule(), get the short-lived process to run, then when
it calls schedule() again, we're back in prepare_exit_to_usermode but now
we can return to userspace.

We don't want to wait for preemption points or interrupts, and there are
no other voluntary reschedules in the prepare_exit_to_usermode() loop.

If the other task had been woken up for some completion, then yes we would
already have had TIF_RESCHED set, but if the other runnable task was (for
example) pre-empted on a timer tick, we wouldn't have TIF_RESCHED set at
this point, and thus we might need to call schedule() explicitly.

Note that the prepare_exit_to_usermode() loop is exactly the point at
which we normally call schedule() if we are in syscall exit, so we are
just encouraging that schedule() to happen if otherwise it might not.

>> By invoking the scheduler here, we allow any tasks that are ready to run to run
>> immediately, rather than waiting for an interrupt to wake the scheduler.
> Well, in this case here we are interested in the current CPU. And if a task
> got awoken and waits for the current CPU, it will have an opportunity to get
> schedule on syscall exit.

That's true if TIF_RESCHED was set because a completion occurred that
the other task was waiting for.  But there might not be any such completion
and the task just got preempted earlier and is still ready to run.

My point is that setting TIF_RESCHED is never harmful, and there are
cases like involuntary preemption where it might help.


>> Plenty of places in the kernel just call schedule() directly when they are
>> waiting.  Since we're waiting here regardless, we might as well
>> immediately get any other runnable tasks dealt with.
>>
>> We could also just return "false" in _task_isolation_ready(), and then
>> check tick_nohz_tick_stopped() in _task_isolation_enter() and if false,
>> call schedule() explicitly there, but that seems a little more roundabout.
>> Admittedly it's more usual to see kernel code call schedule() directly
>> to yield the processor, but in this case I'm not convinced it's cleaner
>> given we're already in a loop where the caller is checking TIF_RESCHED
>> and then calling schedule() when it's set.
> You could call cond_resched(), but really syscall exit is enough for what
> you want. And the problem here if a task prevents the CPU from stopping the
> tick is that task itself, not the fact it doesn't get scheduled.

True, although in that case we just need to wait (e.g. for an RCU tick
to occur to quiesce); we could spin, but spinning through the scheduler
seems no better or worse in that case then just spinning with
interrupts enabled in a loop.  And (as I said above) it could help.

> If we have
> other tasks than the current isolated one on the CPU, it means that the
> environment is not ready for hard isolation.

Right.  But the model is that in that case, the task that wants hard
isolation is just going to have to wait to return to userspace.


> And in general: we shouldn't loop at all there: if something depends on the tick,
> the CPU is not ready for isolation and something needs to be done: setting
> some task affinity, etc... So we should just fail the prctl and let the user
> deal with it.

So there are potentially two cases here:

(1) When we initially do the prctl(), should we check to see if there are
other schedulable tasks, etc., and fail the prctl() if so?  You could make a
case for this, but I think in practice userspace would just end up looping
back to retry the prctl if we created that semantic in the kernel.

(2) What about times when we are leaving the kernel after already
doing the prctl()?  For example a core doing packet forwarding might
want to report some error condition up to the kernel, and remove itself
from the set of cores handling packets, then do some syscall(s) to generate
logging data, and then go back and continue handling packets.  Or, the
process might have created some large anonymous mapping where
every now and then it needs to cross a page boundary for some structure
and touch a new page, and it knows to expect a page fault in that case.
In those cases we are returning from the kernel, not at prctl() time, and
we still want to enforce the semantics that no further interrupts will
occur to disturb the task.  These kinds of use cases are why we have
as general-purpose a mechanism as we do for task isolation.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-02-11 19:24               ` Chris Metcalf
  (?)
@ 2016-03-04 12:56               ` Frederic Weisbecker
  2016-03-09 19:39                   ` Chris Metcalf
  -1 siblings, 1 reply; 92+ messages in thread
From: Frederic Weisbecker @ 2016-03-04 12:56 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Thu, Feb 11, 2016 at 02:24:25PM -0500, Chris Metcalf wrote:
> On 01/30/2016 04:11 PM, Frederic Weisbecker wrote:
> >We have reverted the patch that made isolcpus |= nohz_full. Too
> >many people complained about unusable machines with NO_HZ_FULL_ALL
> >
> >But the user can still set that parameter manually.
> 
> Yes.  What I was suggesting is that if the user specifies task_isolation=X-Y
> we should add cpus X-Y to both the nohz_full set and the isolcpus set.
> I've changed it to work that way for the v10 patch series.

Ok.

> 
> 
> >>>>>>+bool _task_isolation_ready(void)
> >>>>>>+{
> >>>>>>+	WARN_ON_ONCE(!irqs_disabled());
> >>>>>>+
> >>>>>>+	/* If we need to drain the LRU cache, we're not ready. */
> >>>>>>+	if (lru_add_drain_needed(smp_processor_id()))
> >>>>>>+		return false;
> >>>>>>+
> >>>>>>+	/* If vmstats need updating, we're not ready. */
> >>>>>>+	if (!vmstat_idle())
> >>>>>>+		return false;
> >>>>>>+
> >>>>>>+	/* Request rescheduling unless we are in full dynticks mode. */
> >>>>>>+	if (!tick_nohz_tick_stopped()) {
> >>>>>>+		set_tsk_need_resched(current);
> >>>>>I'm not sure doing this will help getting the tick to get stopped.
> >>>>Well, I don't know that there is anything else we CAN do, right?  If there's
> >>>>another task that can run, great - it may be that that's why full dynticks
> >>>>isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
> >>>>there's nothing else we can do, in which case we basically spend our time
> >>>>going around through the scheduler code and back out to the
> >>>>task_isolation_ready() test, but again, there's really nothing else more
> >>>>useful we can be doing at this point.  Once the RCU tick fires (or whatever
> >>>>it was that was preventing full dynticks from engaging), we will pass this
> >>>>test and return to user space.
> >>>There is nothing at all you can do and setting TIF_RESCHED won't help either.
> >>>If there is another task that can run, the scheduler takes care of resched
> >>>by itself :-)
> >>The problem is that the scheduler will only take care of resched at a
> >>later time, typically when we get a timer interrupt later.
> >When a task is enqueued, the scheduler sets TIF_RESCHED on the target. If the
> >target is remote it sends an IPI, if it's local then we wait the next reschedule
> >point (preemption points, voluntary reschedule, interrupts). There is just nothing
> >you can do to accelerate that.
> 
> But that's exactly what I'm saying.  If we're sitting in a loop here waiting
> for some short-lived process (maybe kernel thread) to run and get out of
> the way, we don't want to just spin sitting in prepare_exit_to_usermode().
> We want to call schedule(), get the short-lived process to run, then when
> it calls schedule() again, we're back in prepare_exit_to_usermode but now
> we can return to userspace.

Maybe, although I think returning to userspace with -EAGAIN or -EBUSY, something like
that would be better so that userspace retries a bit later with prctl. Otherwise we may
well be waiting for ever in kernelmode.

> 
> We don't want to wait for preemption points or interrupts, and there are
> no other voluntary reschedules in the prepare_exit_to_usermode() loop.
> 
> If the other task had been woken up for some completion, then yes we would
> already have had TIF_RESCHED set, but if the other runnable task was (for
> example) pre-empted on a timer tick, we wouldn't have TIF_RESCHED set at
> this point, and thus we might need to call schedule() explicitly.

There can't be another task in the runqueue waiting to be preempted since
we (the current task) are running on the CPU.

Besides, if we aren't alone in the runqueue, this breaks the task isolation
mode.

> 
> Note that the prepare_exit_to_usermode() loop is exactly the point at
> which we normally call schedule() if we are in syscall exit, so we are
> just encouraging that schedule() to happen if otherwise it might not.
> 
> >>By invoking the scheduler here, we allow any tasks that are ready to run to run
> >>immediately, rather than waiting for an interrupt to wake the scheduler.
> >Well, in this case here we are interested in the current CPU. And if a task
> >got awoken and waits for the current CPU, it will have an opportunity to get
> >schedule on syscall exit.
> 
> That's true if TIF_RESCHED was set because a completion occurred that
> the other task was waiting for.  But there might not be any such completion
> and the task just got preempted earlier and is still ready to run.

But if another task waits for the CPU, this break task isolation mode. Now
assuming we want a pending task to resume such that we get the CPU for ourself,
we have no idea if the scheduler is going to schedule that task, it depends on
vruntime and other things. TIF_RESCHED only make entering the scheduler, it doesn't
guarantee any context switch.

> My point is that setting TIF_RESCHED is never harmful, and there are
> cases like involuntary preemption where it might help.

Sure but we don't write code just because it doesn't harm. Strange code hurts
the brain of reviewers.

Now concerning involuntary preemption, it's a matter of a millisecond, userspace
needs to wait a few millisecond before retrying anyway. Sleeping at that point is
what can be useful as we leave the CPU for the resuming task.

Also if we have any task on the runqueue anyway, whether we hope that it resumes quickly
or not, it's a very bad sign for a task isolation session. Either we did not affine tasks
correctly or there is a kernel thread that might run again at some time ahead.

> 
> >>Plenty of places in the kernel just call schedule() directly when they are
> >>waiting.  Since we're waiting here regardless, we might as well
> >>immediately get any other runnable tasks dealt with.
> >>
> >>We could also just return "false" in _task_isolation_ready(), and then
> >>check tick_nohz_tick_stopped() in _task_isolation_enter() and if false,
> >>call schedule() explicitly there, but that seems a little more roundabout.
> >>Admittedly it's more usual to see kernel code call schedule() directly
> >>to yield the processor, but in this case I'm not convinced it's cleaner
> >>given we're already in a loop where the caller is checking TIF_RESCHED
> >>and then calling schedule() when it's set.
> >You could call cond_resched(), but really syscall exit is enough for what
> >you want. And the problem here if a task prevents the CPU from stopping the
> >tick is that task itself, not the fact it doesn't get scheduled.
> 
> True, although in that case we just need to wait (e.g. for an RCU tick
> to occur to quiesce); we could spin, but spinning through the scheduler
> seems no better or worse in that case then just spinning with
> interrupts enabled in a loop.  And (as I said above) it could help.

Lets just leave that waiting to userspace. Just sleep a few milliseconds.

> 
> >If we have
> >other tasks than the current isolated one on the CPU, it means that the
> >environment is not ready for hard isolation.
> 
> Right.  But the model is that in that case, the task that wants hard
> isolation is just going to have to wait to return to userspace.

I think we shouldn't do that wait for isolation on the kernel.

> 
> 
> >And in general: we shouldn't loop at all there: if something depends on the tick,
> >the CPU is not ready for isolation and something needs to be done: setting
> >some task affinity, etc... So we should just fail the prctl and let the user
> >deal with it.
> 
> So there are potentially two cases here:
> 
> (1) When we initially do the prctl(), should we check to see if there are
> other schedulable tasks, etc., and fail the prctl() if so?  You could make a
> case for this, but I think in practice userspace would just end up looping
> back to retry the prctl if we created that semantic in the kernel.

That sounds saner to me. And if we still fail after one second, then just give up.
In fact if it doesn't work on the first time, that's a bad sign like I said above.
The task that is running on the CPU may well come again later. Some pre-conditons
are not met.

> 
> (2) What about times when we are leaving the kernel after already
> doing the prctl()?  For example a core doing packet forwarding might
> want to report some error condition up to the kernel, and remove itself
> from the set of cores handling packets, then do some syscall(s) to generate
> logging data, and then go back and continue handling packets.  Or, the
> process might have created some large anonymous mapping where
> every now and then it needs to cross a page boundary for some structure
> and touch a new page, and it knows to expect a page fault in that case.
> In those cases we are returning from the kernel, not at prctl() time, and
> we still want to enforce the semantics that no further interrupts will
> occur to disturb the task.  These kinds of use cases are why we have
> as general-purpose a mechanism as we do for task isolation.

If any interrupt or any kind of disturbance happens, we should leave that
task isolation mode and warn the isolated task about that. SIGTERM?

Thanks.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-03-04 12:56               ` Frederic Weisbecker
@ 2016-03-09 19:39                   ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-03-09 19:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

Frederic,

Thanks for the detailed feedback on the task isolation stuff.

This reply kind of turned into an essay, so I've added a little "TL;DR"
sentence before each section.


   TL;DR: Let's make an explicit decision about whether task isolation
   should be "persistent" or "one-shot".  Both have some advantages.
   =====

An important high-level issue is how "sticky" task isolation mode is.
We need to choose one of these two options:

"Persistent mode": A task switches state to "task isolation" mode
(kind of a level-triggered analogy) and stays there indefinitely.  It
can make a syscall, take a page fault, etc., if it wants to, but the
kernel protects it from incurring any further asynchronous interrupts.
This is the model I've been advocating for.

"One-shot mode": A task requests isolation via prctl(), the kernel
ensures it is isolated on return from the prctl(), but then as soon as
it enters the kernel again, task isolation is switched off until
another prctl is issued.  This is what you recommended in your last
email.

There are a number of pros and cons to the two models.  I think on
balance I still like the "persistent mode" approach, but here's all
the pros/cons I can think of:

PRO for persistent mode: A somewhat easier programming model.  Users
can just imagine "task isolation" as a way for them to still be able
to use the kernel exactly as they always have; it's just slower to get
back out of the kernel so you use it judiciously.  For example, a
process is free to call write() on a socket to perform a diagnostic,
but when returning from the write() syscall, the kernel will hold the
task in kernel mode until any timer ticks (perhaps from networking
stuff) are complete, and then let it return to userspace to continue
in task isolation mode.  This is convenient to the user since they
don't have to fret about re-enabling task isolation after that
syscall, page fault, or whatever; they can just continue running.
With your suggestion, the user pretty much has to leave STRICT mode
enabled so he gets notified of any unexpected return to kernel space
(in fact we might make it required so you always get a signal when
leaving task isolation unless it's via a prctl or exit syscall).

PRO for one-shot mode: A somewhat crisper interaction with
sched_setaffinity() etc.  With a persistent mode approach, a task can
start up task isolation, then later another task can be placed on its
cpu and break it (it won't return to userspace until killed or the new
process affinitizes itself away or stops running).  By contrast, in
one-shot mode, any return to kernel spaces turns off task isolation
anyway, so it's very clear what the interaction looks like.  I suspect
this is more a theoretical advantage to one-shot mode than a practical
one, though.

CON for one-shot mode: It's actually hard to catch every kernel entry
so we can turn the task-isolation flag off again - and we really do
need to have a flag, just so that we can suitably debug any bad
actions that bring us into the kernel when we're not expecting it.
Right now there are things that bring us into the kernel that we don't
bother annotating for task isolation STRICT mode, just because they're
visible to the user anyway: e.g., a bus fault or segmentation
violation.

I think we can actually make both modes available to users with just
another flag bit, so maybe we can look at what that looks like in v11:
adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task
isolation at the next syscall entry, page fault, etc.  Then we can
think more specifically about whether we want to remove the flag or
not, and if we remove it, whether we want to make the code that was
controlled by it unconditionally true or unconditionally false
(i.e. remove it again).


   TL;DR: We should be more willing to return -EINVAL from prctl().
   =====

One thing you've argued is that we should be more aggressive about
failing the prctl() call.  I think, in any case, that this is probably
reasonable.  We already check that the task's affinity is limited to
the current core and that that core is a task_isolation cpu; I think we
can also require that can_stop_full_tick() return true (or the moral
equivalent given your recent patch series).  This will mean you can't
even try to go into task isolation mode if another task is
schedulable, among other things, which seems like a good thing.

However, it is important to note that the current task_isolation_ready
and task_isolation_enter calls that are in the prepare_exit_to_userspace
routine are still required even with your proposed one-shot mode.  We
have to be sure that no interrupts occur on the way back to userspace
that might then in principle lead to timer interrupts being scheduled,
and the way to do that is make sure task_isolation_ready returns true
with interrupts disabled, and interrupts are not then re-enabled before
return to userspace.  Anything else is just keeping your fingers
crossed and guessing.


   TL;DR: Returning -EBUSY from prctl() isn't really that helpful.
   =====

Frederic wonders if we can test for various things not being ready
(dynticks not off yet, etc) and just return -EBUSY and let userspace
do the spinning.

First, note that this is only possible for one-shot mode.  For
persistent mode, we have the potential to run up against this on
return from any syscall, and we obviously can't add new error returns
to other syscalls.  So it doesn't really make sense to add EBUSY
semantics to prctl if nothing else can use it.

But even in one-shot mode, I'm not really sure what the advantage is
here.  We still need to do something like task_isolation_ready() in
the prepare_exit_to_usermode() loop, since that's where we have
interrupts disabled and can do a final assessment of the state of the
kernel for this core.  So, while you could imagine having that code
just hook in and call syscall_set_return_value() there instead of
causing things to loop back, that doesn't really save us much
complexity in the kernel, and instead pushes complexity back to
userspace, which may well handle it just by busywaiting on the prctl()
anyway.  You might argue that if we just return to userspace, userspace
can sleep briefly and retry, thus avoiding spinning in the scheduler.
But it's relatively easy to do that (or better) in the kernel, so I'm
not sure that's more than a straw man.  See the next point.


   TL;DR: Should we arrange to actually use a completion in
   task_isolation_enter when dynticks are ticking, and call complete()
   in tick-sched.c when we shut down dynticks, or, just spin in
   schedule() and not worry about burning a little cpu?
   =====

One question that keeps getting asked is how useful it is to just call
schedule() while we're waiting for dynticks to shut off, since it
could just be a busy spin going into schedule() over and over.  Even
if another task is ready to run we might not switch to it right away.
So one thing we could think about is arranging so that whenever we
turn off dynticks, we also notify any tasks that were waiting for it
to be turned off; that way we can just sleep in task_isolation_enter()
and wait to be notified, thus guaranteeing any other task that wants
to run can run, or even just waiting in cpu idle for a little while.
Does this seem like it's worth coding up?  My impression has always
been that we wait pretty briefly for dynticks to shut down, so it
doesn't really matter if we spin - and even if we do spin, in
principle we already arranged for this cpu to be dedicated to this
task anyway, so it doesn't really do anything bad except maybe burn a
little bit of extra cpu power.  But I'm willing to be convinced...


   TL;DR: We should turn off task isolation mode for signals.
   =====

One thing that occurs to me is that we should arrange so that
any signal delivery turns off task isolation mode.  This is
easily documented semantics even in persistent mode, and it
allows the userspace program to run and discover that something bad
has happened, rather than potentially hanging in the kernel trying to
wait for isolation to be possible before calling the signal handler.
I'll make this change for v11 in any case.

Also, doing this is something of a requirement for the proposed
one-shot mode, since if we have STRICT mode enabled, then any entry
into the kernel is either a syscall, or else ends up causing a signal,
and by hooking the signal mechanism we have a place to catch all the
non-syscall entrypoints, more or less.


   TL;DR: Maybe we should use seccomp for STRICT mode syscall detection.
   =====

This is being investigated in a separate email thread with Andy
Lutomirski.  Whether it gets included in v11 is still TBD.


   TL;DR: Various minor issues in answer to Frederic's comments :-)
   =====

On 03/04/2016 07:56 AM, Frederic Weisbecker wrote:
> On Thu, Feb 11, 2016 at 02:24:25PM -0500, Chris Metcalf wrote:
>> We don't want to wait for preemption points or interrupts, and there are
>> no other voluntary reschedules in the prepare_exit_to_usermode() loop.
>>
>> If the other task had been woken up for some completion, then yes we would
>> already have had TIF_RESCHED set, but if the other runnable task was (for
>> example) pre-empted on a timer tick, we wouldn't have TIF_RESCHED set at
>> this point, and thus we might need to call schedule() explicitly.
>
> There can't be another task in the runqueue waiting to be preempted since
> we (the current task) are running on the CPU.

My earlier sentence may not have been clear.  By saying "if the other
runnable task was pre-empted on a timer tick", I meant that
TIF_RESCHED wasn't set on our task, and we'd only eventually schedule
to that other task once a timer interrupt fired and ended our
scheduler slice.  I know you can't have a different task in the
runqueue waiting to be preempted, since that doesn't make sense :-)

> Besides, if we aren't alone in the runqueue, this breaks the task isolation
> mode.

Indeed.  We can and will do better catching that at prctl() time.
So the question is, if we adopt the "persistent mode", how do we
handle this case on some other return from kernel space?

>>>> By invoking the scheduler here, we allow any tasks that are ready to run to run
>>>> immediately, rather than waiting for an interrupt to wake the scheduler.
>>> Well, in this case here we are interested in the current CPU. And if a task
>>> got awoken and waits for the current CPU, it will have an opportunity to get
>>> schedule on syscall exit.
>>
>> That's true if TIF_RESCHED was set because a completion occurred that
>> the other task was waiting for.  But there might not be any such completion
>> and the task just got preempted earlier and is still ready to run.
>
> But if another task waits for the CPU, this break task isolation mode. Now
> assuming we want a pending task to resume such that we get the CPU for ourself,
> we have no idea if the scheduler is going to schedule that task, it depends on
> vruntime and other things. TIF_RESCHED only make entering the scheduler, it doesn't
> guarantee any context switch.

Yes, true.  So we have to decide if we feel spinning into the
scheduler is so harmful that we should set up a new completion driven
by entering dynticks fullmode, and handle it that way instead.

>> My point is that setting TIF_RESCHED is never harmful, and there are
>> cases like involuntary preemption where it might help.
>
> Sure but we don't write code just because it doesn't harm. Strange code hurts
> the brain of reviewers.

Fair enough, and certainly means at a minimum we need a good comment there!

> Now concerning involuntary preemption, it's a matter of a millisecond, userspace
> needs to wait a few millisecond before retrying anyway. Sleeping at that point is
> what can be useful as we leave the CPU for the resuming task.
>
> Also if we have any task on the runqueue anyway, whether we hope that it resumes quickly
> or not, it's a very bad sign for a task isolation session. Either we did not affine tasks
> correctly or there is a kernel thread that might run again at some time ahead.

Note that it might also be a one-time kernel task or kworker that is
queued by some random syscall in "persistent mode" and we need to let
it run until it quiesces again.  Then we can context switch back to
our task isolation task, and safely return from it to userspace.

>> (2) What about times when we are leaving the kernel after already
>> doing the prctl()?  For example a core doing packet forwarding might
>> want to report some error condition up to the kernel, and remove itself
>> from the set of cores handling packets, then do some syscall(s) to generate
>> logging data, and then go back and continue handling packets.  Or, the
>> process might have created some large anonymous mapping where
>> every now and then it needs to cross a page boundary for some structure
>> and touch a new page, and it knows to expect a page fault in that case.
>> In those cases we are returning from the kernel, not at prctl() time, and
>> we still want to enforce the semantics that no further interrupts will
>> occur to disturb the task.  These kinds of use cases are why we have
>> as general-purpose a mechanism as we do for task isolation.
>
> If any interrupt or any kind of disturbance happens, we should leave that
> task isolation mode and warn the isolated task about that. SIGTERM?

That's the goal of STRICT mode.  By default it uses SIGTERM.  You can
also choose a different signal via the prctl() API.

Thanks again, Frederic!  I'll work to put together a new version of
the patch incorporating a selectable one-shot mode, plus the other
things mentioned in this patch.  I think I will still not add the
suggested "dynticks full enabled completion" thing for now, and just
add a big comment on the code that makes us call schedule(), unless folks
really agree it's a necessary thing to have there.
-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
@ 2016-03-09 19:39                   ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-03-09 19:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

Frederic,

Thanks for the detailed feedback on the task isolation stuff.

This reply kind of turned into an essay, so I've added a little "TL;DR"
sentence before each section.


   TL;DR: Let's make an explicit decision about whether task isolation
   should be "persistent" or "one-shot".  Both have some advantages.
   =====

An important high-level issue is how "sticky" task isolation mode is.
We need to choose one of these two options:

"Persistent mode": A task switches state to "task isolation" mode
(kind of a level-triggered analogy) and stays there indefinitely.  It
can make a syscall, take a page fault, etc., if it wants to, but the
kernel protects it from incurring any further asynchronous interrupts.
This is the model I've been advocating for.

"One-shot mode": A task requests isolation via prctl(), the kernel
ensures it is isolated on return from the prctl(), but then as soon as
it enters the kernel again, task isolation is switched off until
another prctl is issued.  This is what you recommended in your last
email.

There are a number of pros and cons to the two models.  I think on
balance I still like the "persistent mode" approach, but here's all
the pros/cons I can think of:

PRO for persistent mode: A somewhat easier programming model.  Users
can just imagine "task isolation" as a way for them to still be able
to use the kernel exactly as they always have; it's just slower to get
back out of the kernel so you use it judiciously.  For example, a
process is free to call write() on a socket to perform a diagnostic,
but when returning from the write() syscall, the kernel will hold the
task in kernel mode until any timer ticks (perhaps from networking
stuff) are complete, and then let it return to userspace to continue
in task isolation mode.  This is convenient to the user since they
don't have to fret about re-enabling task isolation after that
syscall, page fault, or whatever; they can just continue running.
With your suggestion, the user pretty much has to leave STRICT mode
enabled so he gets notified of any unexpected return to kernel space
(in fact we might make it required so you always get a signal when
leaving task isolation unless it's via a prctl or exit syscall).

PRO for one-shot mode: A somewhat crisper interaction with
sched_setaffinity() etc.  With a persistent mode approach, a task can
start up task isolation, then later another task can be placed on its
cpu and break it (it won't return to userspace until killed or the new
process affinitizes itself away or stops running).  By contrast, in
one-shot mode, any return to kernel spaces turns off task isolation
anyway, so it's very clear what the interaction looks like.  I suspect
this is more a theoretical advantage to one-shot mode than a practical
one, though.

CON for one-shot mode: It's actually hard to catch every kernel entry
so we can turn the task-isolation flag off again - and we really do
need to have a flag, just so that we can suitably debug any bad
actions that bring us into the kernel when we're not expecting it.
Right now there are things that bring us into the kernel that we don't
bother annotating for task isolation STRICT mode, just because they're
visible to the user anyway: e.g., a bus fault or segmentation
violation.

I think we can actually make both modes available to users with just
another flag bit, so maybe we can look at what that looks like in v11:
adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task
isolation at the next syscall entry, page fault, etc.  Then we can
think more specifically about whether we want to remove the flag or
not, and if we remove it, whether we want to make the code that was
controlled by it unconditionally true or unconditionally false
(i.e. remove it again).


   TL;DR: We should be more willing to return -EINVAL from prctl().
   =====

One thing you've argued is that we should be more aggressive about
failing the prctl() call.  I think, in any case, that this is probably
reasonable.  We already check that the task's affinity is limited to
the current core and that that core is a task_isolation cpu; I think we
can also require that can_stop_full_tick() return true (or the moral
equivalent given your recent patch series).  This will mean you can't
even try to go into task isolation mode if another task is
schedulable, among other things, which seems like a good thing.

However, it is important to note that the current task_isolation_ready
and task_isolation_enter calls that are in the prepare_exit_to_userspace
routine are still required even with your proposed one-shot mode.  We
have to be sure that no interrupts occur on the way back to userspace
that might then in principle lead to timer interrupts being scheduled,
and the way to do that is make sure task_isolation_ready returns true
with interrupts disabled, and interrupts are not then re-enabled before
return to userspace.  Anything else is just keeping your fingers
crossed and guessing.


   TL;DR: Returning -EBUSY from prctl() isn't really that helpful.
   =====

Frederic wonders if we can test for various things not being ready
(dynticks not off yet, etc) and just return -EBUSY and let userspace
do the spinning.

First, note that this is only possible for one-shot mode.  For
persistent mode, we have the potential to run up against this on
return from any syscall, and we obviously can't add new error returns
to other syscalls.  So it doesn't really make sense to add EBUSY
semantics to prctl if nothing else can use it.

But even in one-shot mode, I'm not really sure what the advantage is
here.  We still need to do something like task_isolation_ready() in
the prepare_exit_to_usermode() loop, since that's where we have
interrupts disabled and can do a final assessment of the state of the
kernel for this core.  So, while you could imagine having that code
just hook in and call syscall_set_return_value() there instead of
causing things to loop back, that doesn't really save us much
complexity in the kernel, and instead pushes complexity back to
userspace, which may well handle it just by busywaiting on the prctl()
anyway.  You might argue that if we just return to userspace, userspace
can sleep briefly and retry, thus avoiding spinning in the scheduler.
But it's relatively easy to do that (or better) in the kernel, so I'm
not sure that's more than a straw man.  See the next point.


   TL;DR: Should we arrange to actually use a completion in
   task_isolation_enter when dynticks are ticking, and call complete()
   in tick-sched.c when we shut down dynticks, or, just spin in
   schedule() and not worry about burning a little cpu?
   =====

One question that keeps getting asked is how useful it is to just call
schedule() while we're waiting for dynticks to shut off, since it
could just be a busy spin going into schedule() over and over.  Even
if another task is ready to run we might not switch to it right away.
So one thing we could think about is arranging so that whenever we
turn off dynticks, we also notify any tasks that were waiting for it
to be turned off; that way we can just sleep in task_isolation_enter()
and wait to be notified, thus guaranteeing any other task that wants
to run can run, or even just waiting in cpu idle for a little while.
Does this seem like it's worth coding up?  My impression has always
been that we wait pretty briefly for dynticks to shut down, so it
doesn't really matter if we spin - and even if we do spin, in
principle we already arranged for this cpu to be dedicated to this
task anyway, so it doesn't really do anything bad except maybe burn a
little bit of extra cpu power.  But I'm willing to be convinced...


   TL;DR: We should turn off task isolation mode for signals.
   =====

One thing that occurs to me is that we should arrange so that
any signal delivery turns off task isolation mode.  This is
easily documented semantics even in persistent mode, and it
allows the userspace program to run and discover that something bad
has happened, rather than potentially hanging in the kernel trying to
wait for isolation to be possible before calling the signal handler.
I'll make this change for v11 in any case.

Also, doing this is something of a requirement for the proposed
one-shot mode, since if we have STRICT mode enabled, then any entry
into the kernel is either a syscall, or else ends up causing a signal,
and by hooking the signal mechanism we have a place to catch all the
non-syscall entrypoints, more or less.


   TL;DR: Maybe we should use seccomp for STRICT mode syscall detection.
   =====

This is being investigated in a separate email thread with Andy
Lutomirski.  Whether it gets included in v11 is still TBD.


   TL;DR: Various minor issues in answer to Frederic's comments :-)
   =====

On 03/04/2016 07:56 AM, Frederic Weisbecker wrote:
> On Thu, Feb 11, 2016 at 02:24:25PM -0500, Chris Metcalf wrote:
>> We don't want to wait for preemption points or interrupts, and there are
>> no other voluntary reschedules in the prepare_exit_to_usermode() loop.
>>
>> If the other task had been woken up for some completion, then yes we would
>> already have had TIF_RESCHED set, but if the other runnable task was (for
>> example) pre-empted on a timer tick, we wouldn't have TIF_RESCHED set at
>> this point, and thus we might need to call schedule() explicitly.
>
> There can't be another task in the runqueue waiting to be preempted since
> we (the current task) are running on the CPU.

My earlier sentence may not have been clear.  By saying "if the other
runnable task was pre-empted on a timer tick", I meant that
TIF_RESCHED wasn't set on our task, and we'd only eventually schedule
to that other task once a timer interrupt fired and ended our
scheduler slice.  I know you can't have a different task in the
runqueue waiting to be preempted, since that doesn't make sense :-)

> Besides, if we aren't alone in the runqueue, this breaks the task isolation
> mode.

Indeed.  We can and will do better catching that at prctl() time.
So the question is, if we adopt the "persistent mode", how do we
handle this case on some other return from kernel space?

>>>> By invoking the scheduler here, we allow any tasks that are ready to run to run
>>>> immediately, rather than waiting for an interrupt to wake the scheduler.
>>> Well, in this case here we are interested in the current CPU. And if a task
>>> got awoken and waits for the current CPU, it will have an opportunity to get
>>> schedule on syscall exit.
>>
>> That's true if TIF_RESCHED was set because a completion occurred that
>> the other task was waiting for.  But there might not be any such completion
>> and the task just got preempted earlier and is still ready to run.
>
> But if another task waits for the CPU, this break task isolation mode. Now
> assuming we want a pending task to resume such that we get the CPU for ourself,
> we have no idea if the scheduler is going to schedule that task, it depends on
> vruntime and other things. TIF_RESCHED only make entering the scheduler, it doesn't
> guarantee any context switch.

Yes, true.  So we have to decide if we feel spinning into the
scheduler is so harmful that we should set up a new completion driven
by entering dynticks fullmode, and handle it that way instead.

>> My point is that setting TIF_RESCHED is never harmful, and there are
>> cases like involuntary preemption where it might help.
>
> Sure but we don't write code just because it doesn't harm. Strange code hurts
> the brain of reviewers.

Fair enough, and certainly means at a minimum we need a good comment there!

> Now concerning involuntary preemption, it's a matter of a millisecond, userspace
> needs to wait a few millisecond before retrying anyway. Sleeping at that point is
> what can be useful as we leave the CPU for the resuming task.
>
> Also if we have any task on the runqueue anyway, whether we hope that it resumes quickly
> or not, it's a very bad sign for a task isolation session. Either we did not affine tasks
> correctly or there is a kernel thread that might run again at some time ahead.

Note that it might also be a one-time kernel task or kworker that is
queued by some random syscall in "persistent mode" and we need to let
it run until it quiesces again.  Then we can context switch back to
our task isolation task, and safely return from it to userspace.

>> (2) What about times when we are leaving the kernel after already
>> doing the prctl()?  For example a core doing packet forwarding might
>> want to report some error condition up to the kernel, and remove itself
>> from the set of cores handling packets, then do some syscall(s) to generate
>> logging data, and then go back and continue handling packets.  Or, the
>> process might have created some large anonymous mapping where
>> every now and then it needs to cross a page boundary for some structure
>> and touch a new page, and it knows to expect a page fault in that case.
>> In those cases we are returning from the kernel, not at prctl() time, and
>> we still want to enforce the semantics that no further interrupts will
>> occur to disturb the task.  These kinds of use cases are why we have
>> as general-purpose a mechanism as we do for task isolation.
>
> If any interrupt or any kind of disturbance happens, we should leave that
> task isolation mode and warn the isolated task about that. SIGTERM?

That's the goal of STRICT mode.  By default it uses SIGTERM.  You can
also choose a different signal via the prctl() API.

Thanks again, Frederic!  I'll work to put together a new version of
the patch incorporating a selectable one-shot mode, plus the other
things mentioned in this patch.  I think I will still not add the
suggested "dynticks full enabled completion" thing for now, and just
add a big comment on the code that makes us call schedule(), unless folks
really agree it's a necessary thing to have there.
-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-03-09 19:39                   ` Chris Metcalf
  (?)
@ 2016-04-08 13:56                   ` Frederic Weisbecker
  2016-04-08 16:34                       ` Chris Metcalf
  -1 siblings, 1 reply; 92+ messages in thread
From: Frederic Weisbecker @ 2016-04-08 13:56 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
> Frederic,
> 
> Thanks for the detailed feedback on the task isolation stuff.
> 
> This reply kind of turned into an essay, so I've added a little "TL;DR"
> sentence before each section.

I think I'm going to cut my reply into several threads, because really
I can't get myself to make a giant reply in once :-)

> 
> 
>   TL;DR: Let's make an explicit decision about whether task isolation
>   should be "persistent" or "one-shot".  Both have some advantages.
>   =====
> 
> An important high-level issue is how "sticky" task isolation mode is.
> We need to choose one of these two options:
> 
> "Persistent mode": A task switches state to "task isolation" mode
> (kind of a level-triggered analogy) and stays there indefinitely.  It
> can make a syscall, take a page fault, etc., if it wants to, but the
> kernel protects it from incurring any further asynchronous interrupts.
> This is the model I've been advocating for.

But then in this mode, what happens when an interrupt triggers.

> 
> "One-shot mode": A task requests isolation via prctl(), the kernel
> ensures it is isolated on return from the prctl(), but then as soon as
> it enters the kernel again, task isolation is switched off until
> another prctl is issued.  This is what you recommended in your last
> email.

No I think we can issue syscalls for exemple. But asynchronous interruptions
such as exceptions (actually somewhat synchronous but can be unexpected) and
interrupts are what we want to avoid.

> 
> There are a number of pros and cons to the two models.  I think on
> balance I still like the "persistent mode" approach, but here's all
> the pros/cons I can think of:
> 
> PRO for persistent mode: A somewhat easier programming model.  Users
> can just imagine "task isolation" as a way for them to still be able
> to use the kernel exactly as they always have; it's just slower to get
> back out of the kernel so you use it judiciously. For example, a
> process is free to call write() on a socket to perform a diagnostic,
> but when returning from the write() syscall, the kernel will hold the
> task in kernel mode until any timer ticks (perhaps from networking
> stuff) are complete, and then let it return to userspace to continue
> in task isolation mode.

So this is not hard isolation anymore. This is rather soft isolation with
best efforts to avoid disturbance.

Surely we can have different levels of isolation.

I'm still wondering what to do if the task migrates to another CPU. In fact,
perhaps what you're trying to do is rather a CPU property than a process property?

> This is convenient to the user since they
> don't have to fret about re-enabling task isolation after that
> syscall, page fault, or whatever; they can just continue running.
> With your suggestion, the user pretty much has to leave STRICT mode
> enabled so he gets notified of any unexpected return to kernel space
> (in fact we might make it required so you always get a signal when
> leaving task isolation unless it's via a prctl or exit syscall).

Right. Although we can allow all syscalls in this mode actually.

> 
> PRO for one-shot mode: A somewhat crisper interaction with
> sched_setaffinity() etc.  With a persistent mode approach, a task can
> start up task isolation, then later another task can be placed on its
> cpu and break it (it won't return to userspace until killed or the new
> process affinitizes itself away or stops running).  By contrast, in
> one-shot mode, any return to kernel spaces turns off task isolation
> anyway, so it's very clear what the interaction looks like.  I suspect
> this is more a theoretical advantage to one-shot mode than a practical
> one, though.

I think I heard about workloads that need such strict hard isolation.
Workloads that really can not afford any disturbance. They even
use userspace network stack. Maybe HFT?

> CON for one-shot mode: It's actually hard to catch every kernel entry
> so we can turn the task-isolation flag off again - and we really do
> need to have a flag, just so that we can suitably debug any bad
> actions that bring us into the kernel when we're not expecting it.
> Right now there are things that bring us into the kernel that we don't
> bother annotating for task isolation STRICT mode, just because they're
> visible to the user anyway: e.g., a bus fault or segmentation
> violation.
> 
> I think we can actually make both modes available to users with just
> another flag bit, so maybe we can look at what that looks like in v11:
> adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task
> isolation at the next syscall entry, page fault, etc.  Then we can
> think more specifically about whether we want to remove the flag or
> not, and if we remove it, whether we want to make the code that was
> controlled by it unconditionally true or unconditionally false
> (i.e. remove it again).

I think we shouldn't bother with strict hard isolation if we don't need
it yet. The implementation may well be invasive. Lets wait for someone
who really needs it.

> 
> 
>   TL;DR: We should be more willing to return -EINVAL from prctl().
>   =====
> 
> One thing you've argued is that we should be more aggressive about
> failing the prctl() call.  I think, in any case, that this is probably
> reasonable.  We already check that the task's affinity is limited to
> the current core and that that core is a task_isolation cpu; I think we
> can also require that can_stop_full_tick() return true (or the moral
> equivalent given your recent patch series).  This will mean you can't
> even try to go into task isolation mode if another task is
> schedulable, among other things, which seems like a good thing.
> 
> However, it is important to note that the current task_isolation_ready
> and task_isolation_enter calls that are in the prepare_exit_to_userspace
> routine are still required even with your proposed one-shot mode.  We
> have to be sure that no interrupts occur on the way back to userspace
> that might then in principle lead to timer interrupts being scheduled,
> and the way to do that is make sure task_isolation_ready returns true
> with interrupts disabled, and interrupts are not then re-enabled before
> return to userspace.  Anything else is just keeping your fingers
> crossed and guessing.

So your requirements are actually hard isolation but in userspace?

And what happens if you get interrupted in userspace? What about page
faults and other exceptions?

Thanks.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-04-08 13:56                   ` Frederic Weisbecker
@ 2016-04-08 16:34                       ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-04-08 16:34 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
> On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
> >   TL;DR: Let's make an explicit decision about whether task isolation
> >   should be "persistent" or "one-shot".  Both have some advantages.
> >   =====
> >
> > An important high-level issue is how "sticky" task isolation mode is.
> > We need to choose one of these two options:
> >
> > "Persistent mode": A task switches state to "task isolation" mode
> > (kind of a level-triggered analogy) and stays there indefinitely.  It
> > can make a syscall, take a page fault, etc., if it wants to, but the
> > kernel protects it from incurring any further asynchronous interrupts.
> > This is the model I've been advocating for.
>
> But then in this mode, what happens when an interrupt triggers.

So here I'm taking "interrupt" to mean an external, asynchronous
interrupt, from another core or device, or asynchronously triggered
on the local core, like a timer interrupt.  By contrast I use "exception"
or "fault" to refer to synchronous, locally-triggered interruptions.

So for interrupts, the short answer is, it's a bug! :-)

An interrupt could be a kernel bug, in which case we consider it a
"true" bug.  This could be a timer interrupt occurring even after the
task isolation code thought there were none pending, or a hardware
device that incorrectly distributes interrupts to a task-isolation
cpu, or a global IPI that should be sent to fewer cores, or a kernel
TLB flush that could be deferred until the task-isolation task
re-enters the kernel later, etc.  Regardless, I'd consider it a kernel
bug.  I'm sure there are more such bugs that we can continue to fix
going forward; it depends on how arbitrary you want to allow code
running on other cores to be.  For example, can another core unload a
kernel module without interrupting a task-isolation task?  Not right now.

Or, it could be an application bug: the standard example is if you
have an application with task-isolated cores that also does occasional
unmaps on another thread in the same process, on another core.  This
causes TLB flush interrupts under application control.  The
application shouldn't do this, and we tell our customers not to build
their applications this way.  The typical way we encourage our
customers to arrange this kind of "multi-threading" is by having a
pure memory API between the task isolation threads and what are
typically "control" threads running on non-task-isolated cores.  The
two types of threads just both mmap some common, shared memory but run
as different processes.

So what happens if an interrupt does occur?

In the "base" task isolation mode, you just take the interrupt, then
wait to quiesce any further kernel timer ticks, etc., and return to
the process.  This at least limits the damage to being a single
interruption rather than potentially additional ones, if the interrupt
also caused timers to get queued, etc.

If you enable "strict" mode, we disable task isolation mode for that
core and deliver a signal to it.  This lets the application know that
an interrupt occurred, and it can take whatever kind of logging or
debugging action it wants to, re-enable task isolation if it wants to
and continue, or just exit or abort, etc.

If you don't enable "strict" mode, but you do have
task_isolation_debug enabled as a boot flag, you will at least get a
console dump with a backtrace and whatever other data we have.
(Sometimes the debug info actually includes a backtrace of the
interrupting core, if it's an IPI or TLB flush from another core,
which can be pretty useful.)

> > "One-shot mode": A task requests isolation via prctl(), the kernel
> > ensures it is isolated on return from the prctl(), but then as soon as
> > it enters the kernel again, task isolation is switched off until
> > another prctl is issued.  This is what you recommended in your last
> > email.
>
> No I think we can issue syscalls for exemple. But asynchronous interruptions
> such as exceptions (actually somewhat synchronous but can be unexpected) and
> interrupts are what we want to avoid.

Hmm, so I think I'm not really understanding what you are suggesting.

We're certainly in agreement that avoiding interrupts and exceptions
is important.  I'm arguing that the way to deal with them is to
generate appropriate signals/printks, etc.  I'm not actually sure what
you're recommending we do to avoid exceptions.  Since they're
synchronous and deterministic, we can't really avoid them if the
program wants to issue them.  For example, mmap() some anonymous
memory and then start running, and you'll take exceptions each time
you touch a page in that mapped region.  I'd argue it's an application
bug; one should enable "strict" mode to catch and deal with such bugs.

(Typically the recommendation is to do an mlockall() before starting
task isolation mode, to handle the case of page faults.  But you can
do that and still be screwed by another thread in your process doing a
fork() and then your pages end up read-only for COW and you have to
fault them back in.  But, that's an application bug for a
task-isolation thread, and should just be treated as such.)

> > There are a number of pros and cons to the two models.  I think on
> > balance I still like the "persistent mode" approach, but here's all
> > the pros/cons I can think of:
> >
> > PRO for persistent mode: A somewhat easier programming model.  Users
> > can just imagine "task isolation" as a way for them to still be able
> > to use the kernel exactly as they always have; it's just slower to get
> > back out of the kernel so you use it judiciously. For example, a
> > process is free to call write() on a socket to perform a diagnostic,
> > but when returning from the write() syscall, the kernel will hold the
> > task in kernel mode until any timer ticks (perhaps from networking
> > stuff) are complete, and then let it return to userspace to continue
> > in task isolation mode.
>
> So this is not hard isolation anymore. This is rather soft isolation with
> best efforts to avoid disturbance.

No, it's still hard isolation.  The distinction is that we offer a way
to get in and out of the kernel "safely" if you want to run in that
mode.  The syscalls can take a long time if the syscall ends up
requiring some additional timer ticks to finish sorting out whatever
it was you asked the kernel to do, but once you're back in userspace
you immediately regain "hard" isolation.  It's under program control.

Or, you can enable "strict" mode, and then you get hard isolation
without the ability to get in and out of the kernel at all: the kernel
just kills you if you try to leave hard isolation other than by an
explicit prctl().

> Surely we can have different levels of isolation.

Well, we have nohz_full now, and by adding task-isolation, we have
two.  Or three if you count "base" and "strict" mode task isolation as
two separate levels.

> I'm still wondering what to do if the task migrates to another CPU. In fact,
> perhaps what you're trying to do is rather a CPU property than a
> process property?

Well, we did go around on this issue once already (last August) and at
the time you were encouraging isolation to be a "task" property, not a
"cpu" property:

https://lkml.kernel.org/r/20150812160020.GG21542@lerouge

You convinced me at the time :-)

You're right that migration conflicts with task isolation.  But
certainly, if a task has enabled "strict" semantics, it can't migrate;
it will lose task isolation entirely and get a signal instead,
regardless of whether it calls sched_setaffinity() on itself, or if
someone else changes its affinity and it gets a kick.

However, if a task doesn't have strict mode enabled, it can call
sched_setaffinity() and force itself onto a non-task_isolation cpu and
it won't get any isolation until it schedules itself back onto a
task_isolation cpu, at which point it wakes up on the new cpu with
hard isolation still in effect.  I can make up reasons why this sort
of thing might be useful, but it's probably a corner case.

However, this makes me wonder if "strict" mode should be the default
for task isolation??  That way task isolation really doesn't conflict
semantically with migration.  And we could provide a "weak" mode, or a
"kernel-friendly" mode, or some such nomenclature, and define the
migration semantics just for that case, where it makes it clear it's a
bit unusual.

> I think I heard about workloads that need such strict hard isolation.
> Workloads that really can not afford any disturbance. They even
> use userspace network stack. Maybe HFT?

Certainly HFT is one case.

A lot of TILE-Gx customers using task isolation (which we call
"dataplane" or "Zero-Overhead Linux") are doing high-speed network
applications with user-space networking stacks.  It can be DPDK, or it
can be another TCP/IP stack (we ship one called tStack) or it
could just be an application directly messing with the network
hardware from userspace.  These are exactly the applications that led
me into this part of kernel development in the first place.
Googling "Zero-Overhead Linux" does take you to some discussions
of customers that have used this functionality.

> > I think we can actually make both modes available to users with just
> > another flag bit, so maybe we can look at what that looks like in v11:
> > adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task
> > isolation at the next syscall entry, page fault, etc.  Then we can
> > think more specifically about whether we want to remove the flag or
> > not, and if we remove it, whether we want to make the code that was
> > controlled by it unconditionally true or unconditionally false
> > (i.e. remove it again).
>
> I think we shouldn't bother with strict hard isolation if we don't need
> it yet. The implementation may well be invasive. Lets wait for someone
> who really needs it.

I'm not sure what part of the patch series you're saying you don't
think we need yet.  I'd argue the whole patch series is "hard
isolation", and that the "strict" mode introduced in patch 06/13 isn't
particularly invasive.

> So your requirements are actually hard isolation but in userspace?

Yes, exactly.  Were you thinking about a kernel-level hard isolation?
That would have some similarities, I guess, but in some ways might
actually be a harder problem.

> And what happens if you get interrupted in userspace? What about page
> faults and other exceptions?

See above :-)

I hope we're converging here.  If you want to talk live or chat online
to help finish converging, perhaps that would make sense?  I'd be
happy to take notes and publish a summary of wherever we get to.

Thanks for taking the time to review this!

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
@ 2016-04-08 16:34                       ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-04-08 16:34 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
> On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
> >   TL;DR: Let's make an explicit decision about whether task isolation
> >   should be "persistent" or "one-shot".  Both have some advantages.
> >   =====
> >
> > An important high-level issue is how "sticky" task isolation mode is.
> > We need to choose one of these two options:
> >
> > "Persistent mode": A task switches state to "task isolation" mode
> > (kind of a level-triggered analogy) and stays there indefinitely.  It
> > can make a syscall, take a page fault, etc., if it wants to, but the
> > kernel protects it from incurring any further asynchronous interrupts.
> > This is the model I've been advocating for.
>
> But then in this mode, what happens when an interrupt triggers.

So here I'm taking "interrupt" to mean an external, asynchronous
interrupt, from another core or device, or asynchronously triggered
on the local core, like a timer interrupt.  By contrast I use "exception"
or "fault" to refer to synchronous, locally-triggered interruptions.

So for interrupts, the short answer is, it's a bug! :-)

An interrupt could be a kernel bug, in which case we consider it a
"true" bug.  This could be a timer interrupt occurring even after the
task isolation code thought there were none pending, or a hardware
device that incorrectly distributes interrupts to a task-isolation
cpu, or a global IPI that should be sent to fewer cores, or a kernel
TLB flush that could be deferred until the task-isolation task
re-enters the kernel later, etc.  Regardless, I'd consider it a kernel
bug.  I'm sure there are more such bugs that we can continue to fix
going forward; it depends on how arbitrary you want to allow code
running on other cores to be.  For example, can another core unload a
kernel module without interrupting a task-isolation task?  Not right now.

Or, it could be an application bug: the standard example is if you
have an application with task-isolated cores that also does occasional
unmaps on another thread in the same process, on another core.  This
causes TLB flush interrupts under application control.  The
application shouldn't do this, and we tell our customers not to build
their applications this way.  The typical way we encourage our
customers to arrange this kind of "multi-threading" is by having a
pure memory API between the task isolation threads and what are
typically "control" threads running on non-task-isolated cores.  The
two types of threads just both mmap some common, shared memory but run
as different processes.

So what happens if an interrupt does occur?

In the "base" task isolation mode, you just take the interrupt, then
wait to quiesce any further kernel timer ticks, etc., and return to
the process.  This at least limits the damage to being a single
interruption rather than potentially additional ones, if the interrupt
also caused timers to get queued, etc.

If you enable "strict" mode, we disable task isolation mode for that
core and deliver a signal to it.  This lets the application know that
an interrupt occurred, and it can take whatever kind of logging or
debugging action it wants to, re-enable task isolation if it wants to
and continue, or just exit or abort, etc.

If you don't enable "strict" mode, but you do have
task_isolation_debug enabled as a boot flag, you will at least get a
console dump with a backtrace and whatever other data we have.
(Sometimes the debug info actually includes a backtrace of the
interrupting core, if it's an IPI or TLB flush from another core,
which can be pretty useful.)

> > "One-shot mode": A task requests isolation via prctl(), the kernel
> > ensures it is isolated on return from the prctl(), but then as soon as
> > it enters the kernel again, task isolation is switched off until
> > another prctl is issued.  This is what you recommended in your last
> > email.
>
> No I think we can issue syscalls for exemple. But asynchronous interruptions
> such as exceptions (actually somewhat synchronous but can be unexpected) and
> interrupts are what we want to avoid.

Hmm, so I think I'm not really understanding what you are suggesting.

We're certainly in agreement that avoiding interrupts and exceptions
is important.  I'm arguing that the way to deal with them is to
generate appropriate signals/printks, etc.  I'm not actually sure what
you're recommending we do to avoid exceptions.  Since they're
synchronous and deterministic, we can't really avoid them if the
program wants to issue them.  For example, mmap() some anonymous
memory and then start running, and you'll take exceptions each time
you touch a page in that mapped region.  I'd argue it's an application
bug; one should enable "strict" mode to catch and deal with such bugs.

(Typically the recommendation is to do an mlockall() before starting
task isolation mode, to handle the case of page faults.  But you can
do that and still be screwed by another thread in your process doing a
fork() and then your pages end up read-only for COW and you have to
fault them back in.  But, that's an application bug for a
task-isolation thread, and should just be treated as such.)

> > There are a number of pros and cons to the two models.  I think on
> > balance I still like the "persistent mode" approach, but here's all
> > the pros/cons I can think of:
> >
> > PRO for persistent mode: A somewhat easier programming model.  Users
> > can just imagine "task isolation" as a way for them to still be able
> > to use the kernel exactly as they always have; it's just slower to get
> > back out of the kernel so you use it judiciously. For example, a
> > process is free to call write() on a socket to perform a diagnostic,
> > but when returning from the write() syscall, the kernel will hold the
> > task in kernel mode until any timer ticks (perhaps from networking
> > stuff) are complete, and then let it return to userspace to continue
> > in task isolation mode.
>
> So this is not hard isolation anymore. This is rather soft isolation with
> best efforts to avoid disturbance.

No, it's still hard isolation.  The distinction is that we offer a way
to get in and out of the kernel "safely" if you want to run in that
mode.  The syscalls can take a long time if the syscall ends up
requiring some additional timer ticks to finish sorting out whatever
it was you asked the kernel to do, but once you're back in userspace
you immediately regain "hard" isolation.  It's under program control.

Or, you can enable "strict" mode, and then you get hard isolation
without the ability to get in and out of the kernel at all: the kernel
just kills you if you try to leave hard isolation other than by an
explicit prctl().

> Surely we can have different levels of isolation.

Well, we have nohz_full now, and by adding task-isolation, we have
two.  Or three if you count "base" and "strict" mode task isolation as
two separate levels.

> I'm still wondering what to do if the task migrates to another CPU. In fact,
> perhaps what you're trying to do is rather a CPU property than a
> process property?

Well, we did go around on this issue once already (last August) and at
the time you were encouraging isolation to be a "task" property, not a
"cpu" property:

https://lkml.kernel.org/r/20150812160020.GG21542@lerouge

You convinced me at the time :-)

You're right that migration conflicts with task isolation.  But
certainly, if a task has enabled "strict" semantics, it can't migrate;
it will lose task isolation entirely and get a signal instead,
regardless of whether it calls sched_setaffinity() on itself, or if
someone else changes its affinity and it gets a kick.

However, if a task doesn't have strict mode enabled, it can call
sched_setaffinity() and force itself onto a non-task_isolation cpu and
it won't get any isolation until it schedules itself back onto a
task_isolation cpu, at which point it wakes up on the new cpu with
hard isolation still in effect.  I can make up reasons why this sort
of thing might be useful, but it's probably a corner case.

However, this makes me wonder if "strict" mode should be the default
for task isolation??  That way task isolation really doesn't conflict
semantically with migration.  And we could provide a "weak" mode, or a
"kernel-friendly" mode, or some such nomenclature, and define the
migration semantics just for that case, where it makes it clear it's a
bit unusual.

> I think I heard about workloads that need such strict hard isolation.
> Workloads that really can not afford any disturbance. They even
> use userspace network stack. Maybe HFT?

Certainly HFT is one case.

A lot of TILE-Gx customers using task isolation (which we call
"dataplane" or "Zero-Overhead Linux") are doing high-speed network
applications with user-space networking stacks.  It can be DPDK, or it
can be another TCP/IP stack (we ship one called tStack) or it
could just be an application directly messing with the network
hardware from userspace.  These are exactly the applications that led
me into this part of kernel development in the first place.
Googling "Zero-Overhead Linux" does take you to some discussions
of customers that have used this functionality.

> > I think we can actually make both modes available to users with just
> > another flag bit, so maybe we can look at what that looks like in v11:
> > adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task
> > isolation at the next syscall entry, page fault, etc.  Then we can
> > think more specifically about whether we want to remove the flag or
> > not, and if we remove it, whether we want to make the code that was
> > controlled by it unconditionally true or unconditionally false
> > (i.e. remove it again).
>
> I think we shouldn't bother with strict hard isolation if we don't need
> it yet. The implementation may well be invasive. Lets wait for someone
> who really needs it.

I'm not sure what part of the patch series you're saying you don't
think we need yet.  I'd argue the whole patch series is "hard
isolation", and that the "strict" mode introduced in patch 06/13 isn't
particularly invasive.

> So your requirements are actually hard isolation but in userspace?

Yes, exactly.  Were you thinking about a kernel-level hard isolation?
That would have some similarities, I guess, but in some ways might
actually be a harder problem.

> And what happens if you get interrupted in userspace? What about page
> faults and other exceptions?

See above :-)

I hope we're converging here.  If you want to talk live or chat online
to help finish converging, perhaps that would make sense?  I'd be
happy to take notes and publish a summary of wherever we get to.

Thanks for taking the time to review this!

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
@ 2016-04-12 18:41                         ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-04-12 18:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 4/8/2016 12:34 PM, Chris Metcalf wrote:
> However, this makes me wonder if "strict" mode should be the default
> for task isolation??  That way task isolation really doesn't conflict
> semantically with migration.  And we could provide a "weak" mode, or a
> "kernel-friendly" mode, or some such nomenclature, and define the
> migration semantics just for that case, where it makes it clear it's a
> bit unusual. 

I noodled around with this and decided it was a better default,
so I've made the changes and pushed it up to the branch:

     git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Now, by default when you enter task isolation mode, you are in
what I used to call "strict" mode, i.e. you can't use the kernel.

You can select a user-specified signal you want to deliver instead of
the default SIGKILL, and if you select signal 0, then you don't get
a signal at all and instead you get to keep running in task
isolation mode after making a syscall, page fault, etc.

Thus the API now looks like this in <linux/prctl.h>:

#define PR_SET_TASK_ISOLATION		48
#define PR_GET_TASK_ISOLATION		49
# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
# define PR_TASK_ISOLATION_USERSIG	(1 << 1)
# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
# define PR_TASK_ISOLATION_NOSIG \
	(PR_TASK_ISOLATION_USERSIG | PR_TASK_ISOLATION_SET_SIG(0))

I think this better matches what people should want to do in
their applications, and also matches the expectations people
have about what it means to go into task isolation mode in the
first place.

I got rid of the ONESHOT mode that I added in the v12 series, since
it didn't seem like it was what Frederic had been asking for anyway,
and it didn't seem particularly useful on its own.

Frederic, how does this align with your intuition for this stuff?

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
@ 2016-04-12 18:41                         ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-04-12 18:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 4/8/2016 12:34 PM, Chris Metcalf wrote:
> However, this makes me wonder if "strict" mode should be the default
> for task isolation??  That way task isolation really doesn't conflict
> semantically with migration.  And we could provide a "weak" mode, or a
> "kernel-friendly" mode, or some such nomenclature, and define the
> migration semantics just for that case, where it makes it clear it's a
> bit unusual. 

I noodled around with this and decided it was a better default,
so I've made the changes and pushed it up to the branch:

     git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Now, by default when you enter task isolation mode, you are in
what I used to call "strict" mode, i.e. you can't use the kernel.

You can select a user-specified signal you want to deliver instead of
the default SIGKILL, and if you select signal 0, then you don't get
a signal at all and instead you get to keep running in task
isolation mode after making a syscall, page fault, etc.

Thus the API now looks like this in <linux/prctl.h>:

#define PR_SET_TASK_ISOLATION		48
#define PR_GET_TASK_ISOLATION		49
# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
# define PR_TASK_ISOLATION_USERSIG	(1 << 1)
# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
# define PR_TASK_ISOLATION_NOSIG \
	(PR_TASK_ISOLATION_USERSIG | PR_TASK_ISOLATION_SET_SIG(0))

I think this better matches what people should want to do in
their applications, and also matches the expectations people
have about what it means to go into task isolation mode in the
first place.

I got rid of the ONESHOT mode that I added in the v12 series, since
it didn't seem like it was what Frederic had been asking for anyway,
and it didn't seem particularly useful on its own.

Frederic, how does this align with your intuition for this stuff?

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-04-08 16:34                       ` Chris Metcalf
  (?)
  (?)
@ 2016-04-22 13:16                       ` Frederic Weisbecker
  2016-04-25 20:36                           ` Chris Metcalf
  -1 siblings, 1 reply; 92+ messages in thread
From: Frederic Weisbecker @ 2016-04-22 13:16 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
> On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
> >On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
> >>   TL;DR: Let's make an explicit decision about whether task isolation
> >>   should be "persistent" or "one-shot".  Both have some advantages.
> >>   =====
> >>
> >> An important high-level issue is how "sticky" task isolation mode is.
> >> We need to choose one of these two options:
> >>
> >> "Persistent mode": A task switches state to "task isolation" mode
> >> (kind of a level-triggered analogy) and stays there indefinitely.  It
> >> can make a syscall, take a page fault, etc., if it wants to, but the
> >> kernel protects it from incurring any further asynchronous interrupts.
> >> This is the model I've been advocating for.
> >
> >But then in this mode, what happens when an interrupt triggers.
> 
> So here I'm taking "interrupt" to mean an external, asynchronous
> interrupt, from another core or device, or asynchronously triggered
> on the local core, like a timer interrupt.  By contrast I use "exception"
> or "fault" to refer to synchronous, locally-triggered interruptions.

Ok.

> So for interrupts, the short answer is, it's a bug! :-)
> 
> An interrupt could be a kernel bug, in which case we consider it a
> "true" bug.  This could be a timer interrupt occurring even after the
> task isolation code thought there were none pending, or a hardware
> device that incorrectly distributes interrupts to a task-isolation
> cpu, or a global IPI that should be sent to fewer cores, or a kernel
> TLB flush that could be deferred until the task-isolation task
> re-enters the kernel later, etc.  Regardless, I'd consider it a kernel
> bug.  I'm sure there are more such bugs that we can continue to fix
> going forward; it depends on how arbitrary you want to allow code
> running on other cores to be.  For example, can another core unload a
> kernel module without interrupting a task-isolation task?  Not right now.
> 
> Or, it could be an application bug: the standard example is if you
> have an application with task-isolated cores that also does occasional
> unmaps on another thread in the same process, on another core.  This
> causes TLB flush interrupts under application control.  The
> application shouldn't do this, and we tell our customers not to build
> their applications this way.  The typical way we encourage our
> customers to arrange this kind of "multi-threading" is by having a
> pure memory API between the task isolation threads and what are
> typically "control" threads running on non-task-isolated cores.  The
> two types of threads just both mmap some common, shared memory but run
> as different processes.
> 
> So what happens if an interrupt does occur?
> 
> In the "base" task isolation mode, you just take the interrupt, then
> wait to quiesce any further kernel timer ticks, etc., and return to
> the process.  This at least limits the damage to being a single
> interruption rather than potentially additional ones, if the interrupt
> also caused timers to get queued, etc.

So if we take an interrupt that we didn't expect, we want to wait some more
in the end of that interrupt to wait for things to quiesce some more?

That doesn't look right. Things should be quiesced once and for all on
return from the initial prctl() call. We can't even expect to quiesce more
in case of interruptions, the tick can't be forced off anyway.

> 
> If you enable "strict" mode, we disable task isolation mode for that
> core and deliver a signal to it.  This lets the application know that
> an interrupt occurred, and it can take whatever kind of logging or
> debugging action it wants to, re-enable task isolation if it wants to
> and continue, or just exit or abort, etc.

That sounds sensible.

> 
> If you don't enable "strict" mode, but you do have
> task_isolation_debug enabled as a boot flag, you will at least get a
> console dump with a backtrace and whatever other data we have.
> (Sometimes the debug info actually includes a backtrace of the
> interrupting core, if it's an IPI or TLB flush from another core,
> which can be pretty useful.)

Ok.

> 
> >> "One-shot mode": A task requests isolation via prctl(), the kernel
> >> ensures it is isolated on return from the prctl(), but then as soon as
> >> it enters the kernel again, task isolation is switched off until
> >> another prctl is issued.  This is what you recommended in your last
> >> email.
> >
> >No I think we can issue syscalls for exemple. But asynchronous interruptions
> >such as exceptions (actually somewhat synchronous but can be unexpected) and
> >interrupts are what we want to avoid.
> 
> Hmm, so I think I'm not really understanding what you are suggesting.
> 
> We're certainly in agreement that avoiding interrupts and exceptions
> is important.  I'm arguing that the way to deal with them is to
> generate appropriate signals/printks, etc.  I'm not actually sure what
> you're recommending we do to avoid exceptions.  Since they're
> synchronous and deterministic, we can't really avoid them if the
> program wants to issue them.  For example, mmap() some anonymous
> memory and then start running, and you'll take exceptions each time
> you touch a page in that mapped region.  I'd argue it's an application
> bug; one should enable "strict" mode to catch and deal with such bugs.

Ok, that looks right.

> 
> (Typically the recommendation is to do an mlockall() before starting
> task isolation mode, to handle the case of page faults.  But you can
> do that and still be screwed by another thread in your process doing a
> fork() and then your pages end up read-only for COW and you have to
> fault them back in.  But, that's an application bug for a
> task-isolation thread, and should just be treated as such.)

Ok.

> 
> >> There are a number of pros and cons to the two models.  I think on
> >> balance I still like the "persistent mode" approach, but here's all
> >> the pros/cons I can think of:
> >>
> >> PRO for persistent mode: A somewhat easier programming model.  Users
> >> can just imagine "task isolation" as a way for them to still be able
> >> to use the kernel exactly as they always have; it's just slower to get
> >> back out of the kernel so you use it judiciously. For example, a
> >> process is free to call write() on a socket to perform a diagnostic,
> >> but when returning from the write() syscall, the kernel will hold the
> >> task in kernel mode until any timer ticks (perhaps from networking
> >> stuff) are complete, and then let it return to userspace to continue
> >> in task isolation mode.
> >
> >So this is not hard isolation anymore. This is rather soft isolation with
> >best efforts to avoid disturbance.
> 
> No, it's still hard isolation.  The distinction is that we offer a way
> to get in and out of the kernel "safely" if you want to run in that
> mode.  The syscalls can take a long time if the syscall ends up
> requiring some additional timer ticks to finish sorting out whatever
> it was you asked the kernel to do, but once you're back in userspace
> you immediately regain "hard" isolation.  It's under program control.

Yeah indeed, task should be allowed to perform syscalls. So we can assume
that interrupts are fine when they fire in kernel mode.

> 
> Or, you can enable "strict" mode, and then you get hard isolation
> without the ability to get in and out of the kernel at all: the kernel
> just kills you if you try to leave hard isolation other than by an
> explicit prctl().

That would be extreme strict mode yeah. We can still add such mode later
if any user request it.

Thanks.

(I'll reply the rest of the email soonish)

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-04-22 13:16                       ` Frederic Weisbecker
@ 2016-04-25 20:36                           ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-04-25 20:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 4/22/2016 9:16 AM, Frederic Weisbecker wrote:
> On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
>> On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
>>> On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
>>>>    TL;DR: Let's make an explicit decision about whether task isolation
>>>>    should be "persistent" or "one-shot".  Both have some advantages.
>>>>    =====
>>>>
>>>> An important high-level issue is how "sticky" task isolation mode is.
>>>> We need to choose one of these two options:
>>>>
>>>> "Persistent mode": A task switches state to "task isolation" mode
>>>> (kind of a level-triggered analogy) and stays there indefinitely.  It
>>>> can make a syscall, take a page fault, etc., if it wants to, but the
>>>> kernel protects it from incurring any further asynchronous interrupts.
>>>> This is the model I've been advocating for.
>>> But then in this mode, what happens when an interrupt triggers.
>> So here I'm taking "interrupt" to mean an external, asynchronous
>> interrupt, from another core or device, or asynchronously triggered
>> on the local core, like a timer interrupt.  By contrast I use "exception"
>> or "fault" to refer to synchronous, locally-triggered interruptions.
> Ok.
>
>> So for interrupts, the short answer is, it's a bug! :-)
>>
>> An interrupt could be a kernel bug, in which case we consider it a
>> "true" bug.  This could be a timer interrupt occurring even after the
>> task isolation code thought there were none pending, or a hardware
>> device that incorrectly distributes interrupts to a task-isolation
>> cpu, or a global IPI that should be sent to fewer cores, or a kernel
>> TLB flush that could be deferred until the task-isolation task
>> re-enters the kernel later, etc.  Regardless, I'd consider it a kernel
>> bug.  I'm sure there are more such bugs that we can continue to fix
>> going forward; it depends on how arbitrary you want to allow code
>> running on other cores to be.  For example, can another core unload a
>> kernel module without interrupting a task-isolation task?  Not right now.
>>
>> Or, it could be an application bug: the standard example is if you
>> have an application with task-isolated cores that also does occasional
>> unmaps on another thread in the same process, on another core.  This
>> causes TLB flush interrupts under application control.  The
>> application shouldn't do this, and we tell our customers not to build
>> their applications this way.  The typical way we encourage our
>> customers to arrange this kind of "multi-threading" is by having a
>> pure memory API between the task isolation threads and what are
>> typically "control" threads running on non-task-isolated cores.  The
>> two types of threads just both mmap some common, shared memory but run
>> as different processes.
>>
>> So what happens if an interrupt does occur?
>>
>> In the "base" task isolation mode, you just take the interrupt, then
>> wait to quiesce any further kernel timer ticks, etc., and return to
>> the process.  This at least limits the damage to being a single
>> interruption rather than potentially additional ones, if the interrupt
>> also caused timers to get queued, etc.
> So if we take an interrupt that we didn't expect, we want to wait some more
> in the end of that interrupt to wait for things to quiesce some more?

I think it's actually pretty plausible.

Consider the "application bug" case, where you're running some code that does
packet dispatch to different cores.  If a core seems to back up you stop
dispatching packets to it.

Now, we get a TLB flush.  If handling the flush causes us to restart the tick
(maybe just as a side effect of entering the kernel in the first place) we
really are better off staying in the kernel until the tick is handled and
things are quiesced again.  That way, although we may end up dropping a
bunch of packets that were queued up to that core, we only do so ONCE - we
don't do it again when the tick fires a little bit later on, when the core
has already caught up and is claiming to be able to handle packets again.

Also, pragmatically, we would require a whole bunch of machinery in the
kernel to figure out whether we were returning from a syscall, an exception,
or an interrupt, and only skip the task-isolation work for interrupts.  We
don't actually have that information available to us at the moment we are
returning to userspace right now, so we'd need to add that tracking state
in each platform's code somehow.


> That doesn't look right. Things should be quiesced once and for all on
> return from the initial prctl() call. We can't even expect to quiesce more
> in case of interruptions, the tick can't be forced off anyway.

Yes, things are quiesced once and for all after prctl().  We also need to
be prepared to handle unexpected interrupts, though.  It's true that we can't
force the tick off, but as I suggested above, just waiting for the tick may
well be a better strategy than subjecting the application to another interrupt
after some fraction of a second.

>> Or, you can enable "strict" mode, and then you get hard isolation
>> without the ability to get in and out of the kernel at all: the kernel
>> just kills you if you try to leave hard isolation other than by an
>> explicit prctl().
> That would be extreme strict mode yeah. We can still add such mode later
> if any user request it.

So, humorously, I have become totally convinced that "extreme strict mode"
is really the right default for isolation.  It gives semantics that are easily
understandable: you stay in userspace until you do a prctl() to turn off
the flag, or exit(), or else the kernel kills you.  And, it's probably what
people want by default anyway for userspace driver code.  For code that
legitimately wants to make syscalls in this mode, you can just prctl() the
mode off, do whatever you need to do, then prctl() the mode back on again.
It's nominally a bit of overhead, but as a task-isolated application you
should be expecting tons of overhead from going into the kernel anyway.

The "less extreme strict mode" is arguably reasonable if you want to allow
people to make occasional syscalls, but it has confusing performance
characteristics (sometimes the syscalls happen quickly, but sometimes they
take multiple ticks while we wait for interrupts to quiesce), and it has
confusing semantics (what happens if a third party re-affinitizes you to
a non-isolated core).  So I like the idea of just having a separate flag
(PR_TASK_ISOLATION_NOSIG) that tells the kernel to let the user play in
the kernel without getting killed.

> (I'll reply the rest of the email soonish)

Thanks for the feedback.  It makes me feel like we may get there eventually :-)

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
@ 2016-04-25 20:36                           ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-04-25 20:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 4/22/2016 9:16 AM, Frederic Weisbecker wrote:
> On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
>> On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
>>> On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
>>>>    TL;DR: Let's make an explicit decision about whether task isolation
>>>>    should be "persistent" or "one-shot".  Both have some advantages.
>>>>    =====
>>>>
>>>> An important high-level issue is how "sticky" task isolation mode is.
>>>> We need to choose one of these two options:
>>>>
>>>> "Persistent mode": A task switches state to "task isolation" mode
>>>> (kind of a level-triggered analogy) and stays there indefinitely.  It
>>>> can make a syscall, take a page fault, etc., if it wants to, but the
>>>> kernel protects it from incurring any further asynchronous interrupts.
>>>> This is the model I've been advocating for.
>>> But then in this mode, what happens when an interrupt triggers.
>> So here I'm taking "interrupt" to mean an external, asynchronous
>> interrupt, from another core or device, or asynchronously triggered
>> on the local core, like a timer interrupt.  By contrast I use "exception"
>> or "fault" to refer to synchronous, locally-triggered interruptions.
> Ok.
>
>> So for interrupts, the short answer is, it's a bug! :-)
>>
>> An interrupt could be a kernel bug, in which case we consider it a
>> "true" bug.  This could be a timer interrupt occurring even after the
>> task isolation code thought there were none pending, or a hardware
>> device that incorrectly distributes interrupts to a task-isolation
>> cpu, or a global IPI that should be sent to fewer cores, or a kernel
>> TLB flush that could be deferred until the task-isolation task
>> re-enters the kernel later, etc.  Regardless, I'd consider it a kernel
>> bug.  I'm sure there are more such bugs that we can continue to fix
>> going forward; it depends on how arbitrary you want to allow code
>> running on other cores to be.  For example, can another core unload a
>> kernel module without interrupting a task-isolation task?  Not right now.
>>
>> Or, it could be an application bug: the standard example is if you
>> have an application with task-isolated cores that also does occasional
>> unmaps on another thread in the same process, on another core.  This
>> causes TLB flush interrupts under application control.  The
>> application shouldn't do this, and we tell our customers not to build
>> their applications this way.  The typical way we encourage our
>> customers to arrange this kind of "multi-threading" is by having a
>> pure memory API between the task isolation threads and what are
>> typically "control" threads running on non-task-isolated cores.  The
>> two types of threads just both mmap some common, shared memory but run
>> as different processes.
>>
>> So what happens if an interrupt does occur?
>>
>> In the "base" task isolation mode, you just take the interrupt, then
>> wait to quiesce any further kernel timer ticks, etc., and return to
>> the process.  This at least limits the damage to being a single
>> interruption rather than potentially additional ones, if the interrupt
>> also caused timers to get queued, etc.
> So if we take an interrupt that we didn't expect, we want to wait some more
> in the end of that interrupt to wait for things to quiesce some more?

I think it's actually pretty plausible.

Consider the "application bug" case, where you're running some code that does
packet dispatch to different cores.  If a core seems to back up you stop
dispatching packets to it.

Now, we get a TLB flush.  If handling the flush causes us to restart the tick
(maybe just as a side effect of entering the kernel in the first place) we
really are better off staying in the kernel until the tick is handled and
things are quiesced again.  That way, although we may end up dropping a
bunch of packets that were queued up to that core, we only do so ONCE - we
don't do it again when the tick fires a little bit later on, when the core
has already caught up and is claiming to be able to handle packets again.

Also, pragmatically, we would require a whole bunch of machinery in the
kernel to figure out whether we were returning from a syscall, an exception,
or an interrupt, and only skip the task-isolation work for interrupts.  We
don't actually have that information available to us at the moment we are
returning to userspace right now, so we'd need to add that tracking state
in each platform's code somehow.


> That doesn't look right. Things should be quiesced once and for all on
> return from the initial prctl() call. We can't even expect to quiesce more
> in case of interruptions, the tick can't be forced off anyway.

Yes, things are quiesced once and for all after prctl().  We also need to
be prepared to handle unexpected interrupts, though.  It's true that we can't
force the tick off, but as I suggested above, just waiting for the tick may
well be a better strategy than subjecting the application to another interrupt
after some fraction of a second.

>> Or, you can enable "strict" mode, and then you get hard isolation
>> without the ability to get in and out of the kernel at all: the kernel
>> just kills you if you try to leave hard isolation other than by an
>> explicit prctl().
> That would be extreme strict mode yeah. We can still add such mode later
> if any user request it.

So, humorously, I have become totally convinced that "extreme strict mode"
is really the right default for isolation.  It gives semantics that are easily
understandable: you stay in userspace until you do a prctl() to turn off
the flag, or exit(), or else the kernel kills you.  And, it's probably what
people want by default anyway for userspace driver code.  For code that
legitimately wants to make syscalls in this mode, you can just prctl() the
mode off, do whatever you need to do, then prctl() the mode back on again.
It's nominally a bit of overhead, but as a task-isolated application you
should be expecting tons of overhead from going into the kernel anyway.

The "less extreme strict mode" is arguably reasonable if you want to allow
people to make occasional syscalls, but it has confusing performance
characteristics (sometimes the syscalls happen quickly, but sometimes they
take multiple ticks while we wait for interrupts to quiesce), and it has
confusing semantics (what happens if a third party re-affinitizes you to
a non-isolated core).  So I like the idea of just having a separate flag
(PR_TASK_ISOLATION_NOSIG) that tells the kernel to let the user play in
the kernel without getting killed.

> (I'll reply the rest of the email soonish)

Thanks for the feedback.  It makes me feel like we may get there eventually :-)

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-04-08 16:34                       ` Chris Metcalf
                                         ` (2 preceding siblings ...)
  (?)
@ 2016-05-26  1:07                       ` Frederic Weisbecker
  2016-06-03 19:32                           ` Chris Metcalf
  -1 siblings, 1 reply; 92+ messages in thread
From: Frederic Weisbecker @ 2016-05-26  1:07 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

I don't remember how much I answered this email, but I need to finish that :-)

On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
> On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
> >On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
> >>   TL;DR: Let's make an explicit decision about whether task isolation
> >>   should be "persistent" or "one-shot".  Both have some advantages.
> >>   =====
> >>
> >> An important high-level issue is how "sticky" task isolation mode is.
> >> We need to choose one of these two options:
> >>
> >> "Persistent mode": A task switches state to "task isolation" mode
> >> (kind of a level-triggered analogy) and stays there indefinitely.  It
> >> can make a syscall, take a page fault, etc., if it wants to, but the
> >> kernel protects it from incurring any further asynchronous interrupts.
> >> This is the model I've been advocating for.
> >
> >But then in this mode, what happens when an interrupt triggers.
> 
> So what happens if an interrupt does occur?
> 
> In the "base" task isolation mode, you just take the interrupt, then
> wait to quiesce any further kernel timer ticks, etc., and return to
> the process.  This at least limits the damage to being a single
> interruption rather than potentially additional ones, if the interrupt
> also caused timers to get queued, etc.

Good, although that quiescing on kernel return must be an option.

> 
> If you enable "strict" mode, we disable task isolation mode for that
> core and deliver a signal to it.  This lets the application know that
> an interrupt occurred, and it can take whatever kind of logging or
> debugging action it wants to, re-enable task isolation if it wants to
> and continue, or just exit or abort, etc.

Good.

> 
> If you don't enable "strict" mode, but you do have
> task_isolation_debug enabled as a boot flag, you will at least get a
> console dump with a backtrace and whatever other data we have.
> (Sometimes the debug info actually includes a backtrace of the
> interrupting core, if it's an IPI or TLB flush from another core,
> which can be pretty useful.)

Right, I suggest we use trace events btw.

> 
> >> "One-shot mode": A task requests isolation via prctl(), the kernel
> >> ensures it is isolated on return from the prctl(), but then as soon as
> >> it enters the kernel again, task isolation is switched off until
> >> another prctl is issued.  This is what you recommended in your last
> >> email.
> >
> >No I think we can issue syscalls for exemple. But asynchronous interruptions
> >such as exceptions (actually somewhat synchronous but can be unexpected) and
> >interrupts are what we want to avoid.
> 
> Hmm, so I think I'm not really understanding what you are suggesting.
> 
> We're certainly in agreement that avoiding interrupts and exceptions
> is important.  I'm arguing that the way to deal with them is to
> generate appropriate signals/printks, etc.

Yes.

> I'm not actually sure what
> you're recommending we do to avoid exceptions.  Since they're
> synchronous and deterministic, we can't really avoid them if the
> program wants to issue them.  For example, mmap() some anonymous
> memory and then start running, and you'll take exceptions each time
> you touch a page in that mapped region.  I'd argue it's an application
> bug; one should enable "strict" mode to catch and deal with such bugs.

They are not all deterministic. For example a breakpoint, a step, a trap
can be set up by another process. So this is not entirely under the control
of the user.

> 
> (Typically the recommendation is to do an mlockall() before starting
> task isolation mode, to handle the case of page faults.  But you can
> do that and still be screwed by another thread in your process doing a
> fork() and then your pages end up read-only for COW and you have to
> fault them back in.  But, that's an application bug for a
> task-isolation thread, and should just be treated as such.)

Now how do you determine which exception is a bug and which is expected?
Strict mode should refuse all of them.

> >> There are a number of pros and cons to the two models.  I think on
> >> balance I still like the "persistent mode" approach, but here's all
> >> the pros/cons I can think of:
> >>
> >> PRO for persistent mode: A somewhat easier programming model.  Users
> >> can just imagine "task isolation" as a way for them to still be able
> >> to use the kernel exactly as they always have; it's just slower to get
> >> back out of the kernel so you use it judiciously. For example, a
> >> process is free to call write() on a socket to perform a diagnostic,
> >> but when returning from the write() syscall, the kernel will hold the
> >> task in kernel mode until any timer ticks (perhaps from networking
> >> stuff) are complete, and then let it return to userspace to continue
> >> in task isolation mode.
> >
> >So this is not hard isolation anymore. This is rather soft isolation with
> >best efforts to avoid disturbance.
> 
> No, it's still hard isolation.  The distinction is that we offer a way
> to get in and out of the kernel "safely" if you want to run in that
> mode.  The syscalls can take a long time if the syscall ends up
> requiring some additional timer ticks to finish sorting out whatever
> it was you asked the kernel to do, but once you're back in userspace
> you immediately regain "hard" isolation.  It's under program control.
> 
> Or, you can enable "strict" mode, and then you get hard isolation
> without the ability to get in and out of the kernel at all: the kernel
> just kills you if you try to leave hard isolation other than by an
> explicit prctl().

Well, hard isolation is what I would call strict mode.

> 
> >Surely we can have different levels of isolation.
> 
> Well, we have nohz_full now, and by adding task-isolation, we have
> two.  Or three if you count "base" and "strict" mode task isolation as
> two separate levels.

Right.

> 
> >I'm still wondering what to do if the task migrates to another CPU. In fact,
> >perhaps what you're trying to do is rather a CPU property than a
> >process property?
> 
> Well, we did go around on this issue once already (last August) and at
> the time you were encouraging isolation to be a "task" property, not a
> "cpu" property:
> 
> https://lkml.kernel.org/r/20150812160020.GG21542@lerouge
> 
> You convinced me at the time :-)

Indeed :-) Well if it's a task property, we need to handle its affinity properly then.

> 
> You're right that migration conflicts with task isolation.  But
> certainly, if a task has enabled "strict" semantics, it can't migrate;
> it will lose task isolation entirely and get a signal instead,
> regardless of whether it calls sched_setaffinity() on itself, or if
> someone else changes its affinity and it gets a kick.

Yes.

> 
> However, if a task doesn't have strict mode enabled, it can call
> sched_setaffinity() and force itself onto a non-task_isolation cpu and
> it won't get any isolation until it schedules itself back onto a
> task_isolation cpu, at which point it wakes up on the new cpu with
> hard isolation still in effect.  I can make up reasons why this sort
> of thing might be useful, but it's probably a corner case.

That doesn't look sane. The user asks the kernel to get away as much
as it can but if we are in a non-nohz-full CPU we know we can't provide that
service (or rather that non-service).

So we would refuse to enter in task isolation mode if it doesn't run in a
full dynticks CPUs whereas we accept that it migrates later to a periodic
CPU?. This isn't consistent.



> 
> However, this makes me wonder if "strict" mode should be the default
> for task isolation??  That way task isolation really doesn't conflict
> semantically with migration.  And we could provide a "weak" mode, or a
> "kernel-friendly" mode, or some such nomenclature, and define the
> migration semantics just for that case, where it makes it clear it's a
> bit unusual.

Well we can't really implement that strict mode until we fix the 1Hz issue, right?
Besides, is this something that anyone needs now?

> 
> >I think I heard about workloads that need such strict hard isolation.
> >Workloads that really can not afford any disturbance. They even
> >use userspace network stack. Maybe HFT?
> 
> Certainly HFT is one case.
> 
> A lot of TILE-Gx customers using task isolation (which we call
> "dataplane" or "Zero-Overhead Linux") are doing high-speed network
> applications with user-space networking stacks.  It can be DPDK, or it
> can be another TCP/IP stack (we ship one called tStack) or it
> could just be an application directly messing with the network
> hardware from userspace.  These are exactly the applications that led
> me into this part of kernel development in the first place.
> Googling "Zero-Overhead Linux" does take you to some discussions
> of customers that have used this functionality.

So those workloads couldn't stand an interrupt? Like they would like a signal
and exit the strict mode if it happens?

I think that we need to wait for somebody who explicitly request that feature
before we work on it, so we get sure the semantics really agree with someone's
real load case.

> 
> >> I think we can actually make both modes available to users with just
> >> another flag bit, so maybe we can look at what that looks like in v11:
> >> adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task
> >> isolation at the next syscall entry, page fault, etc.  Then we can
> >> think more specifically about whether we want to remove the flag or
> >> not, and if we remove it, whether we want to make the code that was
> >> controlled by it unconditionally true or unconditionally false
> >> (i.e. remove it again).
> >
> >I think we shouldn't bother with strict hard isolation if we don't need
> >it yet. The implementation may well be invasive. Lets wait for someone
> >who really needs it.
> 
> I'm not sure what part of the patch series you're saying you don't
> think we need yet.  I'd argue the whole patch series is "hard
> isolation", and that the "strict" mode introduced in patch 06/13 isn't
> particularly invasive.

It's not in the patch series, I'm talking about the strict mode :-)

> 
> >So your requirements are actually hard isolation but in userspace?
> 
> Yes, exactly.  Were you thinking about a kernel-level hard isolation?
> That would have some similarities, I guess, but in some ways might
> actually be a harder problem.
> 
> >And what happens if you get interrupted in userspace? What about page
> >faults and other exceptions?
> 
> See above :-)
> 
> I hope we're converging here.  If you want to talk live or chat online
> to help finish converging, perhaps that would make sense?  I'd be
> happy to take notes and publish a summary of wherever we get to.
> 
> Thanks for taking the time to review this!

Ok, so thinking about that talk, I'm wondering if we need some flags
such as:

         ISOLATION_SIGNAL_SYSCALL
         ISOLATION_SIGNAL_EXCEPTIONS
         ISOLATION_SIGNAL_INTERRUPTS

Strict mode would be the three above OR'ed. It's just some random thoughts
but that would help define which level of kernel intrusion the user is ready
to tolerate.

I'm just not sure how granular we want that interface to be.

> 
> -- 
> Chris Metcalf, Mellanox Technologies
> http://www.mellanox.com
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-05-26  1:07                       ` Frederic Weisbecker
@ 2016-06-03 19:32                           ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-06-03 19:32 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 5/25/2016 9:07 PM, Frederic Weisbecker wrote:
> I don't remember how much I answered this email, but I need to finish that :-)

Sorry for the slow response - it's been a busy week.

> On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
>> On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
>>> On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
>>>>    TL;DR: Let's make an explicit decision about whether task isolation
>>>>    should be "persistent" or "one-shot".  Both have some advantages.
>>>>    =====
>>>>
>>>> An important high-level issue is how "sticky" task isolation mode is.
>>>> We need to choose one of these two options:
>>>>
>>>> "Persistent mode": A task switches state to "task isolation" mode
>>>> (kind of a level-triggered analogy) and stays there indefinitely.  It
>>>> can make a syscall, take a page fault, etc., if it wants to, but the
>>>> kernel protects it from incurring any further asynchronous interrupts.
>>>> This is the model I've been advocating for.
>>> But then in this mode, what happens when an interrupt triggers.
>> So what happens if an interrupt does occur?
>>
>> In the "base" task isolation mode, you just take the interrupt, then
>> wait to quiesce any further kernel timer ticks, etc., and return to
>> the process.  This at least limits the damage to being a single
>> interruption rather than potentially additional ones, if the interrupt
>> also caused timers to get queued, etc.
> Good, although that quiescing on kernel return must be an option.

Can you spell out why you think turning it off is helpful?  I'll admit
this is the default mode in the commercial version of task isolation
that we ship, and was also the default in the first LKML patch series.
But on consideration I haven't found scenarios where skipping the
quiescing is helpful.  Admittedly you get out of the kernel faster,
but then you're back in userspace and vulnerable to yet more
unexpected interrupts until the timer quiesces.  If you're asking for
task isolation, this is surely not what you want.

>> If you enable "strict" mode, we disable task isolation mode for that
>> core and deliver a signal to it.  This lets the application know that
>> an interrupt occurred, and it can take whatever kind of logging or
>> debugging action it wants to, re-enable task isolation if it wants to
>> and continue, or just exit or abort, etc.
> Good.
>
>> If you don't enable "strict" mode, but you do have
>> task_isolation_debug enabled as a boot flag, you will at least get a
>> console dump with a backtrace and whatever other data we have.
>> (Sometimes the debug info actually includes a backtrace of the
>> interrupting core, if it's an IPI or TLB flush from another core,
>> which can be pretty useful.)
> Right, I suggest we use trace events btw.

This is probably a good idea, although I wonder if it's worth deferring
until after the main patch series goes in - I'm reluctant to expand the scope
of this patch series and add more reasons for it to get delayed :-)
What do you think?

>>>> "One-shot mode": A task requests isolation via prctl(), the kernel
>>>> ensures it is isolated on return from the prctl(), but then as soon as
>>>> it enters the kernel again, task isolation is switched off until
>>>> another prctl is issued.  This is what you recommended in your last
>>>> email.
>>> No I think we can issue syscalls for exemple. But asynchronous interruptions
>>> such as exceptions (actually somewhat synchronous but can be unexpected) and
>>> interrupts are what we want to avoid.
>> Hmm, so I think I'm not really understanding what you are suggesting.
>>
>> We're certainly in agreement that avoiding interrupts and exceptions
>> is important.  I'm arguing that the way to deal with them is to
>> generate appropriate signals/printks, etc.
> Yes.
>
>> I'm not actually sure what
>> you're recommending we do to avoid exceptions.  Since they're
>> synchronous and deterministic, we can't really avoid them if the
>> program wants to issue them.  For example, mmap() some anonymous
>> memory and then start running, and you'll take exceptions each time
>> you touch a page in that mapped region.  I'd argue it's an application
>> bug; one should enable "strict" mode to catch and deal with such bugs.
> They are not all deterministic. For example a breakpoint, a step, a trap
> can be set up by another process. So this is not entirely under the control
> of the user.

That's true, but I'd argue the behavior in that case should be that you can
raise that kind of exception validly (so you can debug), and then you should
quiesce on return to userspace so the application doesn't see additional
exceptions.  There are two ways you could handle debugging:

1. Require the program to set the flag that says it doesn't want a signal
when it is interrupted (so you can interrupt it to debug it, and not kill it);

2. Or have debugging automatically set that flag in the target process.
Similarly, we could just say that if a debugger is attached, we never
generate the kill signal for task isolation.

>> (Typically the recommendation is to do an mlockall() before starting
>> task isolation mode, to handle the case of page faults.  But you can
>> do that and still be screwed by another thread in your process doing a
>> fork() and then your pages end up read-only for COW and you have to
>> fault them back in.  But, that's an application bug for a
>> task-isolation thread, and should just be treated as such.)
> Now how do you determine which exception is a bug and which is expected?
> Strict mode should refuse all of them.

Yes, exactly.  Task isolation will complain about everything. :-)

>>>> There are a number of pros and cons to the two models.  I think on
>>>> balance I still like the "persistent mode" approach, but here's all
>>>> the pros/cons I can think of:
>>>>
>>>> PRO for persistent mode: A somewhat easier programming model.  Users
>>>> can just imagine "task isolation" as a way for them to still be able
>>>> to use the kernel exactly as they always have; it's just slower to get
>>>> back out of the kernel so you use it judiciously. For example, a
>>>> process is free to call write() on a socket to perform a diagnostic,
>>>> but when returning from the write() syscall, the kernel will hold the
>>>> task in kernel mode until any timer ticks (perhaps from networking
>>>> stuff) are complete, and then let it return to userspace to continue
>>>> in task isolation mode.
>>> So this is not hard isolation anymore. This is rather soft isolation with
>>> best efforts to avoid disturbance.
>> No, it's still hard isolation.  The distinction is that we offer a way
>> to get in and out of the kernel "safely" if you want to run in that
>> mode.  The syscalls can take a long time if the syscall ends up
>> requiring some additional timer ticks to finish sorting out whatever
>> it was you asked the kernel to do, but once you're back in userspace
>> you immediately regain "hard" isolation.  It's under program control.
>>
>> Or, you can enable "strict" mode, and then you get hard isolation
>> without the ability to get in and out of the kernel at all: the kernel
>> just kills you if you try to leave hard isolation other than by an
>> explicit prctl().
> Well, hard isolation is what I would call strict mode.

Here's what I am inclined towards:

  - Default mode (hard isolation / "strict") - leave userspace, get a signal, no exceptions.

  - "No signal" mode - leave userspace synchronously (syscall/exception), get quiesced on
    return, no signals.  But asynchronous interrupts still cause a signal since they are
    not expected to occur.

  - Soft mode (I don't think we want this) - like "no signal" except you don't even quiesce
    on return to userspace, and asynchronous interrupts don't even cause a signal.
    It's basically "best effort", just nohz_full plus the code that tries to get things
    like LRU or vmstat to run before returning to userspace.  I think there isn't enough
    "value add" to make this a separate mode, though.

>>> Surely we can have different levels of isolation.
>> Well, we have nohz_full now, and by adding task-isolation, we have
>> two.  Or three if you count "base" and "strict" mode task isolation as
>> two separate levels.
> Right.
>
>>> I'm still wondering what to do if the task migrates to another CPU. In fact,
>>> perhaps what you're trying to do is rather a CPU property than a
>>> process property?
>> Well, we did go around on this issue once already (last August) and at
>> the time you were encouraging isolation to be a "task" property, not a
>> "cpu" property:
>>
>> https://lkml.kernel.org/r/20150812160020.GG21542@lerouge
>>
>> You convinced me at the time :-)
> Indeed :-) Well if it's a task property, we need to handle its affinity properly then.
>> You're right that migration conflicts with task isolation.  But
>> certainly, if a task has enabled "strict" semantics, it can't migrate;
>> it will lose task isolation entirely and get a signal instead,
>> regardless of whether it calls sched_setaffinity() on itself, or if
>> someone else changes its affinity and it gets a kick.
> Yes.
>
>> However, if a task doesn't have strict mode enabled, it can call
>> sched_setaffinity() and force itself onto a non-task_isolation cpu and
>> it won't get any isolation until it schedules itself back onto a
>> task_isolation cpu, at which point it wakes up on the new cpu with
>> hard isolation still in effect.  I can make up reasons why this sort
>> of thing might be useful, but it's probably a corner case.
> That doesn't look sane. The user asks the kernel to get away as much
> as it can but if we are in a non-nohz-full CPU we know we can't provide that
> service (or rather that non-service).
>
> So we would refuse to enter in task isolation mode if it doesn't run in a
> full dynticks CPUs whereas we accept that it migrates later to a periodic
> CPU?. This isn't consistent.

Yes, and originally I made that consistent by not checking when it started
up, either, but I was subsequently convinced that the checks were good for
sanity.

Another answer is just to say that the full strict mode is the only mode, and
that if the task leaves userspace, it leaves task isolation mode until it the mode
is re-enabled.  In the context of receiving a signal each time, this is more plausible.
You can always re-enable task isolation in the signal handler if you want.

I still suspect that the "hybrid" mode where you can leave userspace for things
like syscalls, but quiesce on return, is useful.  I agree that it leaves some question
about task migration.  We can refuse to honor a task's request to migrate itself
in that case, perhaps.  I don't know what to think about when someone else tries
to migrate the task - perhaps it only succeeds if the caller is root, and otherwise
fails, when the task is in task isolation mode?  It gets tricky and that's why I
was inclined to go with a simple "it always works, but it produces results
that you have to read the documentation to understand" (i.e. task isolation
mode goes dormant until you schedule back to a task isolation cpu).
On balance this is still the approach that I like best.

Which approach seems best to you?

>> However, this makes me wonder if "strict" mode should be the default
>> for task isolation??  That way task isolation really doesn't conflict
>> semantically with migration.  And we could provide a "weak" mode, or a
>> "kernel-friendly" mode, or some such nomenclature, and define the
>> migration semantics just for that case, where it makes it clear it's a
>> bit unusual.
> Well we can't really implement that strict mode until we fix the 1Hz issue, right?
> Besides, is this something that anyone needs now?

Certainly all of this is assuming that we have "solved" the 1Hz tick problem,
either by commenting out the max_deferment call, or at such time as we have
really fixed the underlying issues and remove the max deferment entirely.

At that point, I'm not sure it's a question of people needing strict mode per se;
I think it's more about picking the mode that is the best from both a user experience
and a quality of implementation perspective.

>>> I think I heard about workloads that need such strict hard isolation.
>>> Workloads that really can not afford any disturbance. They even
>>> use userspace network stack. Maybe HFT?
>> Certainly HFT is one case.
>>
>> A lot of TILE-Gx customers using task isolation (which we call
>> "dataplane" or "Zero-Overhead Linux") are doing high-speed network
>> applications with user-space networking stacks.  It can be DPDK, or it
>> can be another TCP/IP stack (we ship one called tStack) or it
>> could just be an application directly messing with the network
>> hardware from userspace.  These are exactly the applications that led
>> me into this part of kernel development in the first place.
>> Googling "Zero-Overhead Linux" does take you to some discussions
>> of customers that have used this functionality.
> So those workloads couldn't stand an interrupt? Like they would like a signal
> and exit the strict mode if it happens?

Correct, they couldn't tolerate interrupts.  If one happened, it would cause packets to
be dropped and some kind of logging would fire to report the problem.

> I think that we need to wait for somebody who explicitly request that feature
> before we work on it, so we get sure the semantics really agree with someone's
> real load case.

This is really the scenario that Tilera's customers use, so I'm pretty familiar with
what they expect.

>>>> I think we can actually make both modes available to users with just
>>>> another flag bit, so maybe we can look at what that looks like in v11:
>>>> adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task
>>>> isolation at the next syscall entry, page fault, etc.  Then we can
>>>> think more specifically about whether we want to remove the flag or
>>>> not, and if we remove it, whether we want to make the code that was
>>>> controlled by it unconditionally true or unconditionally false
>>>> (i.e. remove it again).
>>> I think we shouldn't bother with strict hard isolation if we don't need
>>> it yet. The implementation may well be invasive. Lets wait for someone
>>> who really needs it.
>> I'm not sure what part of the patch series you're saying you don't
>> think we need yet.  I'd argue the whole patch series is "hard
>> isolation", and that the "strict" mode introduced in patch 06/13 isn't
>> particularly invasive.
> It's not in the patch series, I'm talking about the strict mode :-)
>
>>> So your requirements are actually hard isolation but in userspace?
>> Yes, exactly.  Were you thinking about a kernel-level hard isolation?
>> That would have some similarities, I guess, but in some ways might
>> actually be a harder problem.
>>
>>> And what happens if you get interrupted in userspace? What about page
>>> faults and other exceptions?
>> See above :-)
>>
>> I hope we're converging here.  If you want to talk live or chat online
>> to help finish converging, perhaps that would make sense?  I'd be
>> happy to take notes and publish a summary of wherever we get to.
>>
>> Thanks for taking the time to review this!
> Ok, so thinking about that talk, I'm wondering if we need some flags
> such as:
>
>           ISOLATION_SIGNAL_SYSCALL
>           ISOLATION_SIGNAL_EXCEPTIONS
>           ISOLATION_SIGNAL_INTERRUPTS
>
> Strict mode would be the three above OR'ed. It's just some random thoughts
> but that would help define which level of kernel intrusion the user is ready
> to tolerate.
>
> I'm just not sure how granular we want that interface to be.

Yes, you could certainly imagine being more granular.  For example, if you expected
to make syscalls but not receive exceptions or interrupts, that might be a useful
mode.  Or, you were willing to make syscalls and take exceptions, but not receive
interrupts.  (I think you should never be willing to receive asynchronous interrupts,
since that kind of defeats the purpose of task isolation in the first place.)

So maybe something like this:

PR_TASK_ISOLATION_ENABLE - turn on basic strict/signaling mode
PR_TASK_ISOLATION_ALLOW_SYSCALLS - for syscalls, no signal, just quiesce before return
PR_TASK_ISOLATION_ALLOW_EXCEPTIONS - for all exceptions, no signal, quiesce before return

It might make sense to say you would allow page faults, for example, but not general
exceptions.  But my guess is that the exception-related stuff really does need an
application use case to account for it.  I would say for the initial support of task
isolation, we have a clearly-understood model for allowing syscalls (e.g. stuff
like generating diagnostics on error or slow paths), but not really a model for
understanding why users would want to take exceptions, so I'd say let's omit
that initially, and maybe just add the _ALLOW_SYSCALLS flag.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
@ 2016-06-03 19:32                           ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-06-03 19:32 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 5/25/2016 9:07 PM, Frederic Weisbecker wrote:
> I don't remember how much I answered this email, but I need to finish that :-)

Sorry for the slow response - it's been a busy week.

> On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
>> On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
>>> On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
>>>>    TL;DR: Let's make an explicit decision about whether task isolation
>>>>    should be "persistent" or "one-shot".  Both have some advantages.
>>>>    =====
>>>>
>>>> An important high-level issue is how "sticky" task isolation mode is.
>>>> We need to choose one of these two options:
>>>>
>>>> "Persistent mode": A task switches state to "task isolation" mode
>>>> (kind of a level-triggered analogy) and stays there indefinitely.  It
>>>> can make a syscall, take a page fault, etc., if it wants to, but the
>>>> kernel protects it from incurring any further asynchronous interrupts.
>>>> This is the model I've been advocating for.
>>> But then in this mode, what happens when an interrupt triggers.
>> So what happens if an interrupt does occur?
>>
>> In the "base" task isolation mode, you just take the interrupt, then
>> wait to quiesce any further kernel timer ticks, etc., and return to
>> the process.  This at least limits the damage to being a single
>> interruption rather than potentially additional ones, if the interrupt
>> also caused timers to get queued, etc.
> Good, although that quiescing on kernel return must be an option.

Can you spell out why you think turning it off is helpful?  I'll admit
this is the default mode in the commercial version of task isolation
that we ship, and was also the default in the first LKML patch series.
But on consideration I haven't found scenarios where skipping the
quiescing is helpful.  Admittedly you get out of the kernel faster,
but then you're back in userspace and vulnerable to yet more
unexpected interrupts until the timer quiesces.  If you're asking for
task isolation, this is surely not what you want.

>> If you enable "strict" mode, we disable task isolation mode for that
>> core and deliver a signal to it.  This lets the application know that
>> an interrupt occurred, and it can take whatever kind of logging or
>> debugging action it wants to, re-enable task isolation if it wants to
>> and continue, or just exit or abort, etc.
> Good.
>
>> If you don't enable "strict" mode, but you do have
>> task_isolation_debug enabled as a boot flag, you will at least get a
>> console dump with a backtrace and whatever other data we have.
>> (Sometimes the debug info actually includes a backtrace of the
>> interrupting core, if it's an IPI or TLB flush from another core,
>> which can be pretty useful.)
> Right, I suggest we use trace events btw.

This is probably a good idea, although I wonder if it's worth deferring
until after the main patch series goes in - I'm reluctant to expand the scope
of this patch series and add more reasons for it to get delayed :-)
What do you think?

>>>> "One-shot mode": A task requests isolation via prctl(), the kernel
>>>> ensures it is isolated on return from the prctl(), but then as soon as
>>>> it enters the kernel again, task isolation is switched off until
>>>> another prctl is issued.  This is what you recommended in your last
>>>> email.
>>> No I think we can issue syscalls for exemple. But asynchronous interruptions
>>> such as exceptions (actually somewhat synchronous but can be unexpected) and
>>> interrupts are what we want to avoid.
>> Hmm, so I think I'm not really understanding what you are suggesting.
>>
>> We're certainly in agreement that avoiding interrupts and exceptions
>> is important.  I'm arguing that the way to deal with them is to
>> generate appropriate signals/printks, etc.
> Yes.
>
>> I'm not actually sure what
>> you're recommending we do to avoid exceptions.  Since they're
>> synchronous and deterministic, we can't really avoid them if the
>> program wants to issue them.  For example, mmap() some anonymous
>> memory and then start running, and you'll take exceptions each time
>> you touch a page in that mapped region.  I'd argue it's an application
>> bug; one should enable "strict" mode to catch and deal with such bugs.
> They are not all deterministic. For example a breakpoint, a step, a trap
> can be set up by another process. So this is not entirely under the control
> of the user.

That's true, but I'd argue the behavior in that case should be that you can
raise that kind of exception validly (so you can debug), and then you should
quiesce on return to userspace so the application doesn't see additional
exceptions.  There are two ways you could handle debugging:

1. Require the program to set the flag that says it doesn't want a signal
when it is interrupted (so you can interrupt it to debug it, and not kill it);

2. Or have debugging automatically set that flag in the target process.
Similarly, we could just say that if a debugger is attached, we never
generate the kill signal for task isolation.

>> (Typically the recommendation is to do an mlockall() before starting
>> task isolation mode, to handle the case of page faults.  But you can
>> do that and still be screwed by another thread in your process doing a
>> fork() and then your pages end up read-only for COW and you have to
>> fault them back in.  But, that's an application bug for a
>> task-isolation thread, and should just be treated as such.)
> Now how do you determine which exception is a bug and which is expected?
> Strict mode should refuse all of them.

Yes, exactly.  Task isolation will complain about everything. :-)

>>>> There are a number of pros and cons to the two models.  I think on
>>>> balance I still like the "persistent mode" approach, but here's all
>>>> the pros/cons I can think of:
>>>>
>>>> PRO for persistent mode: A somewhat easier programming model.  Users
>>>> can just imagine "task isolation" as a way for them to still be able
>>>> to use the kernel exactly as they always have; it's just slower to get
>>>> back out of the kernel so you use it judiciously. For example, a
>>>> process is free to call write() on a socket to perform a diagnostic,
>>>> but when returning from the write() syscall, the kernel will hold the
>>>> task in kernel mode until any timer ticks (perhaps from networking
>>>> stuff) are complete, and then let it return to userspace to continue
>>>> in task isolation mode.
>>> So this is not hard isolation anymore. This is rather soft isolation with
>>> best efforts to avoid disturbance.
>> No, it's still hard isolation.  The distinction is that we offer a way
>> to get in and out of the kernel "safely" if you want to run in that
>> mode.  The syscalls can take a long time if the syscall ends up
>> requiring some additional timer ticks to finish sorting out whatever
>> it was you asked the kernel to do, but once you're back in userspace
>> you immediately regain "hard" isolation.  It's under program control.
>>
>> Or, you can enable "strict" mode, and then you get hard isolation
>> without the ability to get in and out of the kernel at all: the kernel
>> just kills you if you try to leave hard isolation other than by an
>> explicit prctl().
> Well, hard isolation is what I would call strict mode.

Here's what I am inclined towards:

  - Default mode (hard isolation / "strict") - leave userspace, get a signal, no exceptions.

  - "No signal" mode - leave userspace synchronously (syscall/exception), get quiesced on
    return, no signals.  But asynchronous interrupts still cause a signal since they are
    not expected to occur.

  - Soft mode (I don't think we want this) - like "no signal" except you don't even quiesce
    on return to userspace, and asynchronous interrupts don't even cause a signal.
    It's basically "best effort", just nohz_full plus the code that tries to get things
    like LRU or vmstat to run before returning to userspace.  I think there isn't enough
    "value add" to make this a separate mode, though.

>>> Surely we can have different levels of isolation.
>> Well, we have nohz_full now, and by adding task-isolation, we have
>> two.  Or three if you count "base" and "strict" mode task isolation as
>> two separate levels.
> Right.
>
>>> I'm still wondering what to do if the task migrates to another CPU. In fact,
>>> perhaps what you're trying to do is rather a CPU property than a
>>> process property?
>> Well, we did go around on this issue once already (last August) and at
>> the time you were encouraging isolation to be a "task" property, not a
>> "cpu" property:
>>
>> https://lkml.kernel.org/r/20150812160020.GG21542@lerouge
>>
>> You convinced me at the time :-)
> Indeed :-) Well if it's a task property, we need to handle its affinity properly then.
>> You're right that migration conflicts with task isolation.  But
>> certainly, if a task has enabled "strict" semantics, it can't migrate;
>> it will lose task isolation entirely and get a signal instead,
>> regardless of whether it calls sched_setaffinity() on itself, or if
>> someone else changes its affinity and it gets a kick.
> Yes.
>
>> However, if a task doesn't have strict mode enabled, it can call
>> sched_setaffinity() and force itself onto a non-task_isolation cpu and
>> it won't get any isolation until it schedules itself back onto a
>> task_isolation cpu, at which point it wakes up on the new cpu with
>> hard isolation still in effect.  I can make up reasons why this sort
>> of thing might be useful, but it's probably a corner case.
> That doesn't look sane. The user asks the kernel to get away as much
> as it can but if we are in a non-nohz-full CPU we know we can't provide that
> service (or rather that non-service).
>
> So we would refuse to enter in task isolation mode if it doesn't run in a
> full dynticks CPUs whereas we accept that it migrates later to a periodic
> CPU?. This isn't consistent.

Yes, and originally I made that consistent by not checking when it started
up, either, but I was subsequently convinced that the checks were good for
sanity.

Another answer is just to say that the full strict mode is the only mode, and
that if the task leaves userspace, it leaves task isolation mode until it the mode
is re-enabled.  In the context of receiving a signal each time, this is more plausible.
You can always re-enable task isolation in the signal handler if you want.

I still suspect that the "hybrid" mode where you can leave userspace for things
like syscalls, but quiesce on return, is useful.  I agree that it leaves some question
about task migration.  We can refuse to honor a task's request to migrate itself
in that case, perhaps.  I don't know what to think about when someone else tries
to migrate the task - perhaps it only succeeds if the caller is root, and otherwise
fails, when the task is in task isolation mode?  It gets tricky and that's why I
was inclined to go with a simple "it always works, but it produces results
that you have to read the documentation to understand" (i.e. task isolation
mode goes dormant until you schedule back to a task isolation cpu).
On balance this is still the approach that I like best.

Which approach seems best to you?

>> However, this makes me wonder if "strict" mode should be the default
>> for task isolation??  That way task isolation really doesn't conflict
>> semantically with migration.  And we could provide a "weak" mode, or a
>> "kernel-friendly" mode, or some such nomenclature, and define the
>> migration semantics just for that case, where it makes it clear it's a
>> bit unusual.
> Well we can't really implement that strict mode until we fix the 1Hz issue, right?
> Besides, is this something that anyone needs now?

Certainly all of this is assuming that we have "solved" the 1Hz tick problem,
either by commenting out the max_deferment call, or at such time as we have
really fixed the underlying issues and remove the max deferment entirely.

At that point, I'm not sure it's a question of people needing strict mode per se;
I think it's more about picking the mode that is the best from both a user experience
and a quality of implementation perspective.

>>> I think I heard about workloads that need such strict hard isolation.
>>> Workloads that really can not afford any disturbance. They even
>>> use userspace network stack. Maybe HFT?
>> Certainly HFT is one case.
>>
>> A lot of TILE-Gx customers using task isolation (which we call
>> "dataplane" or "Zero-Overhead Linux") are doing high-speed network
>> applications with user-space networking stacks.  It can be DPDK, or it
>> can be another TCP/IP stack (we ship one called tStack) or it
>> could just be an application directly messing with the network
>> hardware from userspace.  These are exactly the applications that led
>> me into this part of kernel development in the first place.
>> Googling "Zero-Overhead Linux" does take you to some discussions
>> of customers that have used this functionality.
> So those workloads couldn't stand an interrupt? Like they would like a signal
> and exit the strict mode if it happens?

Correct, they couldn't tolerate interrupts.  If one happened, it would cause packets to
be dropped and some kind of logging would fire to report the problem.

> I think that we need to wait for somebody who explicitly request that feature
> before we work on it, so we get sure the semantics really agree with someone's
> real load case.

This is really the scenario that Tilera's customers use, so I'm pretty familiar with
what they expect.

>>>> I think we can actually make both modes available to users with just
>>>> another flag bit, so maybe we can look at what that looks like in v11:
>>>> adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task
>>>> isolation at the next syscall entry, page fault, etc.  Then we can
>>>> think more specifically about whether we want to remove the flag or
>>>> not, and if we remove it, whether we want to make the code that was
>>>> controlled by it unconditionally true or unconditionally false
>>>> (i.e. remove it again).
>>> I think we shouldn't bother with strict hard isolation if we don't need
>>> it yet. The implementation may well be invasive. Lets wait for someone
>>> who really needs it.
>> I'm not sure what part of the patch series you're saying you don't
>> think we need yet.  I'd argue the whole patch series is "hard
>> isolation", and that the "strict" mode introduced in patch 06/13 isn't
>> particularly invasive.
> It's not in the patch series, I'm talking about the strict mode :-)
>
>>> So your requirements are actually hard isolation but in userspace?
>> Yes, exactly.  Were you thinking about a kernel-level hard isolation?
>> That would have some similarities, I guess, but in some ways might
>> actually be a harder problem.
>>
>>> And what happens if you get interrupted in userspace? What about page
>>> faults and other exceptions?
>> See above :-)
>>
>> I hope we're converging here.  If you want to talk live or chat online
>> to help finish converging, perhaps that would make sense?  I'd be
>> happy to take notes and publish a summary of wherever we get to.
>>
>> Thanks for taking the time to review this!
> Ok, so thinking about that talk, I'm wondering if we need some flags
> such as:
>
>           ISOLATION_SIGNAL_SYSCALL
>           ISOLATION_SIGNAL_EXCEPTIONS
>           ISOLATION_SIGNAL_INTERRUPTS
>
> Strict mode would be the three above OR'ed. It's just some random thoughts
> but that would help define which level of kernel intrusion the user is ready
> to tolerate.
>
> I'm just not sure how granular we want that interface to be.

Yes, you could certainly imagine being more granular.  For example, if you expected
to make syscalls but not receive exceptions or interrupts, that might be a useful
mode.  Or, you were willing to make syscalls and take exceptions, but not receive
interrupts.  (I think you should never be willing to receive asynchronous interrupts,
since that kind of defeats the purpose of task isolation in the first place.)

So maybe something like this:

PR_TASK_ISOLATION_ENABLE - turn on basic strict/signaling mode
PR_TASK_ISOLATION_ALLOW_SYSCALLS - for syscalls, no signal, just quiesce before return
PR_TASK_ISOLATION_ALLOW_EXCEPTIONS - for all exceptions, no signal, quiesce before return

It might make sense to say you would allow page faults, for example, but not general
exceptions.  But my guess is that the exception-related stuff really does need an
application use case to account for it.  I would say for the initial support of task
isolation, we have a clearly-understood model for allowing syscalls (e.g. stuff
like generating diagnostics on error or slow paths), but not really a model for
understanding why users would want to take exceptions, so I'd say let's omit
that initially, and maybe just add the _ALLOW_SYSCALLS flag.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-06-03 19:32                           ` Chris Metcalf
  (?)
@ 2016-06-29 15:18                           ` Frederic Weisbecker
  2016-07-01 20:59                               ` Chris Metcalf
  -1 siblings, 1 reply; 92+ messages in thread
From: Frederic Weisbecker @ 2016-06-29 15:18 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Fri, Jun 03, 2016 at 03:32:04PM -0400, Chris Metcalf wrote:
> On 5/25/2016 9:07 PM, Frederic Weisbecker wrote:
> >I don't remember how much I answered this email, but I need to finish that :-)
> 
> Sorry for the slow response - it's been a busy week.

I'm certainly much slower ;-)

> 
> >On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
> >>On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
> >>>On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
> >>>>   TL;DR: Let's make an explicit decision about whether task isolation
> >>>>   should be "persistent" or "one-shot".  Both have some advantages.
> >>>>   =====
> >>>>
> >>>>An important high-level issue is how "sticky" task isolation mode is.
> >>>>We need to choose one of these two options:
> >>>>
> >>>>"Persistent mode": A task switches state to "task isolation" mode
> >>>>(kind of a level-triggered analogy) and stays there indefinitely.  It
> >>>>can make a syscall, take a page fault, etc., if it wants to, but the
> >>>>kernel protects it from incurring any further asynchronous interrupts.
> >>>>This is the model I've been advocating for.
> >>>But then in this mode, what happens when an interrupt triggers.
> >>So what happens if an interrupt does occur?
> >>
> >>In the "base" task isolation mode, you just take the interrupt, then
> >>wait to quiesce any further kernel timer ticks, etc., and return to
> >>the process.  This at least limits the damage to being a single
> >>interruption rather than potentially additional ones, if the interrupt
> >>also caused timers to get queued, etc.
> >Good, although that quiescing on kernel return must be an option.
> 
> Can you spell out why you think turning it off is helpful?  I'll admit
> this is the default mode in the commercial version of task isolation
> that we ship, and was also the default in the first LKML patch series.
> But on consideration I haven't found scenarios where skipping the
> quiescing is helpful.  Admittedly you get out of the kernel faster,
> but then you're back in userspace and vulnerable to yet more
> unexpected interrupts until the timer quiesces.  If you're asking for
> task isolation, this is surely not what you want.

I just feel that quiescing, on the way back to user after an unwanted
interruption, is awkward. The quiescing should work once and for all
on return back from the prctl. If we still get disturbed afterward,
either the quiescing is buggy or incomplete, or something is on the
way that can not be quiesced.

> 
> >>If you enable "strict" mode, we disable task isolation mode for that
> >>core and deliver a signal to it.  This lets the application know that
> >>an interrupt occurred, and it can take whatever kind of logging or
> >>debugging action it wants to, re-enable task isolation if it wants to
> >>and continue, or just exit or abort, etc.
> >Good.
> >
> >>If you don't enable "strict" mode, but you do have
> >>task_isolation_debug enabled as a boot flag, you will at least get a
> >>console dump with a backtrace and whatever other data we have.
> >>(Sometimes the debug info actually includes a backtrace of the
> >>interrupting core, if it's an IPI or TLB flush from another core,
> >>which can be pretty useful.)
> >Right, I suggest we use trace events btw.
> 
> This is probably a good idea, although I wonder if it's worth deferring
> until after the main patch series goes in - I'm reluctant to expand the scope
> of this patch series and add more reasons for it to get delayed :-)
> What do you think?

Yeah definetly, the patchset is big enough :-)

> 
> >>>>"One-shot mode": A task requests isolation via prctl(), the kernel
> >>>>ensures it is isolated on return from the prctl(), but then as soon as
> >>>>it enters the kernel again, task isolation is switched off until
> >>>>another prctl is issued.  This is what you recommended in your last
> >>>>email.
> >>>No I think we can issue syscalls for exemple. But asynchronous interruptions
> >>>such as exceptions (actually somewhat synchronous but can be unexpected) and
> >>>interrupts are what we want to avoid.
> >>Hmm, so I think I'm not really understanding what you are suggesting.
> >>
> >>We're certainly in agreement that avoiding interrupts and exceptions
> >>is important.  I'm arguing that the way to deal with them is to
> >>generate appropriate signals/printks, etc.
> >Yes.
> >
> >>I'm not actually sure what
> >>you're recommending we do to avoid exceptions.  Since they're
> >>synchronous and deterministic, we can't really avoid them if the
> >>program wants to issue them.  For example, mmap() some anonymous
> >>memory and then start running, and you'll take exceptions each time
> >>you touch a page in that mapped region.  I'd argue it's an application
> >>bug; one should enable "strict" mode to catch and deal with such bugs.
> >They are not all deterministic. For example a breakpoint, a step, a trap
> >can be set up by another process. So this is not entirely under the control
> >of the user.
> 
> That's true, but I'd argue the behavior in that case should be that you can
> raise that kind of exception validly (so you can debug), and then you should
> quiesce on return to userspace so the application doesn't see additional
> exceptions.

I don't see how we can quiesce such things.

> There are two ways you could handle debugging:
> 
> 1. Require the program to set the flag that says it doesn't want a signal
> when it is interrupted (so you can interrupt it to debug it, and not kill it);

That's rather about exceptions, right?

> 
> 2. Or have debugging automatically set that flag in the target process.
> Similarly, we could just say that if a debugger is attached, we never
> generate the kill signal for task isolation.
> 
> >>(Typically the recommendation is to do an mlockall() before starting
> >>task isolation mode, to handle the case of page faults.  But you can
> >>do that and still be screwed by another thread in your process doing a
> >>fork() and then your pages end up read-only for COW and you have to
> >>fault them back in.  But, that's an application bug for a
> >>task-isolation thread, and should just be treated as such.)
> >Now how do you determine which exception is a bug and which is expected?
> >Strict mode should refuse all of them.
> 
> Yes, exactly.  Task isolation will complain about everything. :-)

Ok :-)

> 
> >>>>There are a number of pros and cons to the two models.  I think on
> >>>>balance I still like the "persistent mode" approach, but here's all
> >>>>the pros/cons I can think of:
> >>>>
> >>>>PRO for persistent mode: A somewhat easier programming model.  Users
> >>>>can just imagine "task isolation" as a way for them to still be able
> >>>>to use the kernel exactly as they always have; it's just slower to get
> >>>>back out of the kernel so you use it judiciously. For example, a
> >>>>process is free to call write() on a socket to perform a diagnostic,
> >>>>but when returning from the write() syscall, the kernel will hold the
> >>>>task in kernel mode until any timer ticks (perhaps from networking
> >>>>stuff) are complete, and then let it return to userspace to continue
> >>>>in task isolation mode.
> >>>So this is not hard isolation anymore. This is rather soft isolation with
> >>>best efforts to avoid disturbance.
> >>No, it's still hard isolation.  The distinction is that we offer a way
> >>to get in and out of the kernel "safely" if you want to run in that
> >>mode.  The syscalls can take a long time if the syscall ends up
> >>requiring some additional timer ticks to finish sorting out whatever
> >>it was you asked the kernel to do, but once you're back in userspace
> >>you immediately regain "hard" isolation.  It's under program control.
> >>
> >>Or, you can enable "strict" mode, and then you get hard isolation
> >>without the ability to get in and out of the kernel at all: the kernel
> >>just kills you if you try to leave hard isolation other than by an
> >>explicit prctl().
> >Well, hard isolation is what I would call strict mode.
> 
> Here's what I am inclined towards:
> 
>  - Default mode (hard isolation / "strict") - leave userspace, get a signal, no exceptions.

Ok.

> 
>  - "No signal" mode - leave userspace synchronously (syscall/exception), get quiesced on
>    return, no signals.  But asynchronous interrupts still cause a signal since they are
>    not expected to occur.

So only interrupt cause a signal in this mode? Exceptions and syscalls are permitted, right?

> 
>  - Soft mode (I don't think we want this) - like "no signal" except you don't even quiesce
>    on return to userspace, and asynchronous interrupts don't even cause a signal.
>    It's basically "best effort", just nohz_full plus the code that tries to get things
>    like LRU or vmstat to run before returning to userspace.  I think there isn't enough
>    "value add" to make this a separate mode, though.

I can imagine HPC to be willing this mode.

> 
> >>>Surely we can have different levels of isolation.
> >>Well, we have nohz_full now, and by adding task-isolation, we have
> >>two.  Or three if you count "base" and "strict" mode task isolation as
> >>two separate levels.
> >Right.
> >
> >>>I'm still wondering what to do if the task migrates to another CPU. In fact,
> >>>perhaps what you're trying to do is rather a CPU property than a
> >>>process property?
> >>Well, we did go around on this issue once already (last August) and at
> >>the time you were encouraging isolation to be a "task" property, not a
> >>"cpu" property:
> >>
> >>https://lkml.kernel.org/r/20150812160020.GG21542@lerouge
> >>
> >>You convinced me at the time :-)
> >Indeed :-) Well if it's a task property, we need to handle its affinity properly then.
> >>You're right that migration conflicts with task isolation.  But
> >>certainly, if a task has enabled "strict" semantics, it can't migrate;
> >>it will lose task isolation entirely and get a signal instead,
> >>regardless of whether it calls sched_setaffinity() on itself, or if
> >>someone else changes its affinity and it gets a kick.
> >Yes.
> >
> >>However, if a task doesn't have strict mode enabled, it can call
> >>sched_setaffinity() and force itself onto a non-task_isolation cpu and
> >>it won't get any isolation until it schedules itself back onto a
> >>task_isolation cpu, at which point it wakes up on the new cpu with
> >>hard isolation still in effect.  I can make up reasons why this sort
> >>of thing might be useful, but it's probably a corner case.
> >That doesn't look sane. The user asks the kernel to get away as much
> >as it can but if we are in a non-nohz-full CPU we know we can't provide that
> >service (or rather that non-service).
> >
> >So we would refuse to enter in task isolation mode if it doesn't run in a
> >full dynticks CPUs whereas we accept that it migrates later to a periodic
> >CPU?. This isn't consistent.
> 
> Yes, and originally I made that consistent by not checking when it started
> up, either, but I was subsequently convinced that the checks were good for
> sanity.

Sure sanity checks are good but if you refuse the prctl with returning an error
on the basis of this sanity condition, the task shouldn't be able to later reach
that insanity state without being properly kicked out of the feature provided by
the prctl().

Otherwise perhaps just drop a warning.

> 
> Another answer is just to say that the full strict mode is the only mode, and
> that if the task leaves userspace, it leaves task isolation mode until it the mode
> is re-enabled.  In the context of receiving a signal each time, this is more plausible.
> You can always re-enable task isolation in the signal handler if you want.

I would be afraid that, on workloads that can live with a few interrupts, those signals
would be a burden.

> 
> I still suspect that the "hybrid" mode where you can leave userspace for things
> like syscalls, but quiesce on return, is useful.  I agree that it leaves some question
> about task migration.  We can refuse to honor a task's request to migrate itself
> in that case, perhaps.  I don't know what to think about when someone else tries
> to migrate the task - perhaps it only succeeds if the caller is root, and otherwise
> fails, when the task is in task isolation mode?  It gets tricky and that's why I
> was inclined to go with a simple "it always works, but it produces results
> that you have to read the documentation to understand" (i.e. task isolation
> mode goes dormant until you schedule back to a task isolation cpu).
> On balance this is still the approach that I like best.
> 
> Which approach seems best to you?

Indeed, forbidding the task to run on a non-nohz-full CPU would be very tricky.
We would need to take care about all possible races, which need to be done under
rq lock so it requires complicating scheduler internals. And eventually if the
CPU gets offlined, we still need to find the task a place to run. Moreover this
raises some privilege issues.

That's not quite an option so this leaves two others:

* Make sure that as soon as the task gets scheduled out of a non-nohz-CPU, it loses
  the flag and gets a signal. That's possible but again it requires some scheduler
  internals.

* Just don't care and schedule the task anywhere, it will be warned soon enough about
  the problem.

The last one looks like a viable and simple enough solution.

> 
> >>However, this makes me wonder if "strict" mode should be the default
> >>for task isolation??  That way task isolation really doesn't conflict
> >>semantically with migration.  And we could provide a "weak" mode, or a
> >>"kernel-friendly" mode, or some such nomenclature, and define the
> >>migration semantics just for that case, where it makes it clear it's a
> >>bit unusual.
> >Well we can't really implement that strict mode until we fix the 1Hz issue, right?
> >Besides, is this something that anyone needs now?
> 
> Certainly all of this is assuming that we have "solved" the 1Hz tick problem,
> either by commenting out the max_deferment call, or at such time as we have
> really fixed the underlying issues and remove the max deferment entirely.
> 
> At that point, I'm not sure it's a question of people needing strict mode per se;
> I think it's more about picking the mode that is the best from both a user experience
> and a quality of implementation perspective.

Sure, ideally we need to start with the mode that people need most and leave room
in the interface for extension.

> 
> >>>I think I heard about workloads that need such strict hard isolation.
> >>>Workloads that really can not afford any disturbance. They even
> >>>use userspace network stack. Maybe HFT?
> >>Certainly HFT is one case.
> >>
> >>A lot of TILE-Gx customers using task isolation (which we call
> >>"dataplane" or "Zero-Overhead Linux") are doing high-speed network
> >>applications with user-space networking stacks.  It can be DPDK, or it
> >>can be another TCP/IP stack (we ship one called tStack) or it
> >>could just be an application directly messing with the network
> >>hardware from userspace.  These are exactly the applications that led
> >>me into this part of kernel development in the first place.
> >>Googling "Zero-Overhead Linux" does take you to some discussions
> >>of customers that have used this functionality.
> >So those workloads couldn't stand an interrupt? Like they would like a signal
> >and exit the strict mode if it happens?
> 
> Correct, they couldn't tolerate interrupts.  If one happened, it would cause packets to
> be dropped and some kind of logging would fire to report the problem.

Ok. And is it this mode you're interested in? Isn't quiescing an issue in this mode?

> 
> >I think that we need to wait for somebody who explicitly request that feature
> >before we work on it, so we get sure the semantics really agree with someone's
> >real load case.
> 
> This is really the scenario that Tilera's customers use, so I'm pretty familiar with
> what they expect.

Ok, so let's take that direction.

> 
> >Ok, so thinking about that talk, I'm wondering if we need some flags
> >such as:
> >
> >          ISOLATION_SIGNAL_SYSCALL
> >          ISOLATION_SIGNAL_EXCEPTIONS
> >          ISOLATION_SIGNAL_INTERRUPTS
> >
> >Strict mode would be the three above OR'ed. It's just some random thoughts
> >but that would help define which level of kernel intrusion the user is ready
> >to tolerate.
> >
> >I'm just not sure how granular we want that interface to be.
> 
> Yes, you could certainly imagine being more granular.  For example, if you expected
> to make syscalls but not receive exceptions or interrupts, that might be a useful
> mode.  Or, you were willing to make syscalls and take exceptions, but not receive
> interrupts.  (I think you should never be willing to receive asynchronous interrupts,
> since that kind of defeats the purpose of task isolation in the first place.)
> 
> So maybe something like this:
> 
> PR_TASK_ISOLATION_ENABLE - turn on basic strict/signaling mode
> PR_TASK_ISOLATION_ALLOW_SYSCALLS - for syscalls, no signal, just quiesce before return
> PR_TASK_ISOLATION_ALLOW_EXCEPTIONS - for all exceptions, no signal, quiesce before return
> 
> It might make sense to say you would allow page faults, for example, but not general
> exceptions.  But my guess is that the exception-related stuff really does need an
> application use case to account for it.  I would say for the initial support of task
> isolation, we have a clearly-understood model for allowing syscalls (e.g. stuff
> like generating diagnostics on error or slow paths), but not really a model for
> understanding why users would want to take exceptions, so I'd say let's omit
> that initially, and maybe just add the _ALLOW_SYSCALLS flag.

Ok. That interface looks better. At least we can start with just PR_TASK_ISOLATION_ENABLE which
does strict pure isolation mode and have future flags for more granularity.

I guess the last thing I'm uncomfortable with is the quiescing that needs to be re-done
everytime we get interrupted.

Thanks.

> 
> -- 
> Chris Metcalf, Mellanox Technologies
> http://www.mellanox.com
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-06-29 15:18                           ` Frederic Weisbecker
@ 2016-07-01 20:59                               ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-07-01 20:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 6/29/2016 11:18 AM, Frederic Weisbecker wrote:
> On Fri, Jun 03, 2016 at 03:32:04PM -0400, Chris Metcalf wrote:
>> On 5/25/2016 9:07 PM, Frederic Weisbecker wrote:
>>> On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
>>>> On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
>>>>> On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
>>>>>>   TL;DR: Let's make an explicit decision about whether task isolation
>>>>>>   should be "persistent" or "one-shot".  Both have some advantages.
>>>>>>   =====
>>>>>>
>>>>>> An important high-level issue is how "sticky" task isolation mode is.
>>>>>> We need to choose one of these two options:
>>>>>>
>>>>>> "Persistent mode": A task switches state to "task isolation" mode
>>>>>> (kind of a level-triggered analogy) and stays there indefinitely.  It
>>>>>> can make a syscall, take a page fault, etc., if it wants to, but the
>>>>>> kernel protects it from incurring any further asynchronous interrupts.
>>>>>> This is the model I've been advocating for.
>>>>> But then in this mode, what happens when an interrupt triggers.
>>>> So what happens if an interrupt does occur?
>>>>
>>>> In the "base" task isolation mode, you just take the interrupt, then
>>>> wait to quiesce any further kernel timer ticks, etc., and return to
>>>> the process.  This at least limits the damage to being a single
>>>> interruption rather than potentially additional ones, if the interrupt
>>>> also caused timers to get queued, etc.
>>> Good, although that quiescing on kernel return must be an option.
>>
>> Can you spell out why you think turning it off is helpful?  I'll admit
>> this is the default mode in the commercial version of task isolation
>> that we ship, and was also the default in the first LKML patch series.
>> But on consideration I haven't found scenarios where skipping the
>> quiescing is helpful.  Admittedly you get out of the kernel faster,
>> but then you're back in userspace and vulnerable to yet more
>> unexpected interrupts until the timer quiesces.  If you're asking for
>> task isolation, this is surely not what you want.
>
> I just feel that quiescing, on the way back to user after an unwanted
> interruption, is awkward. The quiescing should work once and for all
> on return back from the prctl. If we still get disturbed afterward,
> either the quiescing is buggy or incomplete, or something is on the
> way that can not be quiesced.

If we are thinking of an initial implementation that doesn't allow any
subsequent kernel entry to be valid, then this all gets much easier,
since any subsequent kernel entry except for a prctl() syscall will
result in a signal, which will turn off task isolation, and we will
never have to worry about additional quiescing.  I think that's where
we got from the discussion at the bottom of this email.

So for your question here, we're really just thinking about future
directions as far as how to handle interrupts, and if in the future we
add support for allowing syscalls and/or exceptions without leaving
task isolation mode, then we have to think about how that interacts
with interrupts.  The problem is that it's hard to tell, as you're
returning to userspace, whether you're returning from an exception or
an interrupt; you typically don't have that information available.  So
from a purely ease-of-implementation perspective, we'd likely want to
handle exceptions and interrupts the same way, and quiesce both.

In general, I think it would also be a better explanation to users of
task isolation to say "every enter/exit to the kernel is either an
error that causes a signal, or it quiesces on return".  It's a simpler
semantic, and I think it also is better for interrupts anyway, since
it potentially avoids multiple interrupts to the application (whatever
interrupted to begin with, plus potential timer interrupts later).

But that said, if we start with "pure strict" mode only, all of this
becomes hypothetical, and we may in fact choose never to allow "safe"
modes of entering the kernel.

>>>> I'm not actually sure what
>>>> you're recommending we do to avoid exceptions.  Since they're
>>>> synchronous and deterministic, we can't really avoid them if the
>>>> program wants to issue them.  For example, mmap() some anonymous
>>>> memory and then start running, and you'll take exceptions each time
>>>> you touch a page in that mapped region.  I'd argue it's an application
>>>> bug; one should enable "strict" mode to catch and deal with such bugs.
>>> They are not all deterministic. For example a breakpoint, a step, a trap
>>> can be set up by another process. So this is not entirely under the control
>>> of the user.
>>
>> That's true, but I'd argue the behavior in that case should be that you can
>> raise that kind of exception validly (so you can debug), and then you should
>> quiesce on return to userspace so the application doesn't see additional
>> exceptions.
>
> I don't see how we can quiesce such things.

I'm imagining task A is in dataplane mode, and task B wants to debug
it by writing a breakpoint into its text.  When task A hits the
breakpoint, it will enter the kernel, and hold there while task B
pokes at it with ptrace.  When task A finally is allowed to return to
userspace, it should quiesce before entering userspace in case any
timer interrupts got scheduled (again, maybe due to softirqs or
whatever, or random other kernel activity targeting that core while it
was in the kernel, or whatever).  This is just the same kind of
quiescing we do on return from the initial prctl().

With a "pure strict" mode it does get a little tricky, since we will
end up killing task A as it comes back from its breakpoint.  We might
just choose to say that task A should not enable task isolation if it
is going to be debugged (some runtime switch).  This isn't really a
great solution; I do kind of feel that the nicest thing to do is
quiesce the task again at this point.  This feels like the biggest
argument in favor of supporting a mode where a task-isolated task can
safely enter the kernel for exceptions.  What do you think?

>> There are two ways you could handle debugging:
>>
>> 1. Require the program to set the flag that says it doesn't want a signal
>> when it is interrupted (so you can interrupt it to debug it, and not kill it);
>
> That's rather about exceptions, right?

Yes, with the task A/task B example above, you're right.  I was
thinking there was a kick given by task B to task A.  I think that
might even be true in some circumstances, but anyway, it's a detail.

>> Here's what I am inclined towards:
>>
>>  - Default mode (hard isolation / "strict") - leave userspace, get a signal, no exceptions.
>
> Ok.
>
>>
>>  - "No signal" mode - leave userspace synchronously (syscall/exception), get quiesced on
>>    return, no signals.  But asynchronous interrupts still cause a signal since they are
>>    not expected to occur.
>
> So only interrupt cause a signal in this mode? Exceptions and syscalls are permitted, right?

Yes, correct.

>>  - Soft mode (I don't think we want this) - like "no signal" except you don't even quiesce
>>    on return to userspace, and asynchronous interrupts don't even cause a signal.
>>    It's basically "best effort", just nohz_full plus the code that tries to get things
>>    like LRU or vmstat to run before returning to userspace.  I think there isn't enough
>>    "value add" to make this a separate mode, though.
>
> I can imagine HPC to be willing this mode.

Yes, perhaps.  I'm not convinced we want to target HPC without a much
clearer sense of why this is better than nohz_full, though.  I fear
people might think "task isolation" is better by definition and not
think too much about it, but I'm really not sure it is better for the
HPC use case, necessarily.

>>>> You're right that migration conflicts with task isolation.  But
>>>> certainly, if a task has enabled "strict" semantics, it can't migrate;
>>>> it will lose task isolation entirely and get a signal instead,
>>>> regardless of whether it calls sched_setaffinity() on itself, or if
>>>> someone else changes its affinity and it gets a kick.
>>> Yes.
>>>
>>>> However, if a task doesn't have strict mode enabled, it can call
>>>> sched_setaffinity() and force itself onto a non-task_isolation cpu and
>>>> it won't get any isolation until it schedules itself back onto a
>>>> task_isolation cpu, at which point it wakes up on the new cpu with
>>>> hard isolation still in effect.  I can make up reasons why this sort
>>>> of thing might be useful, but it's probably a corner case.
>>> That doesn't look sane. The user asks the kernel to get away as much
>>> as it can but if we are in a non-nohz-full CPU we know we can't provide that
>>> service (or rather that non-service).
>>>
>>> So we would refuse to enter in task isolation mode if it doesn't run in a
>>> full dynticks CPUs whereas we accept that it migrates later to a periodic
>>> CPU?. This isn't consistent.
>>
>> Yes, and originally I made that consistent by not checking when it started
>> up, either, but I was subsequently convinced that the checks were good for
>> sanity.
>
> Sure sanity checks are good but if you refuse the prctl with returning an error
> on the basis of this sanity condition, the task shouldn't be able to later reach
> that insanity state without being properly kicked out of the feature provided by
> the prctl().
>
> Otherwise perhaps just drop a warning.

Are you saying that we should printk a warning in the prctl() rather
than returning an error in the case where it's not on a full dynticks
cpu?  I could be convinced by that just to keep things consistent.

How about doing it this way?  If you invoke prctl() with the default
"strict" mode where any kernel entry results in a signal, the prctl()
will be strict, and require you to be affinitized to a single, full
dynticks cpu.

But, if you enable the "allow syscalls" mode, then the prctl isn't
strict either, since you can use syscalls to get into a state where
you're not on a full dynticks cpu, and you just get a console warning
if you enter task isolation on the wrong cpu.  (Of course, we may end
up not doing the "allow syscalls" mode for the first version of this
patch anyway, as we discuss below.)

>>>> Googling "Zero-Overhead Linux" does take you to some discussions
>>>> of customers that have used this functionality.
>>> So those workloads couldn't stand an interrupt? Like they would like a signal
>>> and exit the strict mode if it happens?
>>
>> Correct, they couldn't tolerate interrupts.  If one happened, it would cause packets to
>> be dropped and some kind of logging would fire to report the problem.
>
> Ok. And is it this mode you're interested in? Isn't quiescing an issue in this mode?

In this mode we don't worry about quiescing for interrupts, since we
are generating a signal, and when you send a signal, you first have to
disable task isolation mode to avoid getting into various bad states
(sending too many signals, or worse, getting deadlocked because you
are signalling the task BECAUSE it was about to receive a signal).  So
we only quiesce after syscalls/exceptions.

>> So maybe something like this:
>>
>> PR_TASK_ISOLATION_ENABLE - turn on basic strict/signaling mode
>> PR_TASK_ISOLATION_ALLOW_SYSCALLS - for syscalls, no signal, just quiesce before return
>> PR_TASK_ISOLATION_ALLOW_EXCEPTIONS - for all exceptions, no signal, quiesce before return
>>
>> It might make sense to say you would allow page faults, for example, but not general
>> exceptions.  But my guess is that the exception-related stuff really does need an
>> application use case to account for it.  I would say for the initial support of task
>> isolation, we have a clearly-understood model for allowing syscalls (e.g. stuff
>> like generating diagnostics on error or slow paths), but not really a model for
>> understanding why users would want to take exceptions, so I'd say let's omit
>> that initially, and maybe just add the _ALLOW_SYSCALLS flag.
>
> Ok. That interface looks better. At least we can start with just PR_TASK_ISOLATION_ENABLE which
> does strict pure isolation mode and have future flags for more granularity.

I think just implementing the basic _ENABLE mode with pure strict task
isolation makes sense for now.  We can wait to enable syscalls or
exceptions until we have a better use case.  Meanwhile, even without
support for allowing syscalls, you can always use prctl() to turn off
task isolation, and then you can do your syscalls, and prctl() it back
on again.  prctl() to disable task isolation always has to work :-)

Or, if we want to make it easy to do debugging, and as a result maybe
also support the plausible mode where task-isolation tasks make
occasional syscalls, we could say that the _ALLOW_EXCEPTIONS flag
above implies syscalls as well, and support that mode.  Perhaps that
makes the most sense...

I'll spin it as a new patch series and you can take a look.

Thanks!
-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
@ 2016-07-01 20:59                               ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-07-01 20:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On 6/29/2016 11:18 AM, Frederic Weisbecker wrote:
> On Fri, Jun 03, 2016 at 03:32:04PM -0400, Chris Metcalf wrote:
>> On 5/25/2016 9:07 PM, Frederic Weisbecker wrote:
>>> On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
>>>> On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
>>>>> On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
>>>>>>   TL;DR: Let's make an explicit decision about whether task isolation
>>>>>>   should be "persistent" or "one-shot".  Both have some advantages.
>>>>>>   =====
>>>>>>
>>>>>> An important high-level issue is how "sticky" task isolation mode is.
>>>>>> We need to choose one of these two options:
>>>>>>
>>>>>> "Persistent mode": A task switches state to "task isolation" mode
>>>>>> (kind of a level-triggered analogy) and stays there indefinitely.  It
>>>>>> can make a syscall, take a page fault, etc., if it wants to, but the
>>>>>> kernel protects it from incurring any further asynchronous interrupts.
>>>>>> This is the model I've been advocating for.
>>>>> But then in this mode, what happens when an interrupt triggers.
>>>> So what happens if an interrupt does occur?
>>>>
>>>> In the "base" task isolation mode, you just take the interrupt, then
>>>> wait to quiesce any further kernel timer ticks, etc., and return to
>>>> the process.  This at least limits the damage to being a single
>>>> interruption rather than potentially additional ones, if the interrupt
>>>> also caused timers to get queued, etc.
>>> Good, although that quiescing on kernel return must be an option.
>>
>> Can you spell out why you think turning it off is helpful?  I'll admit
>> this is the default mode in the commercial version of task isolation
>> that we ship, and was also the default in the first LKML patch series.
>> But on consideration I haven't found scenarios where skipping the
>> quiescing is helpful.  Admittedly you get out of the kernel faster,
>> but then you're back in userspace and vulnerable to yet more
>> unexpected interrupts until the timer quiesces.  If you're asking for
>> task isolation, this is surely not what you want.
>
> I just feel that quiescing, on the way back to user after an unwanted
> interruption, is awkward. The quiescing should work once and for all
> on return back from the prctl. If we still get disturbed afterward,
> either the quiescing is buggy or incomplete, or something is on the
> way that can not be quiesced.

If we are thinking of an initial implementation that doesn't allow any
subsequent kernel entry to be valid, then this all gets much easier,
since any subsequent kernel entry except for a prctl() syscall will
result in a signal, which will turn off task isolation, and we will
never have to worry about additional quiescing.  I think that's where
we got from the discussion at the bottom of this email.

So for your question here, we're really just thinking about future
directions as far as how to handle interrupts, and if in the future we
add support for allowing syscalls and/or exceptions without leaving
task isolation mode, then we have to think about how that interacts
with interrupts.  The problem is that it's hard to tell, as you're
returning to userspace, whether you're returning from an exception or
an interrupt; you typically don't have that information available.  So
from a purely ease-of-implementation perspective, we'd likely want to
handle exceptions and interrupts the same way, and quiesce both.

In general, I think it would also be a better explanation to users of
task isolation to say "every enter/exit to the kernel is either an
error that causes a signal, or it quiesces on return".  It's a simpler
semantic, and I think it also is better for interrupts anyway, since
it potentially avoids multiple interrupts to the application (whatever
interrupted to begin with, plus potential timer interrupts later).

But that said, if we start with "pure strict" mode only, all of this
becomes hypothetical, and we may in fact choose never to allow "safe"
modes of entering the kernel.

>>>> I'm not actually sure what
>>>> you're recommending we do to avoid exceptions.  Since they're
>>>> synchronous and deterministic, we can't really avoid them if the
>>>> program wants to issue them.  For example, mmap() some anonymous
>>>> memory and then start running, and you'll take exceptions each time
>>>> you touch a page in that mapped region.  I'd argue it's an application
>>>> bug; one should enable "strict" mode to catch and deal with such bugs.
>>> They are not all deterministic. For example a breakpoint, a step, a trap
>>> can be set up by another process. So this is not entirely under the control
>>> of the user.
>>
>> That's true, but I'd argue the behavior in that case should be that you can
>> raise that kind of exception validly (so you can debug), and then you should
>> quiesce on return to userspace so the application doesn't see additional
>> exceptions.
>
> I don't see how we can quiesce such things.

I'm imagining task A is in dataplane mode, and task B wants to debug
it by writing a breakpoint into its text.  When task A hits the
breakpoint, it will enter the kernel, and hold there while task B
pokes at it with ptrace.  When task A finally is allowed to return to
userspace, it should quiesce before entering userspace in case any
timer interrupts got scheduled (again, maybe due to softirqs or
whatever, or random other kernel activity targeting that core while it
was in the kernel, or whatever).  This is just the same kind of
quiescing we do on return from the initial prctl().

With a "pure strict" mode it does get a little tricky, since we will
end up killing task A as it comes back from its breakpoint.  We might
just choose to say that task A should not enable task isolation if it
is going to be debugged (some runtime switch).  This isn't really a
great solution; I do kind of feel that the nicest thing to do is
quiesce the task again at this point.  This feels like the biggest
argument in favor of supporting a mode where a task-isolated task can
safely enter the kernel for exceptions.  What do you think?

>> There are two ways you could handle debugging:
>>
>> 1. Require the program to set the flag that says it doesn't want a signal
>> when it is interrupted (so you can interrupt it to debug it, and not kill it);
>
> That's rather about exceptions, right?

Yes, with the task A/task B example above, you're right.  I was
thinking there was a kick given by task B to task A.  I think that
might even be true in some circumstances, but anyway, it's a detail.

>> Here's what I am inclined towards:
>>
>>  - Default mode (hard isolation / "strict") - leave userspace, get a signal, no exceptions.
>
> Ok.
>
>>
>>  - "No signal" mode - leave userspace synchronously (syscall/exception), get quiesced on
>>    return, no signals.  But asynchronous interrupts still cause a signal since they are
>>    not expected to occur.
>
> So only interrupt cause a signal in this mode? Exceptions and syscalls are permitted, right?

Yes, correct.

>>  - Soft mode (I don't think we want this) - like "no signal" except you don't even quiesce
>>    on return to userspace, and asynchronous interrupts don't even cause a signal.
>>    It's basically "best effort", just nohz_full plus the code that tries to get things
>>    like LRU or vmstat to run before returning to userspace.  I think there isn't enough
>>    "value add" to make this a separate mode, though.
>
> I can imagine HPC to be willing this mode.

Yes, perhaps.  I'm not convinced we want to target HPC without a much
clearer sense of why this is better than nohz_full, though.  I fear
people might think "task isolation" is better by definition and not
think too much about it, but I'm really not sure it is better for the
HPC use case, necessarily.

>>>> You're right that migration conflicts with task isolation.  But
>>>> certainly, if a task has enabled "strict" semantics, it can't migrate;
>>>> it will lose task isolation entirely and get a signal instead,
>>>> regardless of whether it calls sched_setaffinity() on itself, or if
>>>> someone else changes its affinity and it gets a kick.
>>> Yes.
>>>
>>>> However, if a task doesn't have strict mode enabled, it can call
>>>> sched_setaffinity() and force itself onto a non-task_isolation cpu and
>>>> it won't get any isolation until it schedules itself back onto a
>>>> task_isolation cpu, at which point it wakes up on the new cpu with
>>>> hard isolation still in effect.  I can make up reasons why this sort
>>>> of thing might be useful, but it's probably a corner case.
>>> That doesn't look sane. The user asks the kernel to get away as much
>>> as it can but if we are in a non-nohz-full CPU we know we can't provide that
>>> service (or rather that non-service).
>>>
>>> So we would refuse to enter in task isolation mode if it doesn't run in a
>>> full dynticks CPUs whereas we accept that it migrates later to a periodic
>>> CPU?. This isn't consistent.
>>
>> Yes, and originally I made that consistent by not checking when it started
>> up, either, but I was subsequently convinced that the checks were good for
>> sanity.
>
> Sure sanity checks are good but if you refuse the prctl with returning an error
> on the basis of this sanity condition, the task shouldn't be able to later reach
> that insanity state without being properly kicked out of the feature provided by
> the prctl().
>
> Otherwise perhaps just drop a warning.

Are you saying that we should printk a warning in the prctl() rather
than returning an error in the case where it's not on a full dynticks
cpu?  I could be convinced by that just to keep things consistent.

How about doing it this way?  If you invoke prctl() with the default
"strict" mode where any kernel entry results in a signal, the prctl()
will be strict, and require you to be affinitized to a single, full
dynticks cpu.

But, if you enable the "allow syscalls" mode, then the prctl isn't
strict either, since you can use syscalls to get into a state where
you're not on a full dynticks cpu, and you just get a console warning
if you enter task isolation on the wrong cpu.  (Of course, we may end
up not doing the "allow syscalls" mode for the first version of this
patch anyway, as we discuss below.)

>>>> Googling "Zero-Overhead Linux" does take you to some discussions
>>>> of customers that have used this functionality.
>>> So those workloads couldn't stand an interrupt? Like they would like a signal
>>> and exit the strict mode if it happens?
>>
>> Correct, they couldn't tolerate interrupts.  If one happened, it would cause packets to
>> be dropped and some kind of logging would fire to report the problem.
>
> Ok. And is it this mode you're interested in? Isn't quiescing an issue in this mode?

In this mode we don't worry about quiescing for interrupts, since we
are generating a signal, and when you send a signal, you first have to
disable task isolation mode to avoid getting into various bad states
(sending too many signals, or worse, getting deadlocked because you
are signalling the task BECAUSE it was about to receive a signal).  So
we only quiesce after syscalls/exceptions.

>> So maybe something like this:
>>
>> PR_TASK_ISOLATION_ENABLE - turn on basic strict/signaling mode
>> PR_TASK_ISOLATION_ALLOW_SYSCALLS - for syscalls, no signal, just quiesce before return
>> PR_TASK_ISOLATION_ALLOW_EXCEPTIONS - for all exceptions, no signal, quiesce before return
>>
>> It might make sense to say you would allow page faults, for example, but not general
>> exceptions.  But my guess is that the exception-related stuff really does need an
>> application use case to account for it.  I would say for the initial support of task
>> isolation, we have a clearly-understood model for allowing syscalls (e.g. stuff
>> like generating diagnostics on error or slow paths), but not really a model for
>> understanding why users would want to take exceptions, so I'd say let's omit
>> that initially, and maybe just add the _ALLOW_SYSCALLS flag.
>
> Ok. That interface looks better. At least we can start with just PR_TASK_ISOLATION_ENABLE which
> does strict pure isolation mode and have future flags for more granularity.

I think just implementing the basic _ENABLE mode with pure strict task
isolation makes sense for now.  We can wait to enable syscalls or
exceptions until we have a better use case.  Meanwhile, even without
support for allowing syscalls, you can always use prctl() to turn off
task isolation, and then you can do your syscalls, and prctl() it back
on again.  prctl() to disable task isolation always has to work :-)

Or, if we want to make it easy to do debugging, and as a result maybe
also support the plausible mode where task-isolation tasks make
occasional syscalls, we could say that the _ALLOW_EXCEPTIONS flag
above implies syscalls as well, and support that mode.  Perhaps that
makes the most sense...

I'll spin it as a new patch series and you can take a look.

Thanks!
-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
@ 2016-07-05 14:41                                 ` Frederic Weisbecker
  0 siblings, 0 replies; 92+ messages in thread
From: Frederic Weisbecker @ 2016-07-05 14:41 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc,
	linux-api, linux-kernel

On Fri, Jul 01, 2016 at 04:59:26PM -0400, Chris Metcalf wrote:
> On 6/29/2016 11:18 AM, Frederic Weisbecker wrote:
> >
> >I just feel that quiescing, on the way back to user after an unwanted
> >interruption, is awkward. The quiescing should work once and for all
> >on return back from the prctl. If we still get disturbed afterward,
> >either the quiescing is buggy or incomplete, or something is on the
> >way that can not be quiesced.
> 
> If we are thinking of an initial implementation that doesn't allow any
> subsequent kernel entry to be valid, then this all gets much easier,
> since any subsequent kernel entry except for a prctl() syscall will
> result in a signal, which will turn off task isolation, and we will
> never have to worry about additional quiescing.  I think that's where
> we got from the discussion at the bottom of this email.

Right.

> 
> So for your question here, we're really just thinking about future
> directions as far as how to handle interrupts, and if in the future we
> add support for allowing syscalls and/or exceptions without leaving
> task isolation mode, then we have to think about how that interacts
> with interrupts.  The problem is that it's hard to tell, as you're
> returning to userspace, whether you're returning from an exception or
> an interrupt; you typically don't have that information available.  So
> from a purely ease-of-implementation perspective, we'd likely want to
> handle exceptions and interrupts the same way, and quiesce both.

Sure but what I don't understand is why do we need to quiesce more than
once (ie: at the prctl() call). Quiescing should be a single operation
that prevents from any further disturbance. Like offlining anything
we to other CPUs. And entering again in the kernel shouldn't break
that.

> 
> In general, I think it would also be a better explanation to users of
> task isolation to say "every enter/exit to the kernel is either an
> error that causes a signal, or it quiesces on return".  It's a simpler
> semantic, and I think it also is better for interrupts anyway, since
> it potentially avoids multiple interrupts to the application (whatever
> interrupted to begin with, plus potential timer interrupts later).
> 
> But that said, if we start with "pure strict" mode only, all of this
> becomes hypothetical, and we may in fact choose never to allow "safe"
> modes of entering the kernel.

Right. And starting with pure strict mode would be a good first step,
provided it is a mode you need.


> >>That's true, but I'd argue the behavior in that case should be that you can
> >>raise that kind of exception validly (so you can debug), and then you should
> >>quiesce on return to userspace so the application doesn't see additional
> >>exceptions.
> >
> >I don't see how we can quiesce such things.
> 
> I'm imagining task A is in dataplane mode, and task B wants to debug
> it by writing a breakpoint into its text.  When task A hits the
> breakpoint, it will enter the kernel, and hold there while task B
> pokes at it with ptrace.  When task A finally is allowed to return to
> userspace, it should quiesce before entering userspace in case any
> timer interrupts got scheduled (again, maybe due to softirqs or
> whatever, or random other kernel activity targeting that core while it
> was in the kernel, or whatever).  This is just the same kind of
> quiescing we do on return from the initial prctl().

Well again I think it shouldn't happen. Quiescing should be done once
and for all.


> With a "pure strict" mode it does get a little tricky, since we will
> end up killing task A as it comes back from its breakpoint.  We might
> just choose to say that task A should not enable task isolation if it
> is going to be debugged (some runtime switch).  This isn't really a
> great solution; I do kind of feel that the nicest thing to do is
> quiesce the task again at this point.  This feels like the biggest
> argument in favor of supporting a mode where a task-isolated task can
> safely enter the kernel for exceptions.  What do you think?

Yeah probably we'll need to introduce some sort of debugability. Allow
debug/trap exceptions only for example.


> >> - Soft mode (I don't think we want this) - like "no signal" except you don't even quiesce
> >>   on return to userspace, and asynchronous interrupts don't even cause a signal.
> >>   It's basically "best effort", just nohz_full plus the code that tries to get things
> >>   like LRU or vmstat to run before returning to userspace.  I think there isn't enough
> >>   "value add" to make this a separate mode, though.
> >
> >I can imagine HPC to be willing this mode.
> 
> Yes, perhaps.  I'm not convinced we want to target HPC without a much
> clearer sense of why this is better than nohz_full, though.  I fear
> people might think "task isolation" is better by definition and not
> think too much about it, but I'm really not sure it is better for the
> HPC use case, necessarily.

I don't know. Perhaps HPC could just consist in quiescing once for all
(offline everything that can) and not signal when there is a rare disturbance.

> >Otherwise perhaps just drop a warning.
> 
> Are you saying that we should printk a warning in the prctl() rather
> than returning an error in the case where it's not on a full dynticks
> cpu?  I could be convinced by that just to keep things consistent.

Yeah that's what I meant.


> How about doing it this way?  If you invoke prctl() with the default
> "strict" mode where any kernel entry results in a signal, the prctl()
> will be strict, and require you to be affinitized to a single, full
> dynticks cpu.

But if you do that, you need to do it properly and care about races against
affinity changes. It involves heavy synchronization against scheduler code.

I tend to think we shouldn't bother with that. If we enter in task isolation
mode on a non-nohz-full CPU, the task will be signalled and kicked out of task
isolation mode in the next tick that happens very soon after the prctl().


> But, if you enable the "allow syscalls" mode, then the prctl isn't
> strict either, since you can use syscalls to get into a state where
> you're not on a full dynticks cpu, and you just get a console warning
> if you enter task isolation on the wrong cpu.  (Of course, we may end
> up not doing the "allow syscalls" mode for the first version of this
> patch anyway, as we discuss below.)

Right.

> >Ok. And is it this mode you're interested in? Isn't quiescing an issue in this mode?
> 
> In this mode we don't worry about quiescing for interrupts, since we
> are generating a signal, and when you send a signal, you first have to
> disable task isolation mode to avoid getting into various bad states
> (sending too many signals, or worse, getting deadlocked because you
> are signalling the task BECAUSE it was about to receive a signal).  So
> we only quiesce after syscalls/exceptions.

Ok. And are you interested in such strict mode? :-)

If so it would be nice to start with just that and iterate on top of it.

> >Ok. That interface looks better. At least we can start with just PR_TASK_ISOLATION_ENABLE which
> >does strict pure isolation mode and have future flags for more granularity.
> 
> I think just implementing the basic _ENABLE mode with pure strict task
> isolation makes sense for now.  We can wait to enable syscalls or
> exceptions until we have a better use case.  Meanwhile, even without
> support for allowing syscalls, you can always use prctl() to turn off
> task isolation, and then you can do your syscalls, and prctl() it back
> on again.  prctl() to disable task isolation always has to work :-)

Perfect!

> Or, if we want to make it easy to do debugging, and as a result maybe
> also support the plausible mode where task-isolation tasks make
> occasional syscalls, we could say that the _ALLOW_EXCEPTIONS flag
> above implies syscalls as well, and support that mode.  Perhaps that
> makes the most sense...

I fear that _ALLOW_EXCEPTIONS is too wide for a special case if all we
want it to allow debugging.

The most granular way to express custom isolation would be to use BPF.
Not sure we want to go that far though.


> I'll spin it as a new patch series and you can take a look.

Ok. Ideally it would be nice to respin a simple version (strict mode)
on top of which we can later iterate.

Thanks.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
@ 2016-07-05 14:41                                 ` Frederic Weisbecker
  0 siblings, 0 replies; 92+ messages in thread
From: Frederic Weisbecker @ 2016-07-05 14:41 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, Jul 01, 2016 at 04:59:26PM -0400, Chris Metcalf wrote:
> On 6/29/2016 11:18 AM, Frederic Weisbecker wrote:
> >
> >I just feel that quiescing, on the way back to user after an unwanted
> >interruption, is awkward. The quiescing should work once and for all
> >on return back from the prctl. If we still get disturbed afterward,
> >either the quiescing is buggy or incomplete, or something is on the
> >way that can not be quiesced.
> 
> If we are thinking of an initial implementation that doesn't allow any
> subsequent kernel entry to be valid, then this all gets much easier,
> since any subsequent kernel entry except for a prctl() syscall will
> result in a signal, which will turn off task isolation, and we will
> never have to worry about additional quiescing.  I think that's where
> we got from the discussion at the bottom of this email.

Right.

> 
> So for your question here, we're really just thinking about future
> directions as far as how to handle interrupts, and if in the future we
> add support for allowing syscalls and/or exceptions without leaving
> task isolation mode, then we have to think about how that interacts
> with interrupts.  The problem is that it's hard to tell, as you're
> returning to userspace, whether you're returning from an exception or
> an interrupt; you typically don't have that information available.  So
> from a purely ease-of-implementation perspective, we'd likely want to
> handle exceptions and interrupts the same way, and quiesce both.

Sure but what I don't understand is why do we need to quiesce more than
once (ie: at the prctl() call). Quiescing should be a single operation
that prevents from any further disturbance. Like offlining anything
we to other CPUs. And entering again in the kernel shouldn't break
that.

> 
> In general, I think it would also be a better explanation to users of
> task isolation to say "every enter/exit to the kernel is either an
> error that causes a signal, or it quiesces on return".  It's a simpler
> semantic, and I think it also is better for interrupts anyway, since
> it potentially avoids multiple interrupts to the application (whatever
> interrupted to begin with, plus potential timer interrupts later).
> 
> But that said, if we start with "pure strict" mode only, all of this
> becomes hypothetical, and we may in fact choose never to allow "safe"
> modes of entering the kernel.

Right. And starting with pure strict mode would be a good first step,
provided it is a mode you need.


> >>That's true, but I'd argue the behavior in that case should be that you can
> >>raise that kind of exception validly (so you can debug), and then you should
> >>quiesce on return to userspace so the application doesn't see additional
> >>exceptions.
> >
> >I don't see how we can quiesce such things.
> 
> I'm imagining task A is in dataplane mode, and task B wants to debug
> it by writing a breakpoint into its text.  When task A hits the
> breakpoint, it will enter the kernel, and hold there while task B
> pokes at it with ptrace.  When task A finally is allowed to return to
> userspace, it should quiesce before entering userspace in case any
> timer interrupts got scheduled (again, maybe due to softirqs or
> whatever, or random other kernel activity targeting that core while it
> was in the kernel, or whatever).  This is just the same kind of
> quiescing we do on return from the initial prctl().

Well again I think it shouldn't happen. Quiescing should be done once
and for all.


> With a "pure strict" mode it does get a little tricky, since we will
> end up killing task A as it comes back from its breakpoint.  We might
> just choose to say that task A should not enable task isolation if it
> is going to be debugged (some runtime switch).  This isn't really a
> great solution; I do kind of feel that the nicest thing to do is
> quiesce the task again at this point.  This feels like the biggest
> argument in favor of supporting a mode where a task-isolated task can
> safely enter the kernel for exceptions.  What do you think?

Yeah probably we'll need to introduce some sort of debugability. Allow
debug/trap exceptions only for example.


> >> - Soft mode (I don't think we want this) - like "no signal" except you don't even quiesce
> >>   on return to userspace, and asynchronous interrupts don't even cause a signal.
> >>   It's basically "best effort", just nohz_full plus the code that tries to get things
> >>   like LRU or vmstat to run before returning to userspace.  I think there isn't enough
> >>   "value add" to make this a separate mode, though.
> >
> >I can imagine HPC to be willing this mode.
> 
> Yes, perhaps.  I'm not convinced we want to target HPC without a much
> clearer sense of why this is better than nohz_full, though.  I fear
> people might think "task isolation" is better by definition and not
> think too much about it, but I'm really not sure it is better for the
> HPC use case, necessarily.

I don't know. Perhaps HPC could just consist in quiescing once for all
(offline everything that can) and not signal when there is a rare disturbance.

> >Otherwise perhaps just drop a warning.
> 
> Are you saying that we should printk a warning in the prctl() rather
> than returning an error in the case where it's not on a full dynticks
> cpu?  I could be convinced by that just to keep things consistent.

Yeah that's what I meant.


> How about doing it this way?  If you invoke prctl() with the default
> "strict" mode where any kernel entry results in a signal, the prctl()
> will be strict, and require you to be affinitized to a single, full
> dynticks cpu.

But if you do that, you need to do it properly and care about races against
affinity changes. It involves heavy synchronization against scheduler code.

I tend to think we shouldn't bother with that. If we enter in task isolation
mode on a non-nohz-full CPU, the task will be signalled and kicked out of task
isolation mode in the next tick that happens very soon after the prctl().


> But, if you enable the "allow syscalls" mode, then the prctl isn't
> strict either, since you can use syscalls to get into a state where
> you're not on a full dynticks cpu, and you just get a console warning
> if you enter task isolation on the wrong cpu.  (Of course, we may end
> up not doing the "allow syscalls" mode for the first version of this
> patch anyway, as we discuss below.)

Right.

> >Ok. And is it this mode you're interested in? Isn't quiescing an issue in this mode?
> 
> In this mode we don't worry about quiescing for interrupts, since we
> are generating a signal, and when you send a signal, you first have to
> disable task isolation mode to avoid getting into various bad states
> (sending too many signals, or worse, getting deadlocked because you
> are signalling the task BECAUSE it was about to receive a signal).  So
> we only quiesce after syscalls/exceptions.

Ok. And are you interested in such strict mode? :-)

If so it would be nice to start with just that and iterate on top of it.

> >Ok. That interface looks better. At least we can start with just PR_TASK_ISOLATION_ENABLE which
> >does strict pure isolation mode and have future flags for more granularity.
> 
> I think just implementing the basic _ENABLE mode with pure strict task
> isolation makes sense for now.  We can wait to enable syscalls or
> exceptions until we have a better use case.  Meanwhile, even without
> support for allowing syscalls, you can always use prctl() to turn off
> task isolation, and then you can do your syscalls, and prctl() it back
> on again.  prctl() to disable task isolation always has to work :-)

Perfect!

> Or, if we want to make it easy to do debugging, and as a result maybe
> also support the plausible mode where task-isolation tasks make
> occasional syscalls, we could say that the _ALLOW_EXCEPTIONS flag
> above implies syscalls as well, and support that mode.  Perhaps that
> makes the most sense...

I fear that _ALLOW_EXCEPTIONS is too wide for a special case if all we
want it to allow debugging.

The most granular way to express custom isolation would be to use BPF.
Not sure we want to go that far though.


> I'll spin it as a new patch series and you can take a look.

Ok. Ideally it would be nice to respin a simple version (strict mode)
on top of which we can later iterate.

Thanks.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v9 04/13] task_isolation: add initial support
  2016-07-05 14:41                                 ` Frederic Weisbecker
  (?)
@ 2016-07-05 17:47                                 ` Christoph Lameter
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Lameter @ 2016-07-05 17:47 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel

On Tue, 5 Jul 2016, Frederic Weisbecker wrote:

> > >>That's true, but I'd argue the behavior in that case should be that you can
> > >>raise that kind of exception validly (so you can debug), and then you should
> > >>quiesce on return to userspace so the application doesn't see additional
> > >>exceptions.
> > >
> > >I don't see how we can quiesce such things.
> >
> > I'm imagining task A is in dataplane mode, and task B wants to debug
> > it by writing a breakpoint into its text.  When task A hits the
> > breakpoint, it will enter the kernel, and hold there while task B
> > pokes at it with ptrace.  When task A finally is allowed to return to
> > userspace, it should quiesce before entering userspace in case any
> > timer interrupts got scheduled (again, maybe due to softirqs or
> > whatever, or random other kernel activity targeting that core while it
> > was in the kernel, or whatever).  This is just the same kind of
> > quiescing we do on return from the initial prctl().
>
> Well again I think it shouldn't happen. Quiescing should be done once
> and for all.

For debugging something like that would be helpful. And yes for the
realtime use cases quiescing is once and for all (until we end a different
operation mode if requested by the app)


> > >> - Soft mode (I don't think we want this) - like "no signal" except you don't even quiesce
> > >>   on return to userspace, and asynchronous interrupts don't even cause a signal.
> > >>   It's basically "best effort", just nohz_full plus the code that tries to get things
> > >>   like LRU or vmstat to run before returning to userspace.  I think there isn't enough
> > >>   "value add" to make this a separate mode, though.
> > >
> > >I can imagine HPC to be willing this mode.
> >
> > Yes, perhaps.  I'm not convinced we want to target HPC without a much
> > clearer sense of why this is better than nohz_full, though.  I fear
> > people might think "task isolation" is better by definition and not
> > think too much about it, but I'm really not sure it is better for the
> > HPC use case, necessarily.

HPC folks generally like to actually understand what is going on in order
to get the best performance. Just expose the knobs for us please.

^ permalink raw reply	[flat|nested] 92+ messages in thread

end of thread, other threads:[~2016-07-05 17:55 UTC | newest]

Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-04 19:34 [PATCH v9 00/13] support "task_isolation" mode for nohz_full Chris Metcalf
2016-01-04 19:34 ` Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 01/13] vmstat: provide a function to quiet down the diff processing Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 02/13] vmstat: add vmstat_idle function Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 03/13] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
2016-01-04 19:34   ` Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 04/13] task_isolation: add initial support Chris Metcalf
2016-01-04 19:34   ` Chris Metcalf
2016-01-19 15:42   ` Frederic Weisbecker
2016-01-19 20:45     ` Chris Metcalf
2016-01-19 20:45       ` Chris Metcalf
2016-01-28  0:28       ` Frederic Weisbecker
2016-01-29 18:18         ` Chris Metcalf
2016-01-29 18:18           ` Chris Metcalf
2016-01-30 21:11           ` Frederic Weisbecker
2016-01-30 21:11             ` Frederic Weisbecker
2016-02-11 19:24             ` Chris Metcalf
2016-02-11 19:24               ` Chris Metcalf
2016-03-04 12:56               ` Frederic Weisbecker
2016-03-09 19:39                 ` Chris Metcalf
2016-03-09 19:39                   ` Chris Metcalf
2016-04-08 13:56                   ` Frederic Weisbecker
2016-04-08 16:34                     ` Chris Metcalf
2016-04-08 16:34                       ` Chris Metcalf
2016-04-12 18:41                       ` Chris Metcalf
2016-04-12 18:41                         ` Chris Metcalf
2016-04-22 13:16                       ` Frederic Weisbecker
2016-04-25 20:36                         ` Chris Metcalf
2016-04-25 20:36                           ` Chris Metcalf
2016-05-26  1:07                       ` Frederic Weisbecker
2016-06-03 19:32                         ` Chris Metcalf
2016-06-03 19:32                           ` Chris Metcalf
2016-06-29 15:18                           ` Frederic Weisbecker
2016-07-01 20:59                             ` Chris Metcalf
2016-07-01 20:59                               ` Chris Metcalf
2016-07-05 14:41                               ` Frederic Weisbecker
2016-07-05 14:41                                 ` Frederic Weisbecker
2016-07-05 17:47                                 ` Christoph Lameter
2016-01-04 19:34 ` [PATCH v9 05/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf
2016-01-04 19:34   ` Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 06/13] task_isolation: add debug boot flag Chris Metcalf
2016-01-04 22:52   ` Steven Rostedt
2016-01-04 23:42     ` Chris Metcalf
2016-01-05 13:42       ` Steven Rostedt
2016-01-04 19:34 ` [PATCH v9 07/13] arch/x86: enable task isolation functionality Chris Metcalf
2016-01-04 21:02   ` [PATCH v9bis " Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86 Chris Metcalf
2016-01-04 19:34   ` Chris Metcalf
2016-01-04 20:33   ` Mark Rutland
2016-01-04 20:33     ` Mark Rutland
2016-01-04 21:01     ` Chris Metcalf
2016-01-04 21:01       ` Chris Metcalf
2016-01-05 17:21       ` Mark Rutland
2016-01-05 17:21         ` Mark Rutland
2016-01-05 17:33         ` [PATCH 1/2] arm64: entry: remove pointless SPSR mode check Mark Rutland
2016-01-05 17:33           ` Mark Rutland
2016-01-06 12:15           ` Catalin Marinas
2016-01-06 12:15             ` Catalin Marinas
2016-01-05 17:33         ` [PATCH 2/2] arm64: factor work_pending state machine to C Mark Rutland
2016-01-05 17:33           ` Mark Rutland
2016-01-05 18:53           ` Chris Metcalf
2016-01-05 18:53             ` Chris Metcalf
2016-01-06 12:30           ` Catalin Marinas
2016-01-06 12:30             ` Catalin Marinas
2016-01-06 12:47             ` Mark Rutland
2016-01-06 12:47               ` Mark Rutland
2016-01-06 13:43           ` Mark Rutland
2016-01-06 13:43             ` Mark Rutland
2016-01-06 14:17             ` Catalin Marinas
2016-01-06 14:17               ` Catalin Marinas
2016-01-04 22:31     ` [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86 Andy Lutomirski
2016-01-04 22:31       ` Andy Lutomirski
2016-01-05 18:01       ` Mark Rutland
2016-01-05 18:01         ` Mark Rutland
2016-01-04 19:34 ` [PATCH v9 09/13] arch/arm64: enable task isolation functionality Chris Metcalf
2016-01-04 19:34   ` Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 10/13] arch/tile: adopt prepare_exit_to_usermode() model from x86 Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 11/13] arch/tile: move user_exit() to early kernel entry sequence Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 12/13] arch/tile: enable task isolation functionality Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 13/13] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
2016-01-11 21:15 ` [PATCH v9 00/13] support "task_isolation" mode for nohz_full Chris Metcalf
2016-01-11 21:15   ` Chris Metcalf
2016-01-12 10:07   ` Will Deacon
2016-01-12 17:49     ` Chris Metcalf
2016-01-12 17:49       ` Chris Metcalf
2016-01-13 10:44       ` Ingo Molnar
2016-01-13 10:44         ` Ingo Molnar
2016-01-13 21:19         ` Chris Metcalf
2016-01-13 21:19           ` Chris Metcalf
2016-01-20 13:27           ` Mark Rutland
2016-01-12 10:53   ` Ingo Molnar
2016-01-12 10:53     ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.