All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v13 00/12] support "task_isolation" mode
@ 2016-07-14 20:48 Chris Metcalf
  2016-07-14 20:48 ` [PATCH v13 01/12] vmstat: add quiet_vmstat_sync function Chris Metcalf
                   ` (14 more replies)
  0 siblings, 15 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

Here is a respin of the task-isolation patch set.  This primarily
reflects feedback from Frederic and Peter Z.

Changes since v12:

- Rebased on v4.7-rc7.

- New default "strict" model for task isolation - tasks exit the
  kernel from the initial prctl() to userspace, and can only legally
  exit by calling prctl() again to turn off isolation.  Any other
  kernel entry results in a SIGKILL by default.

- New optional "relaxed" mode, where the application can receive some
  signal other than SIGKILL, or no signal at all, when it re-enters
  the kernel.  Since by default task isolation is now strict, there is
  no longer an additional "STRICT" mode, but rather a new "NOSIG" mode
  that builds on top of the "USERSIG" support for setting a signal
  other than SIGKILL to be delivered to the process.  The "NOSIG" mode
  also relaxes the required criteria for entering task isolation mode;
  we just issue a warning if the affinity isn't set right, and we
  don't fail with EAGAIN if the kernel isn't ready to stop the tick.

  Running your task-isolation application in this "NOSIG" mode is also
  necessary when debugging, since otherwise hitting breakpoints, etc.,
  will cause a fatal signal to be sent to the process.

  Frederic has suggested we might want to defer this functionality
  until later, but (in addition to the debuggability aspect) there is
  some thought that it might be useful for e.g. HPC, so I have just
  broken out the additional semantics into a single separate patch at
  the end of the series.

- Function naming has been changed and comments have been added to try
  to clarify the role of the task-isolation reporting on kernel
  entries that do NOT cause signals.  This hopefully clarifies why we
  only invoke the renamed task_isolation_quiet_exception() in a few
  places, since all the other places generate signals anyway. [PeterZ]

- The task_isolation_debug() call now has an inline piece that checks
  to see if the target is a task_isolation cpu before actually
  calling. [PeterZ]

- In _task_isolation_debug(), we use the new task_struct_trylock()
  call that is in linux-next now; for now I just have a static copy of
  the function, which I will switch to using the version from
  linux-next in the next rebasing. [PeterZ]

- We now pass a string describing the interrupt up from
  task_isolation_debug() so there is more information on where the
  interrupt came from beyond just the stack backtrace. [PeterZ]

- I added task_isolation_debug() hooks to smp_sched_reschedule() on
  x86, which was missing before, and removed the hooks in the tile
  send_IPI_*() routines, since there were already hooks in the
  callers.  Likewise I moved the hook for arm64 from the generic
  smp_cross_call() routine to the only caller that wasn't already
  hooked, smp_send_reschedule().  The commit message clarifies the
  rationale for where hooks are placed.

- I moved the page fault reporting so that it only reports in the case
  that we are not also sending a SIGSEGV/SIGBUS, for consistency with
  other uses of task_isolation_quiet_exception().

The previous (v12) patch series is here:

https://lkml.kernel.org/g/1459877922-15512-1-git-send-email-cmetcalf@mellanox.com

This version of the patch series has been tested on arm64 and tilegx,
and build-tested on x86.

It remains true that the 1 Hz tick needs to be disabled for this
patch series to be able to achieve its primary goal of enabling
truly tick-free operation, but that is ongoing orthogonal work.
Frederick, do you have a sense of what is left to be done there?
I can certainly try to contribute to that effort as well.

The series is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Chris Metcalf (12):
  vmstat: add quiet_vmstat_sync function
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: track asynchronous interrupts
  arch/x86: enable task isolation functionality
  arm64: factor work_pending state machine to C
  arch/arm64: enable task isolation functionality
  arch/tile: enable task isolation functionality
  arm, tile: turn off timer tick for oneshot_stopped state
  task_isolation: support CONFIG_TASK_ISOLATION_ALL
  task_isolation: add user-settable notification signal

 Documentation/kernel-parameters.txt    |  16 ++
 arch/arm64/Kconfig                     |   1 +
 arch/arm64/include/asm/thread_info.h   |   5 +-
 arch/arm64/kernel/entry.S              |  12 +-
 arch/arm64/kernel/ptrace.c             |  15 +-
 arch/arm64/kernel/signal.c             |  42 +++-
 arch/arm64/kernel/smp.c                |   2 +
 arch/arm64/mm/fault.c                  |   8 +-
 arch/tile/Kconfig                      |   1 +
 arch/tile/include/asm/thread_info.h    |   4 +-
 arch/tile/kernel/process.c             |   9 +
 arch/tile/kernel/ptrace.c              |   7 +
 arch/tile/kernel/single_step.c         |   7 +
 arch/tile/kernel/smp.c                 |  26 +--
 arch/tile/kernel/time.c                |   1 +
 arch/tile/kernel/unaligned.c           |   4 +
 arch/tile/mm/fault.c                   |  13 +-
 arch/tile/mm/homecache.c               |   2 +
 arch/x86/Kconfig                       |   1 +
 arch/x86/entry/common.c                |  18 +-
 arch/x86/include/asm/thread_info.h     |   2 +
 arch/x86/kernel/smp.c                  |   2 +
 arch/x86/kernel/traps.c                |   3 +
 arch/x86/mm/fault.c                    |   5 +
 drivers/base/cpu.c                     |  18 ++
 drivers/clocksource/arm_arch_timer.c   |   2 +
 include/linux/context_tracking_state.h |   6 +
 include/linux/isolation.h              |  73 +++++++
 include/linux/sched.h                  |   3 +
 include/linux/swap.h                   |   1 +
 include/linux/tick.h                   |   2 +
 include/linux/vmstat.h                 |   4 +
 include/uapi/linux/prctl.h             |  10 +
 init/Kconfig                           |  37 ++++
 kernel/Makefile                        |   1 +
 kernel/fork.c                          |   3 +
 kernel/irq_work.c                      |   5 +-
 kernel/isolation.c                     | 337 +++++++++++++++++++++++++++++++++
 kernel/sched/core.c                    |  42 ++++
 kernel/signal.c                        |  15 ++
 kernel/smp.c                           |   6 +-
 kernel/softirq.c                       |  33 ++++
 kernel/sys.c                           |   9 +
 kernel/time/tick-sched.c               |  36 ++--
 mm/swap.c                              |  15 +-
 mm/vmstat.c                            |  19 ++
 46 files changed, 827 insertions(+), 56 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

-- 
2.7.2

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v13 01/12] vmstat: add quiet_vmstat_sync function
  2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
@ 2016-07-14 20:48 ` Chris Metcalf
  2016-07-14 20:48   ` Chris Metcalf
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Michal Hocko, linux-kernel
  Cc: Chris Metcalf

In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
the quiet_vmstat() function became asynchronous, in the sense that
the vmstat work was still scheduled to run on the core when the
function returned.  For task isolation, we need a synchronous
version of the function that guarantees that the vmstat worker
will not run on the core on return from the function.  Add a
quiet_vmstat_sync() function with that semantic.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/vmstat.h | 2 ++
 mm/vmstat.c            | 9 +++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index d2da8e053210..0d96b6b2079d 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -188,6 +188,7 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
 void quiet_vmstat(void);
+void quiet_vmstat_sync(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -253,6 +254,7 @@ static inline void __dec_zone_page_state(struct page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat_sync(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index cb2a67bb4158..945af7ab624e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1483,6 +1483,15 @@ void quiet_vmstat(void)
 }
 
 /*
+ * Synchronously quiet vmstat so the work is guaranteed not to run on return.
+ */
+void quiet_vmstat_sync(void)
+{
+	cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work));
+	refresh_cpu_vm_stats(false);
+}
+
+/*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
  * threads for vm statistics updates disabled because of
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v13 02/12] vmstat: add vmstat_idle function
  2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
@ 2016-07-14 20:48   ` Chris Metcalf
  2016-07-14 20:48   ` Chris Metcalf
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-mm, linux-kernel
  Cc: Chris Metcalf

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 0d96b6b2079d..1168c612a580 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -189,6 +189,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
 void quiet_vmstat(void);
 void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -255,6 +256,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 945af7ab624e..d742947df35d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1492,6 +1492,16 @@ void quiet_vmstat_sync(void)
 }
 
 /*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+	return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) &&
+		!need_update(smp_processor_id());
+}
+
+/*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
  * threads for vm statistics updates disabled because of
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v13 02/12] vmstat: add vmstat_idle function
@ 2016-07-14 20:48   ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-mm, linux-kernel
  Cc: Chris Metcalf

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 0d96b6b2079d..1168c612a580 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -189,6 +189,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
 void quiet_vmstat(void);
 void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -255,6 +256,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 945af7ab624e..d742947df35d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1492,6 +1492,16 @@ void quiet_vmstat_sync(void)
 }
 
 /*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+	return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) &&
+		!need_update(smp_processor_id());
+}
+
+/*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
  * threads for vm statistics updates disabled because of
-- 
2.7.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v13 03/12] lru_add_drain_all: factor out lru_add_drain_needed
  2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
@ 2016-07-14 20:48   ` Chris Metcalf
  2016-07-14 20:48   ` Chris Metcalf
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-mm, linux-kernel
  Cc: Chris Metcalf

This per-cpu check was being done in the loop in lru_add_drain_all(),
but having it be callable for a particular cpu is helpful for the
task-isolation patches.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/swap.h |  1 +
 mm/swap.c            | 15 ++++++++++-----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0af2bb2028fd..40b2d76e9c03 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -304,6 +304,7 @@ extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
+extern bool lru_add_drain_needed(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index 90530ff8ed16..105cfc7ecc95 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -653,6 +653,15 @@ void deactivate_page(struct page *page)
 	}
 }
 
+bool lru_add_drain_needed(int cpu)
+{
+	return (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
+		pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
+		pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+		need_activate_page_drain(cpu));
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
@@ -697,11 +706,7 @@ void lru_add_drain_all(void)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
-		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
-		    need_activate_page_drain(cpu)) {
+		if (lru_add_drain_needed(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, lru_add_drain_wq, work);
 			cpumask_set_cpu(cpu, &has_work);
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v13 03/12] lru_add_drain_all: factor out lru_add_drain_needed
@ 2016-07-14 20:48   ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-mm, linux-kernel
  Cc: Chris Metcalf

This per-cpu check was being done in the loop in lru_add_drain_all(),
but having it be callable for a particular cpu is helpful for the
task-isolation patches.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/swap.h |  1 +
 mm/swap.c            | 15 ++++++++++-----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0af2bb2028fd..40b2d76e9c03 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -304,6 +304,7 @@ extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
+extern bool lru_add_drain_needed(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index 90530ff8ed16..105cfc7ecc95 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -653,6 +653,15 @@ void deactivate_page(struct page *page)
 	}
 }
 
+bool lru_add_drain_needed(int cpu)
+{
+	return (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
+		pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
+		pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+		need_activate_page_drain(cpu));
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
@@ -697,11 +706,7 @@ void lru_add_drain_all(void)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
-		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
-		    need_activate_page_drain(cpu)) {
+		if (lru_add_drain_needed(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, lru_add_drain_wq, work);
 			cpumask_set_cpu(cpu, &has_work);
-- 
2.7.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v13 04/12] task_isolation: add initial support
  2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
@ 2016-07-14 20:48   ` Chris Metcalf
  2016-07-14 20:48   ` Chris Metcalf
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well.  The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in
thread_info flags.  When task isolation is enabled for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future.

The task_isolation_ready() call is invoked when TIF_TASK_ISOLATION is
set in prepare_exit_to_usermode() or its architectural equivalent,
and forces the loop to retry if the system is not ready.  It is
called with interrupts disabled and inspects the kernel state
to determine if it is safe to return into an isolated state.
In particular, if it sees that the scheduler tick is still enabled,
it reports that it is not yet safe.

Each time through the loop of TIF work to do, if TIF_TASK_ISOLATION
is set, we call the new task_isolation_enter() routine.  This
takes any actions that might avoid a future interrupt to the core,
such as a worker thread being scheduled that could be quiesced now
(e.g. the vmstat worker) or a future IPI to the core to clean up some
state that could be cleaned up now (e.g. the mm lru per-cpu cache).
In addition, it reqeusts rescheduling if the scheduler dyntick is
still running.

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
of a number of other synchronous traps, the kernel will kill it
with SIGKILL.  For system calls, this test is performed immediately
before the SECCOMP test and causes the syscall to return immediately
with ENOSYS.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl() syscall so that we can clear the bit again
later, and ignores exit/exit_group to allow exiting the task without
a pointless signal killing you as you try to do so.

A new /sys/devices/system/cpu/task_isolation pseudo-file is added,
parallel to the comparable nohz_full file.

Separate patches that follow provide these changes for x86, tile,
and arm64.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 Documentation/kernel-parameters.txt |   8 ++
 drivers/base/cpu.c                  |  18 +++
 include/linux/isolation.h           |  60 ++++++++++
 include/linux/sched.h               |   3 +
 include/linux/tick.h                |   2 +
 include/uapi/linux/prctl.h          |   5 +
 init/Kconfig                        |  27 +++++
 kernel/Makefile                     |   1 +
 kernel/fork.c                       |   3 +
 kernel/isolation.c                  | 217 ++++++++++++++++++++++++++++++++++++
 kernel/signal.c                     |   8 ++
 kernel/sys.c                        |   9 ++
 kernel/time/tick-sched.c            |  36 +++---
 13 files changed, 384 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 82b42c958d1c..3db9bea08ed6 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3892,6 +3892,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation=	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION=y, set
+			the specified list of CPUs where cpus will be able
+			to use prctl(PR_SET_TASK_ISOLATION) to set up task
+			isolation mode.  Setting this boot flag implicitly
+			also sets up nohz_full and isolcpus mode for the
+			listed set of cpus.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 691eeea2f19a..eaf40f4264ee 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -17,6 +17,7 @@
 #include <linux/of.h>
 #include <linux/cpufeature.h>
 #include <linux/tick.h>
+#include <linux/isolation.h>
 
 #include "base.h"
 
@@ -290,6 +291,20 @@ static ssize_t print_cpus_nohz_full(struct device *dev,
 static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL);
 #endif
 
+#ifdef CONFIG_TASK_ISOLATION
+static ssize_t print_cpus_task_isolation(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	int n = 0, len = PAGE_SIZE-2;
+
+	n = scnprintf(buf, len, "%*pbl\n", cpumask_pr_args(task_isolation_map));
+
+	return n;
+}
+static DEVICE_ATTR(task_isolation, 0444, print_cpus_task_isolation, NULL);
+#endif
+
 static void cpu_device_release(struct device *dev)
 {
 	/*
@@ -460,6 +475,9 @@ static struct attribute *cpu_root_attrs[] = {
 #ifdef CONFIG_NO_HZ_FULL
 	&dev_attr_nohz_full.attr,
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	&dev_attr_task_isolation.attr,
+#endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
 #endif
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..d9288b85b41f
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,60 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/* cpus that are configured to support task isolation */
+extern cpumask_var_t task_isolation_map;
+
+extern int task_isolation_init(void);
+
+static inline bool task_isolation_possible(int cpu)
+{
+	return task_isolation_map != NULL &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+extern int task_isolation_set(unsigned int flags);
+
+extern bool task_isolation_ready(void);
+extern void task_isolation_enter(void);
+
+static inline void task_isolation_set_flags(struct task_struct *p,
+					    unsigned int flags)
+{
+	p->task_isolation_flags = flags;
+
+	if (flags & PR_TASK_ISOLATION_ENABLE)
+		set_tsk_thread_flag(p, TIF_TASK_ISOLATION);
+	else
+		clear_tsk_thread_flag(p, TIF_TASK_ISOLATION);
+}
+
+extern int task_isolation_syscall(int nr);
+
+/* Report on exceptions that don't cause a signal for the user process. */
+extern void _task_isolation_quiet_exception(const char *fmt, ...);
+#define task_isolation_quiet_exception(fmt, ...)			\
+	do {								\
+		if (current_thread_info()->flags & _TIF_TASK_ISOLATION) \
+			_task_isolation_quiet_exception(fmt, ## __VA_ARGS__); \
+	} while (0)
+
+#else
+static inline void task_isolation_init(void) { }
+static inline bool task_isolation_possible(int cpu) { return false; }
+static inline bool task_isolation_ready(void) { return true; }
+static inline void task_isolation_enter(void) { }
+extern inline void task_isolation_set_flags(struct task_struct *p,
+					    unsigned int flags) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_quiet_exception(const char *fmt, ...) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 253538f29ade..8195c14d021a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1918,6 +1918,9 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned int	task_isolation_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 62be0786d6d0..fbd81e322860 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -235,6 +235,8 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
 
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
+extern void tick_nohz_full_add_cpus(const struct cpumask *mask);
+extern bool can_stop_my_full_tick(void);
 #else
 static inline int housekeeping_any_cpu(void)
 {
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..2a49d0d2940a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,9 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Enable/disable or query task_isolation mode for TASK_ISOLATION kernels. */
+#define PR_SET_TASK_ISOLATION		48
+#define PR_GET_TASK_ISOLATION		49
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index c02d89777713..fc71444f9c30 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -783,6 +783,33 @@ config RCU_EXPEDITE_BOOT
 
 endmenu # "RCU Subsystem"
 
+config HAVE_ARCH_TASK_ISOLATION
+	bool
+
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
+	help
+	 Allow userspace processes to place themselves on task_isolation
+	 cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+	 themselves from the kernel.  Prior to returning to userspace,
+	 isolated tasks will arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  By default, attempting to re-enter the kernel
+	 while in this mode will cause the task to be terminated
+	 with a signal; you must explicitly use prctl() to disable
+	 task isolation before resuming normal use of the kernel.
+
+	 This "hard" isolation from the kernel is required for
+	 userspace tasks that are running hard real-time tasks in
+	 userspace, such as a 10 Gbit network driver in userspace.
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index e2ec54e2b952..91ff1615f4d6 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -112,6 +112,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 4a7ec0c6c88c..e1ab8f034a95 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -76,6 +76,7 @@
 #include <linux/compiler.h>
 #include <linux/sysctl.h>
 #include <linux/kcov.h>
+#include <linux/isolation.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1535,6 +1536,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #endif
 	clear_all_latency_tracing(p);
 
+	task_isolation_set_flags(p, 0);
+
 	/* ok, now we should be set up.. */
 	p->pid = pid_nr(pid);
 	if (clone_flags & CLONE_THREAD) {
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..bf3ebb0a727c
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,217 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include <asm/unistd.h>
+#include <asm/syscall.h>
+#include "time/tick-sched.h"
+
+cpumask_var_t task_isolation_map;
+static bool saw_boot_arg;
+
+/*
+ * Isolation requires both nohz and isolcpus support from the scheduler.
+ * We provide a boot flag that enables both for now, and which we can
+ * add other functionality to over time if needed.  Note that just
+ * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
+ */
+static int __init task_isolation_setup(char *str)
+{
+	saw_boot_arg = true;
+
+	alloc_bootmem_cpumask_var(&task_isolation_map);
+	if (cpulist_parse(str, task_isolation_map) < 0) {
+		pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
+		return 1;
+	}
+
+	return 1;
+}
+__setup("task_isolation=", task_isolation_setup);
+
+int __init task_isolation_init(void)
+{
+	/* For offstack cpumask, ensure we allocate an empty cpumask early. */
+	if (!saw_boot_arg) {
+		zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+		return 0;
+	}
+
+	/*
+	 * Add our task_isolation cpus to nohz_full and isolcpus.  Note
+	 * that we are called relatively early in boot, from tick_init();
+	 * at this point neither nohz_full nor isolcpus has been used
+	 * to configure the system, but isolcpus has been allocated
+	 * already in sched_init().
+	 */
+	tick_nohz_full_add_cpus(task_isolation_map);
+	cpumask_or(cpu_isolated_map, cpu_isolated_map, task_isolation_map);
+
+	return 0;
+}
+
+/*
+ * Get a snapshot of whether, at this moment, it would be possible to
+ * stop the tick.  This test normally requires interrupts disabled since
+ * the condition can change if an interrupt is delivered.  However, in
+ * this case we are using it in an advisory capacity to see if there
+ * is anything obviously indicating that the task isolation
+ * preconditions have not been met, so it's OK that in principle it
+ * might not still be true later in the prctl() syscall path.
+ */
+static bool can_stop_my_full_tick_now(void)
+{
+	bool ret;
+
+	local_irq_disable();
+	ret = can_stop_my_full_tick();
+	local_irq_enable();
+	return ret;
+}
+
+/*
+ * This routine controls whether we can enable task-isolation mode.
+ * The task must be affinitized to a single task_isolation core, or
+ * else we return EINVAL.  And, it must be at least statically able to
+ * stop the nohz_full tick (e.g., no other schedulable tasks currently
+ * running, no POSIX cpu timers currently set up, etc.); if not, we
+ * return EAGAIN.
+ */
+int task_isolation_set(unsigned int flags)
+{
+	if (flags != 0) {
+		if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
+		    !task_isolation_possible(raw_smp_processor_id())) {
+			/* Invalid task affinity setting. */
+			return -EINVAL;
+		}
+		if (!can_stop_my_full_tick_now()) {
+			/* System not yet ready for task isolation. */
+			return -EAGAIN;
+		}
+	}
+
+	task_isolation_set_flags(current, flags);
+	return 0;
+}
+
+/*
+ * In task isolation mode we try to return to userspace only after
+ * attempting to make sure we won't be interrupted again.  This test
+ * is run with interrupts disabled to test that everything we need
+ * to be true is true before we can return to userspace.
+ */
+bool task_isolation_ready(void)
+{
+	WARN_ON_ONCE(!irqs_disabled());
+
+	return (!lru_add_drain_needed(smp_processor_id()) &&
+		vmstat_idle() &&
+		tick_nohz_tick_stopped());
+}
+
+/*
+ * Each time we try to prepare for return to userspace in a process
+ * with task isolation enabled, we run this code to quiesce whatever
+ * subsystems we can readily quiesce to avoid later interrupts.
+ */
+void task_isolation_enter(void)
+{
+	WARN_ON_ONCE(irqs_disabled());
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	quiet_vmstat_sync();
+
+	/*
+	 * Request rescheduling unless we are in full dynticks mode.
+	 * We would eventually get pre-empted without this, and if
+	 * there's another task waiting, it would run; but by
+	 * explicitly requesting the reschedule, we may reduce the
+	 * latency.  We could directly call schedule() here as well,
+	 * but since our caller is the standard place where schedule()
+	 * is called, we defer to the caller.
+	 *
+	 * A more substantive approach here would be to use a struct
+	 * completion here explicitly, and complete it when we shut
+	 * down dynticks, but since we presumably have nothing better
+	 * to do on this core anyway, just spinning seems plausible.
+	 */
+	if (!tick_nohz_tick_stopped())
+		set_tsk_need_resched(current);
+}
+
+static void task_isolation_deliver_signal(struct task_struct *task,
+					  const char *buf)
+{
+	siginfo_t info = {};
+
+	info.si_signo = SIGKILL;
+
+	/*
+	 * Report on the fact that isolation was violated for the task.
+	 * It may not be the task's fault (e.g. a TLB flush from another
+	 * core) but we are not blaming it, just reporting that it lost
+	 * its isolation status.
+	 */
+	pr_warn("%s/%d: task_isolation mode lost due to %s\n",
+		task->comm, task->pid, buf);
+
+	/* Turn off task isolation mode to avoid further isolation callbacks. */
+	task_isolation_set_flags(task, 0);
+
+	send_sig_info(info.si_signo, &info, task);
+}
+
+/*
+ * This routine is called from any userspace exception that doesn't
+ * otherwise trigger a signal to the user process (e.g. simple page fault).
+ */
+void _task_isolation_quiet_exception(const char *fmt, ...)
+{
+	struct task_struct *task = current;
+	va_list args;
+	char buf[100];
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	task_isolation_deliver_signal(task, buf);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in), and prevents most syscalls from executing and raises a
+ * signal to notify the process.
+ */
+int task_isolation_syscall(int syscall)
+{
+	char buf[20];
+
+	if (syscall == __NR_prctl ||
+	    syscall == __NR_exit ||
+	    syscall == __NR_exit_group)
+		return 0;
+
+	snprintf(buf, sizeof(buf), "syscall %d", syscall);
+	task_isolation_deliver_signal(current, buf);
+
+	syscall_set_return_value(current, current_pt_regs(),
+					 -ERESTARTNOINTR, -1);
+	return -1;
+}
diff --git a/kernel/signal.c b/kernel/signal.c
index 96e9bc40667f..4ff9bafd5af0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -34,6 +34,7 @@
 #include <linux/compat.h>
 #include <linux/cn_proc.h>
 #include <linux/compiler.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -2213,6 +2214,13 @@ relock:
 		/* Trace actually delivered signals. */
 		trace_signal_deliver(signr, &ksig->info, ka);
 
+		/*
+		 * Disable task isolation when delivering a signal.
+		 * The isolation model requires users to reset task
+		 * isolation from the signal handler if desired.
+		 */
+		task_isolation_set_flags(current, 0);
+
 		if (ka->sa.sa_handler == SIG_IGN) /* Do nothing.  */
 			continue;
 		if (ka->sa.sa_handler != SIG_DFL) {
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be418157..4df84af425e3 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2270,6 +2271,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_TASK_ISOLATION
+	case PR_SET_TASK_ISOLATION:
+		error = task_isolation_set(arg2);
+		break;
+	case PR_GET_TASK_ISOLATION:
+		error = me->task_isolation_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 536ada80f6dd..5cfde92a3785 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -23,6 +23,7 @@
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 
 #include <asm/irq_regs.h>
 
@@ -205,6 +206,11 @@ static bool can_stop_full_tick(struct tick_sched *ts)
 	return true;
 }
 
+bool can_stop_my_full_tick(void)
+{
+	return can_stop_full_tick(this_cpu_ptr(&tick_cpu_sched));
+}
+
 static void nohz_full_kick_func(struct irq_work *work)
 {
 	/* Empty, the tick restart happens on tick_nohz_irq_exit() */
@@ -407,30 +413,34 @@ static int tick_nohz_cpu_down_callback(struct notifier_block *nfb,
 	return NOTIFY_OK;
 }
 
-static int tick_nohz_init_all(void)
+void tick_nohz_full_add_cpus(const struct cpumask *mask)
 {
-	int err = -1;
+	if (!cpumask_weight(mask))
+		return;
 
-#ifdef CONFIG_NO_HZ_FULL_ALL
-	if (!alloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
+	if (tick_nohz_full_mask == NULL &&
+	    !zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
 		WARN(1, "NO_HZ: Can't allocate full dynticks cpumask\n");
-		return err;
+		return;
 	}
-	err = 0;
-	cpumask_setall(tick_nohz_full_mask);
+
+	cpumask_or(tick_nohz_full_mask, tick_nohz_full_mask, mask);
 	tick_nohz_full_running = true;
-#endif
-	return err;
 }
 
 void __init tick_nohz_init(void)
 {
 	int cpu;
 
-	if (!tick_nohz_full_running) {
-		if (tick_nohz_init_all() < 0)
-			return;
-	}
+	task_isolation_init();
+
+#ifdef CONFIG_NO_HZ_FULL_ALL
+	if (!tick_nohz_full_running)
+		tick_nohz_full_add_cpus(cpu_possible_mask);
+#endif
+
+	if (!tick_nohz_full_running)
+		return;
 
 	if (!alloc_cpumask_var(&housekeeping_mask, GFP_KERNEL)) {
 		WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n");
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v13 04/12] task_isolation: add initial support
@ 2016-07-14 20:48   ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well.  The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in
thread_info flags.  When task isolation is enabled for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future.

The task_isolation_ready() call is invoked when TIF_TASK_ISOLATION is
set in prepare_exit_to_usermode() or its architectural equivalent,
and forces the loop to retry if the system is not ready.  It is
called with interrupts disabled and inspects the kernel state
to determine if it is safe to return into an isolated state.
In particular, if it sees that the scheduler tick is still enabled,
it reports that it is not yet safe.

Each time through the loop of TIF work to do, if TIF_TASK_ISOLATION
is set, we call the new task_isolation_enter() routine.  This
takes any actions that might avoid a future interrupt to the core,
such as a worker thread being scheduled that could be quiesced now
(e.g. the vmstat worker) or a future IPI to the core to clean up some
state that could be cleaned up now (e.g. the mm lru per-cpu cache).
In addition, it reqeusts rescheduling if the scheduler dyntick is
still running.

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
of a number of other synchronous traps, the kernel will kill it
with SIGKILL.  For system calls, this test is performed immediately
before the SECCOMP test and causes the syscall to return immediately
with ENOSYS.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl() syscall so that we can clear the bit again
later, and ignores exit/exit_group to allow exiting the task without
a pointless signal killing you as you try to do so.

A new /sys/devices/system/cpu/task_isolation pseudo-file is added,
parallel to the comparable nohz_full file.

Separate patches that follow provide these changes for x86, tile,
and arm64.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 Documentation/kernel-parameters.txt |   8 ++
 drivers/base/cpu.c                  |  18 +++
 include/linux/isolation.h           |  60 ++++++++++
 include/linux/sched.h               |   3 +
 include/linux/tick.h                |   2 +
 include/uapi/linux/prctl.h          |   5 +
 init/Kconfig                        |  27 +++++
 kernel/Makefile                     |   1 +
 kernel/fork.c                       |   3 +
 kernel/isolation.c                  | 217 ++++++++++++++++++++++++++++++++++++
 kernel/signal.c                     |   8 ++
 kernel/sys.c                        |   9 ++
 kernel/time/tick-sched.c            |  36 +++---
 13 files changed, 384 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 82b42c958d1c..3db9bea08ed6 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3892,6 +3892,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation=	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION=y, set
+			the specified list of CPUs where cpus will be able
+			to use prctl(PR_SET_TASK_ISOLATION) to set up task
+			isolation mode.  Setting this boot flag implicitly
+			also sets up nohz_full and isolcpus mode for the
+			listed set of cpus.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 691eeea2f19a..eaf40f4264ee 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -17,6 +17,7 @@
 #include <linux/of.h>
 #include <linux/cpufeature.h>
 #include <linux/tick.h>
+#include <linux/isolation.h>
 
 #include "base.h"
 
@@ -290,6 +291,20 @@ static ssize_t print_cpus_nohz_full(struct device *dev,
 static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL);
 #endif
 
+#ifdef CONFIG_TASK_ISOLATION
+static ssize_t print_cpus_task_isolation(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	int n = 0, len = PAGE_SIZE-2;
+
+	n = scnprintf(buf, len, "%*pbl\n", cpumask_pr_args(task_isolation_map));
+
+	return n;
+}
+static DEVICE_ATTR(task_isolation, 0444, print_cpus_task_isolation, NULL);
+#endif
+
 static void cpu_device_release(struct device *dev)
 {
 	/*
@@ -460,6 +475,9 @@ static struct attribute *cpu_root_attrs[] = {
 #ifdef CONFIG_NO_HZ_FULL
 	&dev_attr_nohz_full.attr,
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	&dev_attr_task_isolation.attr,
+#endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
 #endif
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..d9288b85b41f
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,60 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/* cpus that are configured to support task isolation */
+extern cpumask_var_t task_isolation_map;
+
+extern int task_isolation_init(void);
+
+static inline bool task_isolation_possible(int cpu)
+{
+	return task_isolation_map != NULL &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+extern int task_isolation_set(unsigned int flags);
+
+extern bool task_isolation_ready(void);
+extern void task_isolation_enter(void);
+
+static inline void task_isolation_set_flags(struct task_struct *p,
+					    unsigned int flags)
+{
+	p->task_isolation_flags = flags;
+
+	if (flags & PR_TASK_ISOLATION_ENABLE)
+		set_tsk_thread_flag(p, TIF_TASK_ISOLATION);
+	else
+		clear_tsk_thread_flag(p, TIF_TASK_ISOLATION);
+}
+
+extern int task_isolation_syscall(int nr);
+
+/* Report on exceptions that don't cause a signal for the user process. */
+extern void _task_isolation_quiet_exception(const char *fmt, ...);
+#define task_isolation_quiet_exception(fmt, ...)			\
+	do {								\
+		if (current_thread_info()->flags & _TIF_TASK_ISOLATION) \
+			_task_isolation_quiet_exception(fmt, ## __VA_ARGS__); \
+	} while (0)
+
+#else
+static inline void task_isolation_init(void) { }
+static inline bool task_isolation_possible(int cpu) { return false; }
+static inline bool task_isolation_ready(void) { return true; }
+static inline void task_isolation_enter(void) { }
+extern inline void task_isolation_set_flags(struct task_struct *p,
+					    unsigned int flags) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_quiet_exception(const char *fmt, ...) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 253538f29ade..8195c14d021a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1918,6 +1918,9 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned int	task_isolation_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 62be0786d6d0..fbd81e322860 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -235,6 +235,8 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
 
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
+extern void tick_nohz_full_add_cpus(const struct cpumask *mask);
+extern bool can_stop_my_full_tick(void);
 #else
 static inline int housekeeping_any_cpu(void)
 {
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..2a49d0d2940a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,9 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Enable/disable or query task_isolation mode for TASK_ISOLATION kernels. */
+#define PR_SET_TASK_ISOLATION		48
+#define PR_GET_TASK_ISOLATION		49
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index c02d89777713..fc71444f9c30 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -783,6 +783,33 @@ config RCU_EXPEDITE_BOOT
 
 endmenu # "RCU Subsystem"
 
+config HAVE_ARCH_TASK_ISOLATION
+	bool
+
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
+	help
+	 Allow userspace processes to place themselves on task_isolation
+	 cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+	 themselves from the kernel.  Prior to returning to userspace,
+	 isolated tasks will arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  By default, attempting to re-enter the kernel
+	 while in this mode will cause the task to be terminated
+	 with a signal; you must explicitly use prctl() to disable
+	 task isolation before resuming normal use of the kernel.
+
+	 This "hard" isolation from the kernel is required for
+	 userspace tasks that are running hard real-time tasks in
+	 userspace, such as a 10 Gbit network driver in userspace.
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index e2ec54e2b952..91ff1615f4d6 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -112,6 +112,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 4a7ec0c6c88c..e1ab8f034a95 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -76,6 +76,7 @@
 #include <linux/compiler.h>
 #include <linux/sysctl.h>
 #include <linux/kcov.h>
+#include <linux/isolation.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1535,6 +1536,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #endif
 	clear_all_latency_tracing(p);
 
+	task_isolation_set_flags(p, 0);
+
 	/* ok, now we should be set up.. */
 	p->pid = pid_nr(pid);
 	if (clone_flags & CLONE_THREAD) {
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..bf3ebb0a727c
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,217 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include <asm/unistd.h>
+#include <asm/syscall.h>
+#include "time/tick-sched.h"
+
+cpumask_var_t task_isolation_map;
+static bool saw_boot_arg;
+
+/*
+ * Isolation requires both nohz and isolcpus support from the scheduler.
+ * We provide a boot flag that enables both for now, and which we can
+ * add other functionality to over time if needed.  Note that just
+ * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
+ */
+static int __init task_isolation_setup(char *str)
+{
+	saw_boot_arg = true;
+
+	alloc_bootmem_cpumask_var(&task_isolation_map);
+	if (cpulist_parse(str, task_isolation_map) < 0) {
+		pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
+		return 1;
+	}
+
+	return 1;
+}
+__setup("task_isolation=", task_isolation_setup);
+
+int __init task_isolation_init(void)
+{
+	/* For offstack cpumask, ensure we allocate an empty cpumask early. */
+	if (!saw_boot_arg) {
+		zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+		return 0;
+	}
+
+	/*
+	 * Add our task_isolation cpus to nohz_full and isolcpus.  Note
+	 * that we are called relatively early in boot, from tick_init();
+	 * at this point neither nohz_full nor isolcpus has been used
+	 * to configure the system, but isolcpus has been allocated
+	 * already in sched_init().
+	 */
+	tick_nohz_full_add_cpus(task_isolation_map);
+	cpumask_or(cpu_isolated_map, cpu_isolated_map, task_isolation_map);
+
+	return 0;
+}
+
+/*
+ * Get a snapshot of whether, at this moment, it would be possible to
+ * stop the tick.  This test normally requires interrupts disabled since
+ * the condition can change if an interrupt is delivered.  However, in
+ * this case we are using it in an advisory capacity to see if there
+ * is anything obviously indicating that the task isolation
+ * preconditions have not been met, so it's OK that in principle it
+ * might not still be true later in the prctl() syscall path.
+ */
+static bool can_stop_my_full_tick_now(void)
+{
+	bool ret;
+
+	local_irq_disable();
+	ret = can_stop_my_full_tick();
+	local_irq_enable();
+	return ret;
+}
+
+/*
+ * This routine controls whether we can enable task-isolation mode.
+ * The task must be affinitized to a single task_isolation core, or
+ * else we return EINVAL.  And, it must be at least statically able to
+ * stop the nohz_full tick (e.g., no other schedulable tasks currently
+ * running, no POSIX cpu timers currently set up, etc.); if not, we
+ * return EAGAIN.
+ */
+int task_isolation_set(unsigned int flags)
+{
+	if (flags != 0) {
+		if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
+		    !task_isolation_possible(raw_smp_processor_id())) {
+			/* Invalid task affinity setting. */
+			return -EINVAL;
+		}
+		if (!can_stop_my_full_tick_now()) {
+			/* System not yet ready for task isolation. */
+			return -EAGAIN;
+		}
+	}
+
+	task_isolation_set_flags(current, flags);
+	return 0;
+}
+
+/*
+ * In task isolation mode we try to return to userspace only after
+ * attempting to make sure we won't be interrupted again.  This test
+ * is run with interrupts disabled to test that everything we need
+ * to be true is true before we can return to userspace.
+ */
+bool task_isolation_ready(void)
+{
+	WARN_ON_ONCE(!irqs_disabled());
+
+	return (!lru_add_drain_needed(smp_processor_id()) &&
+		vmstat_idle() &&
+		tick_nohz_tick_stopped());
+}
+
+/*
+ * Each time we try to prepare for return to userspace in a process
+ * with task isolation enabled, we run this code to quiesce whatever
+ * subsystems we can readily quiesce to avoid later interrupts.
+ */
+void task_isolation_enter(void)
+{
+	WARN_ON_ONCE(irqs_disabled());
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	quiet_vmstat_sync();
+
+	/*
+	 * Request rescheduling unless we are in full dynticks mode.
+	 * We would eventually get pre-empted without this, and if
+	 * there's another task waiting, it would run; but by
+	 * explicitly requesting the reschedule, we may reduce the
+	 * latency.  We could directly call schedule() here as well,
+	 * but since our caller is the standard place where schedule()
+	 * is called, we defer to the caller.
+	 *
+	 * A more substantive approach here would be to use a struct
+	 * completion here explicitly, and complete it when we shut
+	 * down dynticks, but since we presumably have nothing better
+	 * to do on this core anyway, just spinning seems plausible.
+	 */
+	if (!tick_nohz_tick_stopped())
+		set_tsk_need_resched(current);
+}
+
+static void task_isolation_deliver_signal(struct task_struct *task,
+					  const char *buf)
+{
+	siginfo_t info = {};
+
+	info.si_signo = SIGKILL;
+
+	/*
+	 * Report on the fact that isolation was violated for the task.
+	 * It may not be the task's fault (e.g. a TLB flush from another
+	 * core) but we are not blaming it, just reporting that it lost
+	 * its isolation status.
+	 */
+	pr_warn("%s/%d: task_isolation mode lost due to %s\n",
+		task->comm, task->pid, buf);
+
+	/* Turn off task isolation mode to avoid further isolation callbacks. */
+	task_isolation_set_flags(task, 0);
+
+	send_sig_info(info.si_signo, &info, task);
+}
+
+/*
+ * This routine is called from any userspace exception that doesn't
+ * otherwise trigger a signal to the user process (e.g. simple page fault).
+ */
+void _task_isolation_quiet_exception(const char *fmt, ...)
+{
+	struct task_struct *task = current;
+	va_list args;
+	char buf[100];
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	task_isolation_deliver_signal(task, buf);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in), and prevents most syscalls from executing and raises a
+ * signal to notify the process.
+ */
+int task_isolation_syscall(int syscall)
+{
+	char buf[20];
+
+	if (syscall == __NR_prctl ||
+	    syscall == __NR_exit ||
+	    syscall == __NR_exit_group)
+		return 0;
+
+	snprintf(buf, sizeof(buf), "syscall %d", syscall);
+	task_isolation_deliver_signal(current, buf);
+
+	syscall_set_return_value(current, current_pt_regs(),
+					 -ERESTARTNOINTR, -1);
+	return -1;
+}
diff --git a/kernel/signal.c b/kernel/signal.c
index 96e9bc40667f..4ff9bafd5af0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -34,6 +34,7 @@
 #include <linux/compat.h>
 #include <linux/cn_proc.h>
 #include <linux/compiler.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -2213,6 +2214,13 @@ relock:
 		/* Trace actually delivered signals. */
 		trace_signal_deliver(signr, &ksig->info, ka);
 
+		/*
+		 * Disable task isolation when delivering a signal.
+		 * The isolation model requires users to reset task
+		 * isolation from the signal handler if desired.
+		 */
+		task_isolation_set_flags(current, 0);
+
 		if (ka->sa.sa_handler == SIG_IGN) /* Do nothing.  */
 			continue;
 		if (ka->sa.sa_handler != SIG_DFL) {
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be418157..4df84af425e3 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2270,6 +2271,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_TASK_ISOLATION
+	case PR_SET_TASK_ISOLATION:
+		error = task_isolation_set(arg2);
+		break;
+	case PR_GET_TASK_ISOLATION:
+		error = me->task_isolation_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 536ada80f6dd..5cfde92a3785 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -23,6 +23,7 @@
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 
 #include <asm/irq_regs.h>
 
@@ -205,6 +206,11 @@ static bool can_stop_full_tick(struct tick_sched *ts)
 	return true;
 }
 
+bool can_stop_my_full_tick(void)
+{
+	return can_stop_full_tick(this_cpu_ptr(&tick_cpu_sched));
+}
+
 static void nohz_full_kick_func(struct irq_work *work)
 {
 	/* Empty, the tick restart happens on tick_nohz_irq_exit() */
@@ -407,30 +413,34 @@ static int tick_nohz_cpu_down_callback(struct notifier_block *nfb,
 	return NOTIFY_OK;
 }
 
-static int tick_nohz_init_all(void)
+void tick_nohz_full_add_cpus(const struct cpumask *mask)
 {
-	int err = -1;
+	if (!cpumask_weight(mask))
+		return;
 
-#ifdef CONFIG_NO_HZ_FULL_ALL
-	if (!alloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
+	if (tick_nohz_full_mask == NULL &&
+	    !zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
 		WARN(1, "NO_HZ: Can't allocate full dynticks cpumask\n");
-		return err;
+		return;
 	}
-	err = 0;
-	cpumask_setall(tick_nohz_full_mask);
+
+	cpumask_or(tick_nohz_full_mask, tick_nohz_full_mask, mask);
 	tick_nohz_full_running = true;
-#endif
-	return err;
 }
 
 void __init tick_nohz_init(void)
 {
 	int cpu;
 
-	if (!tick_nohz_full_running) {
-		if (tick_nohz_init_all() < 0)
-			return;
-	}
+	task_isolation_init();
+
+#ifdef CONFIG_NO_HZ_FULL_ALL
+	if (!tick_nohz_full_running)
+		tick_nohz_full_add_cpus(cpu_possible_mask);
+#endif
+
+	if (!tick_nohz_full_running)
+		return;
 
 	if (!alloc_cpumask_var(&housekeeping_mask, GFP_KERNEL)) {
 		WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n");
-- 
2.7.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v13 05/12] task_isolation: track asynchronous interrupts
  2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
                   ` (3 preceding siblings ...)
  2016-07-14 20:48   ` Chris Metcalf
@ 2016-07-14 20:48 ` Chris Metcalf
  2016-07-14 20:48 ` [PATCH v13 06/12] arch/x86: enable task isolation functionality Chris Metcalf
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-kernel
  Cc: Chris Metcalf

This commit adds support for tracking asynchronous interrupts
delivered to task-isolation tasks, e.g. IPIs or IRQs.  Just
as for exceptions and syscalls, when this occurs we arrange to
deliver a signal to the task so that it knows it has been
interrupted.  If the task is interrupted by an NMI, we can't
safely deliver a signal, so we just dump out a console stack.

We also support a new "task_isolation_debug" flag which forces
the console stack to be dumped out regardless.  We try to catch
the original source of the interrupt, e.g. if an IPI is dispatched
to a task-isolation task, we dump the backtrace of the remote
core that is sending the IPI, rather than just dumping out a
trace showing the core received an IPI from somewhere.

Calls to task_isolation_debug() can be placed in the
platform-independent code when that results in fewer lines
of code changes, as for example is true of the users of the
arch_send_call_function_*() APIs.  Or, they can be placed in the
per-architecture code when there are many callers, as for example
is true of the smp_send_reschedule() call.

A further cleanup might be to create an intermediate layer, so that
for example smp_send_reschedule() is a single generic function that
just calls arch_smp_send_reschedule(), allowing generic code to be
called every time smp_send_reschedule() is invoked.  But for now,
we just update either callers or callees as makes most sense.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 Documentation/kernel-parameters.txt    |  8 ++++
 include/linux/context_tracking_state.h |  6 +++
 include/linux/isolation.h              | 13 ++++++
 kernel/irq_work.c                      |  5 ++-
 kernel/isolation.c                     | 74 ++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                    | 42 +++++++++++++++++++
 kernel/signal.c                        |  7 ++++
 kernel/smp.c                           |  6 ++-
 kernel/softirq.c                       | 33 +++++++++++++++
 9 files changed, 192 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 3db9bea08ed6..15fe7f029f8b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3900,6 +3900,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			also sets up nohz_full and isolcpus mode for the
 			listed set of cpus.
 
+	task_isolation_debug	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION
+			and booted in task_isolation= mode, this
+			setting will generate console backtraces when
+			the kernel is about to interrupt a task that
+			has requested PR_TASK_ISOLATION_ENABLE and is
+			running on a task_isolation core.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 1d34fe68f48a..4e2c4b900b82 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -39,8 +39,14 @@ static inline bool context_tracking_in_user(void)
 {
 	return __this_cpu_read(context_tracking.state) == CONTEXT_USER;
 }
+
+static inline bool context_tracking_cpu_in_user(int cpu)
+{
+	return per_cpu(context_tracking.state, cpu) == CONTEXT_USER;
+}
 #else
 static inline bool context_tracking_in_user(void) { return false; }
+static inline bool context_tracking_cpu_in_user(int cpu) { return false; }
 static inline bool context_tracking_active(void) { return false; }
 static inline bool context_tracking_is_enabled(void) { return false; }
 static inline bool context_tracking_cpu_is_enabled(void) { return false; }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index d9288b85b41f..02728b1f8775 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -46,6 +46,17 @@ extern void _task_isolation_quiet_exception(const char *fmt, ...);
 			_task_isolation_quiet_exception(fmt, ## __VA_ARGS__); \
 	} while (0)
 
+extern void _task_isolation_debug(int cpu, const char *type);
+#define task_isolation_debug(cpu, type)					\
+	do {								\
+		if (task_isolation_possible(cpu))			\
+			_task_isolation_debug(cpu, type);		\
+	} while (0)
+
+extern void task_isolation_debug_cpumask(const struct cpumask *,
+					 const char *type);
+extern void task_isolation_debug_task(int cpu, struct task_struct *p,
+				      const char *type);
 #else
 static inline void task_isolation_init(void) { }
 static inline bool task_isolation_possible(int cpu) { return false; }
@@ -55,6 +66,8 @@ extern inline void task_isolation_set_flags(struct task_struct *p,
 					    unsigned int flags) { }
 static inline int task_isolation_syscall(int nr) { return 0; }
 static inline void task_isolation_quiet_exception(const char *fmt, ...) { }
+static inline void task_isolation_debug(int cpu, const char *type) { }
+#define task_isolation_debug_cpumask(mask, type) do {} while (0)
 #endif
 
 #endif
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index bcf107ce0854..15f3d44acf11 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/isolation.h>
 #include <asm/processor.h>
 
 
@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (!irq_work_claim(work))
 		return false;
 
-	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+		task_isolation_debug(cpu, "irq_work");
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return true;
 }
diff --git a/kernel/isolation.c b/kernel/isolation.c
index bf3ebb0a727c..a9fd4709825a 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,7 @@
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
 #include <linux/syscalls.h>
+#include <linux/ratelimit.h>
 #include <asm/unistd.h>
 #include <asm/syscall.h>
 #include "time/tick-sched.h"
@@ -215,3 +216,76 @@ int task_isolation_syscall(int syscall)
 					 -ERESTARTNOINTR, -1);
 	return -1;
 }
+
+/* Enable debugging of any interrupts of task_isolation cores. */
+static int task_isolation_debug_flag;
+static int __init task_isolation_debug_func(char *str)
+{
+	task_isolation_debug_flag = true;
+	return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+void task_isolation_debug_task(int cpu, struct task_struct *p, const char *type)
+{
+	static DEFINE_RATELIMIT_STATE(console_output, HZ, 1);
+	bool force_debug = false;
+
+	/*
+	 * Our caller made sure the task was running on a task isolation
+	 * core, but make sure the task has enabled isolation.
+	 */
+	if (!(p->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
+		return;
+
+	/*
+	 * Ensure the task is actually in userspace; if it is in kernel
+	 * mode, it is expected that it may receive interrupts, and in
+	 * any case they don't affect the isolation.  Note that there
+	 * is a race condition here as a task may have committed
+	 * to returning to user space but not yet set the context
+	 * tracking state to reflect it, and the check here is before
+	 * we trigger the interrupt, so we might fail to warn about a
+	 * legitimate interrupt.  However, the race window is narrow
+	 * and hitting it does not cause any incorrect behavior other
+	 * than failing to send the warning.
+	 */
+	if (!context_tracking_cpu_in_user(cpu))
+		return;
+
+	/*
+	 * We disable task isolation mode when we deliver a signal
+	 * so we won't end up recursing back here again.
+	 * If we are in an NMI, we don't try delivering the signal
+	 * and instead just treat it as if "debug" mode was enabled,
+	 * since that's pretty much all we can do.
+	 */
+	if (in_nmi())
+		force_debug = true;
+	else
+		task_isolation_deliver_signal(p, type);
+
+	/*
+	 * If (for example) the timer interrupt starts ticking
+	 * unexpectedly, we will get an unmanageable flow of output,
+	 * so limit to one backtrace per second.
+	 */
+	if (force_debug ||
+	    (task_isolation_debug_flag && __ratelimit(&console_output))) {
+		pr_err("cpu %d: %s violating task isolation for %s/%d on cpu %d\n",
+		       smp_processor_id(), type, p->comm, p->pid, cpu);
+		dump_stack();
+	}
+}
+
+void task_isolation_debug_cpumask(const struct cpumask *mask, const char *type)
+{
+	int cpu, thiscpu = get_cpu();
+
+	/* No need to report on this cpu since we're already in the kernel. */
+	for_each_cpu_and(cpu, mask, task_isolation_map)
+		if (cpu != thiscpu)
+			_task_isolation_debug(cpu, type);
+
+	put_cpu();
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51d7105f529a..7d230e70e195 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
 #include <linux/context_tracking.h>
 #include <linux/compiler.h>
 #include <linux/frame.h>
+#include <linux/isolation.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -663,6 +664,47 @@ bool sched_can_stop_tick(struct rq *rq)
 }
 #endif /* CONFIG_NO_HZ_FULL */
 
+#ifdef CONFIG_TASK_ISOLATION
+/*
+ * NOTE: this function is currently in linux-next and included here
+ * as a place-holder for merging upstream.
+ */
+static struct task_struct *try_get_task_struct(struct task_struct **ptask)
+{
+	struct task_struct *task;
+	struct sighand_struct *sighand;
+
+	rcu_read_lock();
+retry:
+	task = rcu_dereference(*ptask);
+	if (!task)
+		goto done;
+	probe_kernel_address(&task->sighand, sighand);
+	smp_rmb();
+	if (unlikely(task != READ_ONCE(*ptask)))
+		goto retry;
+	if (!sighand) {
+		task = NULL;
+		goto done;
+	}
+	get_task_struct(task);
+done:
+	rcu_read_unlock();
+	return task;
+}
+
+void _task_isolation_debug(int cpu, const char *type)
+{
+	struct rq *rq = cpu_rq(cpu);
+	struct task_struct *task = try_get_task_struct(&rq->curr);
+
+	if (task) {
+		task_isolation_debug_task(cpu, task, type);
+		put_task_struct(task);
+	}
+}
+#endif
+
 void sched_avg_update(struct rq *rq)
 {
 	s64 period = sched_avg_period();
diff --git a/kernel/signal.c b/kernel/signal.c
index 4ff9bafd5af0..5c98905df96f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -639,6 +639,13 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+	/*
+	 * We're delivering a signal anyway, so no need for more
+	 * warnings.  This also avoids self-deadlock since an IPI to
+	 * kick the task would otherwise generate another signal.
+	 */
+	task_isolation_set_flags(t, 0);
+
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 74165443c240..58e8129a87e9 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
 #include <linux/smp.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include <linux/isolation.h>
 
 #include "smpboot.h"
 
@@ -177,8 +178,10 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
-	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) {
+		task_isolation_debug(cpu, "IPI function");
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return 0;
 }
@@ -456,6 +459,7 @@ void smp_call_function_many(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	task_isolation_debug_cpumask(cfd->cpumask, "IPI function");
 	arch_send_call_function_ipi_mask(cfd->cpumask);
 
 	if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..2f1065795318 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -26,6 +26,7 @@
 #include <linux/smpboot.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -319,6 +320,37 @@ asmlinkage __visible void do_softirq(void)
 	local_irq_restore(flags);
 }
 
+/* Determine whether this IRQ is something task isolation cares about. */
+static void task_isolation_irq(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+	struct pt_regs *regs;
+
+	if (!context_tracking_cpu_is_enabled())
+		return;
+
+	/*
+	 * We have not yet called __irq_enter() and so we haven't
+	 * adjusted the hardirq count.  This test will allow us to
+	 * avoid false positives for nested IRQs.
+	 */
+	if (in_interrupt())
+		return;
+
+	/*
+	 * If we were already in the kernel, not from an irq but from
+	 * a syscall or synchronous exception/fault, this test should
+	 * avoid a false positive as well.  Note that this requires
+	 * architecture support for calling set_irq_regs() prior to
+	 * calling irq_enter(), and if it's not done consistently, we
+	 * will not consistently avoid false positives here.
+	 */
+	regs = get_irq_regs();
+	if (regs && user_mode(regs))
+		task_isolation_debug(smp_processor_id(), "irq");
+#endif
+}
+
 /*
  * Enter an interrupt context.
  */
@@ -335,6 +367,7 @@ void irq_enter(void)
 		_local_bh_enable();
 	}
 
+	task_isolation_irq();
 	__irq_enter();
 }
 
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v13 06/12] arch/x86: enable task isolation functionality
  2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
                   ` (4 preceding siblings ...)
  2016-07-14 20:48 ` [PATCH v13 05/12] task_isolation: track asynchronous interrupts Chris Metcalf
@ 2016-07-14 20:48 ` Chris Metcalf
  2016-07-14 20:48   ` Chris Metcalf
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	H. Peter Anvin, x86, linux-kernel
  Cc: Chris Metcalf

In exit_to_usermode_loop(), call task_isolation_ready() for
TIF_TASK_ISOLATION tasks when we are checking the thread-info flags,
and after we've handled the other work, call task_isolation_enter()
for such tasks.

In syscall_trace_enter_phase1(), we add the necessary support for
reporting syscalls for task-isolation processes.

We add strict reporting for the kernel exception types that do
not result in signals, namely non-signalling page faults and
non-signalling MPX fixups.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/entry/common.c            | 18 +++++++++++++++++-
 arch/x86/include/asm/thread_info.h |  2 ++
 arch/x86/kernel/smp.c              |  2 ++
 arch/x86/kernel/traps.c            |  3 +++
 arch/x86/mm/fault.c                |  5 +++++
 6 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d9a94da0c29f..0762072ba284 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -89,6 +89,7 @@ config X86
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS	if MMU && COMPAT
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_SOFT_DIRTY		if X86_64
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_EBPF_JIT			if X86_64
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ec138e538c44..33fc40b29c9f 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -21,6 +21,7 @@
 #include <linux/context_tracking.h>
 #include <linux/user-return-notifier.h>
 #include <linux/uprobes.h>
+#include <linux/isolation.h>
 
 #include <asm/desc.h>
 #include <asm/traps.h>
@@ -87,6 +88,13 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 
 	work = ACCESS_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY;
 
+	/* In isolation mode, we may prevent the syscall from running. */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->orig_ax) == -1)
+			return -1;
+		work &= ~_TIF_TASK_ISOLATION;
+	}
+
 #ifdef CONFIG_SECCOMP
 	/*
 	 * Do seccomp first -- it should minimize exposure of other
@@ -202,7 +210,7 @@ long syscall_trace_enter(struct pt_regs *regs)
 
 #define EXIT_TO_USERMODE_LOOP_FLAGS				\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |	\
-	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY)
+	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_TASK_ISOLATION)
 
 static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 {
@@ -236,11 +244,19 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
 			fire_user_return_notifiers();
 
+		if (cached_flags & _TIF_TASK_ISOLATION)
+			task_isolation_enter();
+
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
 		cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
 
+		/* Clear task isolation from cached_flags manually. */
+		if ((cached_flags & _TIF_TASK_ISOLATION) &&
+		    task_isolation_ready())
+			cached_flags &= ~_TIF_TASK_ISOLATION;
+
 		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
 			break;
 
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 30c133ac05cd..10167e086f3b 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -97,6 +97,7 @@ struct thread_info {
 #define TIF_SECCOMP		8	/* secure computing */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
 #define TIF_UPROBE		12	/* breakpointed or singlestepping */
+#define TIF_TASK_ISOLATION	13	/* task isolation enabled for task */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
 #define TIF_FORK		18	/* ret_from_fork */
@@ -121,6 +122,7 @@ struct thread_info {
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_UPROBE		(1 << TIF_UPROBE)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 658777cf3851..e4ffd9581cdb 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/isolation.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -125,6 +126,7 @@ static void native_smp_send_reschedule(int cpu)
 		WARN_ON(1);
 		return;
 	}
+	task_isolation_debug(cpu, "reschedule IPI");
 	apic->send_IPI(cpu, RESCHEDULE_VECTOR);
 }
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 00f03d82e69a..4989af93bb33 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -36,6 +36,7 @@
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/io.h>
+#include <linux/isolation.h>
 
 #ifdef CONFIG_EISA
 #include <linux/ioport.h>
@@ -383,6 +384,8 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 	case 2:	/* Bound directory has invalid entry. */
 		if (mpx_handle_bd_fault())
 			goto exit_trap;
+		/* No signal was generated, but notify task-isolation tasks. */
+		task_isolation_quiet_exception("bounds check");
 		break; /* Success, it was handled */
 	case 1: /* Bound violation. */
 		info = mpx_generate_siginfo(regs);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7d1fa7cd2374..655b4ae0c9b8 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -14,6 +14,7 @@
 #include <linux/prefetch.h>		/* prefetchw			*/
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
+#include <linux/isolation.h>		/* task_isolation_quiet_exception */
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1397,6 +1398,10 @@ good_area:
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
 	}
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	if (flags & PF_USER)
+		task_isolation_quiet_exception("page fault at %#lx", address);
+
 	check_v8086_mode(regs, address, tsk);
 }
 NOKPROBE_SYMBOL(__do_page_fault);
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v13 07/12] arm64: factor work_pending state machine to C
  2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
@ 2016-07-14 20:48   ` Chris Metcalf
  2016-07-14 20:48   ` Chris Metcalf
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Mark Rutland, linux-arm-kernel, linux-kernel
  Cc: Chris Metcalf

Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
state machine that can be difficult to reason about due to duplicated
code and a large number of branch targets.

This patch factors the common logic out into the existing
do_notify_resume function, converting the code to C in the process,
making the code more legible.

This patch tries to closely mirror the existing behaviour while using
the usual C control flow primitives. As local_irq_{disable,enable} may
be instrumented, we balance exception entry (where we will almost most
likely enable IRQs) with a call to trace_hardirqs_on just before the
return to userspace.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/arm64/kernel/entry.S  | 12 ++++--------
 arch/arm64/kernel/signal.c | 36 ++++++++++++++++++++++++++----------
 2 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 6c3b7345a6c4..2bbf7753c674 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -689,18 +689,13 @@ ret_fast_syscall_trace:
  * Ok, we need to do extra processing, enter the slow path.
  */
 work_pending:
-	tbnz	x1, #TIF_NEED_RESCHED, work_resched
-	/* TIF_SIGPENDING, TIF_NOTIFY_RESUME or TIF_FOREIGN_FPSTATE case */
 	mov	x0, sp				// 'regs'
-	enable_irq				// enable interrupts for do_notify_resume()
 	bl	do_notify_resume
-	b	ret_to_user
-work_resched:
 #ifdef CONFIG_TRACE_IRQFLAGS
-	bl	trace_hardirqs_off		// the IRQs are off here, inform the tracing code
+	bl	trace_hardirqs_on		// enabled while in userspace
 #endif
-	bl	schedule
-
+	ldr	x1, [tsk, #TI_FLAGS]		// re-check for single-step
+	b	finish_ret_to_user
 /*
  * "slow" syscall return path.
  */
@@ -709,6 +704,7 @@ ret_to_user:
 	ldr	x1, [tsk, #TI_FLAGS]
 	and	x2, x1, #_TIF_WORK_MASK
 	cbnz	x2, work_pending
+finish_ret_to_user:
 	enable_step_tsk x1, x2
 	kernel_exit 0
 ENDPROC(ret_to_user)
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index a8eafdbc7cb8..404dd67080b9 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -402,15 +402,31 @@ static void do_signal(struct pt_regs *regs)
 asmlinkage void do_notify_resume(struct pt_regs *regs,
 				 unsigned int thread_flags)
 {
-	if (thread_flags & _TIF_SIGPENDING)
-		do_signal(regs);
-
-	if (thread_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-	}
-
-	if (thread_flags & _TIF_FOREIGN_FPSTATE)
-		fpsimd_restore_current_state();
+	/*
+	 * The assembly code enters us with IRQs off, but it hasn't
+	 * informed the tracing code of that for efficiency reasons.
+	 * Update the trace code with the current status.
+	 */
+	trace_hardirqs_off();
+	do {
+		if (thread_flags & _TIF_NEED_RESCHED) {
+			schedule();
+		} else {
+			local_irq_enable();
+
+			if (thread_flags & _TIF_SIGPENDING)
+				do_signal(regs);
+
+			if (thread_flags & _TIF_NOTIFY_RESUME) {
+				clear_thread_flag(TIF_NOTIFY_RESUME);
+				tracehook_notify_resume(regs);
+			}
+
+			if (thread_flags & _TIF_FOREIGN_FPSTATE)
+				fpsimd_restore_current_state();
+		}
 
+		local_irq_disable();
+		thread_flags = READ_ONCE(current_thread_info()->flags);
+	} while (thread_flags & _TIF_WORK_MASK);
 }
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v13 07/12] arm64: factor work_pending state machine to C
@ 2016-07-14 20:48   ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: linux-arm-kernel

Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
state machine that can be difficult to reason about due to duplicated
code and a large number of branch targets.

This patch factors the common logic out into the existing
do_notify_resume function, converting the code to C in the process,
making the code more legible.

This patch tries to closely mirror the existing behaviour while using
the usual C control flow primitives. As local_irq_{disable,enable} may
be instrumented, we balance exception entry (where we will almost most
likely enable IRQs) with a call to trace_hardirqs_on just before the
return to userspace.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/arm64/kernel/entry.S  | 12 ++++--------
 arch/arm64/kernel/signal.c | 36 ++++++++++++++++++++++++++----------
 2 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 6c3b7345a6c4..2bbf7753c674 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -689,18 +689,13 @@ ret_fast_syscall_trace:
  * Ok, we need to do extra processing, enter the slow path.
  */
 work_pending:
-	tbnz	x1, #TIF_NEED_RESCHED, work_resched
-	/* TIF_SIGPENDING, TIF_NOTIFY_RESUME or TIF_FOREIGN_FPSTATE case */
 	mov	x0, sp				// 'regs'
-	enable_irq				// enable interrupts for do_notify_resume()
 	bl	do_notify_resume
-	b	ret_to_user
-work_resched:
 #ifdef CONFIG_TRACE_IRQFLAGS
-	bl	trace_hardirqs_off		// the IRQs are off here, inform the tracing code
+	bl	trace_hardirqs_on		// enabled while in userspace
 #endif
-	bl	schedule
-
+	ldr	x1, [tsk, #TI_FLAGS]		// re-check for single-step
+	b	finish_ret_to_user
 /*
  * "slow" syscall return path.
  */
@@ -709,6 +704,7 @@ ret_to_user:
 	ldr	x1, [tsk, #TI_FLAGS]
 	and	x2, x1, #_TIF_WORK_MASK
 	cbnz	x2, work_pending
+finish_ret_to_user:
 	enable_step_tsk x1, x2
 	kernel_exit 0
 ENDPROC(ret_to_user)
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index a8eafdbc7cb8..404dd67080b9 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -402,15 +402,31 @@ static void do_signal(struct pt_regs *regs)
 asmlinkage void do_notify_resume(struct pt_regs *regs,
 				 unsigned int thread_flags)
 {
-	if (thread_flags & _TIF_SIGPENDING)
-		do_signal(regs);
-
-	if (thread_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-	}
-
-	if (thread_flags & _TIF_FOREIGN_FPSTATE)
-		fpsimd_restore_current_state();
+	/*
+	 * The assembly code enters us with IRQs off, but it hasn't
+	 * informed the tracing code of that for efficiency reasons.
+	 * Update the trace code with the current status.
+	 */
+	trace_hardirqs_off();
+	do {
+		if (thread_flags & _TIF_NEED_RESCHED) {
+			schedule();
+		} else {
+			local_irq_enable();
+
+			if (thread_flags & _TIF_SIGPENDING)
+				do_signal(regs);
+
+			if (thread_flags & _TIF_NOTIFY_RESUME) {
+				clear_thread_flag(TIF_NOTIFY_RESUME);
+				tracehook_notify_resume(regs);
+			}
+
+			if (thread_flags & _TIF_FOREIGN_FPSTATE)
+				fpsimd_restore_current_state();
+		}
 
+		local_irq_disable();
+		thread_flags = READ_ONCE(current_thread_info()->flags);
+	} while (thread_flags & _TIF_WORK_MASK);
 }
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v13 08/12] arch/arm64: enable task isolation functionality
  2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
@ 2016-07-14 20:48   ` Chris Metcalf
  2016-07-14 20:48   ` Chris Metcalf
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Mark Rutland, linux-arm-kernel, linux-kernel
  Cc: Chris Metcalf

In do_notify_resume(), call task_isolation_ready() for
TIF_TASK_ISOLATION tasks when we are checking the thread-info flags;
and after we've handled the other work, call task_isolation_enter()
for such tasks.  To ensure we always call task_isolation_enter() when
returning to userspace, add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
while leaving the old bitmask value as _TIF_WORK_LOOP_MASK to
check while looping.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one.  This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

We instrument the smp_send_reschedule() routine so that it checks for
isolated tasks and generates a suitable warning if we are about
to disturb one of them in strict or debug mode.

Finally, report on page faults in task-isolation processes in
do_page_faults().

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/thread_info.h |  5 ++++-
 arch/arm64/kernel/ptrace.c           | 15 ++++++++++++---
 arch/arm64/kernel/signal.c           | 10 ++++++++++
 arch/arm64/kernel/smp.c              |  2 ++
 arch/arm64/mm/fault.c                |  8 +++++++-
 6 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 5a0a691d4220..36ada13f503f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -58,6 +58,7 @@ config ARM64
 	select HAVE_ARCH_MMAP_RND_BITS
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_ARM_SMCCC
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index abd64bd1f6d9..bdc6426b9968 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -109,6 +109,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	1
 #define TIF_NOTIFY_RESUME	2	/* callback before returning to user */
 #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
+#define TIF_TASK_ISOLATION	4
 #define TIF_NOHZ		7
 #define TIF_SYSCALL_TRACE	8
 #define TIF_SYSCALL_AUDIT	9
@@ -124,6 +125,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
@@ -132,7 +134,8 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_32BIT		(1 << TIF_32BIT)
 
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
-				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE)
+				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
+				 _TIF_TASK_ISOLATION)
 
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 3f6cd5c5234f..ae336065733d 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1246,14 +1247,22 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
-	/* Do the secure computing check first; failures should be fast. */
+	unsigned long work = ACCESS_ONCE(current_thread_info()->flags);
+
+	/* In isolation mode, we may prevent the syscall from running. */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->syscallno) == -1)
+			return -1;
+	}
+
+	/* Do the secure computing check early; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
 
-	if (test_thread_flag(TIF_SYSCALL_TRACE))
+	if (work & _TIF_SYSCALL_TRACE)
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
-	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
+	if (work & _TIF_SYSCALL_TRACEPOINT)
 		trace_sys_enter(regs, regs->syscallno);
 
 	audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 404dd67080b9..f9b9b25636ca 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -25,6 +25,7 @@
 #include <linux/uaccess.h>
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
+#include <linux/isolation.h>
 
 #include <asm/debug-monitors.h>
 #include <asm/elf.h>
@@ -424,9 +425,18 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 
 			if (thread_flags & _TIF_FOREIGN_FPSTATE)
 				fpsimd_restore_current_state();
+
+			if (thread_flags & _TIF_TASK_ISOLATION)
+				task_isolation_enter();
 		}
 
 		local_irq_disable();
 		thread_flags = READ_ONCE(current_thread_info()->flags);
+
+		/* Clear task isolation from cached_flags manually. */
+		if ((thread_flags & _TIF_TASK_ISOLATION) &&
+		    task_isolation_ready())
+			thread_flags &= ~_TIF_TASK_ISOLATION;
+
 	} while (thread_flags & _TIF_WORK_MASK);
 }
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 62ff3c0622e2..30559aebde5c 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -37,6 +37,7 @@
 #include <linux/completion.h>
 #include <linux/of.h>
 #include <linux/irq_work.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -866,6 +867,7 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 
 void smp_send_reschedule(int cpu)
 {
+	task_isolation_debug(cpu, "reschedule IPI");
 	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
 }
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index b1166d1e5955..9c2ba2a318b0 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/isolation.h>
 
 #include <asm/cpufeature.h>
 #include <asm/exception.h>
@@ -354,8 +355,13 @@ retry:
 	 * Handle the "normal" case first - VM_FAULT_MAJOR
 	 */
 	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
-			      VM_FAULT_BADACCESS))))
+			      VM_FAULT_BADACCESS)))) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (user_mode(regs))
+			task_isolation_quiet_exception("page fault at %#lx",
+						       addr);
 		return 0;
+	}
 
 	/*
 	 * If we are in kernel mode at this point, we have no context to
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v13 08/12] arch/arm64: enable task isolation functionality
@ 2016-07-14 20:48   ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: linux-arm-kernel

In do_notify_resume(), call task_isolation_ready() for
TIF_TASK_ISOLATION tasks when we are checking the thread-info flags;
and after we've handled the other work, call task_isolation_enter()
for such tasks.  To ensure we always call task_isolation_enter() when
returning to userspace, add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
while leaving the old bitmask value as _TIF_WORK_LOOP_MASK to
check while looping.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one.  This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

We instrument the smp_send_reschedule() routine so that it checks for
isolated tasks and generates a suitable warning if we are about
to disturb one of them in strict or debug mode.

Finally, report on page faults in task-isolation processes in
do_page_faults().

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/thread_info.h |  5 ++++-
 arch/arm64/kernel/ptrace.c           | 15 ++++++++++++---
 arch/arm64/kernel/signal.c           | 10 ++++++++++
 arch/arm64/kernel/smp.c              |  2 ++
 arch/arm64/mm/fault.c                |  8 +++++++-
 6 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 5a0a691d4220..36ada13f503f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -58,6 +58,7 @@ config ARM64
 	select HAVE_ARCH_MMAP_RND_BITS
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_ARM_SMCCC
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index abd64bd1f6d9..bdc6426b9968 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -109,6 +109,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	1
 #define TIF_NOTIFY_RESUME	2	/* callback before returning to user */
 #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
+#define TIF_TASK_ISOLATION	4
 #define TIF_NOHZ		7
 #define TIF_SYSCALL_TRACE	8
 #define TIF_SYSCALL_AUDIT	9
@@ -124,6 +125,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
@@ -132,7 +134,8 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_32BIT		(1 << TIF_32BIT)
 
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
-				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE)
+				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
+				 _TIF_TASK_ISOLATION)
 
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 3f6cd5c5234f..ae336065733d 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1246,14 +1247,22 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
-	/* Do the secure computing check first; failures should be fast. */
+	unsigned long work = ACCESS_ONCE(current_thread_info()->flags);
+
+	/* In isolation mode, we may prevent the syscall from running. */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->syscallno) == -1)
+			return -1;
+	}
+
+	/* Do the secure computing check early; failures should be fast. */
 	if (secure_computing() == -1)
 		return -1;
 
-	if (test_thread_flag(TIF_SYSCALL_TRACE))
+	if (work & _TIF_SYSCALL_TRACE)
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
-	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
+	if (work & _TIF_SYSCALL_TRACEPOINT)
 		trace_sys_enter(regs, regs->syscallno);
 
 	audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 404dd67080b9..f9b9b25636ca 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -25,6 +25,7 @@
 #include <linux/uaccess.h>
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
+#include <linux/isolation.h>
 
 #include <asm/debug-monitors.h>
 #include <asm/elf.h>
@@ -424,9 +425,18 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 
 			if (thread_flags & _TIF_FOREIGN_FPSTATE)
 				fpsimd_restore_current_state();
+
+			if (thread_flags & _TIF_TASK_ISOLATION)
+				task_isolation_enter();
 		}
 
 		local_irq_disable();
 		thread_flags = READ_ONCE(current_thread_info()->flags);
+
+		/* Clear task isolation from cached_flags manually. */
+		if ((thread_flags & _TIF_TASK_ISOLATION) &&
+		    task_isolation_ready())
+			thread_flags &= ~_TIF_TASK_ISOLATION;
+
 	} while (thread_flags & _TIF_WORK_MASK);
 }
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 62ff3c0622e2..30559aebde5c 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -37,6 +37,7 @@
 #include <linux/completion.h>
 #include <linux/of.h>
 #include <linux/irq_work.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -866,6 +867,7 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 
 void smp_send_reschedule(int cpu)
 {
+	task_isolation_debug(cpu, "reschedule IPI");
 	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
 }
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index b1166d1e5955..9c2ba2a318b0 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/isolation.h>
 
 #include <asm/cpufeature.h>
 #include <asm/exception.h>
@@ -354,8 +355,13 @@ retry:
 	 * Handle the "normal" case first - VM_FAULT_MAJOR
 	 */
 	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
-			      VM_FAULT_BADACCESS))))
+			      VM_FAULT_BADACCESS)))) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (user_mode(regs))
+			task_isolation_quiet_exception("page fault at %#lx",
+						       addr);
 		return 0;
+	}
 
 	/*
 	 * If we are in kernel mode at this point, we have no context to
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v13 09/12] arch/tile: enable task isolation functionality
  2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
                   ` (7 preceding siblings ...)
  2016-07-14 20:48   ` Chris Metcalf
@ 2016-07-14 20:48 ` Chris Metcalf
  2016-07-14 20:48 ` [PATCH v13 10/12] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel
  Cc: Chris Metcalf

We add the necessary call to task_isolation_enter() in the
prepare_exit_to_usermode() routine.  We already unconditionally
call into this routine if TIF_NOHZ is set, since that's where
we do the user_enter() call.

We add calls to task_isolation_quiet_exception() in places
where exceptions may not generate signals to the application.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/tile/Kconfig                   |  1 +
 arch/tile/include/asm/thread_info.h |  4 +++-
 arch/tile/kernel/process.c          |  9 +++++++++
 arch/tile/kernel/ptrace.c           |  7 +++++++
 arch/tile/kernel/single_step.c      |  7 +++++++
 arch/tile/kernel/smp.c              | 26 ++++++++++++++------------
 arch/tile/kernel/unaligned.c        |  4 ++++
 arch/tile/mm/fault.c                | 13 ++++++++++++-
 arch/tile/mm/homecache.c            |  2 ++
 9 files changed, 59 insertions(+), 14 deletions(-)

diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig
index 4820a02838ac..937cfe4cbb5b 100644
--- a/arch/tile/Kconfig
+++ b/arch/tile/Kconfig
@@ -18,6 +18,7 @@ config TILE
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_CONTEXT_TRACKING
 	select HAVE_DEBUG_BUGVERBOSE
diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index c1467ac59ce6..470c8aa5cb09 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -126,6 +126,7 @@ extern void _cpu_idle(void);
 #define TIF_SYSCALL_TRACEPOINT	9	/* syscall tracepoint instrumentation */
 #define TIF_POLLING_NRFLAG	10	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_NOHZ		11	/* in adaptive nohz mode */
+#define TIF_TASK_ISOLATION	12	/* in task isolation mode */
 
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1<<TIF_NEED_RESCHED)
@@ -139,11 +140,12 @@ extern void _cpu_idle(void);
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
 #define _TIF_NOHZ		(1<<TIF_NOHZ)
+#define _TIF_TASK_ISOLATION	(1<<TIF_TASK_ISOLATION)
 
 /* Work to do as we loop to exit to user space. */
 #define _TIF_WORK_MASK \
 	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
-	 _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME)
+	 _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME | _TIF_TASK_ISOLATION)
 
 /* Work to do on any return to user space. */
 #define _TIF_ALLWORK_MASK \
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index a465d8372edd..bbe1d29b242f 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -29,6 +29,7 @@
 #include <linux/signal.h>
 #include <linux/delay.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 #include <asm/stack.h>
 #include <asm/switch_to.h>
 #include <asm/homecache.h>
@@ -496,9 +497,17 @@ void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags)
 			tracehook_notify_resume(regs);
 		}
 
+		if (thread_info_flags & _TIF_TASK_ISOLATION)
+			task_isolation_enter();
+
 		local_irq_disable();
 		thread_info_flags = READ_ONCE(current_thread_info()->flags);
 
+		/* Clear task isolation from cached_flags manually. */
+		if ((thread_info_flags & _TIF_TASK_ISOLATION) &&
+		    task_isolation_ready())
+			thread_info_flags &= ~_TIF_TASK_ISOLATION;
+
 	} while (thread_info_flags & _TIF_WORK_MASK);
 
 	if (thread_info_flags & _TIF_SINGLESTEP) {
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index 54e7b723db99..475a362415fb 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -23,6 +23,7 @@
 #include <linux/elf.h>
 #include <linux/tracehook.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 #include <asm/traps.h>
 #include <arch/chip.h>
 
@@ -255,6 +256,12 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 {
 	u32 work = ACCESS_ONCE(current_thread_info()->flags);
 
+	/* In isolation mode, we may prevent the syscall from running. */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]) == -1)
+			return -1;
+	}
+
 	if (secure_computing() == -1)
 		return -1;
 
diff --git a/arch/tile/kernel/single_step.c b/arch/tile/kernel/single_step.c
index 862973074bf9..b48da9860b80 100644
--- a/arch/tile/kernel/single_step.c
+++ b/arch/tile/kernel/single_step.c
@@ -23,6 +23,7 @@
 #include <linux/types.h>
 #include <linux/err.h>
 #include <linux/prctl.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <asm/uaccess.h>
@@ -320,6 +321,9 @@ void single_step_once(struct pt_regs *regs)
 	int size = 0, sign_ext = 0;  /* happy compiler */
 	int align_ctl;
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	task_isolation_quiet_exception("single step at %#lx", regs->pc);
+
 	align_ctl = unaligned_fixup;
 	switch (task_thread_info(current)->align_ctl) {
 	case PR_UNALIGN_NOPRINT:
@@ -767,6 +771,9 @@ void single_step_once(struct pt_regs *regs)
 	unsigned long *ss_pc = this_cpu_ptr(&ss_saved_pc);
 	unsigned long control = __insn_mfspr(SPR_SINGLE_STEP_CONTROL_K);
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	task_isolation_quiet_exception("single step at %#lx", regs->pc);
+
 	*ss_pc = regs->pc;
 	control |= SPR_SINGLE_STEP_CONTROL_1__CANCELED_MASK;
 	control |= SPR_SINGLE_STEP_CONTROL_1__INHIBIT_MASK;
diff --git a/arch/tile/kernel/smp.c b/arch/tile/kernel/smp.c
index 07e3ff5cc740..d610322026d0 100644
--- a/arch/tile/kernel/smp.c
+++ b/arch/tile/kernel/smp.c
@@ -20,6 +20,7 @@
 #include <linux/irq.h>
 #include <linux/irq_work.h>
 #include <linux/module.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/homecache.h>
 
@@ -181,10 +182,11 @@ void flush_icache_range(unsigned long start, unsigned long end)
 	struct ipi_flush flush = { start, end };
 
 	/* If invoked with irqs disabled, we can not issue IPIs. */
-	if (irqs_disabled())
+	if (irqs_disabled()) {
+		task_isolation_debug_cpumask(task_isolation_map, "icache flush");
 		flush_remote(0, HV_FLUSH_EVICT_L1I, NULL, 0, 0, 0,
 			NULL, NULL, 0);
-	else {
+	} else {
 		preempt_disable();
 		on_each_cpu(ipi_flush_icache_range, &flush, 1);
 		preempt_enable();
@@ -258,10 +260,8 @@ void __init ipi_init(void)
 
 #if CHIP_HAS_IPI()
 
-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
 {
-	WARN_ON(cpu_is_offline(cpu));
-
 	/*
 	 * We just want to do an MMIO store.  The traditional writeq()
 	 * functions aren't really correct here, since they're always
@@ -273,15 +273,17 @@ void smp_send_reschedule(int cpu)
 
 #else
 
-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
 {
-	HV_Coord coord;
-
-	WARN_ON(cpu_is_offline(cpu));
-
-	coord.y = cpu_y(cpu);
-	coord.x = cpu_x(cpu);
+	HV_Coord coord = { .y = cpu_y(cpu), .x = cpu_x(cpu) };
 	hv_trigger_ipi(coord, IRQ_RESCHEDULE);
 }
 
 #endif /* CHIP_HAS_IPI() */
+
+void smp_send_reschedule(int cpu)
+{
+	WARN_ON(cpu_is_offline(cpu));
+	task_isolation_debug(cpu, "reschedule IPI");
+	__smp_send_reschedule(cpu);
+}
diff --git a/arch/tile/kernel/unaligned.c b/arch/tile/kernel/unaligned.c
index 9772a3554282..0335f7cd81f4 100644
--- a/arch/tile/kernel/unaligned.c
+++ b/arch/tile/kernel/unaligned.c
@@ -25,6 +25,7 @@
 #include <linux/module.h>
 #include <linux/compat.h>
 #include <linux/prctl.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <asm/uaccess.h>
@@ -1545,6 +1546,9 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
 		return;
 	}
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	task_isolation_quiet_exception("unaligned JIT at %#lx", regs->pc);
+
 	if (!info->unalign_jit_base) {
 		void __user *user_page;
 
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 26734214818c..096ac03cab5a 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -35,6 +35,7 @@
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 #include <linux/kdebug.h>
+#include <linux/isolation.h>
 
 #include <asm/pgalloc.h>
 #include <asm/sections.h>
@@ -308,8 +309,13 @@ static int handle_page_fault(struct pt_regs *regs,
 	 */
 	pgd = get_current_pgd();
 	if (handle_migrating_pte(pgd, fault_num, address, regs->pc,
-				 is_kernel_mode, write))
+				 is_kernel_mode, write)) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (!is_kernel_mode)
+			task_isolation_quiet_exception("migration fault at %#lx",
+						       address);
 		return 1;
+	}
 
 	si_code = SEGV_MAPERR;
 
@@ -479,6 +485,11 @@ good_area:
 #endif
 
 	up_read(&mm->mmap_sem);
+
+	/* No signal was generated, but notify task-isolation tasks. */
+	if (flags & FAULT_FLAG_USER)
+		task_isolation_quiet_exception("page fault at %#lx", address);
+
 	return 1;
 
 /*
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..316a27906956 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
 #include <linux/smp.h>
 #include <linux/module.h>
 #include <linux/hugetlb.h>
+#include <linux/isolation.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -83,6 +84,7 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
 	 * Don't bother to update atomically; losing a count
 	 * here is not that critical.
 	 */
+	task_isolation_debug_cpumask(&mask, "homecache flush");
 	for_each_cpu(cpu, &mask)
 		++per_cpu(irq_stat, cpu).irq_hv_flush_count;
 }
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v13 10/12] arm, tile: turn off timer tick for oneshot_stopped state
  2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
                   ` (8 preceding siblings ...)
  2016-07-14 20:48 ` [PATCH v13 09/12] arch/tile: " Chris Metcalf
@ 2016-07-14 20:48 ` Chris Metcalf
  2016-07-14 20:48 ` [PATCH v13 11/12] task_isolation: support CONFIG_TASK_ISOLATION_ALL Chris Metcalf
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-kernel
  Cc: Chris Metcalf

When the schedule tick is disabled in tick_nohz_stop_sched_tick(),
we call hrtimer_cancel(), which eventually calls down into
__remove_hrtimer() and thus into hrtimer_force_reprogram().
That function's call to tick_program_event() detects that
we are trying to set the expiration to KTIME_MAX and calls
clockevents_switch_state() to set the state to ONESHOT_STOPPED,
and returns.  See commit 8fff52fd5093 ("clockevents: Introduce
CLOCK_EVT_STATE_ONESHOT_STOPPED state") for more background.

However, by default the internal __clockevents_switch_state() code
doesn't have a "set_state_oneshot_stopped" function pointer for
the arm_arch_timer or tile clock_event_device structures, so that
code returns -ENOSYS, and we end up not setting the state, and more
importantly, we don't actually turn off the hardware timer.
As a result, the timer tick we were waiting for before is still
queued, and fires shortly afterwards, only to discover there was
nothing for it to do, at which point it quiesces.

The fix is to provide that function pointer field, and like the
other function pointers, have it just turn off the timer interrupt.
Any call to set a new timer interval will properly re-enable it.

This fix avoids a small performance hiccup for regular applications,
but for TASK_ISOLATION code, it fixes a potentially serious
kernel timer interruption to the time-sensitive application.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
 arch/tile/kernel/time.c              | 1 +
 drivers/clocksource/arm_arch_timer.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index 178989e6d3e3..fbedf380d9d4 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -159,6 +159,7 @@ static DEFINE_PER_CPU(struct clock_event_device, tile_timer) = {
 	.set_next_event = tile_timer_set_next_event,
 	.set_state_shutdown = tile_timer_shutdown,
 	.set_state_oneshot = tile_timer_shutdown,
+	.set_state_oneshot_stopped = tile_timer_shutdown,
 	.tick_resume = tile_timer_shutdown,
 };
 
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index 4814446a0024..45bbf6000867 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -306,6 +306,8 @@ static void __arch_timer_setup(unsigned type,
 		}
 	}
 
+	clk->set_state_oneshot_stopped = clk->set_state_shutdown;
+
 	clk->set_state_shutdown(clk);
 
 	clockevents_config_and_register(clk, arch_timer_rate, 0xf, 0x7fffffff);
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v13 11/12] task_isolation: support CONFIG_TASK_ISOLATION_ALL
  2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
                   ` (9 preceding siblings ...)
  2016-07-14 20:48 ` [PATCH v13 10/12] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
@ 2016-07-14 20:48 ` Chris Metcalf
  2016-07-14 20:48 ` [PATCH v13 12/12] task_isolation: add user-settable notification signal Chris Metcalf
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-kernel
  Cc: Chris Metcalf

This option, similar to NO_HZ_FULL_ALL, simplifies configuring
a system to boot by default with all cores except the boot core
running in task isolation mode.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 init/Kconfig       | 10 ++++++++++
 kernel/isolation.c |  6 ++++++
 2 files changed, 16 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index fc71444f9c30..0b8384c76571 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -810,6 +810,16 @@ config TASK_ISOLATION
 	 You should say "N" unless you are intending to run a
 	 high-performance userspace driver or similar task.
 
+config TASK_ISOLATION_ALL
+	bool "Provide task isolation on all CPUs by default (except CPU 0)"
+	depends on TASK_ISOLATION
+	help
+	 If the user doesn't pass the task_isolation boot option to
+	 define the range of task isolation CPUs, consider that all
+	 CPUs in the system are task isolation by default.
+	 Note the boot CPU will still be kept outside the range to
+	 handle timekeeping duty, etc.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/isolation.c b/kernel/isolation.c
index a9fd4709825a..5e6cd67dfb0c 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -43,8 +43,14 @@ int __init task_isolation_init(void)
 {
 	/* For offstack cpumask, ensure we allocate an empty cpumask early. */
 	if (!saw_boot_arg) {
+#ifdef CONFIG_TASK_ISOLATION_ALL
+		alloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+		cpumask_copy(task_isolation_map, cpu_possible_mask);
+		cpumask_clear_cpu(smp_processor_id(), task_isolation_map);
+#else
 		zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
 		return 0;
+#endif
 	}
 
 	/*
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v13 12/12] task_isolation: add user-settable notification signal
  2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
                   ` (10 preceding siblings ...)
  2016-07-14 20:48 ` [PATCH v13 11/12] task_isolation: support CONFIG_TASK_ISOLATION_ALL Chris Metcalf
@ 2016-07-14 20:48 ` Chris Metcalf
  2016-07-14 21:03   ` Andy Lutomirski
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 20:48 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

By default, if a task in task isolation mode re-enters the kernel,
it is terminated with SIGKILL.  With this commit, the application
can choose what signal to receive on a task isolation violation
by invoking prctl() with PR_TASK_ISOLATION_ENABLE, or'ing in the
PR_TASK_ISOLATION_USERSIG bit, and setting the specific requested
signal by or'ing in PR_TASK_ISOLATION_SET_SIG(sig).

This mode allows for catching the notification signal; for example,
in a production environment, it might be helpful to log information
to the application logging mechanism before exiting.  Or, the
application might choose to re-enable task isolation and return to
continue execution.

As a special case, the user may set the signal to 0, which means
that no signal will be delivered.  In this mode, the application
may freely enter the kernel for syscalls and synchronous exceptions
such as page faults, but each time it will be held in the kernel
before returning to userspace until the kernel has quiesced timer
ticks or other potential future interruptions, just like it does
on return from the initial prctl() call.  Note that in this mode,
the task can be migrated away from its initial task_isolation core,
and if it is migrated to a non-isolated core it will lose task
isolation until it is migrated back to an isolated core.
In addition, in this mode we no longer require the affinity to
be set correctly on entry (though we warn on the console if it's
not right), and we don't bother to notify the user that the kernel
isn't ready to quiesce either (since we'll presumably be in and
out of the kernel multiple times with task isolation enabled anyway).
The PR_TASK_ISOLATION_NOSIG define is provided as a convenience
wrapper to express this semantic.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/uapi/linux/prctl.h |  5 ++++
 kernel/isolation.c         | 62 ++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 2a49d0d2940a..7af6eb51c1dc 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,10 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		48
 #define PR_GET_TASK_ISOLATION		49
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_USERSIG	(1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+# define PR_TASK_ISOLATION_NOSIG \
+	(PR_TASK_ISOLATION_USERSIG | PR_TASK_ISOLATION_SET_SIG(0))
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 5e6cd67dfb0c..aca5de5e2e05 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -85,6 +85,15 @@ static bool can_stop_my_full_tick_now(void)
 	return ret;
 }
 
+/* Get the signal number that will be sent for a particular set of flag bits. */
+static int task_isolation_sig(int flags)
+{
+	if (flags & PR_TASK_ISOLATION_USERSIG)
+		return PR_TASK_ISOLATION_GET_SIG(flags);
+	else
+		return SIGKILL;
+}
+
 /*
  * This routine controls whether we can enable task-isolation mode.
  * The task must be affinitized to a single task_isolation core, or
@@ -92,16 +101,30 @@ static bool can_stop_my_full_tick_now(void)
  * stop the nohz_full tick (e.g., no other schedulable tasks currently
  * running, no POSIX cpu timers currently set up, etc.); if not, we
  * return EAGAIN.
+ *
+ * If we will not be strictly enforcing kernel re-entry with a signal,
+ * we just generate a warning printk if there is a bad affinity set
+ * on entry (since after all you can always change it again after you
+ * call prctl) and we don't bother failing the prctl with -EAGAIN
+ * since we assume you will go in and out of kernel mode anyway.
  */
 int task_isolation_set(unsigned int flags)
 {
 	if (flags != 0) {
+		int sig = task_isolation_sig(flags);
+
 		if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
 		    !task_isolation_possible(raw_smp_processor_id())) {
 			/* Invalid task affinity setting. */
-			return -EINVAL;
+			if (sig)
+				return -EINVAL;
+			else
+				pr_warn("%s/%d: enabling non-signalling task isolation\n"
+					"and not bound to a single task isolation core\n",
+					current->comm, current->pid);
 		}
-		if (!can_stop_my_full_tick_now()) {
+
+		if (sig && !can_stop_my_full_tick_now()) {
 			/* System not yet ready for task isolation. */
 			return -EAGAIN;
 		}
@@ -160,11 +183,11 @@ void task_isolation_enter(void)
 }
 
 static void task_isolation_deliver_signal(struct task_struct *task,
-					  const char *buf)
+					  const char *buf, int sig)
 {
 	siginfo_t info = {};
 
-	info.si_signo = SIGKILL;
+	info.si_signo = sig;
 
 	/*
 	 * Report on the fact that isolation was violated for the task.
@@ -175,7 +198,10 @@ static void task_isolation_deliver_signal(struct task_struct *task,
 	pr_warn("%s/%d: task_isolation mode lost due to %s\n",
 		task->comm, task->pid, buf);
 
-	/* Turn off task isolation mode to avoid further isolation callbacks. */
+	/*
+	 * Turn off task isolation mode to avoid further isolation callbacks.
+	 * It can choose to re-enable task isolation mode in the signal handler.
+	 */
 	task_isolation_set_flags(task, 0);
 
 	send_sig_info(info.si_signo, &info, task);
@@ -190,15 +216,20 @@ void _task_isolation_quiet_exception(const char *fmt, ...)
 	struct task_struct *task = current;
 	va_list args;
 	char buf[100];
+	int sig;
 
 	/* RCU should have been enabled prior to this point. */
 	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
 
+	sig = task_isolation_sig(task->task_isolation_flags);
+	if (sig == 0)
+		return;
+
 	va_start(args, fmt);
 	vsnprintf(buf, sizeof(buf), fmt, args);
 	va_end(args);
 
-	task_isolation_deliver_signal(task, buf);
+	task_isolation_deliver_signal(task, buf, sig);
 }
 
 /*
@@ -209,14 +240,19 @@ void _task_isolation_quiet_exception(const char *fmt, ...)
 int task_isolation_syscall(int syscall)
 {
 	char buf[20];
+	int sig;
 
 	if (syscall == __NR_prctl ||
 	    syscall == __NR_exit ||
 	    syscall == __NR_exit_group)
 		return 0;
 
+	sig = task_isolation_sig(current->task_isolation_flags);
+	if (sig == 0)
+		return 0;
+
 	snprintf(buf, sizeof(buf), "syscall %d", syscall);
-	task_isolation_deliver_signal(current, buf);
+	task_isolation_deliver_signal(current, buf, sig);
 
 	syscall_set_return_value(current, current_pt_regs(),
 					 -ERESTARTNOINTR, -1);
@@ -236,6 +272,7 @@ void task_isolation_debug_task(int cpu, struct task_struct *p, const char *type)
 {
 	static DEFINE_RATELIMIT_STATE(console_output, HZ, 1);
 	bool force_debug = false;
+	int sig;
 
 	/*
 	 * Our caller made sure the task was running on a task isolation
@@ -266,10 +303,13 @@ void task_isolation_debug_task(int cpu, struct task_struct *p, const char *type)
 	 * and instead just treat it as if "debug" mode was enabled,
 	 * since that's pretty much all we can do.
 	 */
-	if (in_nmi())
-		force_debug = true;
-	else
-		task_isolation_deliver_signal(p, type);
+	sig = task_isolation_sig(p->task_isolation_flags);
+	if (sig != 0) {
+		if (in_nmi())
+			force_debug = true;
+		else
+			task_isolation_deliver_signal(p, type, sig);
+	}
 
 	/*
 	 * If (for example) the timer interrupt starts ticking
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
@ 2016-07-14 21:03   ` Andy Lutomirski
  0 siblings, 0 replies; 72+ messages in thread
From: Andy Lutomirski @ 2016-07-14 21:03 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	linux-doc, Linux API, linux-kernel

On Thu, Jul 14, 2016 at 1:48 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
> Here is a respin of the task-isolation patch set.  This primarily
> reflects feedback from Frederic and Peter Z.

I still think this is the wrong approach, at least at this point.  The
first step should be to instrument things if necessary and fix the
obvious cases where the kernel gets entered asynchronously.  Only once
there's a credible reason to believe it can work well should any form
of strictness be applied.

As an example, enough vmalloc/vfree activity will eventually cause
flush_tlb_kernel_range to be called and *boom*, there goes your shiny
production dataplane application.  Once virtually mapped kernel stacks
happen, the frequency with which this happens will only increase.

On very brief inspection, __kmem_cache_shutdown will be a problem on
some workloads as well.

--Andy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
@ 2016-07-14 21:03   ` Andy Lutomirski
  0 siblings, 0 replies; 72+ messages in thread
From: Andy Lutomirski @ 2016-07-14 21:03 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Jul 14, 2016 at 1:48 PM, Chris Metcalf <cmetcalf-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> Here is a respin of the task-isolation patch set.  This primarily
> reflects feedback from Frederic and Peter Z.

I still think this is the wrong approach, at least at this point.  The
first step should be to instrument things if necessary and fix the
obvious cases where the kernel gets entered asynchronously.  Only once
there's a credible reason to believe it can work well should any form
of strictness be applied.

As an example, enough vmalloc/vfree activity will eventually cause
flush_tlb_kernel_range to be called and *boom*, there goes your shiny
production dataplane application.  Once virtually mapped kernel stacks
happen, the frequency with which this happens will only increase.

On very brief inspection, __kmem_cache_shutdown will be a problem on
some workloads as well.

--Andy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
@ 2016-07-14 21:22     ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 21:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	linux-doc, Linux API, linux-kernel

On 7/14/2016 5:03 PM, Andy Lutomirski wrote:
> On Thu, Jul 14, 2016 at 1:48 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>> Here is a respin of the task-isolation patch set.  This primarily
>> reflects feedback from Frederic and Peter Z.
> I still think this is the wrong approach, at least at this point.  The
> first step should be to instrument things if necessary and fix the
> obvious cases where the kernel gets entered asynchronously.

Note, however, that the task_isolation_debug mode is a very convenient
way of discovering what is going on when things do go wrong for task isolation.

> Only once
> there's a credible reason to believe it can work well should any form
> of strictness be applied.

I'm not sure what criteria you need for this, though.  Certainly we've been
shipping our version of task isolation to customers since 2008, and there
are quite a few customer applications in production that are working well.
I'd argue that's a credible reason.

> As an example, enough vmalloc/vfree activity will eventually cause
> flush_tlb_kernel_range to be called and *boom*, there goes your shiny
> production dataplane application.

Well, that's actually a refinement that I did not inflict on this patch series.

In our code base, we have a hook for kernel TLB flushes that defers such
flushes for cores that are running in userspace, because, after all, they
don't yet care about such flushes.  Instead, we atomically set a flag that
is checked on entry to the kernel, and that causes the TLB flush to occur
at that point.

> On very brief inspection, __kmem_cache_shutdown will be a problem on
> some workloads as well.

That looks like it should be amenable to a version of the same fix I pushed
upstream in 5fbc461636c32efd ("mm: make lru_add_drain_all() selective").
You would basically check which cores have non-empty caches, and only
interrupt those cores.  For extra credit, you empty the cache on your local cpu
when you are entering task isolation mode.  Now you don't get interrupted.

To be fair, I've never seen this particular path cause an interruption.  And I
think this speaks to the fact that there really can't be a black and white
decision about when you have removed enough possible interrupt paths.
It really does depend on what else is running on your machine in addition
to the task isolation code, and that will vary from application to application.
And, as the kernel evolves, new ways of interrupting task isolation cores
will get added and need to be dealt with.  There really isn't a perfect time
you can wait for and then declare that all the asynchronous entry cases
have been dealt with and now things are safe for task isolation.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
@ 2016-07-14 21:22     ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-14 21:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 7/14/2016 5:03 PM, Andy Lutomirski wrote:
> On Thu, Jul 14, 2016 at 1:48 PM, Chris Metcalf <cmetcalf-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>> Here is a respin of the task-isolation patch set.  This primarily
>> reflects feedback from Frederic and Peter Z.
> I still think this is the wrong approach, at least at this point.  The
> first step should be to instrument things if necessary and fix the
> obvious cases where the kernel gets entered asynchronously.

Note, however, that the task_isolation_debug mode is a very convenient
way of discovering what is going on when things do go wrong for task isolation.

> Only once
> there's a credible reason to believe it can work well should any form
> of strictness be applied.

I'm not sure what criteria you need for this, though.  Certainly we've been
shipping our version of task isolation to customers since 2008, and there
are quite a few customer applications in production that are working well.
I'd argue that's a credible reason.

> As an example, enough vmalloc/vfree activity will eventually cause
> flush_tlb_kernel_range to be called and *boom*, there goes your shiny
> production dataplane application.

Well, that's actually a refinement that I did not inflict on this patch series.

In our code base, we have a hook for kernel TLB flushes that defers such
flushes for cores that are running in userspace, because, after all, they
don't yet care about such flushes.  Instead, we atomically set a flag that
is checked on entry to the kernel, and that causes the TLB flush to occur
at that point.

> On very brief inspection, __kmem_cache_shutdown will be a problem on
> some workloads as well.

That looks like it should be amenable to a version of the same fix I pushed
upstream in 5fbc461636c32efd ("mm: make lru_add_drain_all() selective").
You would basically check which cores have non-empty caches, and only
interrupt those cores.  For extra credit, you empty the cache on your local cpu
when you are entering task isolation mode.  Now you don't get interrupted.

To be fair, I've never seen this particular path cause an interruption.  And I
think this speaks to the fact that there really can't be a black and white
decision about when you have removed enough possible interrupt paths.
It really does depend on what else is running on your machine in addition
to the task isolation code, and that will vary from application to application.
And, as the kernel evolves, new ways of interrupting task isolation cores
will get added and need to be dealt with.  There really isn't a perfect time
you can wait for and then declare that all the asynchronous entry cases
have been dealt with and now things are safe for task isolation.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
@ 2016-07-18  0:42     ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2016-07-18  0:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	linux-doc, Linux API, linux-kernel

On Thu, 14 Jul 2016, Andy Lutomirski wrote:

> As an example, enough vmalloc/vfree activity will eventually cause
> flush_tlb_kernel_range to be called and *boom*, there goes your shiny
> production dataplane application.  Once virtually mapped kernel stacks
> happen, the frequency with which this happens will only increase.

But then vmalloc/vfre activity is not to be expected if user space only is
running. Since the kernel is not active and this affects kernel address
space only it could be deferred. Such events will cause OS activity that
causes a number of high latency events but then the system will quiet down
again.

> On very brief inspection, __kmem_cache_shutdown will be a problem on
> some workloads as well.

These are all corner cases that can be worked on over time if they are
significant. The main issue here is to reduce the obvious and relatively
frequent causes for ticks and allow easier detection of events that cause
tick activity.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
@ 2016-07-18  0:42     ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2016-07-18  0:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, 14 Jul 2016, Andy Lutomirski wrote:

> As an example, enough vmalloc/vfree activity will eventually cause
> flush_tlb_kernel_range to be called and *boom*, there goes your shiny
> production dataplane application.  Once virtually mapped kernel stacks
> happen, the frequency with which this happens will only increase.

But then vmalloc/vfre activity is not to be expected if user space only is
running. Since the kernel is not active and this affects kernel address
space only it could be deferred. Such events will cause OS activity that
causes a number of high latency events but then the system will quiet down
again.

> On very brief inspection, __kmem_cache_shutdown will be a problem on
> some workloads as well.

These are all corner cases that can be worked on over time if they are
significant. The main issue here is to reduce the obvious and relatively
frequent causes for ticks and allow easier detection of events that cause
tick activity.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
  2016-07-14 21:22     ` Chris Metcalf
  (?)
@ 2016-07-18 22:11     ` Andy Lutomirski
  2016-07-18 22:50       ` Chris Metcalf
  -1 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2016-07-18 22:11 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	linux-doc, Linux API, linux-kernel

On Thu, Jul 14, 2016 at 2:22 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
> On 7/14/2016 5:03 PM, Andy Lutomirski wrote:
>>
>> On Thu, Jul 14, 2016 at 1:48 PM, Chris Metcalf <cmetcalf@mellanox.com>
>> wrote:
>>>
>>> Here is a respin of the task-isolation patch set.  This primarily
>>> reflects feedback from Frederic and Peter Z.
>>
>> I still think this is the wrong approach, at least at this point.  The
>> first step should be to instrument things if necessary and fix the
>> obvious cases where the kernel gets entered asynchronously.
>
>
> Note, however, that the task_isolation_debug mode is a very convenient
> way of discovering what is going on when things do go wrong for task
> isolation.
>
>> Only once
>> there's a credible reason to believe it can work well should any form
>> of strictness be applied.
>
>
> I'm not sure what criteria you need for this, though.  Certainly we've been
> shipping our version of task isolation to customers since 2008, and there
> are quite a few customer applications in production that are working well.
> I'd argue that's a credible reason.
>
>> As an example, enough vmalloc/vfree activity will eventually cause
>> flush_tlb_kernel_range to be called and *boom*, there goes your shiny
>> production dataplane application.
>
>
> Well, that's actually a refinement that I did not inflict on this patch
> series.

Submit it separately, perhaps?

The "kill the process if it goofs" think while there are known goofs
in the kernel, apparently with patches written but unsent, seems
questionable.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
  2016-07-18 22:11     ` Andy Lutomirski
@ 2016-07-18 22:50       ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-18 22:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	linux-doc, Linux API, linux-kernel

On 7/18/2016 6:11 PM, Andy Lutomirski wrote:
>>> As an example, enough vmalloc/vfree activity will eventually cause
>>> flush_tlb_kernel_range to be called and*boom*, there goes your shiny
>>> production dataplane application.
>>
>> Well, that's actually a refinement that I did not inflict on this patch
>> series.
> Submit it separately, perhaps?
>
> The "kill the process if it goofs" thing while there are known goofs
> in the kernel, apparently with patches written but unsent, seems
> questionable.

Sure, that's a good idea.

I think what I will plan to do is, once the patch series is accepted into
some tree, return to this piece.  I'll have to go back and look at the internal
Tilera version of this code, since we have diverged quite a ways from that
in the 13 versions of the patch series, but my memory is that the kernel TLB
flush management was the only substantial piece of additional code not in
the initial batch of changes.  The extra requirement is the need to have a
hook very early on in the kernel entry path that you can hook in all paths;
arm64 has the ct_user_exit macro and tile has the finish_interrupt_save macro,
but I'm not sure there's something equivalent on x86 to catch all entries.

It's worth noting that the typical target application for task isolation, though
(at least in our experience) is a pretty dedicated machine, with the primary
application running in task isolation mode almost all of the time, and so
you are generally in pretty good control of all aspects of the system, including
whether or not you are generating kernel TLB flushes from your non task
isolation cores.  So I would argue the kernel TLB flush management piece is
an improvement to, not a requirement for, the main patch series.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
@ 2016-07-21  2:04   ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2016-07-21  2:04 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

We are trying to test the patchset on x86 and are getting strange
backtraces and aborts. It seems that the cpu before the cpu we are running
on creates an irq_work event that causes a latency event on the next cpu.

This is weird. Is there a new round robin IPI feature in the kernel that I
am not aware of?

Backtraces from dmesg:

[  956.603223] latencytest/7928: task_isolation mode lost due to irq_work
[  956.610817] cpu 12: irq_work violating task isolation for latencytest/7928 on cpu 13
[  956.619985] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.7.0-rc7-stream1 #1
[  956.628765] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.0.2 03/15/2016
[  956.637642]  0000000000000086 ce6735c7b39e7b81 ffff88103e783d00 ffffffff8134f6ff
[  956.646739]  ffff88102c50d700 000000000000000d ffff88103e783d28 ffffffff811986f4
[  956.655828]  ffff88102c50d700 ffff88203cf97f80 000000000000000d ffff88103e783d68
[  956.664924] Call Trace:
[  956.667945]  <IRQ>  [<ffffffff8134f6ff>] dump_stack+0x63/0x84
[  956.674740]  [<ffffffff811986f4>] task_isolation_debug_task+0xb4/0xd0
[  956.682229]  [<ffffffff810b4a13>] _task_isolation_debug+0x83/0xc0
[  956.689331]  [<ffffffff81179c0c>] irq_work_queue_on+0x9c/0x120
[  956.696142]  [<ffffffff811075e4>] tick_nohz_full_kick_cpu+0x44/0x50
[  956.703438]  [<ffffffff810b48d9>] wake_up_nohz_cpu+0x99/0x110
[  956.710150]  [<ffffffff810f57e1>] internal_add_timer+0x71/0xb0
[  956.716959]  [<ffffffff810f696b>] add_timer_on+0xbb/0x140
[  956.723283]  [<ffffffff81100ca0>] clocksource_watchdog+0x230/0x300
[  956.730480]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
[  956.738555]  [<ffffffff810f5615>] call_timer_fn+0x35/0x120
[  956.744973]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
[  956.753046]  [<ffffffff810f64cc>] run_timer_softirq+0x23c/0x2f0
[  956.759952]  [<ffffffff816d4397>] __do_softirq+0xd7/0x2c5
[  956.766272]  [<ffffffff81091245>] irq_exit+0xf5/0x100
[  956.772209]  [<ffffffff816d41d2>] smp_apic_timer_interrupt+0x42/0x50
[  956.779600]  [<ffffffff816d231c>] apic_timer_interrupt+0x8c/0xa0
[  956.786602]  <EOI>  [<ffffffff81569eb0>] ? poll_idle+0x40/0x80
[  956.793490]  [<ffffffff815697dc>] cpuidle_enter_state+0x9c/0x260
[  956.800498]  [<ffffffff815699d7>] cpuidle_enter+0x17/0x20
[  956.806810]  [<ffffffff810cf497>] cpu_startup_entry+0x2b7/0x3a0
[  956.813717]  [<ffffffff81050e6c>] start_secondary+0x15c/0x1a0
[ 1036.601758] cpu 12: irq_work violating task isolation for latencytest/8447 on cpu 13
[ 1036.610922] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.7.0-rc7-stream1 #1
[ 1036.619692] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.0.2 03/15/2016
[ 1036.628551]  0000000000000086 ce6735c7b39e7b81 ffff88103e783d00 ffffffff8134f6ff
[ 1036.637648]  ffff88102dca0000 000000000000000d ffff88103e783d28 ffffffff811986f4
[ 1036.646741]  ffff88102dca0000 ffff88203cf97f80 000000000000000d ffff88103e783d68
[ 1036.655833] Call Trace:
[ 1036.658852]  <IRQ>  [<ffffffff8134f6ff>] dump_stack+0x63/0x84
[ 1036.665649]  [<ffffffff811986f4>] task_isolation_debug_task+0xb4/0xd0
[ 1036.673136]  [<ffffffff810b4a13>] _task_isolation_debug+0x83/0xc0
[ 1036.680237]  [<ffffffff81179c0c>] irq_work_queue_on+0x9c/0x120
[ 1036.687091]  [<ffffffff811075e4>] tick_nohz_full_kick_cpu+0x44/0x50
[ 1036.694388]  [<ffffffff810b48d9>] wake_up_nohz_cpu+0x99/0x110
[ 1036.701089]  [<ffffffff810f57e1>] internal_add_timer+0x71/0xb0
[ 1036.707896]  [<ffffffff810f696b>] add_timer_on+0xbb/0x140
[ 1036.714210]  [<ffffffff81100ca0>] clocksource_watchdog+0x230/0x300
[ 1036.721411]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
[ 1036.729478]  [<ffffffff810f5615>] call_timer_fn+0x35/0x120
[ 1036.735899]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
[ 1036.743970]  [<ffffffff810f64cc>] run_timer_softirq+0x23c/0x2f0
[ 1036.750878]  [<ffffffff816d4397>] __do_softirq+0xd7/0x2c5
[ 1036.757199]  [<ffffffff81091245>] irq_exit+0xf5/0x100
[ 1036.763132]  [<ffffffff816d41d2>] smp_apic_timer_interrupt+0x42/0x50
[ 1036.770520]  [<ffffffff816d231c>] apic_timer_interrupt+0x8c/0xa0
[ 1036.777520]  <EOI>  [<ffffffff81569eb0>] ? poll_idle+0x40/0x80
[ 1036.784410]  [<ffffffff815697dc>] cpuidle_enter_state+0x9c/0x260
[ 1036.791413]  [<ffffffff815699d7>] cpuidle_enter+0x17/0x20
[ 1036.797734]  [<ffffffff810cf497>] cpu_startup_entry+0x2b7/0x3a0
[ 1036.804641]  [<ffffffff81050e6c>] start_secondary+0x15c/0x1a0

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
@ 2016-07-21  2:04   ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2016-07-21  2:04 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

We are trying to test the patchset on x86 and are getting strange
backtraces and aborts. It seems that the cpu before the cpu we are running
on creates an irq_work event that causes a latency event on the next cpu.

This is weird. Is there a new round robin IPI feature in the kernel that I
am not aware of?

Backtraces from dmesg:

[  956.603223] latencytest/7928: task_isolation mode lost due to irq_work
[  956.610817] cpu 12: irq_work violating task isolation for latencytest/7928 on cpu 13
[  956.619985] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.7.0-rc7-stream1 #1
[  956.628765] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.0.2 03/15/2016
[  956.637642]  0000000000000086 ce6735c7b39e7b81 ffff88103e783d00 ffffffff8134f6ff
[  956.646739]  ffff88102c50d700 000000000000000d ffff88103e783d28 ffffffff811986f4
[  956.655828]  ffff88102c50d700 ffff88203cf97f80 000000000000000d ffff88103e783d68
[  956.664924] Call Trace:
[  956.667945]  <IRQ>  [<ffffffff8134f6ff>] dump_stack+0x63/0x84
[  956.674740]  [<ffffffff811986f4>] task_isolation_debug_task+0xb4/0xd0
[  956.682229]  [<ffffffff810b4a13>] _task_isolation_debug+0x83/0xc0
[  956.689331]  [<ffffffff81179c0c>] irq_work_queue_on+0x9c/0x120
[  956.696142]  [<ffffffff811075e4>] tick_nohz_full_kick_cpu+0x44/0x50
[  956.703438]  [<ffffffff810b48d9>] wake_up_nohz_cpu+0x99/0x110
[  956.710150]  [<ffffffff810f57e1>] internal_add_timer+0x71/0xb0
[  956.716959]  [<ffffffff810f696b>] add_timer_on+0xbb/0x140
[  956.723283]  [<ffffffff81100ca0>] clocksource_watchdog+0x230/0x300
[  956.730480]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
[  956.738555]  [<ffffffff810f5615>] call_timer_fn+0x35/0x120
[  956.744973]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
[  956.753046]  [<ffffffff810f64cc>] run_timer_softirq+0x23c/0x2f0
[  956.759952]  [<ffffffff816d4397>] __do_softirq+0xd7/0x2c5
[  956.766272]  [<ffffffff81091245>] irq_exit+0xf5/0x100
[  956.772209]  [<ffffffff816d41d2>] smp_apic_timer_interrupt+0x42/0x50
[  956.779600]  [<ffffffff816d231c>] apic_timer_interrupt+0x8c/0xa0
[  956.786602]  <EOI>  [<ffffffff81569eb0>] ? poll_idle+0x40/0x80
[  956.793490]  [<ffffffff815697dc>] cpuidle_enter_state+0x9c/0x260
[  956.800498]  [<ffffffff815699d7>] cpuidle_enter+0x17/0x20
[  956.806810]  [<ffffffff810cf497>] cpu_startup_entry+0x2b7/0x3a0
[  956.813717]  [<ffffffff81050e6c>] start_secondary+0x15c/0x1a0
[ 1036.601758] cpu 12: irq_work violating task isolation for latencytest/8447 on cpu 13
[ 1036.610922] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.7.0-rc7-stream1 #1
[ 1036.619692] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.0.2 03/15/2016
[ 1036.628551]  0000000000000086 ce6735c7b39e7b81 ffff88103e783d00 ffffffff8134f6ff
[ 1036.637648]  ffff88102dca0000 000000000000000d ffff88103e783d28 ffffffff811986f4
[ 1036.646741]  ffff88102dca0000 ffff88203cf97f80 000000000000000d ffff88103e783d68
[ 1036.655833] Call Trace:
[ 1036.658852]  <IRQ>  [<ffffffff8134f6ff>] dump_stack+0x63/0x84
[ 1036.665649]  [<ffffffff811986f4>] task_isolation_debug_task+0xb4/0xd0
[ 1036.673136]  [<ffffffff810b4a13>] _task_isolation_debug+0x83/0xc0
[ 1036.680237]  [<ffffffff81179c0c>] irq_work_queue_on+0x9c/0x120
[ 1036.687091]  [<ffffffff811075e4>] tick_nohz_full_kick_cpu+0x44/0x50
[ 1036.694388]  [<ffffffff810b48d9>] wake_up_nohz_cpu+0x99/0x110
[ 1036.701089]  [<ffffffff810f57e1>] internal_add_timer+0x71/0xb0
[ 1036.707896]  [<ffffffff810f696b>] add_timer_on+0xbb/0x140
[ 1036.714210]  [<ffffffff81100ca0>] clocksource_watchdog+0x230/0x300
[ 1036.721411]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
[ 1036.729478]  [<ffffffff810f5615>] call_timer_fn+0x35/0x120
[ 1036.735899]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
[ 1036.743970]  [<ffffffff810f64cc>] run_timer_softirq+0x23c/0x2f0
[ 1036.750878]  [<ffffffff816d4397>] __do_softirq+0xd7/0x2c5
[ 1036.757199]  [<ffffffff81091245>] irq_exit+0xf5/0x100
[ 1036.763132]  [<ffffffff816d41d2>] smp_apic_timer_interrupt+0x42/0x50
[ 1036.770520]  [<ffffffff816d231c>] apic_timer_interrupt+0x8c/0xa0
[ 1036.777520]  <EOI>  [<ffffffff81569eb0>] ? poll_idle+0x40/0x80
[ 1036.784410]  [<ffffffff815697dc>] cpuidle_enter_state+0x9c/0x260
[ 1036.791413]  [<ffffffff815699d7>] cpuidle_enter+0x17/0x20
[ 1036.797734]  [<ffffffff810cf497>] cpu_startup_entry+0x2b7/0x3a0
[ 1036.804641]  [<ffffffff81050e6c>] start_secondary+0x15c/0x1a0

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
@ 2016-07-21 14:06     ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-21 14:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On 7/20/2016 10:04 PM, Christoph Lameter wrote:
> We are trying to test the patchset on x86 and are getting strange
> backtraces and aborts. It seems that the cpu before the cpu we are running
> on creates an irq_work event that causes a latency event on the next cpu.
>
> This is weird. Is there a new round robin IPI feature in the kernel that I
> am not aware of?

This seems to be from your clocksource declaring itself to be
unstable, and then scheduling work to safely remove that timer.
I haven't looked at this code before (in kernel/time/clocksource.c
under CONFIG_CLOCKSOURCE_WATCHDOG) since the timers on
arm64 and tile aren't unstable.  Is it possible to boot your machine
with a stable clocksource?


> Backtraces from dmesg:
>
> [  956.603223] latencytest/7928: task_isolation mode lost due to irq_work
> [  956.610817] cpu 12: irq_work violating task isolation for latencytest/7928 on cpu 13
> [  956.619985] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.7.0-rc7-stream1 #1
> [  956.628765] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.0.2 03/15/2016
> [  956.637642]  0000000000000086 ce6735c7b39e7b81 ffff88103e783d00 ffffffff8134f6ff
> [  956.646739]  ffff88102c50d700 000000000000000d ffff88103e783d28 ffffffff811986f4
> [  956.655828]  ffff88102c50d700 ffff88203cf97f80 000000000000000d ffff88103e783d68
> [  956.664924] Call Trace:
> [  956.667945]  <IRQ>  [<ffffffff8134f6ff>] dump_stack+0x63/0x84
> [  956.674740]  [<ffffffff811986f4>] task_isolation_debug_task+0xb4/0xd0
> [  956.682229]  [<ffffffff810b4a13>] _task_isolation_debug+0x83/0xc0
> [  956.689331]  [<ffffffff81179c0c>] irq_work_queue_on+0x9c/0x120
> [  956.696142]  [<ffffffff811075e4>] tick_nohz_full_kick_cpu+0x44/0x50
> [  956.703438]  [<ffffffff810b48d9>] wake_up_nohz_cpu+0x99/0x110
> [  956.710150]  [<ffffffff810f57e1>] internal_add_timer+0x71/0xb0
> [  956.716959]  [<ffffffff810f696b>] add_timer_on+0xbb/0x140
> [  956.723283]  [<ffffffff81100ca0>] clocksource_watchdog+0x230/0x300
> [  956.730480]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
> [  956.738555]  [<ffffffff810f5615>] call_timer_fn+0x35/0x120
> [  956.744973]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
> [  956.753046]  [<ffffffff810f64cc>] run_timer_softirq+0x23c/0x2f0
> [  956.759952]  [<ffffffff816d4397>] __do_softirq+0xd7/0x2c5
> [  956.766272]  [<ffffffff81091245>] irq_exit+0xf5/0x100
> [  956.772209]  [<ffffffff816d41d2>] smp_apic_timer_interrupt+0x42/0x50
> [  956.779600]  [<ffffffff816d231c>] apic_timer_interrupt+0x8c/0xa0
> [  956.786602]  <EOI>  [<ffffffff81569eb0>] ? poll_idle+0x40/0x80
> [  956.793490]  [<ffffffff815697dc>] cpuidle_enter_state+0x9c/0x260
> [  956.800498]  [<ffffffff815699d7>] cpuidle_enter+0x17/0x20
> [  956.806810]  [<ffffffff810cf497>] cpu_startup_entry+0x2b7/0x3a0
> [  956.813717]  [<ffffffff81050e6c>] start_secondary+0x15c/0x1a0
> [ 1036.601758] cpu 12: irq_work violating task isolation for latencytest/8447 on cpu 13
> [ 1036.610922] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.7.0-rc7-stream1 #1
> [ 1036.619692] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.0.2 03/15/2016
> [ 1036.628551]  0000000000000086 ce6735c7b39e7b81 ffff88103e783d00 ffffffff8134f6ff
> [ 1036.637648]  ffff88102dca0000 000000000000000d ffff88103e783d28 ffffffff811986f4
> [ 1036.646741]  ffff88102dca0000 ffff88203cf97f80 000000000000000d ffff88103e783d68
> [ 1036.655833] Call Trace:
> [ 1036.658852]  <IRQ>  [<ffffffff8134f6ff>] dump_stack+0x63/0x84
> [ 1036.665649]  [<ffffffff811986f4>] task_isolation_debug_task+0xb4/0xd0
> [ 1036.673136]  [<ffffffff810b4a13>] _task_isolation_debug+0x83/0xc0
> [ 1036.680237]  [<ffffffff81179c0c>] irq_work_queue_on+0x9c/0x120
> [ 1036.687091]  [<ffffffff811075e4>] tick_nohz_full_kick_cpu+0x44/0x50
> [ 1036.694388]  [<ffffffff810b48d9>] wake_up_nohz_cpu+0x99/0x110
> [ 1036.701089]  [<ffffffff810f57e1>] internal_add_timer+0x71/0xb0
> [ 1036.707896]  [<ffffffff810f696b>] add_timer_on+0xbb/0x140
> [ 1036.714210]  [<ffffffff81100ca0>] clocksource_watchdog+0x230/0x300
> [ 1036.721411]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
> [ 1036.729478]  [<ffffffff810f5615>] call_timer_fn+0x35/0x120
> [ 1036.735899]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
> [ 1036.743970]  [<ffffffff810f64cc>] run_timer_softirq+0x23c/0x2f0
> [ 1036.750878]  [<ffffffff816d4397>] __do_softirq+0xd7/0x2c5
> [ 1036.757199]  [<ffffffff81091245>] irq_exit+0xf5/0x100
> [ 1036.763132]  [<ffffffff816d41d2>] smp_apic_timer_interrupt+0x42/0x50
> [ 1036.770520]  [<ffffffff816d231c>] apic_timer_interrupt+0x8c/0xa0
> [ 1036.777520]  <EOI>  [<ffffffff81569eb0>] ? poll_idle+0x40/0x80
> [ 1036.784410]  [<ffffffff815697dc>] cpuidle_enter_state+0x9c/0x260
> [ 1036.791413]  [<ffffffff815699d7>] cpuidle_enter+0x17/0x20
> [ 1036.797734]  [<ffffffff810cf497>] cpu_startup_entry+0x2b7/0x3a0
> [ 1036.804641]  [<ffffffff81050e6c>] start_secondary+0x15c/0x1a0
>
>

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
@ 2016-07-21 14:06     ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-21 14:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 7/20/2016 10:04 PM, Christoph Lameter wrote:
> We are trying to test the patchset on x86 and are getting strange
> backtraces and aborts. It seems that the cpu before the cpu we are running
> on creates an irq_work event that causes a latency event on the next cpu.
>
> This is weird. Is there a new round robin IPI feature in the kernel that I
> am not aware of?

This seems to be from your clocksource declaring itself to be
unstable, and then scheduling work to safely remove that timer.
I haven't looked at this code before (in kernel/time/clocksource.c
under CONFIG_CLOCKSOURCE_WATCHDOG) since the timers on
arm64 and tile aren't unstable.  Is it possible to boot your machine
with a stable clocksource?


> Backtraces from dmesg:
>
> [  956.603223] latencytest/7928: task_isolation mode lost due to irq_work
> [  956.610817] cpu 12: irq_work violating task isolation for latencytest/7928 on cpu 13
> [  956.619985] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.7.0-rc7-stream1 #1
> [  956.628765] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.0.2 03/15/2016
> [  956.637642]  0000000000000086 ce6735c7b39e7b81 ffff88103e783d00 ffffffff8134f6ff
> [  956.646739]  ffff88102c50d700 000000000000000d ffff88103e783d28 ffffffff811986f4
> [  956.655828]  ffff88102c50d700 ffff88203cf97f80 000000000000000d ffff88103e783d68
> [  956.664924] Call Trace:
> [  956.667945]  <IRQ>  [<ffffffff8134f6ff>] dump_stack+0x63/0x84
> [  956.674740]  [<ffffffff811986f4>] task_isolation_debug_task+0xb4/0xd0
> [  956.682229]  [<ffffffff810b4a13>] _task_isolation_debug+0x83/0xc0
> [  956.689331]  [<ffffffff81179c0c>] irq_work_queue_on+0x9c/0x120
> [  956.696142]  [<ffffffff811075e4>] tick_nohz_full_kick_cpu+0x44/0x50
> [  956.703438]  [<ffffffff810b48d9>] wake_up_nohz_cpu+0x99/0x110
> [  956.710150]  [<ffffffff810f57e1>] internal_add_timer+0x71/0xb0
> [  956.716959]  [<ffffffff810f696b>] add_timer_on+0xbb/0x140
> [  956.723283]  [<ffffffff81100ca0>] clocksource_watchdog+0x230/0x300
> [  956.730480]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
> [  956.738555]  [<ffffffff810f5615>] call_timer_fn+0x35/0x120
> [  956.744973]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
> [  956.753046]  [<ffffffff810f64cc>] run_timer_softirq+0x23c/0x2f0
> [  956.759952]  [<ffffffff816d4397>] __do_softirq+0xd7/0x2c5
> [  956.766272]  [<ffffffff81091245>] irq_exit+0xf5/0x100
> [  956.772209]  [<ffffffff816d41d2>] smp_apic_timer_interrupt+0x42/0x50
> [  956.779600]  [<ffffffff816d231c>] apic_timer_interrupt+0x8c/0xa0
> [  956.786602]  <EOI>  [<ffffffff81569eb0>] ? poll_idle+0x40/0x80
> [  956.793490]  [<ffffffff815697dc>] cpuidle_enter_state+0x9c/0x260
> [  956.800498]  [<ffffffff815699d7>] cpuidle_enter+0x17/0x20
> [  956.806810]  [<ffffffff810cf497>] cpu_startup_entry+0x2b7/0x3a0
> [  956.813717]  [<ffffffff81050e6c>] start_secondary+0x15c/0x1a0
> [ 1036.601758] cpu 12: irq_work violating task isolation for latencytest/8447 on cpu 13
> [ 1036.610922] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.7.0-rc7-stream1 #1
> [ 1036.619692] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.0.2 03/15/2016
> [ 1036.628551]  0000000000000086 ce6735c7b39e7b81 ffff88103e783d00 ffffffff8134f6ff
> [ 1036.637648]  ffff88102dca0000 000000000000000d ffff88103e783d28 ffffffff811986f4
> [ 1036.646741]  ffff88102dca0000 ffff88203cf97f80 000000000000000d ffff88103e783d68
> [ 1036.655833] Call Trace:
> [ 1036.658852]  <IRQ>  [<ffffffff8134f6ff>] dump_stack+0x63/0x84
> [ 1036.665649]  [<ffffffff811986f4>] task_isolation_debug_task+0xb4/0xd0
> [ 1036.673136]  [<ffffffff810b4a13>] _task_isolation_debug+0x83/0xc0
> [ 1036.680237]  [<ffffffff81179c0c>] irq_work_queue_on+0x9c/0x120
> [ 1036.687091]  [<ffffffff811075e4>] tick_nohz_full_kick_cpu+0x44/0x50
> [ 1036.694388]  [<ffffffff810b48d9>] wake_up_nohz_cpu+0x99/0x110
> [ 1036.701089]  [<ffffffff810f57e1>] internal_add_timer+0x71/0xb0
> [ 1036.707896]  [<ffffffff810f696b>] add_timer_on+0xbb/0x140
> [ 1036.714210]  [<ffffffff81100ca0>] clocksource_watchdog+0x230/0x300
> [ 1036.721411]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
> [ 1036.729478]  [<ffffffff810f5615>] call_timer_fn+0x35/0x120
> [ 1036.735899]  [<ffffffff81100a70>] ? __clocksource_unstable.isra.2+0x40/0x40
> [ 1036.743970]  [<ffffffff810f64cc>] run_timer_softirq+0x23c/0x2f0
> [ 1036.750878]  [<ffffffff816d4397>] __do_softirq+0xd7/0x2c5
> [ 1036.757199]  [<ffffffff81091245>] irq_exit+0xf5/0x100
> [ 1036.763132]  [<ffffffff816d41d2>] smp_apic_timer_interrupt+0x42/0x50
> [ 1036.770520]  [<ffffffff816d231c>] apic_timer_interrupt+0x8c/0xa0
> [ 1036.777520]  <EOI>  [<ffffffff81569eb0>] ? poll_idle+0x40/0x80
> [ 1036.784410]  [<ffffffff815697dc>] cpuidle_enter_state+0x9c/0x260
> [ 1036.791413]  [<ffffffff815699d7>] cpuidle_enter+0x17/0x20
> [ 1036.797734]  [<ffffffff810cf497>] cpu_startup_entry+0x2b7/0x3a0
> [ 1036.804641]  [<ffffffff81050e6c>] start_secondary+0x15c/0x1a0
>
>

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
  2016-07-21 14:06     ` Chris Metcalf
  (?)
@ 2016-07-22  2:20     ` Christoph Lameter
  2016-07-22 12:50         ` Chris Metcalf
  -1 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2016-07-22  2:20 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel


On Thu, 21 Jul 2016, Chris Metcalf wrote:
> On 7/20/2016 10:04 PM, Christoph Lameter wrote:

> unstable, and then scheduling work to safely remove that timer.
> I haven't looked at this code before (in kernel/time/clocksource.c
> under CONFIG_CLOCKSOURCE_WATCHDOG) since the timers on
> arm64 and tile aren't unstable.  Is it possible to boot your machine
> with a stable clocksource?

It already as a stable clocksource. Sorry but that was one of the criteria
for the server when we ordered them. Could this be clock adjustments?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
  2016-07-22  2:20     ` Christoph Lameter
@ 2016-07-22 12:50         ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-22 12:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On 7/21/2016 10:20 PM, Christoph Lameter wrote:
> On Thu, 21 Jul 2016, Chris Metcalf wrote:
>> On 7/20/2016 10:04 PM, Christoph Lameter wrote:
>> unstable, and then scheduling work to safely remove that timer.
>> I haven't looked at this code before (in kernel/time/clocksource.c
>> under CONFIG_CLOCKSOURCE_WATCHDOG) since the timers on
>> arm64 and tile aren't unstable.  Is it possible to boot your machine
>> with a stable clocksource?
> It already as a stable clocksource. Sorry but that was one of the criteria
> for the server when we ordered them. Could this be clock adjustments?

We probably need to get clock folks to jump in on this thread!

Maybe it's disabling some built-in unstable clock just as part of
falling back to using the better, stable clock that you also have?
So maybe there's a way of just disabling that clocksource from the
get-go instead of having it be marked unstable later.

If you run the test again after this storm of unstable marking, does
it all happen again?  Or is it a persistent state in the kernel?
If so, maybe you can just arrange to get to that state before starting
your application's task-isolation code.

Or, if you think it's clock adjustments, perhaps running your test with
ntpd disabled would make it work better?

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
@ 2016-07-22 12:50         ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-22 12:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On 7/21/2016 10:20 PM, Christoph Lameter wrote:
> On Thu, 21 Jul 2016, Chris Metcalf wrote:
>> On 7/20/2016 10:04 PM, Christoph Lameter wrote:
>> unstable, and then scheduling work to safely remove that timer.
>> I haven't looked at this code before (in kernel/time/clocksource.c
>> under CONFIG_CLOCKSOURCE_WATCHDOG) since the timers on
>> arm64 and tile aren't unstable.  Is it possible to boot your machine
>> with a stable clocksource?
> It already as a stable clocksource. Sorry but that was one of the criteria
> for the server when we ordered them. Could this be clock adjustments?

We probably need to get clock folks to jump in on this thread!

Maybe it's disabling some built-in unstable clock just as part of
falling back to using the better, stable clock that you also have?
So maybe there's a way of just disabling that clocksource from the
get-go instead of having it be marked unstable later.

If you run the test again after this storm of unstable marking, does
it all happen again?  Or is it a persistent state in the kernel?
If so, maybe you can just arrange to get to that state before starting
your application's task-isolation code.

Or, if you think it's clock adjustments, perhaps running your test with
ntpd disabled would make it work better?

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
  2016-07-22 12:50         ` Chris Metcalf
  (?)
@ 2016-07-25 16:35         ` Christoph Lameter
  2016-07-27 13:55             ` Christoph Lameter
  -1 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2016-07-25 16:35 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On Fri, 22 Jul 2016, Chris Metcalf wrote:

> > It already as a stable clocksource. Sorry but that was one of the criteria
> > for the server when we ordered them. Could this be clock adjustments?
>
> We probably need to get clock folks to jump in on this thread!

Guess so. I will have a look at this when I get some time again.

> Maybe it's disabling some built-in unstable clock just as part of
> falling back to using the better, stable clock that you also have?
> So maybe there's a way of just disabling that clocksource from the
> get-go instead of having it be marked unstable later.

This is a standard Dell server. No clocksources are marked as unstable as
far as I can tell.

> If you run the test again after this storm of unstable marking, does
> it all happen again?  Or is it a persistent state in the kernel?

This happens anytime we try to run with prctl().

I hope to get some more detail once I get some time to look at this. But
this is likely an x86 specific problem.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-07-27 13:55             ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2016-07-27 13:55 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On Mon, 25 Jul 2016, Christoph Lameter wrote:

> Guess so. I will have a look at this when I get some time again.

Ok so the problem is the clocksource_watchdog() function in
kernel/time/clocksource.c. This function is active if
CONFIG_CLOCKSOURCE_WATCHDOG is defined. It will check the timesources of
each processor for being within bounds and then reschedule itself on the
next one.

The purpose of the function seems to be to determine *if* a clocksource is
unstable. It does not mean that the clocksource *is* unstable.

The critical piece of code is this:

        /*
         * Cycle through CPUs to check if the CPUs stay synchronized
         * to each other.
         */
        next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
        if (next_cpu >= nr_cpu_ids)
                next_cpu = cpumask_first(cpu_online_mask);
        watchdog_timer.expires += WATCHDOG_INTERVAL;
        add_timer_on(&watchdog_timer, next_cpu);


Should we just cycle through the cpus that are not isolated? Otherwise we
need to have some means to check the clocksources for accuracy remotely
(probably impossible for TSC etc).

The WATCHDOG_INTERVAL is 1 second so this causes an interrupt every
second.

Note that we are running with the patch that removes the 1 HZ mininum time
tick. With an older kernel code base (redhat) we can keep the kernel quiet
for minutes. The clocksource watchdog causes timers to fire again.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-07-27 13:55             ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2016-07-27 13:55 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, 25 Jul 2016, Christoph Lameter wrote:

> Guess so. I will have a look at this when I get some time again.

Ok so the problem is the clocksource_watchdog() function in
kernel/time/clocksource.c. This function is active if
CONFIG_CLOCKSOURCE_WATCHDOG is defined. It will check the timesources of
each processor for being within bounds and then reschedule itself on the
next one.

The purpose of the function seems to be to determine *if* a clocksource is
unstable. It does not mean that the clocksource *is* unstable.

The critical piece of code is this:

        /*
         * Cycle through CPUs to check if the CPUs stay synchronized
         * to each other.
         */
        next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
        if (next_cpu >= nr_cpu_ids)
                next_cpu = cpumask_first(cpu_online_mask);
        watchdog_timer.expires += WATCHDOG_INTERVAL;
        add_timer_on(&watchdog_timer, next_cpu);


Should we just cycle through the cpus that are not isolated? Otherwise we
need to have some means to check the clocksources for accuracy remotely
(probably impossible for TSC etc).

The WATCHDOG_INTERVAL is 1 second so this causes an interrupt every
second.

Note that we are running with the patch that removes the 1 HZ mininum time
tick. With an older kernel code base (redhat) we can keep the kernel quiet
for minutes. The clocksource watchdog causes timers to fire again.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
  2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
                   ` (13 preceding siblings ...)
  2016-07-21  2:04   ` Christoph Lameter
@ 2016-07-27 14:01 ` Christoph Lameter
  14 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2016-07-27 14:01 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel


We tested this with 4.7-rc7 and aside from the issue with
clocksource_watchdog() this is working fine.

Tested-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-07-27 13:55             ` Christoph Lameter
@ 2016-07-27 14:12               ` Chris Metcalf
  -1 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-27 14:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On 7/27/2016 9:55 AM, Christoph Lameter wrote:
> The critical piece of code is this:
>
>          /*
>           * Cycle through CPUs to check if the CPUs stay synchronized
>           * to each other.
>           */
>          next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
>          if (next_cpu >= nr_cpu_ids)
>                  next_cpu = cpumask_first(cpu_online_mask);
>          watchdog_timer.expires += WATCHDOG_INTERVAL;
>          add_timer_on(&watchdog_timer, next_cpu);
>
>
> Should we just cycle through the cpus that are not isolated? Otherwise we
> need to have some means to check the clocksources for accuracy remotely
> (probably impossible for TSC etc).

That sounds like the right idea - use the housekeeping cpu mask instead of the
cpu online mask.  Should be a straightforward patch; do you want to do that
and test it in your configuration, and I'll include it in the next spin of the
patch series?

Thanks for your testing!

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-07-27 14:12               ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-27 14:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On 7/27/2016 9:55 AM, Christoph Lameter wrote:
> The critical piece of code is this:
>
>          /*
>           * Cycle through CPUs to check if the CPUs stay synchronized
>           * to each other.
>           */
>          next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
>          if (next_cpu >= nr_cpu_ids)
>                  next_cpu = cpumask_first(cpu_online_mask);
>          watchdog_timer.expires += WATCHDOG_INTERVAL;
>          add_timer_on(&watchdog_timer, next_cpu);
>
>
> Should we just cycle through the cpus that are not isolated? Otherwise we
> need to have some means to check the clocksources for accuracy remotely
> (probably impossible for TSC etc).

That sounds like the right idea - use the housekeeping cpu mask instead of the
cpu online mask.  Should be a straightforward patch; do you want to do that
and test it in your configuration, and I'll include it in the next spin of the
patch series?

Thanks for your testing!

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-07-27 15:23                 ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2016-07-27 15:23 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On Wed, 27 Jul 2016, Chris Metcalf wrote:

> > Should we just cycle through the cpus that are not isolated? Otherwise we
> > need to have some means to check the clocksources for accuracy remotely
> > (probably impossible for TSC etc).
>
> That sounds like the right idea - use the housekeeping cpu mask instead of the
> cpu online mask.  Should be a straightforward patch; do you want to do that
> and test it in your configuration, and I'll include it in the next spin of the
> patch series?

Sadly housekeeping_mask is defined the following way:

static inline const struct cpumask *housekeeping_cpumask(void)
{
#ifdef CONFIG_NO_HZ_FULL
        if (tick_nohz_full_enabled())
                return housekeeping_mask;
#endif
        return cpu_possible_mask;
}

Why is it not returning cpu_online_mask?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-07-27 15:23                 ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2016-07-27 15:23 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, 27 Jul 2016, Chris Metcalf wrote:

> > Should we just cycle through the cpus that are not isolated? Otherwise we
> > need to have some means to check the clocksources for accuracy remotely
> > (probably impossible for TSC etc).
>
> That sounds like the right idea - use the housekeeping cpu mask instead of the
> cpu online mask.  Should be a straightforward patch; do you want to do that
> and test it in your configuration, and I'll include it in the next spin of the
> patch series?

Sadly housekeeping_mask is defined the following way:

static inline const struct cpumask *housekeeping_cpumask(void)
{
#ifdef CONFIG_NO_HZ_FULL
        if (tick_nohz_full_enabled())
                return housekeeping_mask;
#endif
        return cpu_possible_mask;
}

Why is it not returning cpu_online_mask?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-07-27 15:31                   ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2016-07-27 15:31 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

Ok here is a possible patch that explicitly checks for housekeeping cpus:

Subject: clocksource: Do not schedule watchdog on isolated or NOHZ cpus

watchdog checks can only run on housekeeping capable cpus. Otherwise
we will be generating noise that we would like to avoid on the isolated
processors.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/kernel/time/clocksource.c
===================================================================
--- linux.orig/kernel/time/clocksource.c	2016-07-27 08:41:17.109862517 -0500
+++ linux/kernel/time/clocksource.c	2016-07-27 10:28:31.172447732 -0500
@@ -269,9 +269,12 @@ static void clocksource_watchdog(unsigne
 	 * Cycle through CPUs to check if the CPUs stay synchronized
 	 * to each other.
 	 */
-	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
-	if (next_cpu >= nr_cpu_ids)
-		next_cpu = cpumask_first(cpu_online_mask);
+	do {
+		next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
+		if (next_cpu >= nr_cpu_ids)
+			next_cpu = cpumask_first(cpu_online_mask);
+	} while (!is_housekeeping_cpu(next_cpu));
+
 	watchdog_timer.expires += WATCHDOG_INTERVAL;
 	add_timer_on(&watchdog_timer, next_cpu);
 out:

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-07-27 15:31                   ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2016-07-27 15:31 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Ok here is a possible patch that explicitly checks for housekeeping cpus:

Subject: clocksource: Do not schedule watchdog on isolated or NOHZ cpus

watchdog checks can only run on housekeeping capable cpus. Otherwise
we will be generating noise that we would like to avoid on the isolated
processors.

Signed-off-by: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>

Index: linux/kernel/time/clocksource.c
===================================================================
--- linux.orig/kernel/time/clocksource.c	2016-07-27 08:41:17.109862517 -0500
+++ linux/kernel/time/clocksource.c	2016-07-27 10:28:31.172447732 -0500
@@ -269,9 +269,12 @@ static void clocksource_watchdog(unsigne
 	 * Cycle through CPUs to check if the CPUs stay synchronized
 	 * to each other.
 	 */
-	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
-	if (next_cpu >= nr_cpu_ids)
-		next_cpu = cpumask_first(cpu_online_mask);
+	do {
+		next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
+		if (next_cpu >= nr_cpu_ids)
+			next_cpu = cpumask_first(cpu_online_mask);
+	} while (!is_housekeeping_cpu(next_cpu));
+
 	watchdog_timer.expires += WATCHDOG_INTERVAL;
 	add_timer_on(&watchdog_timer, next_cpu);
 out:

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-07-27 15:31                   ` Christoph Lameter
@ 2016-07-27 17:06                     ` Chris Metcalf
  -1 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-27 17:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On 7/27/2016 11:31 AM, Christoph Lameter wrote:
> Ok here is a possible patch that explicitly checks for housekeeping cpus:
>
> Subject: clocksource: Do not schedule watchdog on isolated or NOHZ cpus
>
> watchdog checks can only run on housekeeping capable cpus. Otherwise
> we will be generating noise that we would like to avoid on the isolated
> processors.
>
> Signed-off-by: Christoph Lameter <cl@linux.com>
>
> Index: linux/kernel/time/clocksource.c
> ===================================================================
> --- linux.orig/kernel/time/clocksource.c	2016-07-27 08:41:17.109862517 -0500
> +++ linux/kernel/time/clocksource.c	2016-07-27 10:28:31.172447732 -0500
> @@ -269,9 +269,12 @@ static void clocksource_watchdog(unsigne
>   	 * Cycle through CPUs to check if the CPUs stay synchronized
>   	 * to each other.
>   	 */
> -	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
> -	if (next_cpu >= nr_cpu_ids)
> -		next_cpu = cpumask_first(cpu_online_mask);
> +	do {
> +		next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
> +		if (next_cpu >= nr_cpu_ids)
> +			next_cpu = cpumask_first(cpu_online_mask);
> +	} while (!is_housekeeping_cpu(next_cpu));
> +
>   	watchdog_timer.expires += WATCHDOG_INTERVAL;
>   	add_timer_on(&watchdog_timer, next_cpu);
>   out:

How about using cpumask_next_and(raw_smp_processor_id(), cpu_online_mask,
housekeeping_cpumask()), likewise cpumask_first_and()?  Does that work?

Note that you should also  cpumask_first_and() in clocksource_start_watchdog(),
just to be complete.

Hopefully the init code runs after tick_init().  It seems like that's probably true.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-07-27 17:06                     ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-27 17:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On 7/27/2016 11:31 AM, Christoph Lameter wrote:
> Ok here is a possible patch that explicitly checks for housekeeping cpus:
>
> Subject: clocksource: Do not schedule watchdog on isolated or NOHZ cpus
>
> watchdog checks can only run on housekeeping capable cpus. Otherwise
> we will be generating noise that we would like to avoid on the isolated
> processors.
>
> Signed-off-by: Christoph Lameter <cl@linux.com>
>
> Index: linux/kernel/time/clocksource.c
> ===================================================================
> --- linux.orig/kernel/time/clocksource.c	2016-07-27 08:41:17.109862517 -0500
> +++ linux/kernel/time/clocksource.c	2016-07-27 10:28:31.172447732 -0500
> @@ -269,9 +269,12 @@ static void clocksource_watchdog(unsigne
>   	 * Cycle through CPUs to check if the CPUs stay synchronized
>   	 * to each other.
>   	 */
> -	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
> -	if (next_cpu >= nr_cpu_ids)
> -		next_cpu = cpumask_first(cpu_online_mask);
> +	do {
> +		next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
> +		if (next_cpu >= nr_cpu_ids)
> +			next_cpu = cpumask_first(cpu_online_mask);
> +	} while (!is_housekeeping_cpu(next_cpu));
> +
>   	watchdog_timer.expires += WATCHDOG_INTERVAL;
>   	add_timer_on(&watchdog_timer, next_cpu);
>   out:

How about using cpumask_next_and(raw_smp_processor_id(), cpu_online_mask,
housekeeping_cpumask()), likewise cpumask_first_and()?  Does that work?

Note that you should also  cpumask_first_and() in clocksource_start_watchdog(),
just to be complete.

Hopefully the init code runs after tick_init().  It seems like that's probably true.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-07-27 17:06                     ` Chris Metcalf
  (?)
@ 2016-07-27 18:56                     ` Christoph Lameter
  2016-07-27 19:49                         ` Chris Metcalf
  -1 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2016-07-27 18:56 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On Wed, 27 Jul 2016, Chris Metcalf wrote:

> How about using cpumask_next_and(raw_smp_processor_id(), cpu_online_mask,
> housekeeping_cpumask()), likewise cpumask_first_and()?  Does that work?

Ok here is V2:


Subject: clocksource: Do not schedule watchdog on isolated or NOHZ cpus V2

watchdog checks can only run on housekeeping capable cpus. Otherwise
we will be generating noise that we would like to avoid on the isolated
processors.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/kernel/time/clocksource.c
===================================================================
--- linux.orig/kernel/time/clocksource.c
+++ linux/kernel/time/clocksource.c
@@ -269,9 +269,10 @@ static void clocksource_watchdog(unsigne
 	 * Cycle through CPUs to check if the CPUs stay synchronized
 	 * to each other.
 	 */
-	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
+	next_cpu = cpumask_next_and(raw_smp_processor_id(), cpu_online_mask, housekeeping_cpumask());
 	if (next_cpu >= nr_cpu_ids)
-		next_cpu = cpumask_first(cpu_online_mask);
+		next_cpu = cpumask_first_and(cpu_online_mask, housekeeping_cpumask());
+
 	watchdog_timer.expires += WATCHDOG_INTERVAL;
 	add_timer_on(&watchdog_timer, next_cpu);
 out:

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-07-27 18:56                     ` Christoph Lameter
@ 2016-07-27 19:49                         ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-27 19:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On 7/27/2016 2:56 PM, Christoph Lameter wrote:
> On Wed, 27 Jul 2016, Chris Metcalf wrote:
>
>> How about using cpumask_next_and(raw_smp_processor_id(), cpu_online_mask,
>> housekeeping_cpumask()), likewise cpumask_first_and()?  Does that work?
> Ok here is V2:
>
>
> Subject: clocksource: Do not schedule watchdog on isolated or NOHZ cpus V2
>
> watchdog checks can only run on housekeeping capable cpus. Otherwise
> we will be generating noise that we would like to avoid on the isolated
> processors.
>
> Signed-off-by: Christoph Lameter <cl@linux.com>
>
> Index: linux/kernel/time/clocksource.c
> ===================================================================
> --- linux.orig/kernel/time/clocksource.c
> +++ linux/kernel/time/clocksource.c
> @@ -269,9 +269,10 @@ static void clocksource_watchdog(unsigne
>   	 * Cycle through CPUs to check if the CPUs stay synchronized
>   	 * to each other.
>   	 */
> -	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
> +	next_cpu = cpumask_next_and(raw_smp_processor_id(), cpu_online_mask, housekeeping_cpumask());
>   	if (next_cpu >= nr_cpu_ids)
> -		next_cpu = cpumask_first(cpu_online_mask);
> +		next_cpu = cpumask_first_and(cpu_online_mask, housekeeping_cpumask());
> +
>   	watchdog_timer.expires += WATCHDOG_INTERVAL;
>   	add_timer_on(&watchdog_timer, next_cpu);
>   out:

Looks good.  Did you omit the equivalent fix in clocksource_start_watchdog()
on purpose?  For now I just took your change, but tweaked it to add the
equivalent diff with cpumask_first_and() there.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-07-27 19:49                         ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-27 19:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On 7/27/2016 2:56 PM, Christoph Lameter wrote:
> On Wed, 27 Jul 2016, Chris Metcalf wrote:
>
>> How about using cpumask_next_and(raw_smp_processor_id(), cpu_online_mask,
>> housekeeping_cpumask()), likewise cpumask_first_and()?  Does that work?
> Ok here is V2:
>
>
> Subject: clocksource: Do not schedule watchdog on isolated or NOHZ cpus V2
>
> watchdog checks can only run on housekeeping capable cpus. Otherwise
> we will be generating noise that we would like to avoid on the isolated
> processors.
>
> Signed-off-by: Christoph Lameter <cl@linux.com>
>
> Index: linux/kernel/time/clocksource.c
> ===================================================================
> --- linux.orig/kernel/time/clocksource.c
> +++ linux/kernel/time/clocksource.c
> @@ -269,9 +269,10 @@ static void clocksource_watchdog(unsigne
>   	 * Cycle through CPUs to check if the CPUs stay synchronized
>   	 * to each other.
>   	 */
> -	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
> +	next_cpu = cpumask_next_and(raw_smp_processor_id(), cpu_online_mask, housekeeping_cpumask());
>   	if (next_cpu >= nr_cpu_ids)
> -		next_cpu = cpumask_first(cpu_online_mask);
> +		next_cpu = cpumask_first_and(cpu_online_mask, housekeeping_cpumask());
> +
>   	watchdog_timer.expires += WATCHDOG_INTERVAL;
>   	add_timer_on(&watchdog_timer, next_cpu);
>   out:

Looks good.  Did you omit the equivalent fix in clocksource_start_watchdog()
on purpose?  For now I just took your change, but tweaked it to add the
equivalent diff with cpumask_first_and() there.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-07-27 19:49                         ` Chris Metcalf
  (?)
@ 2016-07-27 19:53                         ` Christoph Lameter
  2016-07-27 19:58                             ` Chris Metcalf
  -1 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2016-07-27 19:53 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On Wed, 27 Jul 2016, Chris Metcalf wrote:

> Looks good.  Did you omit the equivalent fix in clocksource_start_watchdog()
> on purpose?  For now I just took your change, but tweaked it to add the
> equivalent diff with cpumask_first_and() there.

Can the watchdog be started on an isolated cpu at all? I would expect that
the code would start a watchdog only on a housekeeping cpu.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-07-27 19:53                         ` Christoph Lameter
@ 2016-07-27 19:58                             ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-27 19:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On 7/27/2016 3:53 PM, Christoph Lameter wrote:
> On Wed, 27 Jul 2016, Chris Metcalf wrote:
>
>> Looks good.  Did you omit the equivalent fix in clocksource_start_watchdog()
>> on purpose?  For now I just took your change, but tweaked it to add the
>> equivalent diff with cpumask_first_and() there.
> Can the watchdog be started on an isolated cpu at all? I would expect that
> the code would start a watchdog only on a housekeeping cpu.

The code just starts the watchdog initially on the first online cpu.
In principle you could have configured that as an isolated cpu, so
without any change to that code, you'd interrupt that cpu.

I guess another way to slice it would be to start the watchdog on the
current core.  But just using the same idiom as in clocksource_watchdog()
seems cleanest to me.

I added your patch to the series and pushed it up (along with adding your
Tested-by to the x86 enablement commit).  It's still based on 4.6 so I'll need
to rebase it once the merge window closes.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-07-27 19:58                             ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-27 19:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On 7/27/2016 3:53 PM, Christoph Lameter wrote:
> On Wed, 27 Jul 2016, Chris Metcalf wrote:
>
>> Looks good.  Did you omit the equivalent fix in clocksource_start_watchdog()
>> on purpose?  For now I just took your change, but tweaked it to add the
>> equivalent diff with cpumask_first_and() there.
> Can the watchdog be started on an isolated cpu at all? I would expect that
> the code would start a watchdog only on a housekeeping cpu.

The code just starts the watchdog initially on the first online cpu.
In principle you could have configured that as an isolated cpu, so
without any change to that code, you'd interrupt that cpu.

I guess another way to slice it would be to start the watchdog on the
current core.  But just using the same idiom as in clocksource_watchdog()
seems cleanest to me.

I added your patch to the series and pushed it up (along with adding your
Tested-by to the x86 enablement commit).  It's still based on 4.6 so I'll need
to rebase it once the merge window closes.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-07-29 18:31                               ` Francis Giraldeau
  0 siblings, 0 replies; 72+ messages in thread
From: Francis Giraldeau @ 2016-07-29 18:31 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel

I tested this patch on 4.7 and confirm that irq_work does not occurs anymore on
the isolated cpu. Thanks!

I don't know of any utility to test the task isolation feature, so I started
one:

    https://github.com/giraldeau/taskisol

The script exp.sh runs the taskisol to test five different conditions, but some
behavior is not the one I would expect.

At startup, it does:
 - register a custom signal handler for SIGUSR1
 - sched_setaffinity() on CPU 1, which is isolated
 - mlockall(MCL_CURRENT) to prevent undesired page faults

The default strict mode is set with:

    prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE)

And then, the syscall write() is called. From previous discussion, the SIGKILL
should be sent, but it does not occur. When instead of calling write() we force
a page fault, then the SIGKILL is correctly sent.

When instead a custom signal handler SIGUSR1:

    prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_USERSIG |
                      PR_TASK_ISOLATION_SET_SIG(SIGUSR1)

The signal is never delivered, either when the syscall is issued nor when the
page fault occurs.

I can confirm that, if two taskisol are created on the same CPU, the second one
fails with Resource temporarily unavailable, so that's fine.

I can add more test cases depending on your comments, such as the TLB events
triggered by another thread on a non-isolated core. But maybe there is already
a test suite?

Francis

2016-07-27 15:58 GMT-04:00 Chris Metcalf <cmetcalf@mellanox.com>:
> On 7/27/2016 3:53 PM, Christoph Lameter wrote:
>>
>> On Wed, 27 Jul 2016, Chris Metcalf wrote:
>>
>>> Looks good.  Did you omit the equivalent fix in
>>> clocksource_start_watchdog()
>>> on purpose?  For now I just took your change, but tweaked it to add the
>>> equivalent diff with cpumask_first_and() there.
>>
>> Can the watchdog be started on an isolated cpu at all? I would expect that
>> the code would start a watchdog only on a housekeeping cpu.
>
>
> The code just starts the watchdog initially on the first online cpu.
> In principle you could have configured that as an isolated cpu, so
> without any change to that code, you'd interrupt that cpu.
>
> I guess another way to slice it would be to start the watchdog on the
> current core.  But just using the same idiom as in clocksource_watchdog()
> seems cleanest to me.
>
> I added your patch to the series and pushed it up (along with adding your
> Tested-by to the x86 enablement commit).  It's still based on 4.6 so I'll
> need
> to rebase it once the merge window closes.
>
>
> --
> Chris Metcalf, Mellanox Technologies
> http://www.mellanox.com
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-07-29 18:31                               ` Francis Giraldeau
  0 siblings, 0 replies; 72+ messages in thread
From: Francis Giraldeau @ 2016-07-29 18:31 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

I tested this patch on 4.7 and confirm that irq_work does not occurs anymore on
the isolated cpu. Thanks!

I don't know of any utility to test the task isolation feature, so I started
one:

    https://github.com/giraldeau/taskisol

The script exp.sh runs the taskisol to test five different conditions, but some
behavior is not the one I would expect.

At startup, it does:
 - register a custom signal handler for SIGUSR1
 - sched_setaffinity() on CPU 1, which is isolated
 - mlockall(MCL_CURRENT) to prevent undesired page faults

The default strict mode is set with:

    prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE)

And then, the syscall write() is called. From previous discussion, the SIGKILL
should be sent, but it does not occur. When instead of calling write() we force
a page fault, then the SIGKILL is correctly sent.

When instead a custom signal handler SIGUSR1:

    prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_USERSIG |
                      PR_TASK_ISOLATION_SET_SIG(SIGUSR1)

The signal is never delivered, either when the syscall is issued nor when the
page fault occurs.

I can confirm that, if two taskisol are created on the same CPU, the second one
fails with Resource temporarily unavailable, so that's fine.

I can add more test cases depending on your comments, such as the TLB events
triggered by another thread on a non-isolated core. But maybe there is already
a test suite?

Francis

2016-07-27 15:58 GMT-04:00 Chris Metcalf <cmetcalf-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>:
> On 7/27/2016 3:53 PM, Christoph Lameter wrote:
>>
>> On Wed, 27 Jul 2016, Chris Metcalf wrote:
>>
>>> Looks good.  Did you omit the equivalent fix in
>>> clocksource_start_watchdog()
>>> on purpose?  For now I just took your change, but tweaked it to add the
>>> equivalent diff with cpumask_first_and() there.
>>
>> Can the watchdog be started on an isolated cpu at all? I would expect that
>> the code would start a watchdog only on a housekeeping cpu.
>
>
> The code just starts the watchdog initially on the first online cpu.
> In principle you could have configured that as an isolated cpu, so
> without any change to that code, you'd interrupt that cpu.
>
> I guess another way to slice it would be to start the watchdog on the
> current core.  But just using the same idiom as in clocksource_watchdog()
> seems cleanest to me.
>
> I added your patch to the series and pushed it up (along with adding your
> Tested-by to the x86 enablement commit).  It's still based on 4.6 so I'll
> need
> to rebase it once the merge window closes.
>
>
> --
> Chris Metcalf, Mellanox Technologies
> http://www.mellanox.com
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-07-29 21:04                                 ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-29 21:04 UTC (permalink / raw)
  To: Francis Giraldeau
  Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel

On 7/29/2016 2:31 PM, Francis Giraldeau wrote:
> I tested this patch on 4.7 and confirm that irq_work does not occurs anymore on
> the isolated cpu. Thanks!

Great!  Let me know if you'd like me to add your Tested-by in the patch series.

> I don't know of any utility to test the task isolation feature, so I started
> one:
>
>      https://github.com/giraldeau/taskisol
>
> The script exp.sh runs the taskisol to test five different conditions, but some
> behavior is not the one I would expect.
>
> At startup, it does:
>   - register a custom signal handler for SIGUSR1
>   - sched_setaffinity() on CPU 1, which is isolated
>   - mlockall(MCL_CURRENT) to prevent undesired page faults
>
> The default strict mode is set with:
>
>      prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE)
>
> And then, the syscall write() is called. From previous discussion, the SIGKILL
> should be sent, but it does not occur. When instead of calling write() we force
> a page fault, then the SIGKILL is correctly sent.

This looks like it may be a bug in the x86-specific part of the kernel support.
On tilegx and arm64, running your test does the right thing:

# ./taskisol default syscall
taskisol run
taskisol/1855: task_isolation mode lost due to syscall 64
Killed

I think the x86 support doesn't properly return right away from a bad
syscall.  The patch below should fix that; can you try it?  However, it's
not clear to me why the signal isn't getting delivered.  Perhaps you can
try adding some tracing to the syscall_trace_enter() path and see if we're
actually running this code as expected?  Thank you!  :-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -90,8 +90,10 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
  
      /* In isolation mode, we may prevent the syscall from running. */
      if (work & _TIF_TASK_ISOLATION) {
-        if (task_isolation_syscall(regs->orig_ax) == -1)
-            return -1;
+        if (task_isolation_syscall(regs->orig_ax) == -1) {
+            regs->orig_ax = -1;
+            return 0;
+        }
          work &= ~_TIF_TASK_ISOLATION;
      }

I updated my dataplane branch on kernel.org with this fix.

> When instead a custom signal handler SIGUSR1:
>
>      prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_USERSIG |
>                        PR_TASK_ISOLATION_SET_SIG(SIGUSR1)
>
> The signal is never delivered, either when the syscall is issued nor when the
> page fault occurs.

This is a bug in your test program.  Try again with this fix:

--- a/taskisol.c
+++ b/taskisol.c
@@ -79,8 +79,9 @@ int main(int argc, char *argv[])
           * The program completes when using USERSIG,
           * but actually no signal is delivered
           */
-        if (strcmp(argv[1], "signal") == 0) {
-            if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_USERSIG |
+        else if (strcmp(argv[1], "signal") == 0) {
+            if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE |
+                      PR_TASK_ISOLATION_USERSIG |
                        PR_TASK_ISOLATION_SET_SIG(SIGUSR1)) < 0) {
                  perror("prctl sigusr");
                  return -1;

The prctl() API is intended to be one-shot, i.e. you set all the state you
want with a single prctl().  The next call to prctl() will reset the state
to whatever you specify (including if you don't specify "enable").

(Also, as a side note, I'd expect your Makefile to invoke $(CC) for taskisol,
not $(CXX) - there doesn't seem to be any actual C++ in the program.)

> I can confirm that, if two taskisol are created on the same CPU, the second one
> fails with Resource temporarily unavailable, so that's fine.
>
> I can add more test cases depending on your comments, such as the TLB events
> triggered by another thread on a non-isolated core. But maybe there is already
> a test suite?

The appended code is what I've been using as a test harness.  It passes on
tilegx and arm64.  No guarantees as to production-level code quality :-)

#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <assert.h>
#include <string.h>
#include <errno.h>
#include <sched.h>
#include <pthread.h>
#include <sys/wait.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/prctl.h>

#ifndef PR_SET_TASK_ISOLATION   // Not in system headers yet?
# define PR_SET_TASK_ISOLATION		48
# define PR_GET_TASK_ISOLATION		49
# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
# define PR_TASK_ISOLATION_USERSIG	(1 << 1)
# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
# define PR_TASK_ISOLATION_NOSIG \
     (PR_TASK_ISOLATION_USERSIG | PR_TASK_ISOLATION_SET_SIG(0))
#endif

// The cpu we are using for isolation tests.
static int task_isolation_cpu;

// Overall status, maintained as tests run.
static int exit_status = EXIT_SUCCESS;

// Set affinity to a single cpu.
int set_my_cpu(int cpu)
{
	cpu_set_t set;
	CPU_ZERO(&set);
	CPU_SET(cpu, &set);
	return sched_setaffinity(0, sizeof(cpu_set_t), &set);
}

// Run a child process in task isolation mode and report its status.
// The child does mlockall() and moves itself to the task isolation cpu.
// It then runs SETUP_FUNC (if specified), calls prctl(PR_SET_TASK_ISOLATION, )
// with FLAGS (if non-zero), and then invokes TEST_FUNC and exits
// with its status.
static int run_test(void (*setup_func)(), int (*test_func)(), int flags)
{
	fflush(stdout);
	int pid = fork();
	assert(pid >= 0);
	if (pid != 0) {
		// In parent; wait for child and return its status.
		int status;
		waitpid(pid, &status, 0);
		return status;
	}

	// In child.
	int rc = mlockall(MCL_CURRENT);
	assert(rc == 0);
	rc = set_my_cpu(task_isolation_cpu);
	assert(rc == 0);
	if (setup_func)
		setup_func();
	if (flags) {
		int rc;
		do
			rc = prctl(PR_SET_TASK_ISOLATION, flags);
		while (rc != 0 && errno == EAGAIN);
		if (rc != 0) {
			printf("couldn't enable isolation (%d): FAIL\n", errno);
			exit(EXIT_FAILURE);
		}
	}
	rc = test_func();
	exit(rc);
}

// Run a test and ensure it is killed with SIGKILL by default,
// for whatever misdemeanor is committed in TEST_FUNC.
// Also test it with SIGUSR1 as well to make sure that works.
static void test_killed(const char *testname, void (*setup_func)(),
			int (*test_func)())
{
	int status = run_test(setup_func, test_func, PR_TASK_ISOLATION_ENABLE);
	if (WIFSIGNALED(status) && WTERMSIG(status) == SIGKILL) {
		printf("%s: OK\n", testname);
	} else {
		printf("%s: FAIL (%#x)\n", testname, status);
		exit_status = EXIT_FAILURE;
	}

	status = run_test(setup_func, test_func,
			  PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_USERSIG |
			  PR_TASK_ISOLATION_SET_SIG(SIGUSR1));
	if (WIFSIGNALED(status) && WTERMSIG(status) == SIGUSR1) {
		printf("%s (SIGUSR1): OK\n", testname);
	} else {
		printf("%s (SIGUSR1): FAIL (%#x)\n", testname, status);
		exit_status = EXIT_FAILURE;
	}
}

// Run a test and make sure it exits with success.
static void test_ok(const char *testname, void (*setup_func)(),
		    int (*test_func)())
{
	int status = run_test(setup_func, test_func, PR_TASK_ISOLATION_ENABLE);
	if (status == EXIT_SUCCESS) {
		printf("%s: OK\n", testname);
	} else {
		printf("%s: FAIL (%#x)\n", testname, status);
		exit_status = EXIT_FAILURE;
	}
}

// Run a test with no signals and make sure it exits with success.
static void test_nosig(const char *testname, void (*setup_func)(),
		       int (*test_func)())
{
	int status =
		run_test(setup_func, test_func,
			 PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_NOSIG);
	if (status == EXIT_SUCCESS) {
		printf("%s: OK\n", testname);
	} else {
		printf("%s: FAIL (%#x)\n", testname, status);
		exit_status = EXIT_FAILURE;
	}
}

// Mapping address passed from setup function to test function.
static char *fault_file_mapping;

// mmap() a file in so we can test touching an unmapped page.
static void setup_fault(void)
{
	char fault_file[] = "/tmp/isolation_XXXXXX";
	int fd = mkstemp(fault_file);
	assert(fd >= 0);
	int rc = ftruncate(fd, getpagesize());
	assert(rc == 0);
	fault_file_mapping = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
				  MAP_SHARED, fd, 0);
	assert(fault_file_mapping != MAP_FAILED);
	close(fd);
	unlink(fault_file);
}

// Now touch the unmapped page (and be killed).
static int do_fault(void)
{
	*fault_file_mapping = 1;
	return EXIT_FAILURE;
}

// Make a syscall (and be killed).
static int do_syscall(void)
{
	write(STDOUT_FILENO, "goodbye, world\n", 13);
	return EXIT_FAILURE;
}

// Turn isolation back off and don't be killed.
static int do_syscall_off(void)
{
	prctl(PR_SET_TASK_ISOLATION, 0);
	write(STDOUT_FILENO, "==> hello, world\n", 17);
	return EXIT_SUCCESS;
}

// If we're not getting a signal, make sure we can do multiple system calls.
static int do_syscall_multi(void)
{
	write(STDOUT_FILENO, "==> hello, world 1\n", 19);
	write(STDOUT_FILENO, "==> hello, world 2\n", 19);
	return EXIT_SUCCESS;
}

#ifdef __aarch64__
/* ARM64 uses tlbi instructions so doesn't need to interrupt the remote core. */
static void test_munmap(void) {}
#else

// Fork a thread that will munmap() after a short while.
// It will deliver a TLB flush to the task isolation core.

static void *start_munmap(void *p)
{
	usleep(500000);   // 0.5s
	munmap(p, getpagesize());
	return 0;
}

static void setup_munmap(void)
{
	// First, go back to cpu 0 and allocate some memory.
	set_my_cpu(0);
	void *p = mmap(0, getpagesize(), PROT_READ|PROT_WRITE,
		       MAP_ANONYMOUS|MAP_POPULATE|MAP_PRIVATE, 0, 0);
	assert(p != MAP_FAILED);

	// Now fire up a thread that will wait half a second on cpu 0
	// and then munmap the mapping.
	pthread_t thr;
	int rc = pthread_create(&thr, NULL, start_munmap, p);
	assert(rc == 0);

	// Back to the task-isolation cpu.
	set_my_cpu(task_isolation_cpu);
}

// Global variable to avoid the compiler outsmarting us.
volatile int munmap_spin;

static int do_munmap(void)
{
	while (munmap_spin < 1000000000)
		++munmap_spin;
	return EXIT_FAILURE;
}

static void test_munmap(void)
{
	test_killed("test_munmap", setup_munmap, do_munmap);
}
#endif

#ifdef __tilegx__
// Make an unaligned access (and be killed).
// Only for tilegx, since other platforms don't do in-kernel fixups.
static int
do_unaligned(void)
{
	static int buf[2];
	volatile int* addr = (volatile int *)((char *)buf + 1);

	*addr;

	asm("nop");
	return EXIT_FAILURE;
}

static void test_unaligned(void)
{
	test_killed("test_unaligned", NULL, do_unaligned);
}
#else
static void test_unaligned(void) {}
#endif

// Fork a process that will spin annoyingly on the same core
// for a second.  Since prctl() won't work if this task is actively
// running, we following this handshake sequence:
//
// 1. Child (in setup_quiesce, here) starts up, sets state 1 to let the
//    parent know it's running, and starts doing short sleeps waiting on a
//    state change.
// 2. Parent (in do_quiesce, below) starts up, spins waiting for state 1,
//    then spins waiting on prctl() to succeed.  At that point it is in
//    isolation mode and the child is completing its most recent sleep.
//    Now, as soon as the parent is scheduled out, it won't schedule back
//    in until the child stops spinning.
// 3. Child sees the state change to 2, sets it to 3, and starts spinning
//    waiting for a second to elapse, at which point it exits.
// 4. Parent spins waiting for the state to get to 3, then makes one
//    syscall.  This should take about a second even though the child
//    was spinning for a whole second after changing the state to 3.

volatile int *statep, *childstate;
struct timeval quiesce_start, quiesce_end;
int child_pid;

static void setup_quiesce(void)
{
	// First, go back to cpu 0 and allocate some shared memory.
	set_my_cpu(0);
	statep = mmap(0, getpagesize(), PROT_READ|PROT_WRITE,
		      MAP_ANONYMOUS|MAP_SHARED, 0, 0);
	assert(statep != MAP_FAILED);
	childstate = statep + 1;

	gettimeofday(&quiesce_start, NULL);

	// Fork and fault in all memory in both.
	child_pid = fork();
	assert(child_pid >= 0);
	if (child_pid == 0)
		*childstate = 1;
	int rc = mlockall(MCL_CURRENT);
	assert(rc == 0);
	if (child_pid != 0) {
		set_my_cpu(task_isolation_cpu);
		return;
	}

	// In child.  Wait until parent notifies us that it has completed
	// its prctl, then jump to its cpu and let it know.
	*childstate = 2;
	while (*statep == 0)
		;
	*childstate = 3;
	//  printf("child: jumping to cpu %d\n", task_isolation_cpu);
	set_my_cpu(task_isolation_cpu);
	//  printf("child: jumped to cpu %d\n", task_isolation_cpu);
	*statep = 2;
	*childstate = 4;

	// Now we are competing for the runqueue on task_isolation_cpu.
	// Spin for one second to ensure the parent gets caught in kernel space.
	struct timeval start, tv;
	gettimeofday(&start, NULL);
	while (1) {
		gettimeofday(&tv, NULL);
		double time = (tv.tv_sec - start.tv_sec) +
			(tv.tv_usec - start.tv_usec) / 1000000.0;
		if (time >= 0.5)
			exit(0);
	}
}

static int do_quiesce(void)
{
	double time;
	int rc;

	rc = prctl(PR_SET_TASK_ISOLATION,
		   PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_NOSIG);
	if (rc != 0) {
		prctl(PR_SET_TASK_ISOLATION, 0);
		printf("prctl failed: rc %d", rc);
		goto fail;
	}
	*statep = 1;
     
	// Wait for child to come disturb us.
	while (*statep == 1) {
		gettimeofday(&quiesce_end, NULL);
		time = (quiesce_end.tv_sec - quiesce_start.tv_sec) +
			(quiesce_end.tv_usec - quiesce_start.tv_usec)/1000000.0;
		if (time > 0.1 && *statep == 1)	{
			prctl(PR_SET_TASK_ISOLATION, 0);
			printf("timed out at %gs in child migrate loop (%d)\n",
			       time, *childstate);
			char buf[100];
			sprintf(buf, "cat /proc/%d/stack", child_pid);
			system(buf);
			goto fail;
		}
	}
	assert(*statep == 2);

	// At this point the child is spinning, so any interrupt will keep us
	// in kernel space.  Make a syscall to make sure it happens at least
	// once during the second that the child is spinning.
	kill(0, 0);
	gettimeofday(&quiesce_end, NULL);
	prctl(PR_SET_TASK_ISOLATION, 0);
	time = (quiesce_end.tv_sec - quiesce_start.tv_sec) +
		(quiesce_end.tv_usec - quiesce_start.tv_usec) / 1000000.0;
	if (time < 0.4 || time > 0.6) {
		printf("expected 1s wait after quiesce: was %g\n", time);
		goto fail;
	}
	kill(child_pid, SIGKILL);
	return EXIT_SUCCESS;

fail:
	kill(child_pid, SIGKILL);
	return EXIT_FAILURE;
}

int main(int argc, char **argv)
{
	/* How many seconds to wait after running the other tests? */
	double waittime;
	if (argc == 1)
		waittime = 10;
	else if (argc == 2)
		waittime = strtof(argv[1], NULL);
	else {
		printf("syntax: isolation [seconds]\n");
		exit(EXIT_FAILURE);
	}

	/* Test that the /sys device is present and pick a cpu. */
	FILE *f = fopen("/sys/devices/system/cpu/task_isolation", "r");
	if (f == NULL) {
		printf("/sys device: FAIL\n");
		exit(EXIT_FAILURE);
	}
	char buf[100];
	char *result = fgets(buf, sizeof(buf), f);
	assert(result == buf);
	fclose(f);
	char *end;
	task_isolation_cpu = strtol(buf, &end, 10);
	assert(end != buf);
	assert(*end == ',' || *end == '-' || *end == '\n');
	assert(task_isolation_cpu >= 0);
	printf("/sys device : OK\n");

	// Test to see if with no mask set, we fail.
	if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) == 0 ||
	    errno != EINVAL) {
		printf("prctl unaffinitized: FAIL\n");
		exit_status = EXIT_FAILURE;
	} else {
		printf("prctl unaffinitized: OK\n");
	}

	// Or if affinitized to the wrong cpu.
	set_my_cpu(0);
	if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) == 0 ||
	    errno != EINVAL) {
		printf("prctl on cpu 0: FAIL\n");
		exit_status = EXIT_FAILURE;
	} else {
		printf("prctl on cpu 0: OK\n");
	}

	// Run the tests.
	test_killed("test_fault", setup_fault, do_fault);
	test_killed("test_syscall", NULL, do_syscall);
	test_munmap();
	test_unaligned();
	test_ok("test_off", NULL, do_syscall_off);
	test_nosig("test_multi", NULL, do_syscall_multi);
	test_nosig("test_quiesce", setup_quiesce, do_quiesce);

	// Exit failure if any test failed.
	if (exit_status != EXIT_SUCCESS)
		return exit_status;

	// Wait for however long was requested on the command line.
	// Note that this requires a vDSO implementation of gettimeofday();
	// if it's not available, we could just spin a fixed number of
	// iterations instead.
	struct timeval start, tv;
	gettimeofday(&start, NULL);
	while (1) {
		gettimeofday(&tv, NULL);
		double time = (tv.tv_sec - start.tv_sec) +
			(tv.tv_usec - start.tv_usec) / 1000000.0;
		if (time >= waittime)
			break;
	}

	return EXIT_SUCCESS;
}

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-07-29 21:04                                 ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-07-29 21:04 UTC (permalink / raw)
  To: Francis Giraldeau
  Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 7/29/2016 2:31 PM, Francis Giraldeau wrote:
> I tested this patch on 4.7 and confirm that irq_work does not occurs anymore on
> the isolated cpu. Thanks!

Great!  Let me know if you'd like me to add your Tested-by in the patch series.

> I don't know of any utility to test the task isolation feature, so I started
> one:
>
>      https://github.com/giraldeau/taskisol
>
> The script exp.sh runs the taskisol to test five different conditions, but some
> behavior is not the one I would expect.
>
> At startup, it does:
>   - register a custom signal handler for SIGUSR1
>   - sched_setaffinity() on CPU 1, which is isolated
>   - mlockall(MCL_CURRENT) to prevent undesired page faults
>
> The default strict mode is set with:
>
>      prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE)
>
> And then, the syscall write() is called. From previous discussion, the SIGKILL
> should be sent, but it does not occur. When instead of calling write() we force
> a page fault, then the SIGKILL is correctly sent.

This looks like it may be a bug in the x86-specific part of the kernel support.
On tilegx and arm64, running your test does the right thing:

# ./taskisol default syscall
taskisol run
taskisol/1855: task_isolation mode lost due to syscall 64
Killed

I think the x86 support doesn't properly return right away from a bad
syscall.  The patch below should fix that; can you try it?  However, it's
not clear to me why the signal isn't getting delivered.  Perhaps you can
try adding some tracing to the syscall_trace_enter() path and see if we're
actually running this code as expected?  Thank you!  :-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -90,8 +90,10 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
  
      /* In isolation mode, we may prevent the syscall from running. */
      if (work & _TIF_TASK_ISOLATION) {
-        if (task_isolation_syscall(regs->orig_ax) == -1)
-            return -1;
+        if (task_isolation_syscall(regs->orig_ax) == -1) {
+            regs->orig_ax = -1;
+            return 0;
+        }
          work &= ~_TIF_TASK_ISOLATION;
      }

I updated my dataplane branch on kernel.org with this fix.

> When instead a custom signal handler SIGUSR1:
>
>      prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_USERSIG |
>                        PR_TASK_ISOLATION_SET_SIG(SIGUSR1)
>
> The signal is never delivered, either when the syscall is issued nor when the
> page fault occurs.

This is a bug in your test program.  Try again with this fix:

--- a/taskisol.c
+++ b/taskisol.c
@@ -79,8 +79,9 @@ int main(int argc, char *argv[])
           * The program completes when using USERSIG,
           * but actually no signal is delivered
           */
-        if (strcmp(argv[1], "signal") == 0) {
-            if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_USERSIG |
+        else if (strcmp(argv[1], "signal") == 0) {
+            if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE |
+                      PR_TASK_ISOLATION_USERSIG |
                        PR_TASK_ISOLATION_SET_SIG(SIGUSR1)) < 0) {
                  perror("prctl sigusr");
                  return -1;

The prctl() API is intended to be one-shot, i.e. you set all the state you
want with a single prctl().  The next call to prctl() will reset the state
to whatever you specify (including if you don't specify "enable").

(Also, as a side note, I'd expect your Makefile to invoke $(CC) for taskisol,
not $(CXX) - there doesn't seem to be any actual C++ in the program.)

> I can confirm that, if two taskisol are created on the same CPU, the second one
> fails with Resource temporarily unavailable, so that's fine.
>
> I can add more test cases depending on your comments, such as the TLB events
> triggered by another thread on a non-isolated core. But maybe there is already
> a test suite?

The appended code is what I've been using as a test harness.  It passes on
tilegx and arm64.  No guarantees as to production-level code quality :-)

#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <assert.h>
#include <string.h>
#include <errno.h>
#include <sched.h>
#include <pthread.h>
#include <sys/wait.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/prctl.h>

#ifndef PR_SET_TASK_ISOLATION   // Not in system headers yet?
# define PR_SET_TASK_ISOLATION		48
# define PR_GET_TASK_ISOLATION		49
# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
# define PR_TASK_ISOLATION_USERSIG	(1 << 1)
# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
# define PR_TASK_ISOLATION_NOSIG \
     (PR_TASK_ISOLATION_USERSIG | PR_TASK_ISOLATION_SET_SIG(0))
#endif

// The cpu we are using for isolation tests.
static int task_isolation_cpu;

// Overall status, maintained as tests run.
static int exit_status = EXIT_SUCCESS;

// Set affinity to a single cpu.
int set_my_cpu(int cpu)
{
	cpu_set_t set;
	CPU_ZERO(&set);
	CPU_SET(cpu, &set);
	return sched_setaffinity(0, sizeof(cpu_set_t), &set);
}

// Run a child process in task isolation mode and report its status.
// The child does mlockall() and moves itself to the task isolation cpu.
// It then runs SETUP_FUNC (if specified), calls prctl(PR_SET_TASK_ISOLATION, )
// with FLAGS (if non-zero), and then invokes TEST_FUNC and exits
// with its status.
static int run_test(void (*setup_func)(), int (*test_func)(), int flags)
{
	fflush(stdout);
	int pid = fork();
	assert(pid >= 0);
	if (pid != 0) {
		// In parent; wait for child and return its status.
		int status;
		waitpid(pid, &status, 0);
		return status;
	}

	// In child.
	int rc = mlockall(MCL_CURRENT);
	assert(rc == 0);
	rc = set_my_cpu(task_isolation_cpu);
	assert(rc == 0);
	if (setup_func)
		setup_func();
	if (flags) {
		int rc;
		do
			rc = prctl(PR_SET_TASK_ISOLATION, flags);
		while (rc != 0 && errno == EAGAIN);
		if (rc != 0) {
			printf("couldn't enable isolation (%d): FAIL\n", errno);
			exit(EXIT_FAILURE);
		}
	}
	rc = test_func();
	exit(rc);
}

// Run a test and ensure it is killed with SIGKILL by default,
// for whatever misdemeanor is committed in TEST_FUNC.
// Also test it with SIGUSR1 as well to make sure that works.
static void test_killed(const char *testname, void (*setup_func)(),
			int (*test_func)())
{
	int status = run_test(setup_func, test_func, PR_TASK_ISOLATION_ENABLE);
	if (WIFSIGNALED(status) && WTERMSIG(status) == SIGKILL) {
		printf("%s: OK\n", testname);
	} else {
		printf("%s: FAIL (%#x)\n", testname, status);
		exit_status = EXIT_FAILURE;
	}

	status = run_test(setup_func, test_func,
			  PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_USERSIG |
			  PR_TASK_ISOLATION_SET_SIG(SIGUSR1));
	if (WIFSIGNALED(status) && WTERMSIG(status) == SIGUSR1) {
		printf("%s (SIGUSR1): OK\n", testname);
	} else {
		printf("%s (SIGUSR1): FAIL (%#x)\n", testname, status);
		exit_status = EXIT_FAILURE;
	}
}

// Run a test and make sure it exits with success.
static void test_ok(const char *testname, void (*setup_func)(),
		    int (*test_func)())
{
	int status = run_test(setup_func, test_func, PR_TASK_ISOLATION_ENABLE);
	if (status == EXIT_SUCCESS) {
		printf("%s: OK\n", testname);
	} else {
		printf("%s: FAIL (%#x)\n", testname, status);
		exit_status = EXIT_FAILURE;
	}
}

// Run a test with no signals and make sure it exits with success.
static void test_nosig(const char *testname, void (*setup_func)(),
		       int (*test_func)())
{
	int status =
		run_test(setup_func, test_func,
			 PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_NOSIG);
	if (status == EXIT_SUCCESS) {
		printf("%s: OK\n", testname);
	} else {
		printf("%s: FAIL (%#x)\n", testname, status);
		exit_status = EXIT_FAILURE;
	}
}

// Mapping address passed from setup function to test function.
static char *fault_file_mapping;

// mmap() a file in so we can test touching an unmapped page.
static void setup_fault(void)
{
	char fault_file[] = "/tmp/isolation_XXXXXX";
	int fd = mkstemp(fault_file);
	assert(fd >= 0);
	int rc = ftruncate(fd, getpagesize());
	assert(rc == 0);
	fault_file_mapping = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
				  MAP_SHARED, fd, 0);
	assert(fault_file_mapping != MAP_FAILED);
	close(fd);
	unlink(fault_file);
}

// Now touch the unmapped page (and be killed).
static int do_fault(void)
{
	*fault_file_mapping = 1;
	return EXIT_FAILURE;
}

// Make a syscall (and be killed).
static int do_syscall(void)
{
	write(STDOUT_FILENO, "goodbye, world\n", 13);
	return EXIT_FAILURE;
}

// Turn isolation back off and don't be killed.
static int do_syscall_off(void)
{
	prctl(PR_SET_TASK_ISOLATION, 0);
	write(STDOUT_FILENO, "==> hello, world\n", 17);
	return EXIT_SUCCESS;
}

// If we're not getting a signal, make sure we can do multiple system calls.
static int do_syscall_multi(void)
{
	write(STDOUT_FILENO, "==> hello, world 1\n", 19);
	write(STDOUT_FILENO, "==> hello, world 2\n", 19);
	return EXIT_SUCCESS;
}

#ifdef __aarch64__
/* ARM64 uses tlbi instructions so doesn't need to interrupt the remote core. */
static void test_munmap(void) {}
#else

// Fork a thread that will munmap() after a short while.
// It will deliver a TLB flush to the task isolation core.

static void *start_munmap(void *p)
{
	usleep(500000);   // 0.5s
	munmap(p, getpagesize());
	return 0;
}

static void setup_munmap(void)
{
	// First, go back to cpu 0 and allocate some memory.
	set_my_cpu(0);
	void *p = mmap(0, getpagesize(), PROT_READ|PROT_WRITE,
		       MAP_ANONYMOUS|MAP_POPULATE|MAP_PRIVATE, 0, 0);
	assert(p != MAP_FAILED);

	// Now fire up a thread that will wait half a second on cpu 0
	// and then munmap the mapping.
	pthread_t thr;
	int rc = pthread_create(&thr, NULL, start_munmap, p);
	assert(rc == 0);

	// Back to the task-isolation cpu.
	set_my_cpu(task_isolation_cpu);
}

// Global variable to avoid the compiler outsmarting us.
volatile int munmap_spin;

static int do_munmap(void)
{
	while (munmap_spin < 1000000000)
		++munmap_spin;
	return EXIT_FAILURE;
}

static void test_munmap(void)
{
	test_killed("test_munmap", setup_munmap, do_munmap);
}
#endif

#ifdef __tilegx__
// Make an unaligned access (and be killed).
// Only for tilegx, since other platforms don't do in-kernel fixups.
static int
do_unaligned(void)
{
	static int buf[2];
	volatile int* addr = (volatile int *)((char *)buf + 1);

	*addr;

	asm("nop");
	return EXIT_FAILURE;
}

static void test_unaligned(void)
{
	test_killed("test_unaligned", NULL, do_unaligned);
}
#else
static void test_unaligned(void) {}
#endif

// Fork a process that will spin annoyingly on the same core
// for a second.  Since prctl() won't work if this task is actively
// running, we following this handshake sequence:
//
// 1. Child (in setup_quiesce, here) starts up, sets state 1 to let the
//    parent know it's running, and starts doing short sleeps waiting on a
//    state change.
// 2. Parent (in do_quiesce, below) starts up, spins waiting for state 1,
//    then spins waiting on prctl() to succeed.  At that point it is in
//    isolation mode and the child is completing its most recent sleep.
//    Now, as soon as the parent is scheduled out, it won't schedule back
//    in until the child stops spinning.
// 3. Child sees the state change to 2, sets it to 3, and starts spinning
//    waiting for a second to elapse, at which point it exits.
// 4. Parent spins waiting for the state to get to 3, then makes one
//    syscall.  This should take about a second even though the child
//    was spinning for a whole second after changing the state to 3.

volatile int *statep, *childstate;
struct timeval quiesce_start, quiesce_end;
int child_pid;

static void setup_quiesce(void)
{
	// First, go back to cpu 0 and allocate some shared memory.
	set_my_cpu(0);
	statep = mmap(0, getpagesize(), PROT_READ|PROT_WRITE,
		      MAP_ANONYMOUS|MAP_SHARED, 0, 0);
	assert(statep != MAP_FAILED);
	childstate = statep + 1;

	gettimeofday(&quiesce_start, NULL);

	// Fork and fault in all memory in both.
	child_pid = fork();
	assert(child_pid >= 0);
	if (child_pid == 0)
		*childstate = 1;
	int rc = mlockall(MCL_CURRENT);
	assert(rc == 0);
	if (child_pid != 0) {
		set_my_cpu(task_isolation_cpu);
		return;
	}

	// In child.  Wait until parent notifies us that it has completed
	// its prctl, then jump to its cpu and let it know.
	*childstate = 2;
	while (*statep == 0)
		;
	*childstate = 3;
	//  printf("child: jumping to cpu %d\n", task_isolation_cpu);
	set_my_cpu(task_isolation_cpu);
	//  printf("child: jumped to cpu %d\n", task_isolation_cpu);
	*statep = 2;
	*childstate = 4;

	// Now we are competing for the runqueue on task_isolation_cpu.
	// Spin for one second to ensure the parent gets caught in kernel space.
	struct timeval start, tv;
	gettimeofday(&start, NULL);
	while (1) {
		gettimeofday(&tv, NULL);
		double time = (tv.tv_sec - start.tv_sec) +
			(tv.tv_usec - start.tv_usec) / 1000000.0;
		if (time >= 0.5)
			exit(0);
	}
}

static int do_quiesce(void)
{
	double time;
	int rc;

	rc = prctl(PR_SET_TASK_ISOLATION,
		   PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_NOSIG);
	if (rc != 0) {
		prctl(PR_SET_TASK_ISOLATION, 0);
		printf("prctl failed: rc %d", rc);
		goto fail;
	}
	*statep = 1;
     
	// Wait for child to come disturb us.
	while (*statep == 1) {
		gettimeofday(&quiesce_end, NULL);
		time = (quiesce_end.tv_sec - quiesce_start.tv_sec) +
			(quiesce_end.tv_usec - quiesce_start.tv_usec)/1000000.0;
		if (time > 0.1 && *statep == 1)	{
			prctl(PR_SET_TASK_ISOLATION, 0);
			printf("timed out at %gs in child migrate loop (%d)\n",
			       time, *childstate);
			char buf[100];
			sprintf(buf, "cat /proc/%d/stack", child_pid);
			system(buf);
			goto fail;
		}
	}
	assert(*statep == 2);

	// At this point the child is spinning, so any interrupt will keep us
	// in kernel space.  Make a syscall to make sure it happens at least
	// once during the second that the child is spinning.
	kill(0, 0);
	gettimeofday(&quiesce_end, NULL);
	prctl(PR_SET_TASK_ISOLATION, 0);
	time = (quiesce_end.tv_sec - quiesce_start.tv_sec) +
		(quiesce_end.tv_usec - quiesce_start.tv_usec) / 1000000.0;
	if (time < 0.4 || time > 0.6) {
		printf("expected 1s wait after quiesce: was %g\n", time);
		goto fail;
	}
	kill(child_pid, SIGKILL);
	return EXIT_SUCCESS;

fail:
	kill(child_pid, SIGKILL);
	return EXIT_FAILURE;
}

int main(int argc, char **argv)
{
	/* How many seconds to wait after running the other tests? */
	double waittime;
	if (argc == 1)
		waittime = 10;
	else if (argc == 2)
		waittime = strtof(argv[1], NULL);
	else {
		printf("syntax: isolation [seconds]\n");
		exit(EXIT_FAILURE);
	}

	/* Test that the /sys device is present and pick a cpu. */
	FILE *f = fopen("/sys/devices/system/cpu/task_isolation", "r");
	if (f == NULL) {
		printf("/sys device: FAIL\n");
		exit(EXIT_FAILURE);
	}
	char buf[100];
	char *result = fgets(buf, sizeof(buf), f);
	assert(result == buf);
	fclose(f);
	char *end;
	task_isolation_cpu = strtol(buf, &end, 10);
	assert(end != buf);
	assert(*end == ',' || *end == '-' || *end == '\n');
	assert(task_isolation_cpu >= 0);
	printf("/sys device : OK\n");

	// Test to see if with no mask set, we fail.
	if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) == 0 ||
	    errno != EINVAL) {
		printf("prctl unaffinitized: FAIL\n");
		exit_status = EXIT_FAILURE;
	} else {
		printf("prctl unaffinitized: OK\n");
	}

	// Or if affinitized to the wrong cpu.
	set_my_cpu(0);
	if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) == 0 ||
	    errno != EINVAL) {
		printf("prctl on cpu 0: FAIL\n");
		exit_status = EXIT_FAILURE;
	} else {
		printf("prctl on cpu 0: OK\n");
	}

	// Run the tests.
	test_killed("test_fault", setup_fault, do_fault);
	test_killed("test_syscall", NULL, do_syscall);
	test_munmap();
	test_unaligned();
	test_ok("test_off", NULL, do_syscall_off);
	test_nosig("test_multi", NULL, do_syscall_multi);
	test_nosig("test_quiesce", setup_quiesce, do_quiesce);

	// Exit failure if any test failed.
	if (exit_status != EXIT_SUCCESS)
		return exit_status;

	// Wait for however long was requested on the command line.
	// Note that this requires a vDSO implementation of gettimeofday();
	// if it's not available, we could just spin a fixed number of
	// iterations instead.
	struct timeval start, tv;
	gettimeofday(&start, NULL);
	while (1) {
		gettimeofday(&tv, NULL);
		double time = (tv.tv_sec - start.tv_sec) +
			(tv.tv_usec - start.tv_usec) / 1000000.0;
		if (time >= waittime)
			break;
	}

	return EXIT_SUCCESS;
}

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-07-27 13:55             ` Christoph Lameter
  (?)
  (?)
@ 2016-08-10 22:16             ` Frederic Weisbecker
  2016-08-10 22:26                 ` Chris Metcalf
  2016-08-11  8:40               ` Peter Zijlstra
  -1 siblings, 2 replies; 72+ messages in thread
From: Frederic Weisbecker @ 2016-08-10 22:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On Wed, Jul 27, 2016 at 08:55:28AM -0500, Christoph Lameter wrote:
> On Mon, 25 Jul 2016, Christoph Lameter wrote:
> 
> > Guess so. I will have a look at this when I get some time again.
> 
> Ok so the problem is the clocksource_watchdog() function in
> kernel/time/clocksource.c. This function is active if
> CONFIG_CLOCKSOURCE_WATCHDOG is defined. It will check the timesources of
> each processor for being within bounds and then reschedule itself on the
> next one.
> 
> The purpose of the function seems to be to determine *if* a clocksource is
> unstable. It does not mean that the clocksource *is* unstable.
> 
> The critical piece of code is this:
> 
>         /*
>          * Cycle through CPUs to check if the CPUs stay synchronized
>          * to each other.
>          */
>         next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
>         if (next_cpu >= nr_cpu_ids)
>                 next_cpu = cpumask_first(cpu_online_mask);
>         watchdog_timer.expires += WATCHDOG_INTERVAL;
>         add_timer_on(&watchdog_timer, next_cpu);
> 
> 
> Should we just cycle through the cpus that are not isolated? Otherwise we
> need to have some means to check the clocksources for accuracy remotely
> (probably impossible for TSC etc).
> 
> The WATCHDOG_INTERVAL is 1 second so this causes an interrupt every
> second.
> 
> Note that we are running with the patch that removes the 1 HZ mininum time
> tick. With an older kernel code base (redhat) we can keep the kernel quiet
> for minutes. The clocksource watchdog causes timers to fire again.

I had similar issues, this seems to happen when the tsc is considered not reliable
(which doesn't necessarily mean unstable. I think it has to do with some x86 CPU feature
flag).

IIRC, this _has_ to execute on all online CPUs because every TSCs of running CPUs
are concerned.

I personally override that with passing the tsc=reliable kernel parameter. Of course
use it at your own risk.

But eventually I don't think we can offline that to housekeeping only CPUs.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-10 22:16             ` Frederic Weisbecker
@ 2016-08-10 22:26                 ` Chris Metcalf
  2016-08-11  8:40               ` Peter Zijlstra
  1 sibling, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-08-10 22:26 UTC (permalink / raw)
  To: Frederic Weisbecker, Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, linux-doc, linux-api,
	linux-kernel

On 8/10/2016 6:16 PM, Frederic Weisbecker wrote:
> On Wed, Jul 27, 2016 at 08:55:28AM -0500, Christoph Lameter wrote:
>> On Mon, 25 Jul 2016, Christoph Lameter wrote:
>>
>>> Guess so. I will have a look at this when I get some time again.
>> Ok so the problem is the clocksource_watchdog() function in
>> kernel/time/clocksource.c. This function is active if
>> CONFIG_CLOCKSOURCE_WATCHDOG is defined. It will check the timesources of
>> each processor for being within bounds and then reschedule itself on the
>> next one.
>>
>> The purpose of the function seems to be to determine *if* a clocksource is
>> unstable. It does not mean that the clocksource *is* unstable.
>>
>> The critical piece of code is this:
>>
>>          /*
>>           * Cycle through CPUs to check if the CPUs stay synchronized
>>           * to each other.
>>           */
>>          next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
>>          if (next_cpu >= nr_cpu_ids)
>>                  next_cpu = cpumask_first(cpu_online_mask);
>>          watchdog_timer.expires += WATCHDOG_INTERVAL;
>>          add_timer_on(&watchdog_timer, next_cpu);
>>
>>
>> Should we just cycle through the cpus that are not isolated? Otherwise we
>> need to have some means to check the clocksources for accuracy remotely
>> (probably impossible for TSC etc).
>>
>> The WATCHDOG_INTERVAL is 1 second so this causes an interrupt every
>> second.
>>
>> Note that we are running with the patch that removes the 1 HZ mininum time
>> tick. With an older kernel code base (redhat) we can keep the kernel quiet
>> for minutes. The clocksource watchdog causes timers to fire again.
> I had similar issues, this seems to happen when the tsc is considered not reliable
> (which doesn't necessarily mean unstable. I think it has to do with some x86 CPU feature
> flag).
>
> IIRC, this _has_ to execute on all online CPUs because every TSCs of running CPUs
> are concerned.
>
> I personally override that with passing the tsc=reliable kernel parameter. Of course
> use it at your own risk.
>
> But eventually I don't think we can offline that to housekeeping only CPUs.

Maybe the eventual model here is that as task-isolation cores
re-enter the kernel, they catch a hook that tells them to go
call the unreliable-tsc stuff and see what the state of it is.

This would be the same hook that we could use to defer
kernel TLB flushes, also.

The hard part is that on some platforms it may be fairly
intrusive to get all the hooks in.  Arm64 has a nice consistent
set of assembly routines to enter the kernel, which is how they
manage the context_tracking as well, but I fear that x86 may
have a lot more.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-08-10 22:26                 ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-08-10 22:26 UTC (permalink / raw)
  To: Frederic Weisbecker, Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, linux-doc, linux-api,
	linux-kernel

On 8/10/2016 6:16 PM, Frederic Weisbecker wrote:
> On Wed, Jul 27, 2016 at 08:55:28AM -0500, Christoph Lameter wrote:
>> On Mon, 25 Jul 2016, Christoph Lameter wrote:
>>
>>> Guess so. I will have a look at this when I get some time again.
>> Ok so the problem is the clocksource_watchdog() function in
>> kernel/time/clocksource.c. This function is active if
>> CONFIG_CLOCKSOURCE_WATCHDOG is defined. It will check the timesources of
>> each processor for being within bounds and then reschedule itself on the
>> next one.
>>
>> The purpose of the function seems to be to determine *if* a clocksource is
>> unstable. It does not mean that the clocksource *is* unstable.
>>
>> The critical piece of code is this:
>>
>>          /*
>>           * Cycle through CPUs to check if the CPUs stay synchronized
>>           * to each other.
>>           */
>>          next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
>>          if (next_cpu >= nr_cpu_ids)
>>                  next_cpu = cpumask_first(cpu_online_mask);
>>          watchdog_timer.expires += WATCHDOG_INTERVAL;
>>          add_timer_on(&watchdog_timer, next_cpu);
>>
>>
>> Should we just cycle through the cpus that are not isolated? Otherwise we
>> need to have some means to check the clocksources for accuracy remotely
>> (probably impossible for TSC etc).
>>
>> The WATCHDOG_INTERVAL is 1 second so this causes an interrupt every
>> second.
>>
>> Note that we are running with the patch that removes the 1 HZ mininum time
>> tick. With an older kernel code base (redhat) we can keep the kernel quiet
>> for minutes. The clocksource watchdog causes timers to fire again.
> I had similar issues, this seems to happen when the tsc is considered not reliable
> (which doesn't necessarily mean unstable. I think it has to do with some x86 CPU feature
> flag).
>
> IIRC, this _has_ to execute on all online CPUs because every TSCs of running CPUs
> are concerned.
>
> I personally override that with passing the tsc=reliable kernel parameter. Of course
> use it at your own risk.
>
> But eventually I don't think we can offline that to housekeeping only CPUs.

Maybe the eventual model here is that as task-isolation cores
re-enter the kernel, they catch a hook that tells them to go
call the unreliable-tsc stuff and see what the state of it is.

This would be the same hook that we could use to defer
kernel TLB flushes, also.

The hard part is that on some platforms it may be fairly
intrusive to get all the hooks in.  Arm64 has a nice consistent
set of assembly routines to enter the kernel, which is how they
manage the context_tracking as well, but I fear that x86 may
have a lot more.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v13 00/12] support "task_isolation" mode
  2016-07-22 12:50         ` Chris Metcalf
  (?)
  (?)
@ 2016-08-11  8:27         ` Peter Zijlstra
  -1 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2016-08-11  8:27 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Christoph Lameter, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, linux-doc,
	linux-api, linux-kernel

On Fri, Jul 22, 2016 at 08:50:44AM -0400, Chris Metcalf wrote:
> On 7/21/2016 10:20 PM, Christoph Lameter wrote:
> >On Thu, 21 Jul 2016, Chris Metcalf wrote:
> >>On 7/20/2016 10:04 PM, Christoph Lameter wrote:
> >>unstable, and then scheduling work to safely remove that timer.
> >>I haven't looked at this code before (in kernel/time/clocksource.c
> >>under CONFIG_CLOCKSOURCE_WATCHDOG) since the timers on
> >>arm64 and tile aren't unstable.  Is it possible to boot your machine
> >>with a stable clocksource?
> >It already as a stable clocksource. Sorry but that was one of the criteria
> >for the server when we ordered them. Could this be clock adjustments?
> 
> We probably need to get clock folks to jump in on this thread!

Boot with: tsc=reliable, this disables the watchdog.

We (sadly) have to have this thing running on most x86 because TSC, even
if initially stable, can do weird things once its running.

We have seen:

 - SMI
 - hotplug
 - suspend
 - multi-socket

mess up the TSC, even if it was deemed 'good' at boot time.

If you _know_ your TSC to be solid, boot with tsc=reliable and be happy.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-10 22:16             ` Frederic Weisbecker
  2016-08-10 22:26                 ` Chris Metcalf
@ 2016-08-11  8:40               ` Peter Zijlstra
  2016-08-11 11:58                 ` Frederic Weisbecker
  2016-08-11 16:00                 ` Paul E. McKenney
  1 sibling, 2 replies; 72+ messages in thread
From: Peter Zijlstra @ 2016-08-11  8:40 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, Chris Metcalf, Gilad Ben Yossef,
	Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Paul E. McKenney, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Thu, Aug 11, 2016 at 12:16:58AM +0200, Frederic Weisbecker wrote:
> I had similar issues, this seems to happen when the tsc is considered not reliable
> (which doesn't necessarily mean unstable. I think it has to do with some x86 CPU feature
> flag).

Right, as per the other email, in general we cannot know/assume the TSC
to be working as intended :/

> IIRC, this _has_ to execute on all online CPUs because every TSCs of running CPUs
> are concerned.

With modern Intel we could run it on one CPU per package I think, but at
the same time, too much in NOHZ_FULL assumes the TSC is indeed sane so
it doesn't make sense to me to keep the watchdog running, when it
triggers it would also have to kill all NOHZ_FULL stuff, which would
probably bring the entire machine down..

Arguably we should issue a boot time warning if NOHZ_FULL is configured
and the TSC watchdog is running.

> I personally override that with passing the tsc=reliable kernel
> parameter. Of course use it at your own risk.

Yes, that is (sadly) our only option. Manually assert our hardware is
solid under the intended workload and then manually disabling the
watchdog.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-11  8:40               ` Peter Zijlstra
@ 2016-08-11 11:58                 ` Frederic Weisbecker
  2016-08-15 15:03                     ` Chris Metcalf
  2016-08-11 16:00                 ` Paul E. McKenney
  1 sibling, 1 reply; 72+ messages in thread
From: Frederic Weisbecker @ 2016-08-11 11:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Chris Metcalf, Gilad Ben Yossef,
	Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel,
	Tejun Heo, Thomas Gleixner, Paul E. McKenney, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Thu, Aug 11, 2016 at 10:40:02AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 11, 2016 at 12:16:58AM +0200, Frederic Weisbecker wrote:
> > I had similar issues, this seems to happen when the tsc is considered not reliable
> > (which doesn't necessarily mean unstable. I think it has to do with some x86 CPU feature
> > flag).
> 
> Right, as per the other email, in general we cannot know/assume the TSC
> to be working as intended :/

Yeah, I remember you explained me that a little while ago.

> 
> > IIRC, this _has_ to execute on all online CPUs because every TSCs of running CPUs
> > are concerned.
> 
> With modern Intel we could run it on one CPU per package I think, but at
> the same time, too much in NOHZ_FULL assumes the TSC is indeed sane so
> it doesn't make sense to me to keep the watchdog running, when it
> triggers it would also have to kill all NOHZ_FULL stuff, which would
> probably bring the entire machine down..
> 
> Arguably we should issue a boot time warning if NOHZ_FULL is configured
> and the TSC watchdog is running.

That's a very good idea! We do that when tsc is unstable but indeed we can't
seriously run NOHZ_FULL on a non-reliable tsc.

I'll take care of that warning.

> 
> > I personally override that with passing the tsc=reliable kernel
> > parameter. Of course use it at your own risk.
> 
> Yes, that is (sadly) our only option. Manually assert our hardware is
> solid under the intended workload and then manually disabling the
> watchdog.

Right, I'll tell about that in the warning.

Thanks for those details!

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-11  8:40               ` Peter Zijlstra
  2016-08-11 11:58                 ` Frederic Weisbecker
@ 2016-08-11 16:00                 ` Paul E. McKenney
  2016-08-11 23:02                   ` Christoph Lameter
  1 sibling, 1 reply; 72+ messages in thread
From: Paul E. McKenney @ 2016-08-11 16:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Christoph Lameter, Chris Metcalf,
	Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Thu, Aug 11, 2016 at 10:40:02AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 11, 2016 at 12:16:58AM +0200, Frederic Weisbecker wrote:
> > I had similar issues, this seems to happen when the tsc is considered not reliable
> > (which doesn't necessarily mean unstable. I think it has to do with some x86 CPU feature
> > flag).
> 
> Right, as per the other email, in general we cannot know/assume the TSC
> to be working as intended :/
> 
> > IIRC, this _has_ to execute on all online CPUs because every TSCs of running CPUs
> > are concerned.
> 
> With modern Intel we could run it on one CPU per package I think, but at
> the same time, too much in NOHZ_FULL assumes the TSC is indeed sane so
> it doesn't make sense to me to keep the watchdog running, when it
> triggers it would also have to kill all NOHZ_FULL stuff, which would
> probably bring the entire machine down..

Well, you -could- force a very low priority CPU-bound task to run on
all nohz_full CPUs.  Not necessarily a good idea, but a relatively
non-intrusive response to that particular error condition.

							Thanx, Paul

> Arguably we should issue a boot time warning if NOHZ_FULL is configured
> and the TSC watchdog is running.
> 
> > I personally override that with passing the tsc=reliable kernel
> > parameter. Of course use it at your own risk.
> 
> Yes, that is (sadly) our only option. Manually assert our hardware is
> solid under the intended workload and then manually disabling the
> watchdog.
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-11 16:00                 ` Paul E. McKenney
@ 2016-08-11 23:02                   ` Christoph Lameter
  2016-08-11 23:47                     ` Paul E. McKenney
  0 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2016-08-11 23:02 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Frederic Weisbecker, Chris Metcalf,
	Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Thu, 11 Aug 2016, Paul E. McKenney wrote:

> > With modern Intel we could run it on one CPU per package I think, but at
> > the same time, too much in NOHZ_FULL assumes the TSC is indeed sane so
> > it doesn't make sense to me to keep the watchdog running, when it
> > triggers it would also have to kill all NOHZ_FULL stuff, which would
> > probably bring the entire machine down..
>
> Well, you -could- force a very low priority CPU-bound task to run on
> all nohz_full CPUs.  Not necessarily a good idea, but a relatively
> non-intrusive response to that particular error condition.

Given that we want the cpu only to run the user task I would think that is
not a good idea.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-11 23:02                   ` Christoph Lameter
@ 2016-08-11 23:47                     ` Paul E. McKenney
  2016-08-12 14:23                       ` Christoph Lameter
  0 siblings, 1 reply; 72+ messages in thread
From: Paul E. McKenney @ 2016-08-11 23:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Frederic Weisbecker, Chris Metcalf,
	Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Thu, Aug 11, 2016 at 06:02:34PM -0500, Christoph Lameter wrote:
> On Thu, 11 Aug 2016, Paul E. McKenney wrote:
> 
> > > With modern Intel we could run it on one CPU per package I think, but at
> > > the same time, too much in NOHZ_FULL assumes the TSC is indeed sane so
> > > it doesn't make sense to me to keep the watchdog running, when it
> > > triggers it would also have to kill all NOHZ_FULL stuff, which would
> > > probably bring the entire machine down..
> >
> > Well, you -could- force a very low priority CPU-bound task to run on
> > all nohz_full CPUs.  Not necessarily a good idea, but a relatively
> > non-intrusive response to that particular error condition.
> 
> Given that we want the cpu only to run the user task I would think that is
> not a good idea.

Heh!  The only really good idea is for clocks to be reliably in sync.

But if they go out of sync, what do you want to do instead?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-11 23:47                     ` Paul E. McKenney
@ 2016-08-12 14:23                       ` Christoph Lameter
  2016-08-12 14:26                           ` Frederic Weisbecker
  0 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2016-08-12 14:23 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Frederic Weisbecker, Chris Metcalf,
	Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Thu, 11 Aug 2016, Paul E. McKenney wrote:

> Heh!  The only really good idea is for clocks to be reliably in sync.
>
> But if they go out of sync, what do you want to do instead?

For a NOHZ task? Write a message to the syslog and reenable tick.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-08-12 14:26                           ` Frederic Weisbecker
  0 siblings, 0 replies; 72+ messages in thread
From: Frederic Weisbecker @ 2016-08-12 14:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Peter Zijlstra, Chris Metcalf,
	Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Fri, Aug 12, 2016 at 09:23:13AM -0500, Christoph Lameter wrote:
> On Thu, 11 Aug 2016, Paul E. McKenney wrote:
> 
> > Heh!  The only really good idea is for clocks to be reliably in sync.
> >
> > But if they go out of sync, what do you want to do instead?
> 
> For a NOHZ task? Write a message to the syslog and reenable tick.

Indeed, a strong clocksource is a requirement for a full tickless machine.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-08-12 14:26                           ` Frederic Weisbecker
  0 siblings, 0 replies; 72+ messages in thread
From: Frederic Weisbecker @ 2016-08-12 14:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Peter Zijlstra, Chris Metcalf,
	Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, Aug 12, 2016 at 09:23:13AM -0500, Christoph Lameter wrote:
> On Thu, 11 Aug 2016, Paul E. McKenney wrote:
> 
> > Heh!  The only really good idea is for clocks to be reliably in sync.
> >
> > But if they go out of sync, what do you want to do instead?
> 
> For a NOHZ task? Write a message to the syslog and reenable tick.

Indeed, a strong clocksource is a requirement for a full tickless machine.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-12 14:26                           ` Frederic Weisbecker
  (?)
@ 2016-08-12 16:19                           ` Paul E. McKenney
  2016-08-13 15:39                               ` Frederic Weisbecker
  -1 siblings, 1 reply; 72+ messages in thread
From: Paul E. McKenney @ 2016-08-12 16:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, Peter Zijlstra, Chris Metcalf,
	Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Fri, Aug 12, 2016 at 04:26:13PM +0200, Frederic Weisbecker wrote:
> On Fri, Aug 12, 2016 at 09:23:13AM -0500, Christoph Lameter wrote:
> > On Thu, 11 Aug 2016, Paul E. McKenney wrote:
> > 
> > > Heh!  The only really good idea is for clocks to be reliably in sync.
> > >
> > > But if they go out of sync, what do you want to do instead?
> > 
> > For a NOHZ task? Write a message to the syslog and reenable tick.

Fair enough!  Kicking off a low-priority task would achieve the latter
but not necessarily the former.  And of course assumes that the worker
thread is at real-time priority with various scheduler anti-starvation
features disabled.

> Indeed, a strong clocksource is a requirement for a full tickless machine.

No disagrement here!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-08-13 15:39                               ` Frederic Weisbecker
  0 siblings, 0 replies; 72+ messages in thread
From: Frederic Weisbecker @ 2016-08-13 15:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Christoph Lameter, Peter Zijlstra, Chris Metcalf,
	Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Fri, Aug 12, 2016 at 09:19:19AM -0700, Paul E. McKenney wrote:
> On Fri, Aug 12, 2016 at 04:26:13PM +0200, Frederic Weisbecker wrote:
> > On Fri, Aug 12, 2016 at 09:23:13AM -0500, Christoph Lameter wrote:
> > > On Thu, 11 Aug 2016, Paul E. McKenney wrote:
> > > 
> > > > Heh!  The only really good idea is for clocks to be reliably in sync.
> > > >
> > > > But if they go out of sync, what do you want to do instead?
> > > 
> > > For a NOHZ task? Write a message to the syslog and reenable tick.
> 
> Fair enough!  Kicking off a low-priority task would achieve the latter
> but not necessarily the former.  And of course assumes that the worker
> thread is at real-time priority with various scheduler anti-starvation
> features disabled.
> 
> > Indeed, a strong clocksource is a requirement for a full tickless machine.
> 
> No disagrement here!  ;-)

I have a bot in my mind that randomly posts obvious statements about nohz_full
here and then :-)

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-08-13 15:39                               ` Frederic Weisbecker
  0 siblings, 0 replies; 72+ messages in thread
From: Frederic Weisbecker @ 2016-08-13 15:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Christoph Lameter, Peter Zijlstra, Chris Metcalf,
	Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, Aug 12, 2016 at 09:19:19AM -0700, Paul E. McKenney wrote:
> On Fri, Aug 12, 2016 at 04:26:13PM +0200, Frederic Weisbecker wrote:
> > On Fri, Aug 12, 2016 at 09:23:13AM -0500, Christoph Lameter wrote:
> > > On Thu, 11 Aug 2016, Paul E. McKenney wrote:
> > > 
> > > > Heh!  The only really good idea is for clocks to be reliably in sync.
> > > >
> > > > But if they go out of sync, what do you want to do instead?
> > > 
> > > For a NOHZ task? Write a message to the syslog and reenable tick.
> 
> Fair enough!  Kicking off a low-priority task would achieve the latter
> but not necessarily the former.  And of course assumes that the worker
> thread is at real-time priority with various scheduler anti-starvation
> features disabled.
> 
> > Indeed, a strong clocksource is a requirement for a full tickless machine.
> 
> No disagrement here!  ;-)

I have a bot in my mind that randomly posts obvious statements about nohz_full
here and then :-)

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
  2016-08-11 11:58                 ` Frederic Weisbecker
@ 2016-08-15 15:03                     ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-08-15 15:03 UTC (permalink / raw)
  To: Frederic Weisbecker, Peter Zijlstra, Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel

On 8/11/2016 7:58 AM, Frederic Weisbecker wrote:
>> Arguably we should issue a boot time warning if NOHZ_FULL is configured
>> >and the TSC watchdog is running.
> That's a very good idea! We do that when tsc is unstable but indeed we can't
> seriously run NOHZ_FULL on a non-reliable tsc.
>
> I'll take care of that warning.

Thanks.  So I will drop Christoph's patch to run the TSC watchdog on just
housekeeping cores and we will rely on the "boot time warning" instead.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)
@ 2016-08-15 15:03                     ` Chris Metcalf
  0 siblings, 0 replies; 72+ messages in thread
From: Chris Metcalf @ 2016-08-15 15:03 UTC (permalink / raw)
  To: Frederic Weisbecker, Peter Zijlstra, Christoph Lameter
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel

On 8/11/2016 7:58 AM, Frederic Weisbecker wrote:
>> Arguably we should issue a boot time warning if NOHZ_FULL is configured
>> >and the TSC watchdog is running.
> That's a very good idea! We do that when tsc is unstable but indeed we can't
> seriously run NOHZ_FULL on a non-reliable tsc.
>
> I'll take care of that warning.

Thanks.  So I will drop Christoph's patch to run the TSC watchdog on just
housekeeping cores and we will rely on the "boot time warning" instead.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2016-08-15 15:04 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-14 20:48 [PATCH v13 00/12] support "task_isolation" mode Chris Metcalf
2016-07-14 20:48 ` [PATCH v13 01/12] vmstat: add quiet_vmstat_sync function Chris Metcalf
2016-07-14 20:48 ` [PATCH v13 02/12] vmstat: add vmstat_idle function Chris Metcalf
2016-07-14 20:48   ` Chris Metcalf
2016-07-14 20:48 ` [PATCH v13 03/12] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
2016-07-14 20:48   ` Chris Metcalf
2016-07-14 20:48 ` [PATCH v13 04/12] task_isolation: add initial support Chris Metcalf
2016-07-14 20:48   ` Chris Metcalf
2016-07-14 20:48 ` [PATCH v13 05/12] task_isolation: track asynchronous interrupts Chris Metcalf
2016-07-14 20:48 ` [PATCH v13 06/12] arch/x86: enable task isolation functionality Chris Metcalf
2016-07-14 20:48 ` [PATCH v13 07/12] arm64: factor work_pending state machine to C Chris Metcalf
2016-07-14 20:48   ` Chris Metcalf
2016-07-14 20:48 ` [PATCH v13 08/12] arch/arm64: enable task isolation functionality Chris Metcalf
2016-07-14 20:48   ` Chris Metcalf
2016-07-14 20:48 ` [PATCH v13 09/12] arch/tile: " Chris Metcalf
2016-07-14 20:48 ` [PATCH v13 10/12] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
2016-07-14 20:48 ` [PATCH v13 11/12] task_isolation: support CONFIG_TASK_ISOLATION_ALL Chris Metcalf
2016-07-14 20:48 ` [PATCH v13 12/12] task_isolation: add user-settable notification signal Chris Metcalf
2016-07-14 21:03 ` [PATCH v13 00/12] support "task_isolation" mode Andy Lutomirski
2016-07-14 21:03   ` Andy Lutomirski
2016-07-14 21:22   ` Chris Metcalf
2016-07-14 21:22     ` Chris Metcalf
2016-07-18 22:11     ` Andy Lutomirski
2016-07-18 22:50       ` Chris Metcalf
2016-07-18  0:42   ` Christoph Lameter
2016-07-18  0:42     ` Christoph Lameter
2016-07-21  2:04 ` Christoph Lameter
2016-07-21  2:04   ` Christoph Lameter
2016-07-21 14:06   ` Chris Metcalf
2016-07-21 14:06     ` Chris Metcalf
2016-07-22  2:20     ` Christoph Lameter
2016-07-22 12:50       ` Chris Metcalf
2016-07-22 12:50         ` Chris Metcalf
2016-07-25 16:35         ` Christoph Lameter
2016-07-27 13:55           ` clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode) Christoph Lameter
2016-07-27 13:55             ` Christoph Lameter
2016-07-27 14:12             ` Chris Metcalf
2016-07-27 14:12               ` Chris Metcalf
2016-07-27 15:23               ` Christoph Lameter
2016-07-27 15:23                 ` Christoph Lameter
2016-07-27 15:31                 ` Christoph Lameter
2016-07-27 15:31                   ` Christoph Lameter
2016-07-27 17:06                   ` Chris Metcalf
2016-07-27 17:06                     ` Chris Metcalf
2016-07-27 18:56                     ` Christoph Lameter
2016-07-27 19:49                       ` Chris Metcalf
2016-07-27 19:49                         ` Chris Metcalf
2016-07-27 19:53                         ` Christoph Lameter
2016-07-27 19:58                           ` Chris Metcalf
2016-07-27 19:58                             ` Chris Metcalf
2016-07-29 18:31                             ` Francis Giraldeau
2016-07-29 18:31                               ` Francis Giraldeau
2016-07-29 21:04                               ` Chris Metcalf
2016-07-29 21:04                                 ` Chris Metcalf
2016-08-10 22:16             ` Frederic Weisbecker
2016-08-10 22:26               ` Chris Metcalf
2016-08-10 22:26                 ` Chris Metcalf
2016-08-11  8:40               ` Peter Zijlstra
2016-08-11 11:58                 ` Frederic Weisbecker
2016-08-15 15:03                   ` Chris Metcalf
2016-08-15 15:03                     ` Chris Metcalf
2016-08-11 16:00                 ` Paul E. McKenney
2016-08-11 23:02                   ` Christoph Lameter
2016-08-11 23:47                     ` Paul E. McKenney
2016-08-12 14:23                       ` Christoph Lameter
2016-08-12 14:26                         ` Frederic Weisbecker
2016-08-12 14:26                           ` Frederic Weisbecker
2016-08-12 16:19                           ` Paul E. McKenney
2016-08-13 15:39                             ` Frederic Weisbecker
2016-08-13 15:39                               ` Frederic Weisbecker
2016-08-11  8:27         ` [PATCH v13 00/12] support "task_isolation" mode Peter Zijlstra
2016-07-27 14:01 ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.