linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v15 00/13] support "task_isolation" mode
@ 2016-08-16 21:19 Chris Metcalf
  2016-08-16 21:19 ` [PATCH v15 01/13] vmstat: add quiet_vmstat_sync function Chris Metcalf
                   ` (15 more replies)
  0 siblings, 16 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-08-16 21:19 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, Francis Giraldeau, linux-doc, linux-api,
	linux-kernel
  Cc: Chris Metcalf

Here is a respin of the task-isolation patch set.

Again, I have been getting email asking me when and where this patch
will be upstreamed so folks can start using it.  I had been thinking
the obvious path was via Frederic Weisbecker to Ingo as a NOHZ kind of
thing.  But perhaps it touches enough other subsystems that that
doesn't really make sense?  Andrew, would it make sense to take it
directly via your tree?  Frederic, Ingo, what do you think?

Changes since v14:

- Rebased on v4.8-rc2 (so incorporates my NOHZ bugfix vs v4.8-rc1)

- Dropped Christoph Lameter's patch to avoid scheduling the
  clocksource watchdog on nohz cores; the recommendation is to just
  boot with tsc=reliable for NOHZ in any case, if necessary.

- Optimize task_isolation_enter() by checking vmstat_idle() before
  calling quiet_vmstat_sync() [Frederic, Christoph]

- Correct buggy x86 syscall_trace_enter() support [Andy]

- Add _TIF_TASK_ISOLATION to x86 _TIF_ALLWORK_MASK; not technically
  necessary but good for self-documentation [Andy]

- Improve comment for task_isolation_syscall() callsites to clarify
  that we are delivering a signal if we bail out of the syscall [Andy]

- Ran the selftest through checkpatch and cleaned up style issues

The previous (v14) patch series is here:

https://lkml.kernel.org/r/1470774596-17341-1-git-send-email-cmetcalf@mellanox.com

This version of the patch series has been tested on arm64 and tilegx,
and build-tested on x86 (plus some volunteer testing on x86 by
Christoph and Francis).

It remains true that the 1 Hz tick needs to be disabled for this
patch series to be able to achieve its primary goal of enabling
truly tick-free operation, but that is ongoing orthogonal work.
Frederic, do you have a sense of what is left to be done there?
I can certainly try to contribute to that effort as well.

The series is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Chris Metcalf (13):
  vmstat: add quiet_vmstat_sync function
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: track asynchronous interrupts
  arch/x86: enable task isolation functionality
  arm64: factor work_pending state machine to C
  arch/arm64: enable task isolation functionality
  arch/tile: enable task isolation functionality
  arm, tile: turn off timer tick for oneshot_stopped state
  task_isolation: support CONFIG_TASK_ISOLATION_ALL
  task_isolation: add user-settable notification signal
  task_isolation self test

 Documentation/kernel-parameters.txt                |  16 +
 arch/arm64/Kconfig                                 |   1 +
 arch/arm64/include/asm/thread_info.h               |   5 +-
 arch/arm64/kernel/entry.S                          |  12 +-
 arch/arm64/kernel/ptrace.c                         |  18 +-
 arch/arm64/kernel/signal.c                         |  42 +-
 arch/arm64/kernel/smp.c                            |   2 +
 arch/arm64/mm/fault.c                              |   8 +-
 arch/tile/Kconfig                                  |   1 +
 arch/tile/include/asm/thread_info.h                |   4 +-
 arch/tile/kernel/process.c                         |   9 +
 arch/tile/kernel/ptrace.c                          |  10 +
 arch/tile/kernel/single_step.c                     |   7 +
 arch/tile/kernel/smp.c                             |  26 +-
 arch/tile/kernel/time.c                            |   1 +
 arch/tile/kernel/unaligned.c                       |   4 +
 arch/tile/mm/fault.c                               |  13 +-
 arch/tile/mm/homecache.c                           |   2 +
 arch/x86/Kconfig                                   |   1 +
 arch/x86/entry/common.c                            |  21 +-
 arch/x86/include/asm/thread_info.h                 |   4 +-
 arch/x86/kernel/smp.c                              |   2 +
 arch/x86/kernel/traps.c                            |   3 +
 arch/x86/mm/fault.c                                |   5 +
 drivers/base/cpu.c                                 |  18 +
 drivers/clocksource/arm_arch_timer.c               |   2 +
 include/linux/context_tracking_state.h             |   6 +
 include/linux/isolation.h                          |  73 +++
 include/linux/sched.h                              |   3 +
 include/linux/swap.h                               |   1 +
 include/linux/tick.h                               |   2 +
 include/linux/vmstat.h                             |   4 +
 include/uapi/linux/prctl.h                         |  10 +
 init/Kconfig                                       |  37 ++
 kernel/Makefile                                    |   1 +
 kernel/fork.c                                      |   3 +
 kernel/irq_work.c                                  |   5 +-
 kernel/isolation.c                                 | 338 +++++++++++
 kernel/sched/core.c                                |  14 +
 kernel/signal.c                                    |  15 +
 kernel/smp.c                                       |   6 +-
 kernel/softirq.c                                   |  33 ++
 kernel/sys.c                                       |   9 +
 kernel/time/tick-sched.c                           |  36 +-
 mm/swap.c                                          |  15 +-
 mm/vmstat.c                                        |  19 +
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/task_isolation/Makefile    |  11 +
 tools/testing/selftests/task_isolation/config      |   2 +
 tools/testing/selftests/task_isolation/isolation.c | 646 +++++++++++++++++++++
 50 files changed, 1470 insertions(+), 57 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c
 create mode 100644 tools/testing/selftests/task_isolation/Makefile
 create mode 100644 tools/testing/selftests/task_isolation/config
 create mode 100644 tools/testing/selftests/task_isolation/isolation.c

-- 
2.7.2

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v15 01/13] vmstat: add quiet_vmstat_sync function
  2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
@ 2016-08-16 21:19 ` Chris Metcalf
  2016-08-16 21:19 ` [PATCH v15 02/13] vmstat: add vmstat_idle function Chris Metcalf
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-08-16 21:19 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Michal Hocko, linux-kernel
  Cc: Chris Metcalf

In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
the quiet_vmstat() function became asynchronous, in the sense that
the vmstat work was still scheduled to run on the core when the
function returned.  For task isolation, we need a synchronous
version of the function that guarantees that the vmstat worker
will not run on the core on return from the function.  Add a
quiet_vmstat_sync() function with that semantic.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/vmstat.h | 2 ++
 mm/vmstat.c            | 9 +++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 613771909b6e..fab62aa74079 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -234,6 +234,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
+void quiet_vmstat_sync(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -336,6 +337,7 @@ static inline void __dec_node_page_state(struct page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat_sync(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 89cec42d19ff..57fc29750da6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1754,6 +1754,15 @@ void quiet_vmstat(void)
 }
 
 /*
+ * Synchronously quiet vmstat so the work is guaranteed not to run on return.
+ */
+void quiet_vmstat_sync(void)
+{
+	cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work));
+	refresh_cpu_vm_stats(false);
+}
+
+/*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
  * threads for vm statistics updates disabled because of
-- 
2.7.2

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v15 02/13] vmstat: add vmstat_idle function
  2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
  2016-08-16 21:19 ` [PATCH v15 01/13] vmstat: add quiet_vmstat_sync function Chris Metcalf
@ 2016-08-16 21:19 ` Chris Metcalf
  2016-08-16 21:19 ` [PATCH v15 03/13] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-08-16 21:19 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-mm, linux-kernel
  Cc: Chris Metcalf

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index fab62aa74079..69b6cc4be909 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -235,6 +235,7 @@ extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
 void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -338,6 +339,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 57fc29750da6..7dd17c06d3a7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1763,6 +1763,16 @@ void quiet_vmstat_sync(void)
 }
 
 /*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+	return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) &&
+		!need_update(smp_processor_id());
+}
+
+/*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
  * threads for vm statistics updates disabled because of
-- 
2.7.2

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v15 03/13] lru_add_drain_all: factor out lru_add_drain_needed
  2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
  2016-08-16 21:19 ` [PATCH v15 01/13] vmstat: add quiet_vmstat_sync function Chris Metcalf
  2016-08-16 21:19 ` [PATCH v15 02/13] vmstat: add vmstat_idle function Chris Metcalf
@ 2016-08-16 21:19 ` Chris Metcalf
  2016-08-16 21:19 ` [PATCH v15 04/13] task_isolation: add initial support Chris Metcalf
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-08-16 21:19 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-mm, linux-kernel
  Cc: Chris Metcalf

This per-cpu check was being done in the loop in lru_add_drain_all(),
but having it be callable for a particular cpu is helpful for the
task-isolation patches.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/swap.h |  1 +
 mm/swap.c            | 15 ++++++++++-----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b17cc4830fa6..58966a235298 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -295,6 +295,7 @@ extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
+extern bool lru_add_drain_needed(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index 75c63bb2a1da..a2be6f0931b5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -655,6 +655,15 @@ void deactivate_page(struct page *page)
 	}
 }
 
+bool lru_add_drain_needed(int cpu)
+{
+	return (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
+		pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
+		pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+		need_activate_page_drain(cpu));
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
@@ -699,11 +708,7 @@ void lru_add_drain_all(void)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
-		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
-		    need_activate_page_drain(cpu)) {
+		if (lru_add_drain_needed(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, lru_add_drain_wq, work);
 			cpumask_set_cpu(cpu, &has_work);
-- 
2.7.2

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v15 04/13] task_isolation: add initial support
  2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
                   ` (2 preceding siblings ...)
  2016-08-16 21:19 ` [PATCH v15 03/13] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
@ 2016-08-16 21:19 ` Chris Metcalf
  2016-08-29 16:33   ` Peter Zijlstra
  2017-02-02 16:13   ` Eugene Syromiatnikov
  2016-08-16 21:19 ` [PATCH v15 05/13] task_isolation: track asynchronous interrupts Chris Metcalf
                   ` (11 subsequent siblings)
  15 siblings, 2 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-08-16 21:19 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well.  The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in
thread_info flags.  When task isolation is enabled for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future.

The task_isolation_ready() call is invoked when TIF_TASK_ISOLATION is
set in prepare_exit_to_usermode() or its architectural equivalent,
and forces the loop to retry if the system is not ready.  It is
called with interrupts disabled and inspects the kernel state
to determine if it is safe to return into an isolated state.
In particular, if it sees that the scheduler tick is still enabled,
it reports that it is not yet safe.

Each time through the loop of TIF work to do, if TIF_TASK_ISOLATION
is set, we call the new task_isolation_enter() routine.  This
takes any actions that might avoid a future interrupt to the core,
such as a worker thread being scheduled that could be quiesced now
(e.g. the vmstat worker) or a future IPI to the core to clean up some
state that could be cleaned up now (e.g. the mm lru per-cpu cache).
In addition, it reqeusts rescheduling if the scheduler dyntick is
still running.

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
of a number of other synchronous traps, the kernel will kill it
with SIGKILL.  For system calls, this test is performed immediately
before the SECCOMP test and causes the syscall to return immediately
with ENOSYS.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl() syscall so that we can clear the bit again
later, and ignores exit/exit_group to allow exiting the task without
a pointless signal killing you as you try to do so.

A new /sys/devices/system/cpu/task_isolation pseudo-file is added,
parallel to the comparable nohz_full file.

Separate patches that follow provide these changes for x86, tile,
and arm64.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 Documentation/kernel-parameters.txt |   8 ++
 drivers/base/cpu.c                  |  18 +++
 include/linux/isolation.h           |  60 ++++++++++
 include/linux/sched.h               |   3 +
 include/linux/tick.h                |   2 +
 include/uapi/linux/prctl.h          |   5 +
 init/Kconfig                        |  27 +++++
 kernel/Makefile                     |   1 +
 kernel/fork.c                       |   3 +
 kernel/isolation.c                  | 218 ++++++++++++++++++++++++++++++++++++
 kernel/signal.c                     |   8 ++
 kernel/sys.c                        |   9 ++
 kernel/time/tick-sched.c            |  36 +++---
 13 files changed, 385 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 46c030a49186..7f1336b50dcc 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3943,6 +3943,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation=	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION=y, set
+			the specified list of CPUs where cpus will be able
+			to use prctl(PR_SET_TASK_ISOLATION) to set up task
+			isolation mode.  Setting this boot flag implicitly
+			also sets up nohz_full and isolcpus mode for the
+			listed set of cpus.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 691eeea2f19a..eaf40f4264ee 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -17,6 +17,7 @@
 #include <linux/of.h>
 #include <linux/cpufeature.h>
 #include <linux/tick.h>
+#include <linux/isolation.h>
 
 #include "base.h"
 
@@ -290,6 +291,20 @@ static ssize_t print_cpus_nohz_full(struct device *dev,
 static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL);
 #endif
 
+#ifdef CONFIG_TASK_ISOLATION
+static ssize_t print_cpus_task_isolation(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	int n = 0, len = PAGE_SIZE-2;
+
+	n = scnprintf(buf, len, "%*pbl\n", cpumask_pr_args(task_isolation_map));
+
+	return n;
+}
+static DEVICE_ATTR(task_isolation, 0444, print_cpus_task_isolation, NULL);
+#endif
+
 static void cpu_device_release(struct device *dev)
 {
 	/*
@@ -460,6 +475,9 @@ static struct attribute *cpu_root_attrs[] = {
 #ifdef CONFIG_NO_HZ_FULL
 	&dev_attr_nohz_full.attr,
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	&dev_attr_task_isolation.attr,
+#endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
 #endif
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..d9288b85b41f
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,60 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/* cpus that are configured to support task isolation */
+extern cpumask_var_t task_isolation_map;
+
+extern int task_isolation_init(void);
+
+static inline bool task_isolation_possible(int cpu)
+{
+	return task_isolation_map != NULL &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+extern int task_isolation_set(unsigned int flags);
+
+extern bool task_isolation_ready(void);
+extern void task_isolation_enter(void);
+
+static inline void task_isolation_set_flags(struct task_struct *p,
+					    unsigned int flags)
+{
+	p->task_isolation_flags = flags;
+
+	if (flags & PR_TASK_ISOLATION_ENABLE)
+		set_tsk_thread_flag(p, TIF_TASK_ISOLATION);
+	else
+		clear_tsk_thread_flag(p, TIF_TASK_ISOLATION);
+}
+
+extern int task_isolation_syscall(int nr);
+
+/* Report on exceptions that don't cause a signal for the user process. */
+extern void _task_isolation_quiet_exception(const char *fmt, ...);
+#define task_isolation_quiet_exception(fmt, ...)			\
+	do {								\
+		if (current_thread_info()->flags & _TIF_TASK_ISOLATION) \
+			_task_isolation_quiet_exception(fmt, ## __VA_ARGS__); \
+	} while (0)
+
+#else
+static inline void task_isolation_init(void) { }
+static inline bool task_isolation_possible(int cpu) { return false; }
+static inline bool task_isolation_ready(void) { return true; }
+static inline void task_isolation_enter(void) { }
+extern inline void task_isolation_set_flags(struct task_struct *p,
+					    unsigned int flags) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_quiet_exception(const char *fmt, ...) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e513e39..77dc12cd4fe8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1923,6 +1923,9 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned int	task_isolation_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 62be0786d6d0..fbd81e322860 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -235,6 +235,8 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
 
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
+extern void tick_nohz_full_add_cpus(const struct cpumask *mask);
+extern bool can_stop_my_full_tick(void);
 #else
 static inline int housekeeping_any_cpu(void)
 {
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..2a49d0d2940a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,9 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Enable/disable or query task_isolation mode for TASK_ISOLATION kernels. */
+#define PR_SET_TASK_ISOLATION		48
+#define PR_GET_TASK_ISOLATION		49
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index cac3f096050d..a95a35a31b46 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -786,6 +786,33 @@ config RCU_EXPEDITE_BOOT
 
 endmenu # "RCU Subsystem"
 
+config HAVE_ARCH_TASK_ISOLATION
+	bool
+
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
+	help
+	 Allow userspace processes to place themselves on task_isolation
+	 cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+	 themselves from the kernel.  Prior to returning to userspace,
+	 isolated tasks will arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  By default, attempting to re-enter the kernel
+	 while in this mode will cause the task to be terminated
+	 with a signal; you must explicitly use prctl() to disable
+	 task isolation before resuming normal use of the kernel.
+
+	 This "hard" isolation from the kernel is required for
+	 userspace tasks that are running hard real-time tasks in
+	 userspace, such as a 10 Gbit network driver in userspace.
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index e2ec54e2b952..91ff1615f4d6 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -112,6 +112,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 52e725d4a866..54542266d7a8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -76,6 +76,7 @@
 #include <linux/compiler.h>
 #include <linux/sysctl.h>
 #include <linux/kcov.h>
+#include <linux/isolation.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1533,6 +1534,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #endif
 	clear_all_latency_tracing(p);
 
+	task_isolation_set_flags(p, 0);
+
 	/* ok, now we should be set up.. */
 	p->pid = pid_nr(pid);
 	if (clone_flags & CLONE_THREAD) {
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..4382e2043de9
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,218 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include <asm/unistd.h>
+#include <asm/syscall.h>
+#include "time/tick-sched.h"
+
+cpumask_var_t task_isolation_map;
+static bool saw_boot_arg;
+
+/*
+ * Isolation requires both nohz and isolcpus support from the scheduler.
+ * We provide a boot flag that enables both for now, and which we can
+ * add other functionality to over time if needed.  Note that just
+ * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
+ */
+static int __init task_isolation_setup(char *str)
+{
+	saw_boot_arg = true;
+
+	alloc_bootmem_cpumask_var(&task_isolation_map);
+	if (cpulist_parse(str, task_isolation_map) < 0) {
+		pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
+		return 1;
+	}
+
+	return 1;
+}
+__setup("task_isolation=", task_isolation_setup);
+
+int __init task_isolation_init(void)
+{
+	/* For offstack cpumask, ensure we allocate an empty cpumask early. */
+	if (!saw_boot_arg) {
+		zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+		return 0;
+	}
+
+	/*
+	 * Add our task_isolation cpus to nohz_full and isolcpus.  Note
+	 * that we are called relatively early in boot, from tick_init();
+	 * at this point neither nohz_full nor isolcpus has been used
+	 * to configure the system, but isolcpus has been allocated
+	 * already in sched_init().
+	 */
+	tick_nohz_full_add_cpus(task_isolation_map);
+	cpumask_or(cpu_isolated_map, cpu_isolated_map, task_isolation_map);
+
+	return 0;
+}
+
+/*
+ * Get a snapshot of whether, at this moment, it would be possible to
+ * stop the tick.  This test normally requires interrupts disabled since
+ * the condition can change if an interrupt is delivered.  However, in
+ * this case we are using it in an advisory capacity to see if there
+ * is anything obviously indicating that the task isolation
+ * preconditions have not been met, so it's OK that in principle it
+ * might not still be true later in the prctl() syscall path.
+ */
+static bool can_stop_my_full_tick_now(void)
+{
+	bool ret;
+
+	local_irq_disable();
+	ret = can_stop_my_full_tick();
+	local_irq_enable();
+	return ret;
+}
+
+/*
+ * This routine controls whether we can enable task-isolation mode.
+ * The task must be affinitized to a single task_isolation core, or
+ * else we return EINVAL.  And, it must be at least statically able to
+ * stop the nohz_full tick (e.g., no other schedulable tasks currently
+ * running, no POSIX cpu timers currently set up, etc.); if not, we
+ * return EAGAIN.
+ */
+int task_isolation_set(unsigned int flags)
+{
+	if (flags != 0) {
+		if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
+		    !task_isolation_possible(raw_smp_processor_id())) {
+			/* Invalid task affinity setting. */
+			return -EINVAL;
+		}
+		if (!can_stop_my_full_tick_now()) {
+			/* System not yet ready for task isolation. */
+			return -EAGAIN;
+		}
+	}
+
+	task_isolation_set_flags(current, flags);
+	return 0;
+}
+
+/*
+ * In task isolation mode we try to return to userspace only after
+ * attempting to make sure we won't be interrupted again.  This test
+ * is run with interrupts disabled to test that everything we need
+ * to be true is true before we can return to userspace.
+ */
+bool task_isolation_ready(void)
+{
+	WARN_ON_ONCE(!irqs_disabled());
+
+	return (!lru_add_drain_needed(smp_processor_id()) &&
+		vmstat_idle() &&
+		tick_nohz_tick_stopped());
+}
+
+/*
+ * Each time we try to prepare for return to userspace in a process
+ * with task isolation enabled, we run this code to quiesce whatever
+ * subsystems we can readily quiesce to avoid later interrupts.
+ */
+void task_isolation_enter(void)
+{
+	WARN_ON_ONCE(irqs_disabled());
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	if (!vmstat_idle())
+		quiet_vmstat_sync();
+
+	/*
+	 * Request rescheduling unless we are in full dynticks mode.
+	 * We would eventually get pre-empted without this, and if
+	 * there's another task waiting, it would run; but by
+	 * explicitly requesting the reschedule, we may reduce the
+	 * latency.  We could directly call schedule() here as well,
+	 * but since our caller is the standard place where schedule()
+	 * is called, we defer to the caller.
+	 *
+	 * A more substantive approach here would be to use a struct
+	 * completion here explicitly, and complete it when we shut
+	 * down dynticks, but since we presumably have nothing better
+	 * to do on this core anyway, just spinning seems plausible.
+	 */
+	if (!tick_nohz_tick_stopped())
+		set_tsk_need_resched(current);
+}
+
+static void task_isolation_deliver_signal(struct task_struct *task,
+					  const char *buf)
+{
+	siginfo_t info = {};
+
+	info.si_signo = SIGKILL;
+
+	/*
+	 * Report on the fact that isolation was violated for the task.
+	 * It may not be the task's fault (e.g. a TLB flush from another
+	 * core) but we are not blaming it, just reporting that it lost
+	 * its isolation status.
+	 */
+	pr_warn("%s/%d: task_isolation mode lost due to %s\n",
+		task->comm, task->pid, buf);
+
+	/* Turn off task isolation mode to avoid further isolation callbacks. */
+	task_isolation_set_flags(task, 0);
+
+	send_sig_info(info.si_signo, &info, task);
+}
+
+/*
+ * This routine is called from any userspace exception that doesn't
+ * otherwise trigger a signal to the user process (e.g. simple page fault).
+ */
+void _task_isolation_quiet_exception(const char *fmt, ...)
+{
+	struct task_struct *task = current;
+	va_list args;
+	char buf[100];
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	task_isolation_deliver_signal(task, buf);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in), and prevents most syscalls from executing and raises a
+ * signal to notify the process.
+ */
+int task_isolation_syscall(int syscall)
+{
+	char buf[20];
+
+	if (syscall == __NR_prctl ||
+	    syscall == __NR_exit ||
+	    syscall == __NR_exit_group)
+		return 0;
+
+	snprintf(buf, sizeof(buf), "syscall %d", syscall);
+	task_isolation_deliver_signal(current, buf);
+
+	syscall_set_return_value(current, current_pt_regs(),
+					 -ERESTARTNOINTR, -1);
+	return -1;
+}
diff --git a/kernel/signal.c b/kernel/signal.c
index af21afc00d08..895f547ff66f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -34,6 +34,7 @@
 #include <linux/compat.h>
 #include <linux/cn_proc.h>
 #include <linux/compiler.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -2213,6 +2214,13 @@ relock:
 		/* Trace actually delivered signals. */
 		trace_signal_deliver(signr, &ksig->info, ka);
 
+		/*
+		 * Disable task isolation when delivering a signal.
+		 * The isolation model requires users to reset task
+		 * isolation from the signal handler if desired.
+		 */
+		task_isolation_set_flags(current, 0);
+
 		if (ka->sa.sa_handler == SIG_IGN) /* Do nothing.  */
 			continue;
 		if (ka->sa.sa_handler != SIG_DFL) {
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be418157..4df84af425e3 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2270,6 +2271,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_TASK_ISOLATION
+	case PR_SET_TASK_ISOLATION:
+		error = task_isolation_set(arg2);
+		break;
+	case PR_GET_TASK_ISOLATION:
+		error = me->task_isolation_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 204fdc86863d..a6e29527743e 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -23,6 +23,7 @@
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 
 #include <asm/irq_regs.h>
 
@@ -205,6 +206,11 @@ static bool can_stop_full_tick(struct tick_sched *ts)
 	return true;
 }
 
+bool can_stop_my_full_tick(void)
+{
+	return can_stop_full_tick(this_cpu_ptr(&tick_cpu_sched));
+}
+
 static void nohz_full_kick_func(struct irq_work *work)
 {
 	/* Empty, the tick restart happens on tick_nohz_irq_exit() */
@@ -407,30 +413,34 @@ static int tick_nohz_cpu_down_callback(struct notifier_block *nfb,
 	return NOTIFY_OK;
 }
 
-static int tick_nohz_init_all(void)
+void tick_nohz_full_add_cpus(const struct cpumask *mask)
 {
-	int err = -1;
+	if (!cpumask_weight(mask))
+		return;
 
-#ifdef CONFIG_NO_HZ_FULL_ALL
-	if (!alloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
+	if (tick_nohz_full_mask == NULL &&
+	    !zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
 		WARN(1, "NO_HZ: Can't allocate full dynticks cpumask\n");
-		return err;
+		return;
 	}
-	err = 0;
-	cpumask_setall(tick_nohz_full_mask);
+
+	cpumask_or(tick_nohz_full_mask, tick_nohz_full_mask, mask);
 	tick_nohz_full_running = true;
-#endif
-	return err;
 }
 
 void __init tick_nohz_init(void)
 {
 	int cpu;
 
-	if (!tick_nohz_full_running) {
-		if (tick_nohz_init_all() < 0)
-			return;
-	}
+	task_isolation_init();
+
+#ifdef CONFIG_NO_HZ_FULL_ALL
+	if (!tick_nohz_full_running)
+		tick_nohz_full_add_cpus(cpu_possible_mask);
+#endif
+
+	if (!tick_nohz_full_running)
+		return;
 
 	if (!alloc_cpumask_var(&housekeeping_mask, GFP_KERNEL)) {
 		WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n");
-- 
2.7.2

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v15 05/13] task_isolation: track asynchronous interrupts
  2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
                   ` (3 preceding siblings ...)
  2016-08-16 21:19 ` [PATCH v15 04/13] task_isolation: add initial support Chris Metcalf
@ 2016-08-16 21:19 ` Chris Metcalf
  2016-08-16 21:19 ` [PATCH v15 06/13] arch/x86: enable task isolation functionality Chris Metcalf
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-08-16 21:19 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-kernel
  Cc: Chris Metcalf

This commit adds support for tracking asynchronous interrupts
delivered to task-isolation tasks, e.g. IPIs or IRQs.  Just
as for exceptions and syscalls, when this occurs we arrange to
deliver a signal to the task so that it knows it has been
interrupted.  If the task is interrupted by an NMI, we can't
safely deliver a signal, so we just dump out a console stack.

We also support a new "task_isolation_debug" flag which forces
the console stack to be dumped out regardless.  We try to catch
the original source of the interrupt, e.g. if an IPI is dispatched
to a task-isolation task, we dump the backtrace of the remote
core that is sending the IPI, rather than just dumping out a
trace showing the core received an IPI from somewhere.

Calls to task_isolation_debug() can be placed in the
platform-independent code when that results in fewer lines
of code changes, as for example is true of the users of the
arch_send_call_function_*() APIs.  Or, they can be placed in the
per-architecture code when there are many callers, as for example
is true of the smp_send_reschedule() call.

A further cleanup might be to create an intermediate layer, so that
for example smp_send_reschedule() is a single generic function that
just calls arch_smp_send_reschedule(), allowing generic code to be
called every time smp_send_reschedule() is invoked.  But for now,
we just update either callers or callees as makes most sense.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 Documentation/kernel-parameters.txt    |  8 ++++
 include/linux/context_tracking_state.h |  6 +++
 include/linux/isolation.h              | 13 ++++++
 kernel/irq_work.c                      |  5 ++-
 kernel/isolation.c                     | 74 ++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                    | 14 +++++++
 kernel/signal.c                        |  7 ++++
 kernel/smp.c                           |  6 ++-
 kernel/softirq.c                       | 33 +++++++++++++++
 9 files changed, 164 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 7f1336b50dcc..f172cd310cf4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3951,6 +3951,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			also sets up nohz_full and isolcpus mode for the
 			listed set of cpus.
 
+	task_isolation_debug	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION
+			and booted in task_isolation= mode, this
+			setting will generate console backtraces when
+			the kernel is about to interrupt a task that
+			has requested PR_TASK_ISOLATION_ENABLE and is
+			running on a task_isolation core.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 1d34fe68f48a..4e2c4b900b82 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -39,8 +39,14 @@ static inline bool context_tracking_in_user(void)
 {
 	return __this_cpu_read(context_tracking.state) == CONTEXT_USER;
 }
+
+static inline bool context_tracking_cpu_in_user(int cpu)
+{
+	return per_cpu(context_tracking.state, cpu) == CONTEXT_USER;
+}
 #else
 static inline bool context_tracking_in_user(void) { return false; }
+static inline bool context_tracking_cpu_in_user(int cpu) { return false; }
 static inline bool context_tracking_active(void) { return false; }
 static inline bool context_tracking_is_enabled(void) { return false; }
 static inline bool context_tracking_cpu_is_enabled(void) { return false; }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index d9288b85b41f..02728b1f8775 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -46,6 +46,17 @@ extern void _task_isolation_quiet_exception(const char *fmt, ...);
 			_task_isolation_quiet_exception(fmt, ## __VA_ARGS__); \
 	} while (0)
 
+extern void _task_isolation_debug(int cpu, const char *type);
+#define task_isolation_debug(cpu, type)					\
+	do {								\
+		if (task_isolation_possible(cpu))			\
+			_task_isolation_debug(cpu, type);		\
+	} while (0)
+
+extern void task_isolation_debug_cpumask(const struct cpumask *,
+					 const char *type);
+extern void task_isolation_debug_task(int cpu, struct task_struct *p,
+				      const char *type);
 #else
 static inline void task_isolation_init(void) { }
 static inline bool task_isolation_possible(int cpu) { return false; }
@@ -55,6 +66,8 @@ extern inline void task_isolation_set_flags(struct task_struct *p,
 					    unsigned int flags) { }
 static inline int task_isolation_syscall(int nr) { return 0; }
 static inline void task_isolation_quiet_exception(const char *fmt, ...) { }
+static inline void task_isolation_debug(int cpu, const char *type) { }
+#define task_isolation_debug_cpumask(mask, type) do {} while (0)
 #endif
 
 #endif
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index bcf107ce0854..15f3d44acf11 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/isolation.h>
 #include <asm/processor.h>
 
 
@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (!irq_work_claim(work))
 		return false;
 
-	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+		task_isolation_debug(cpu, "irq_work");
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return true;
 }
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 4382e2043de9..be7e95192e76 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,7 @@
 #include <linux/vmstat.h>
 #include <linux/isolation.h>
 #include <linux/syscalls.h>
+#include <linux/ratelimit.h>
 #include <asm/unistd.h>
 #include <asm/syscall.h>
 #include "time/tick-sched.h"
@@ -216,3 +217,76 @@ int task_isolation_syscall(int syscall)
 					 -ERESTARTNOINTR, -1);
 	return -1;
 }
+
+/* Enable debugging of any interrupts of task_isolation cores. */
+static int task_isolation_debug_flag;
+static int __init task_isolation_debug_func(char *str)
+{
+	task_isolation_debug_flag = true;
+	return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+void task_isolation_debug_task(int cpu, struct task_struct *p, const char *type)
+{
+	static DEFINE_RATELIMIT_STATE(console_output, HZ, 1);
+	bool force_debug = false;
+
+	/*
+	 * Our caller made sure the task was running on a task isolation
+	 * core, but make sure the task has enabled isolation.
+	 */
+	if (!(p->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
+		return;
+
+	/*
+	 * Ensure the task is actually in userspace; if it is in kernel
+	 * mode, it is expected that it may receive interrupts, and in
+	 * any case they don't affect the isolation.  Note that there
+	 * is a race condition here as a task may have committed
+	 * to returning to user space but not yet set the context
+	 * tracking state to reflect it, and the check here is before
+	 * we trigger the interrupt, so we might fail to warn about a
+	 * legitimate interrupt.  However, the race window is narrow
+	 * and hitting it does not cause any incorrect behavior other
+	 * than failing to send the warning.
+	 */
+	if (cpu != smp_processor_id() && !context_tracking_cpu_in_user(cpu))
+		return;
+
+	/*
+	 * We disable task isolation mode when we deliver a signal
+	 * so we won't end up recursing back here again.
+	 * If we are in an NMI, we don't try delivering the signal
+	 * and instead just treat it as if "debug" mode was enabled,
+	 * since that's pretty much all we can do.
+	 */
+	if (in_nmi())
+		force_debug = true;
+	else
+		task_isolation_deliver_signal(p, type);
+
+	/*
+	 * If (for example) the timer interrupt starts ticking
+	 * unexpectedly, we will get an unmanageable flow of output,
+	 * so limit to one backtrace per second.
+	 */
+	if (force_debug ||
+	    (task_isolation_debug_flag && __ratelimit(&console_output))) {
+		pr_err("cpu %d: %s violating task isolation for %s/%d on cpu %d\n",
+		       smp_processor_id(), type, p->comm, p->pid, cpu);
+		dump_stack();
+	}
+}
+
+void task_isolation_debug_cpumask(const struct cpumask *mask, const char *type)
+{
+	int cpu, thiscpu = get_cpu();
+
+	/* No need to report on this cpu since we're already in the kernel. */
+	for_each_cpu_and(cpu, mask, task_isolation_map)
+		if (cpu != thiscpu)
+			_task_isolation_debug(cpu, type);
+
+	put_cpu();
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2a906f20fba7..ef2e6de37cd4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -75,6 +75,7 @@
 #include <linux/compiler.h>
 #include <linux/frame.h>
 #include <linux/prefetch.h>
+#include <linux/isolation.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -664,6 +665,19 @@ bool sched_can_stop_tick(struct rq *rq)
 }
 #endif /* CONFIG_NO_HZ_FULL */
 
+#ifdef CONFIG_TASK_ISOLATION
+void _task_isolation_debug(int cpu, const char *type)
+{
+	struct rq *rq = cpu_rq(cpu);
+	struct task_struct *task = try_get_task_struct(&rq->curr);
+
+	if (task) {
+		task_isolation_debug_task(cpu, task, type);
+		put_task_struct(task);
+	}
+}
+#endif
+
 void sched_avg_update(struct rq *rq)
 {
 	s64 period = sched_avg_period();
diff --git a/kernel/signal.c b/kernel/signal.c
index 895f547ff66f..40356a06b761 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -639,6 +639,13 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+	/*
+	 * We're delivering a signal anyway, so no need for more
+	 * warnings.  This also avoids self-deadlock since an IPI to
+	 * kick the task would otherwise generate another signal.
+	 */
+	task_isolation_set_flags(t, 0);
+
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/smp.c b/kernel/smp.c
index 3aa642d39c03..35ca174db581 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
 #include <linux/smp.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include <linux/isolation.h>
 
 #include "smpboot.h"
 
@@ -162,8 +163,10 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
-	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) {
+		task_isolation_debug(cpu, "IPI function");
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return 0;
 }
@@ -441,6 +444,7 @@ void smp_call_function_many(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	task_isolation_debug_cpumask(cfd->cpumask, "IPI function");
 	arch_send_call_function_ipi_mask(cfd->cpumask);
 
 	if (wait) {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..2f1065795318 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -26,6 +26,7 @@
 #include <linux/smpboot.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -319,6 +320,37 @@ asmlinkage __visible void do_softirq(void)
 	local_irq_restore(flags);
 }
 
+/* Determine whether this IRQ is something task isolation cares about. */
+static void task_isolation_irq(void)
+{
+#ifdef CONFIG_TASK_ISOLATION
+	struct pt_regs *regs;
+
+	if (!context_tracking_cpu_is_enabled())
+		return;
+
+	/*
+	 * We have not yet called __irq_enter() and so we haven't
+	 * adjusted the hardirq count.  This test will allow us to
+	 * avoid false positives for nested IRQs.
+	 */
+	if (in_interrupt())
+		return;
+
+	/*
+	 * If we were already in the kernel, not from an irq but from
+	 * a syscall or synchronous exception/fault, this test should
+	 * avoid a false positive as well.  Note that this requires
+	 * architecture support for calling set_irq_regs() prior to
+	 * calling irq_enter(), and if it's not done consistently, we
+	 * will not consistently avoid false positives here.
+	 */
+	regs = get_irq_regs();
+	if (regs && user_mode(regs))
+		task_isolation_debug(smp_processor_id(), "irq");
+#endif
+}
+
 /*
  * Enter an interrupt context.
  */
@@ -335,6 +367,7 @@ void irq_enter(void)
 		_local_bh_enable();
 	}
 
+	task_isolation_irq();
 	__irq_enter();
 }
 
-- 
2.7.2

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v15 06/13] arch/x86: enable task isolation functionality
  2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
                   ` (4 preceding siblings ...)
  2016-08-16 21:19 ` [PATCH v15 05/13] task_isolation: track asynchronous interrupts Chris Metcalf
@ 2016-08-16 21:19 ` Chris Metcalf
  2016-08-30 21:46   ` Andy Lutomirski
  2016-08-16 21:19 ` [PATCH v15 07/13] arm64: factor work_pending state machine to C Chris Metcalf
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 80+ messages in thread
From: Chris Metcalf @ 2016-08-16 21:19 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	H. Peter Anvin, x86, linux-kernel
  Cc: Chris Metcalf

In exit_to_usermode_loop(), call task_isolation_ready() for
TIF_TASK_ISOLATION tasks when we are checking the thread-info flags,
and after we've handled the other work, call task_isolation_enter()
for such tasks.

In syscall_trace_enter_phase1(), we add the necessary support for
reporting syscalls for task-isolation processes.

We add strict reporting for the kernel exception types that do
not result in signals, namely non-signalling page faults and
non-signalling MPX fixups.

Tested-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/entry/common.c            | 21 ++++++++++++++++++++-
 arch/x86/include/asm/thread_info.h |  4 +++-
 arch/x86/kernel/smp.c              |  2 ++
 arch/x86/kernel/traps.c            |  3 +++
 arch/x86/mm/fault.c                |  5 +++++
 6 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c580d8c33562..7f6ec46d18d0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -90,6 +90,7 @@ config X86
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS	if MMU && COMPAT
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_SOFT_DIRTY		if X86_64
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_ARCH_WITHIN_STACK_FRAMES
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 1433f6b4607d..3b23b3542909 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -21,6 +21,7 @@
 #include <linux/context_tracking.h>
 #include <linux/user-return-notifier.h>
 #include <linux/uprobes.h>
+#include <linux/isolation.h>
 
 #include <asm/desc.h>
 #include <asm/traps.h>
@@ -91,6 +92,16 @@ static long syscall_trace_enter(struct pt_regs *regs)
 	if (emulated)
 		return -1L;
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->orig_ax) == -1)
+			return -1L;
+		work &= ~_TIF_TASK_ISOLATION;
+	}
+
 #ifdef CONFIG_SECCOMP
 	/*
 	 * Do seccomp after ptrace, to catch any tracer changes.
@@ -136,7 +147,7 @@ static long syscall_trace_enter(struct pt_regs *regs)
 
 #define EXIT_TO_USERMODE_LOOP_FLAGS				\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |	\
-	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY)
+	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_TASK_ISOLATION)
 
 static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 {
@@ -170,11 +181,19 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
 			fire_user_return_notifiers();
 
+		if (cached_flags & _TIF_TASK_ISOLATION)
+			task_isolation_enter();
+
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
 		cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
 
+		/* Clear task isolation from cached_flags manually. */
+		if ((cached_flags & _TIF_TASK_ISOLATION) &&
+		    task_isolation_ready())
+			cached_flags &= ~_TIF_TASK_ISOLATION;
+
 		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
 			break;
 
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 8b7c8d8e0852..7255367fd499 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -93,6 +93,7 @@ struct thread_info {
 #define TIF_SECCOMP		8	/* secure computing */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
 #define TIF_UPROBE		12	/* breakpointed or singlestepping */
+#define TIF_TASK_ISOLATION	13	/* task isolation enabled for task */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
 #define TIF_FORK		18	/* ret_from_fork */
@@ -117,6 +118,7 @@ struct thread_info {
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_UPROBE		(1 << TIF_UPROBE)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)
@@ -142,7 +144,7 @@ struct thread_info {
 /* work to do on any return to user space */
 #define _TIF_ALLWORK_MASK						\
 	((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT |	\
-	_TIF_NOHZ)
+	 _TIF_NOHZ | _TIF_TASK_ISOLATION)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW							\
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 658777cf3851..e4ffd9581cdb 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/isolation.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -125,6 +126,7 @@ static void native_smp_send_reschedule(int cpu)
 		WARN_ON(1);
 		return;
 	}
+	task_isolation_debug(cpu, "reschedule IPI");
 	apic->send_IPI(cpu, RESCHEDULE_VECTOR);
 }
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b70ca12dd389..eae51685c2b3 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -36,6 +36,7 @@
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/io.h>
+#include <linux/isolation.h>
 
 #ifdef CONFIG_EISA
 #include <linux/ioport.h>
@@ -383,6 +384,8 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 	case 2:	/* Bound directory has invalid entry. */
 		if (mpx_handle_bd_fault())
 			goto exit_trap;
+		/* No signal was generated, but notify task-isolation tasks. */
+		task_isolation_quiet_exception("bounds check");
 		break; /* Success, it was handled */
 	case 1: /* Bound violation. */
 		info = mpx_generate_siginfo(regs);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index dc8023060456..b1509876794c 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -14,6 +14,7 @@
 #include <linux/prefetch.h>		/* prefetchw			*/
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
+#include <linux/isolation.h>		/* task_isolation_quiet_exception */
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1397,6 +1398,10 @@ good_area:
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
 	}
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	if (flags & PF_USER)
+		task_isolation_quiet_exception("page fault at %#lx", address);
+
 	check_v8086_mode(regs, address, tsk);
 }
 NOKPROBE_SYMBOL(__do_page_fault);
-- 
2.7.2

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v15 07/13] arm64: factor work_pending state machine to C
  2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
                   ` (5 preceding siblings ...)
  2016-08-16 21:19 ` [PATCH v15 06/13] arch/x86: enable task isolation functionality Chris Metcalf
@ 2016-08-16 21:19 ` Chris Metcalf
  2016-08-17  8:05   ` Will Deacon
  2016-08-16 21:19 ` [PATCH v15 08/13] arch/arm64: enable task isolation functionality Chris Metcalf
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 80+ messages in thread
From: Chris Metcalf @ 2016-08-16 21:19 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Mark Rutland, linux-arm-kernel, linux-kernel
  Cc: Chris Metcalf

Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
state machine that can be difficult to reason about due to duplicated
code and a large number of branch targets.

This patch factors the common logic out into the existing
do_notify_resume function, converting the code to C in the process,
making the code more legible.

This patch tries to closely mirror the existing behaviour while using
the usual C control flow primitives. As local_irq_{disable,enable} may
be instrumented, we balance exception entry (where we will almost most
likely enable IRQs) with a call to trace_hardirqs_on just before the
return to userspace.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/arm64/kernel/entry.S  | 12 ++++--------
 arch/arm64/kernel/signal.c | 36 ++++++++++++++++++++++++++----------
 2 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 441420ca7d08..6a64182822e5 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -707,18 +707,13 @@ ret_fast_syscall_trace:
  * Ok, we need to do extra processing, enter the slow path.
  */
 work_pending:
-	tbnz	x1, #TIF_NEED_RESCHED, work_resched
-	/* TIF_SIGPENDING, TIF_NOTIFY_RESUME or TIF_FOREIGN_FPSTATE case */
 	mov	x0, sp				// 'regs'
-	enable_irq				// enable interrupts for do_notify_resume()
 	bl	do_notify_resume
-	b	ret_to_user
-work_resched:
 #ifdef CONFIG_TRACE_IRQFLAGS
-	bl	trace_hardirqs_off		// the IRQs are off here, inform the tracing code
+	bl	trace_hardirqs_on		// enabled while in userspace
 #endif
-	bl	schedule
-
+	ldr	x1, [tsk, #TI_FLAGS]		// re-check for single-step
+	b	finish_ret_to_user
 /*
  * "slow" syscall return path.
  */
@@ -727,6 +722,7 @@ ret_to_user:
 	ldr	x1, [tsk, #TI_FLAGS]
 	and	x2, x1, #_TIF_WORK_MASK
 	cbnz	x2, work_pending
+finish_ret_to_user:
 	enable_step_tsk x1, x2
 	kernel_exit 0
 ENDPROC(ret_to_user)
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index a8eafdbc7cb8..404dd67080b9 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -402,15 +402,31 @@ static void do_signal(struct pt_regs *regs)
 asmlinkage void do_notify_resume(struct pt_regs *regs,
 				 unsigned int thread_flags)
 {
-	if (thread_flags & _TIF_SIGPENDING)
-		do_signal(regs);
-
-	if (thread_flags & _TIF_NOTIFY_RESUME) {
-		clear_thread_flag(TIF_NOTIFY_RESUME);
-		tracehook_notify_resume(regs);
-	}
-
-	if (thread_flags & _TIF_FOREIGN_FPSTATE)
-		fpsimd_restore_current_state();
+	/*
+	 * The assembly code enters us with IRQs off, but it hasn't
+	 * informed the tracing code of that for efficiency reasons.
+	 * Update the trace code with the current status.
+	 */
+	trace_hardirqs_off();
+	do {
+		if (thread_flags & _TIF_NEED_RESCHED) {
+			schedule();
+		} else {
+			local_irq_enable();
+
+			if (thread_flags & _TIF_SIGPENDING)
+				do_signal(regs);
+
+			if (thread_flags & _TIF_NOTIFY_RESUME) {
+				clear_thread_flag(TIF_NOTIFY_RESUME);
+				tracehook_notify_resume(regs);
+			}
+
+			if (thread_flags & _TIF_FOREIGN_FPSTATE)
+				fpsimd_restore_current_state();
+		}
 
+		local_irq_disable();
+		thread_flags = READ_ONCE(current_thread_info()->flags);
+	} while (thread_flags & _TIF_WORK_MASK);
 }
-- 
2.7.2

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v15 08/13] arch/arm64: enable task isolation functionality
  2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
                   ` (6 preceding siblings ...)
  2016-08-16 21:19 ` [PATCH v15 07/13] arm64: factor work_pending state machine to C Chris Metcalf
@ 2016-08-16 21:19 ` Chris Metcalf
  2016-08-26 16:25   ` Catalin Marinas
  2016-08-16 21:19 ` [PATCH v15 09/13] arch/tile: " Chris Metcalf
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 80+ messages in thread
From: Chris Metcalf @ 2016-08-16 21:19 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Mark Rutland, linux-arm-kernel, linux-kernel
  Cc: Chris Metcalf

In do_notify_resume(), call task_isolation_ready() for
TIF_TASK_ISOLATION tasks when we are checking the thread-info flags;
and after we've handled the other work, call task_isolation_enter()
for such tasks.  To ensure we always call task_isolation_enter() when
returning to userspace, add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
while leaving the old bitmask value as _TIF_WORK_LOOP_MASK to
check while looping.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one.  This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

We instrument the smp_send_reschedule() routine so that it checks for
isolated tasks and generates a suitable warning if we are about
to disturb one of them in strict or debug mode.

Finally, report on page faults in task-isolation processes in
do_page_faults().

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/thread_info.h |  5 ++++-
 arch/arm64/kernel/ptrace.c           | 18 +++++++++++++++---
 arch/arm64/kernel/signal.c           | 10 ++++++++++
 arch/arm64/kernel/smp.c              |  2 ++
 arch/arm64/mm/fault.c                |  8 +++++++-
 6 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index bc3f00f586f1..5cacf1de28ae 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -62,6 +62,7 @@ config ARM64
 	select HAVE_ARCH_MMAP_RND_BITS
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_ARM_SMCCC
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index abd64bd1f6d9..bdc6426b9968 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -109,6 +109,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED	1
 #define TIF_NOTIFY_RESUME	2	/* callback before returning to user */
 #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
+#define TIF_TASK_ISOLATION	4
 #define TIF_NOHZ		7
 #define TIF_SYSCALL_TRACE	8
 #define TIF_SYSCALL_AUDIT	9
@@ -124,6 +125,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
@@ -132,7 +134,8 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_32BIT		(1 << TIF_32BIT)
 
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
-				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE)
+				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
+				 _TIF_TASK_ISOLATION)
 
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index e0c81da60f76..9f093fcf97a3 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -37,6 +37,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1347,14 +1348,25 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
-	if (test_thread_flag(TIF_SYSCALL_TRACE))
+	unsigned long work = ACCESS_ONCE(current_thread_info()->flags);
+
+	if (work & _TIF_SYSCALL_TRACE)
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
-	/* Do the secure computing after ptrace; failures should be fast. */
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->syscallno) == -1)
+			return -1;
+	}
+
+	/* Do the secure computing check early; failures should be fast. */
 	if (secure_computing(NULL) == -1)
 		return -1;
 
-	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
+	if (work & _TIF_SYSCALL_TRACEPOINT)
 		trace_sys_enter(regs, regs->syscallno);
 
 	audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 404dd67080b9..f9b9b25636ca 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -25,6 +25,7 @@
 #include <linux/uaccess.h>
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
+#include <linux/isolation.h>
 
 #include <asm/debug-monitors.h>
 #include <asm/elf.h>
@@ -424,9 +425,18 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 
 			if (thread_flags & _TIF_FOREIGN_FPSTATE)
 				fpsimd_restore_current_state();
+
+			if (thread_flags & _TIF_TASK_ISOLATION)
+				task_isolation_enter();
 		}
 
 		local_irq_disable();
 		thread_flags = READ_ONCE(current_thread_info()->flags);
+
+		/* Clear task isolation from cached_flags manually. */
+		if ((thread_flags & _TIF_TASK_ISOLATION) &&
+		    task_isolation_ready())
+			thread_flags &= ~_TIF_TASK_ISOLATION;
+
 	} while (thread_flags & _TIF_WORK_MASK);
 }
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index d93d43352504..08b0f3754e85 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -37,6 +37,7 @@
 #include <linux/completion.h>
 #include <linux/of.h>
 #include <linux/irq_work.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -874,6 +875,7 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 
 void smp_send_reschedule(int cpu)
 {
+	task_isolation_debug(cpu, "reschedule IPI");
 	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
 }
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 05d2bd776c69..784817478535 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/isolation.h>
 
 #include <asm/cpufeature.h>
 #include <asm/exception.h>
@@ -392,8 +393,13 @@ retry:
 	 * Handle the "normal" case first - VM_FAULT_MAJOR
 	 */
 	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
-			      VM_FAULT_BADACCESS))))
+			      VM_FAULT_BADACCESS)))) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (user_mode(regs))
+			task_isolation_quiet_exception("page fault at %#lx",
+						       addr);
 		return 0;
+	}
 
 	/*
 	 * If we are in kernel mode at this point, we have no context to
-- 
2.7.2

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v15 09/13] arch/tile: enable task isolation functionality
  2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
                   ` (7 preceding siblings ...)
  2016-08-16 21:19 ` [PATCH v15 08/13] arch/arm64: enable task isolation functionality Chris Metcalf
@ 2016-08-16 21:19 ` Chris Metcalf
  2016-08-16 21:19 ` [PATCH v15 10/13] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-08-16 21:19 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-kernel
  Cc: Chris Metcalf

We add the necessary call to task_isolation_enter() in the
prepare_exit_to_usermode() routine.  We already unconditionally
call into this routine if TIF_NOHZ is set, since that's where
we do the user_enter() call.

We add calls to task_isolation_quiet_exception() in places
where exceptions may not generate signals to the application.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/tile/Kconfig                   |  1 +
 arch/tile/include/asm/thread_info.h |  4 +++-
 arch/tile/kernel/process.c          |  9 +++++++++
 arch/tile/kernel/ptrace.c           | 10 ++++++++++
 arch/tile/kernel/single_step.c      |  7 +++++++
 arch/tile/kernel/smp.c              | 26 ++++++++++++++------------
 arch/tile/kernel/unaligned.c        |  4 ++++
 arch/tile/mm/fault.c                | 13 ++++++++++++-
 arch/tile/mm/homecache.c            |  2 ++
 9 files changed, 62 insertions(+), 14 deletions(-)

diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig
index 4820a02838ac..937cfe4cbb5b 100644
--- a/arch/tile/Kconfig
+++ b/arch/tile/Kconfig
@@ -18,6 +18,7 @@ config TILE
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_CONTEXT_TRACKING
 	select HAVE_DEBUG_BUGVERBOSE
diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index b7659b8f1117..8fe17c7e872e 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -126,6 +126,7 @@ extern void _cpu_idle(void);
 #define TIF_SYSCALL_TRACEPOINT	9	/* syscall tracepoint instrumentation */
 #define TIF_POLLING_NRFLAG	10	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_NOHZ		11	/* in adaptive nohz mode */
+#define TIF_TASK_ISOLATION	12	/* in task isolation mode */
 
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1<<TIF_NEED_RESCHED)
@@ -139,11 +140,12 @@ extern void _cpu_idle(void);
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
 #define _TIF_NOHZ		(1<<TIF_NOHZ)
+#define _TIF_TASK_ISOLATION	(1<<TIF_TASK_ISOLATION)
 
 /* Work to do as we loop to exit to user space. */
 #define _TIF_WORK_MASK \
 	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
-	 _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME)
+	 _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME | _TIF_TASK_ISOLATION)
 
 /* Work to do on any return to user space. */
 #define _TIF_ALLWORK_MASK \
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index a465d8372edd..bbe1d29b242f 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -29,6 +29,7 @@
 #include <linux/signal.h>
 #include <linux/delay.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 #include <asm/stack.h>
 #include <asm/switch_to.h>
 #include <asm/homecache.h>
@@ -496,9 +497,17 @@ void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags)
 			tracehook_notify_resume(regs);
 		}
 
+		if (thread_info_flags & _TIF_TASK_ISOLATION)
+			task_isolation_enter();
+
 		local_irq_disable();
 		thread_info_flags = READ_ONCE(current_thread_info()->flags);
 
+		/* Clear task isolation from cached_flags manually. */
+		if ((thread_info_flags & _TIF_TASK_ISOLATION) &&
+		    task_isolation_ready())
+			thread_info_flags &= ~_TIF_TASK_ISOLATION;
+
 	} while (thread_info_flags & _TIF_WORK_MASK);
 
 	if (thread_info_flags & _TIF_SINGLESTEP) {
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index d89b7011667c..a92e334bb562 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -23,6 +23,7 @@
 #include <linux/elf.h>
 #include <linux/tracehook.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 #include <asm/traps.h>
 #include <arch/chip.h>
 
@@ -261,6 +262,15 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 		return -1;
 	}
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]) == -1)
+			return -1;
+	}
+
 	if (secure_computing(NULL) == -1)
 		return -1;
 
diff --git a/arch/tile/kernel/single_step.c b/arch/tile/kernel/single_step.c
index 862973074bf9..b48da9860b80 100644
--- a/arch/tile/kernel/single_step.c
+++ b/arch/tile/kernel/single_step.c
@@ -23,6 +23,7 @@
 #include <linux/types.h>
 #include <linux/err.h>
 #include <linux/prctl.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <asm/uaccess.h>
@@ -320,6 +321,9 @@ void single_step_once(struct pt_regs *regs)
 	int size = 0, sign_ext = 0;  /* happy compiler */
 	int align_ctl;
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	task_isolation_quiet_exception("single step at %#lx", regs->pc);
+
 	align_ctl = unaligned_fixup;
 	switch (task_thread_info(current)->align_ctl) {
 	case PR_UNALIGN_NOPRINT:
@@ -767,6 +771,9 @@ void single_step_once(struct pt_regs *regs)
 	unsigned long *ss_pc = this_cpu_ptr(&ss_saved_pc);
 	unsigned long control = __insn_mfspr(SPR_SINGLE_STEP_CONTROL_K);
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	task_isolation_quiet_exception("single step at %#lx", regs->pc);
+
 	*ss_pc = regs->pc;
 	control |= SPR_SINGLE_STEP_CONTROL_1__CANCELED_MASK;
 	control |= SPR_SINGLE_STEP_CONTROL_1__INHIBIT_MASK;
diff --git a/arch/tile/kernel/smp.c b/arch/tile/kernel/smp.c
index 07e3ff5cc740..d610322026d0 100644
--- a/arch/tile/kernel/smp.c
+++ b/arch/tile/kernel/smp.c
@@ -20,6 +20,7 @@
 #include <linux/irq.h>
 #include <linux/irq_work.h>
 #include <linux/module.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/homecache.h>
 
@@ -181,10 +182,11 @@ void flush_icache_range(unsigned long start, unsigned long end)
 	struct ipi_flush flush = { start, end };
 
 	/* If invoked with irqs disabled, we can not issue IPIs. */
-	if (irqs_disabled())
+	if (irqs_disabled()) {
+		task_isolation_debug_cpumask(task_isolation_map, "icache flush");
 		flush_remote(0, HV_FLUSH_EVICT_L1I, NULL, 0, 0, 0,
 			NULL, NULL, 0);
-	else {
+	} else {
 		preempt_disable();
 		on_each_cpu(ipi_flush_icache_range, &flush, 1);
 		preempt_enable();
@@ -258,10 +260,8 @@ void __init ipi_init(void)
 
 #if CHIP_HAS_IPI()
 
-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
 {
-	WARN_ON(cpu_is_offline(cpu));
-
 	/*
 	 * We just want to do an MMIO store.  The traditional writeq()
 	 * functions aren't really correct here, since they're always
@@ -273,15 +273,17 @@ void smp_send_reschedule(int cpu)
 
 #else
 
-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
 {
-	HV_Coord coord;
-
-	WARN_ON(cpu_is_offline(cpu));
-
-	coord.y = cpu_y(cpu);
-	coord.x = cpu_x(cpu);
+	HV_Coord coord = { .y = cpu_y(cpu), .x = cpu_x(cpu) };
 	hv_trigger_ipi(coord, IRQ_RESCHEDULE);
 }
 
 #endif /* CHIP_HAS_IPI() */
+
+void smp_send_reschedule(int cpu)
+{
+	WARN_ON(cpu_is_offline(cpu));
+	task_isolation_debug(cpu, "reschedule IPI");
+	__smp_send_reschedule(cpu);
+}
diff --git a/arch/tile/kernel/unaligned.c b/arch/tile/kernel/unaligned.c
index 9772a3554282..0335f7cd81f4 100644
--- a/arch/tile/kernel/unaligned.c
+++ b/arch/tile/kernel/unaligned.c
@@ -25,6 +25,7 @@
 #include <linux/module.h>
 #include <linux/compat.h>
 #include <linux/prctl.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <asm/uaccess.h>
@@ -1545,6 +1546,9 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
 		return;
 	}
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	task_isolation_quiet_exception("unaligned JIT at %#lx", regs->pc);
+
 	if (!info->unalign_jit_base) {
 		void __user *user_page;
 
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index beba986589e5..9bd3600092dc 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -35,6 +35,7 @@
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 #include <linux/kdebug.h>
+#include <linux/isolation.h>
 
 #include <asm/pgalloc.h>
 #include <asm/sections.h>
@@ -308,8 +309,13 @@ static int handle_page_fault(struct pt_regs *regs,
 	 */
 	pgd = get_current_pgd();
 	if (handle_migrating_pte(pgd, fault_num, address, regs->pc,
-				 is_kernel_mode, write))
+				 is_kernel_mode, write)) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (!is_kernel_mode)
+			task_isolation_quiet_exception("migration fault at %#lx",
+						       address);
 		return 1;
+	}
 
 	si_code = SEGV_MAPERR;
 
@@ -479,6 +485,11 @@ good_area:
 #endif
 
 	up_read(&mm->mmap_sem);
+
+	/* No signal was generated, but notify task-isolation tasks. */
+	if (flags & FAULT_FLAG_USER)
+		task_isolation_quiet_exception("page fault at %#lx", address);
+
 	return 1;
 
 /*
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 40ca30a9fee3..2fe368599df6 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
 #include <linux/smp.h>
 #include <linux/module.h>
 #include <linux/hugetlb.h>
+#include <linux/isolation.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -83,6 +84,7 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
 	 * Don't bother to update atomically; losing a count
 	 * here is not that critical.
 	 */
+	task_isolation_debug_cpumask(&mask, "remote cache/TLB flush");
 	for_each_cpu(cpu, &mask)
 		++per_cpu(irq_stat, cpu).irq_hv_flush_count;
 }
-- 
2.7.2

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v15 10/13] arm, tile: turn off timer tick for oneshot_stopped state
  2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
                   ` (8 preceding siblings ...)
  2016-08-16 21:19 ` [PATCH v15 09/13] arch/tile: " Chris Metcalf
@ 2016-08-16 21:19 ` Chris Metcalf
  2016-08-16 21:19 ` [PATCH v15 11/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL Chris Metcalf
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-08-16 21:19 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-kernel
  Cc: Chris Metcalf

When the schedule tick is disabled in tick_nohz_stop_sched_tick(),
we call hrtimer_cancel(), which eventually calls down into
__remove_hrtimer() and thus into hrtimer_force_reprogram().
That function's call to tick_program_event() detects that
we are trying to set the expiration to KTIME_MAX and calls
clockevents_switch_state() to set the state to ONESHOT_STOPPED,
and returns.  See commit 8fff52fd5093 ("clockevents: Introduce
CLOCK_EVT_STATE_ONESHOT_STOPPED state") for more background.

However, by default the internal __clockevents_switch_state() code
doesn't have a "set_state_oneshot_stopped" function pointer for
the arm_arch_timer or tile clock_event_device structures, so that
code returns -ENOSYS, and we end up not setting the state, and more
importantly, we don't actually turn off the hardware timer.
As a result, the timer tick we were waiting for before is still
queued, and fires shortly afterwards, only to discover there was
nothing for it to do, at which point it quiesces.

The fix is to provide that function pointer field, and like the
other function pointers, have it just turn off the timer interrupt.
Any call to set a new timer interval will properly re-enable it.

This fix avoids a small performance hiccup for regular applications,
but for TASK_ISOLATION code, it fixes a potentially serious
kernel timer interruption to the time-sensitive application.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
 arch/tile/kernel/time.c              | 1 +
 drivers/clocksource/arm_arch_timer.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index 178989e6d3e3..fbedf380d9d4 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -159,6 +159,7 @@ static DEFINE_PER_CPU(struct clock_event_device, tile_timer) = {
 	.set_next_event = tile_timer_set_next_event,
 	.set_state_shutdown = tile_timer_shutdown,
 	.set_state_oneshot = tile_timer_shutdown,
+	.set_state_oneshot_stopped = tile_timer_shutdown,
 	.tick_resume = tile_timer_shutdown,
 };
 
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index 57700541f951..1fe9c48f5f51 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -317,6 +317,8 @@ static void __arch_timer_setup(unsigned type,
 		}
 	}
 
+	clk->set_state_oneshot_stopped = clk->set_state_shutdown;
+
 	clk->set_state_shutdown(clk);
 
 	clockevents_config_and_register(clk, arch_timer_rate, 0xf, 0x7fffffff);
-- 
2.7.2

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v15 11/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL
  2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
                   ` (9 preceding siblings ...)
  2016-08-16 21:19 ` [PATCH v15 10/13] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
@ 2016-08-16 21:19 ` Chris Metcalf
  2016-08-16 21:19 ` [PATCH v15 12/13] task_isolation: add user-settable notification signal Chris Metcalf
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-08-16 21:19 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-kernel
  Cc: Chris Metcalf

This option, similar to NO_HZ_FULL_ALL, simplifies configuring
a system to boot by default with all cores except the boot core
running in task isolation mode.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 init/Kconfig       | 10 ++++++++++
 kernel/isolation.c |  6 ++++++
 2 files changed, 16 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index a95a35a31b46..a9b9c7635de2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -813,6 +813,16 @@ config TASK_ISOLATION
 	 You should say "N" unless you are intending to run a
 	 high-performance userspace driver or similar task.
 
+config TASK_ISOLATION_ALL
+	bool "Provide task isolation on all CPUs by default (except CPU 0)"
+	depends on TASK_ISOLATION
+	help
+	 If the user doesn't pass the task_isolation boot option to
+	 define the range of task isolation CPUs, consider that all
+	 CPUs in the system are task isolation by default.
+	 Note the boot CPU will still be kept outside the range to
+	 handle timekeeping duty, etc.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/isolation.c b/kernel/isolation.c
index be7e95192e76..3dbb01ac503f 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -43,8 +43,14 @@ int __init task_isolation_init(void)
 {
 	/* For offstack cpumask, ensure we allocate an empty cpumask early. */
 	if (!saw_boot_arg) {
+#ifdef CONFIG_TASK_ISOLATION_ALL
+		alloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+		cpumask_copy(task_isolation_map, cpu_possible_mask);
+		cpumask_clear_cpu(smp_processor_id(), task_isolation_map);
+#else
 		zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
 		return 0;
+#endif
 	}
 
 	/*
-- 
2.7.2

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v15 12/13] task_isolation: add user-settable notification signal
  2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
                   ` (10 preceding siblings ...)
  2016-08-16 21:19 ` [PATCH v15 11/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL Chris Metcalf
@ 2016-08-16 21:19 ` Chris Metcalf
  2016-08-16 21:19 ` [PATCH v15 13/13] task_isolation self test Chris Metcalf
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-08-16 21:19 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

By default, if a task in task isolation mode re-enters the kernel,
it is terminated with SIGKILL.  With this commit, the application
can choose what signal to receive on a task isolation violation
by invoking prctl() with PR_TASK_ISOLATION_ENABLE, or'ing in the
PR_TASK_ISOLATION_USERSIG bit, and setting the specific requested
signal by or'ing in PR_TASK_ISOLATION_SET_SIG(sig).

This mode allows for catching the notification signal; for example,
in a production environment, it might be helpful to log information
to the application logging mechanism before exiting.  Or, the
application might choose to re-enable task isolation and return to
continue execution.

As a special case, the user may set the signal to 0, which means
that no signal will be delivered.  In this mode, the application
may freely enter the kernel for syscalls and synchronous exceptions
such as page faults, but each time it will be held in the kernel
before returning to userspace until the kernel has quiesced timer
ticks or other potential future interruptions, just like it does
on return from the initial prctl() call.  Note that in this mode,
the task can be migrated away from its initial task_isolation core,
and if it is migrated to a non-isolated core it will lose task
isolation until it is migrated back to an isolated core.
In addition, in this mode we no longer require the affinity to
be set correctly on entry (though we warn on the console if it's
not right), and we don't bother to notify the user that the kernel
isn't ready to quiesce either (since we'll presumably be in and
out of the kernel multiple times with task isolation enabled anyway).
The PR_TASK_ISOLATION_NOSIG define is provided as a convenience
wrapper to express this semantic.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/uapi/linux/prctl.h |  5 ++++
 kernel/isolation.c         | 62 ++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 2a49d0d2940a..7af6eb51c1dc 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,10 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION		48
 #define PR_GET_TASK_ISOLATION		49
 # define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_USERSIG	(1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+# define PR_TASK_ISOLATION_NOSIG \
+	(PR_TASK_ISOLATION_USERSIG | PR_TASK_ISOLATION_SET_SIG(0))
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 3dbb01ac503f..ba643ad9d02b 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -85,6 +85,15 @@ static bool can_stop_my_full_tick_now(void)
 	return ret;
 }
 
+/* Get the signal number that will be sent for a particular set of flag bits. */
+static int task_isolation_sig(int flags)
+{
+	if (flags & PR_TASK_ISOLATION_USERSIG)
+		return PR_TASK_ISOLATION_GET_SIG(flags);
+	else
+		return SIGKILL;
+}
+
 /*
  * This routine controls whether we can enable task-isolation mode.
  * The task must be affinitized to a single task_isolation core, or
@@ -92,16 +101,30 @@ static bool can_stop_my_full_tick_now(void)
  * stop the nohz_full tick (e.g., no other schedulable tasks currently
  * running, no POSIX cpu timers currently set up, etc.); if not, we
  * return EAGAIN.
+ *
+ * If we will not be strictly enforcing kernel re-entry with a signal,
+ * we just generate a warning printk if there is a bad affinity set
+ * on entry (since after all you can always change it again after you
+ * call prctl) and we don't bother failing the prctl with -EAGAIN
+ * since we assume you will go in and out of kernel mode anyway.
  */
 int task_isolation_set(unsigned int flags)
 {
 	if (flags != 0) {
+		int sig = task_isolation_sig(flags);
+
 		if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
 		    !task_isolation_possible(raw_smp_processor_id())) {
 			/* Invalid task affinity setting. */
-			return -EINVAL;
+			if (sig)
+				return -EINVAL;
+			else
+				pr_warn("%s/%d: enabling non-signalling task isolation\n"
+					"and not bound to a single task isolation core\n",
+					current->comm, current->pid);
 		}
-		if (!can_stop_my_full_tick_now()) {
+
+		if (sig && !can_stop_my_full_tick_now()) {
 			/* System not yet ready for task isolation. */
 			return -EAGAIN;
 		}
@@ -161,11 +184,11 @@ void task_isolation_enter(void)
 }
 
 static void task_isolation_deliver_signal(struct task_struct *task,
-					  const char *buf)
+					  const char *buf, int sig)
 {
 	siginfo_t info = {};
 
-	info.si_signo = SIGKILL;
+	info.si_signo = sig;
 
 	/*
 	 * Report on the fact that isolation was violated for the task.
@@ -176,7 +199,10 @@ static void task_isolation_deliver_signal(struct task_struct *task,
 	pr_warn("%s/%d: task_isolation mode lost due to %s\n",
 		task->comm, task->pid, buf);
 
-	/* Turn off task isolation mode to avoid further isolation callbacks. */
+	/*
+	 * Turn off task isolation mode to avoid further isolation callbacks.
+	 * It can choose to re-enable task isolation mode in the signal handler.
+	 */
 	task_isolation_set_flags(task, 0);
 
 	send_sig_info(info.si_signo, &info, task);
@@ -191,15 +217,20 @@ void _task_isolation_quiet_exception(const char *fmt, ...)
 	struct task_struct *task = current;
 	va_list args;
 	char buf[100];
+	int sig;
 
 	/* RCU should have been enabled prior to this point. */
 	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
 
+	sig = task_isolation_sig(task->task_isolation_flags);
+	if (sig == 0)
+		return;
+
 	va_start(args, fmt);
 	vsnprintf(buf, sizeof(buf), fmt, args);
 	va_end(args);
 
-	task_isolation_deliver_signal(task, buf);
+	task_isolation_deliver_signal(task, buf, sig);
 }
 
 /*
@@ -210,14 +241,19 @@ void _task_isolation_quiet_exception(const char *fmt, ...)
 int task_isolation_syscall(int syscall)
 {
 	char buf[20];
+	int sig;
 
 	if (syscall == __NR_prctl ||
 	    syscall == __NR_exit ||
 	    syscall == __NR_exit_group)
 		return 0;
 
+	sig = task_isolation_sig(current->task_isolation_flags);
+	if (sig == 0)
+		return 0;
+
 	snprintf(buf, sizeof(buf), "syscall %d", syscall);
-	task_isolation_deliver_signal(current, buf);
+	task_isolation_deliver_signal(current, buf, sig);
 
 	syscall_set_return_value(current, current_pt_regs(),
 					 -ERESTARTNOINTR, -1);
@@ -237,6 +273,7 @@ void task_isolation_debug_task(int cpu, struct task_struct *p, const char *type)
 {
 	static DEFINE_RATELIMIT_STATE(console_output, HZ, 1);
 	bool force_debug = false;
+	int sig;
 
 	/*
 	 * Our caller made sure the task was running on a task isolation
@@ -267,10 +304,13 @@ void task_isolation_debug_task(int cpu, struct task_struct *p, const char *type)
 	 * and instead just treat it as if "debug" mode was enabled,
 	 * since that's pretty much all we can do.
 	 */
-	if (in_nmi())
-		force_debug = true;
-	else
-		task_isolation_deliver_signal(p, type);
+	sig = task_isolation_sig(p->task_isolation_flags);
+	if (sig != 0) {
+		if (in_nmi())
+			force_debug = true;
+		else
+			task_isolation_deliver_signal(p, type, sig);
+	}
 
 	/*
 	 * If (for example) the timer interrupt starts ticking
-- 
2.7.2

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v15 13/13] task_isolation self test
  2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
                   ` (11 preceding siblings ...)
  2016-08-16 21:19 ` [PATCH v15 12/13] task_isolation: add user-settable notification signal Chris Metcalf
@ 2016-08-16 21:19 ` Chris Metcalf
  2016-08-17 19:37 ` [PATCH] Fix /proc/stat freezes (was [PATCH v15] "task_isolation" mode) Christoph Lameter
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-08-16 21:19 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, Shuah Khan, Shuah Khan, linux-kselftest,
	linux-kernel
  Cc: Chris Metcalf

This code tests various aspects of task_isolation.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/task_isolation/Makefile    |  11 +
 tools/testing/selftests/task_isolation/config      |   2 +
 tools/testing/selftests/task_isolation/isolation.c | 646 +++++++++++++++++++++
 4 files changed, 660 insertions(+)
 create mode 100644 tools/testing/selftests/task_isolation/Makefile
 create mode 100644 tools/testing/selftests/task_isolation/config
 create mode 100644 tools/testing/selftests/task_isolation/isolation.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index ff9e5f20a5a7..bd97479f44b3 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -23,6 +23,7 @@ TARGETS += sigaltstack
 TARGETS += size
 TARGETS += static_keys
 TARGETS += sysctl
+TARGETS += task_isolation
 ifneq (1, $(quicktest))
 TARGETS += timers
 endif
diff --git a/tools/testing/selftests/task_isolation/Makefile b/tools/testing/selftests/task_isolation/Makefile
new file mode 100644
index 000000000000..c8927994fe18
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/Makefile
@@ -0,0 +1,11 @@
+CFLAGS += -O2 -g -W -Wall
+LDFLAGS += -pthread
+
+TEST_PROGS := isolation
+
+all: $(TEST_PROGS)
+
+include ../lib.mk
+
+clean:
+	$(RM) $(TEST_PROGS)
diff --git a/tools/testing/selftests/task_isolation/config b/tools/testing/selftests/task_isolation/config
new file mode 100644
index 000000000000..49e18e43b737
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/config
@@ -0,0 +1,2 @@
+CONFIG_TASK_ISOLATION=y
+CONFIG_TASK_ISOLATION_ALL=y
diff --git a/tools/testing/selftests/task_isolation/isolation.c b/tools/testing/selftests/task_isolation/isolation.c
new file mode 100644
index 000000000000..d5f18231b94d
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/isolation.c
@@ -0,1 +1,646 @@
+/*
+ * This test program tests the features of task isolation.
+ *
+ * - Makes sure enabling task isolation fails if you are unaffinitized
+ *   or on a non-task-isolation cpu.
+ *
+ * - Tests that /sys/devices/system/cpu/task_isolation works correctly.
+ *
+ * - Validates that various synchronous exceptions are fatal in isolation
+ *   mode:
+ *
+ *   * Page fault
+ *   * System call
+ *   * TLB invalidation from another thread [1]
+ *   * Unaligned access [2]
+ *
+ * - Tests that taking a user-defined signal for the above faults works.
+ *
+ * - Tests that isolation in "no signal" mode works as expected: you can
+ *   perform multiple system calls without a signal, and if another
+ *   process bumps you, you return to userspace without any extra jitter.
+ *
+ * [1] TLB invalidations do not cause IPIs on some platforms, e.g. arm64
+ * [2] Unaligned access only causes exceptions on some platforms, e.g. tile
+ *
+ *
+ * You must be running under a kernel configured with TASK_ISOLATION.
+ *
+ * You must either have configured with TASK_ISOLATION_ALL or else
+ * booted with an argument like "task_isolation=1-15" to enable some
+ * task-isolation cores.  If you get interrupts, you can also add
+ * the boot argument "task_isolation_debug" to learn more.
+ *
+ * NOTE: you must disable the code in tick_nohz_stop_sched_tick()
+ * that limits the tick delta to the maximum scheduler deferment
+ * by making it conditional not just on "!ts->inidle" but also
+ * on !test_thread_flag(TIF_TASK_ISOLATION).  This is around line 1292
+ * in kernel/time/tick-sched.c (as of kernel 4.7).
+ *
+ *
+ * To compile the test program, run "make".
+ *
+ * Run the program as "./isolation" and if you want to run the
+ * jitter-detection loop for longer than 10 giga-cycles, specify the
+ * number of giga-cycles to run it for as a command-line argument.
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <assert.h>
+#include <string.h>
+#include <errno.h>
+#include <sched.h>
+#include <pthread.h>
+#include <sys/wait.h>
+#include <sys/mman.h>
+#include <sys/time.h>
+#include <sys/prctl.h>
+#include "../kselftest.h"
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
+#define WRITE_ONCE(x, val) (*(volatile typeof(x) *)&(x) = (val))
+
+#ifndef PR_SET_TASK_ISOLATION   /* Not in system headers yet? */
+# define PR_SET_TASK_ISOLATION		48
+# define PR_GET_TASK_ISOLATION		49
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_USERSIG	(1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+# define PR_TASK_ISOLATION_NOSIG \
+	(PR_TASK_ISOLATION_USERSIG | PR_TASK_ISOLATION_SET_SIG(0))
+#endif
+
+/* The cpu we are using for isolation tests. */
+static int task_isolation_cpu;
+
+/* Overall status, maintained as tests run. */
+static int exit_status = KSFT_PASS;
+
+/* Set affinity to a single cpu or die if trying to do so fails. */
+void set_my_cpu(int cpu)
+{
+	cpu_set_t set;
+	int rc;
+
+	CPU_ZERO(&set);
+	CPU_SET(cpu, &set);
+	rc = sched_setaffinity(0, sizeof(cpu_set_t), &set);
+	assert(rc == 0);
+}
+
+/*
+ * Run a child process in task isolation mode and report its status.
+ * The child does mlockall() and moves itself to the task isolation cpu.
+ * It then runs SETUP_FUNC (if specified), calls prctl(PR_SET_TASK_ISOLATION, )
+ * with FLAGS (if non-zero), and then invokes TEST_FUNC and exits
+ * with its status.
+ */
+static int run_test(void (*setup_func)(), int (*test_func)(), int flags)
+{
+	int pid, rc, status;
+
+	fflush(stdout);
+	pid = fork();
+	assert(pid >= 0);
+	if (pid != 0) {
+		/* In parent; wait for child and return its status. */
+		waitpid(pid, &status, 0);
+		return status;
+	}
+
+	/* In child. */
+	rc = mlockall(MCL_CURRENT);
+	assert(rc == 0);
+	set_my_cpu(task_isolation_cpu);
+	if (setup_func)
+		setup_func();
+	if (flags) {
+		do
+			rc = prctl(PR_SET_TASK_ISOLATION, flags);
+		while (rc != 0 && errno == EAGAIN);
+		if (rc != 0) {
+			printf("couldn't enable isolation (%d): FAIL\n", errno);
+			ksft_exit_fail();
+		}
+	}
+	rc = test_func();
+	exit(rc);
+}
+
+/*
+ * Run a test and ensure it is killed with SIGKILL by default,
+ * for whatever misdemeanor is committed in TEST_FUNC.
+ * Also test it with SIGUSR1 as well to make sure that works.
+ */
+static void test_killed(const char *testname, void (*setup_func)(),
+			int (*test_func)())
+{
+	int status;
+
+	status = run_test(setup_func, test_func, PR_TASK_ISOLATION_ENABLE);
+	if (WIFSIGNALED(status) && WTERMSIG(status) == SIGKILL) {
+		printf("%s: OK\n", testname);
+	} else {
+		printf("%s: FAIL (%#x)\n", testname, status);
+		exit_status = KSFT_FAIL;
+	}
+
+	status = run_test(setup_func, test_func,
+			  PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_USERSIG |
+			  PR_TASK_ISOLATION_SET_SIG(SIGUSR1));
+	if (WIFSIGNALED(status) && WTERMSIG(status) == SIGUSR1) {
+		printf("%s (SIGUSR1): OK\n", testname);
+	} else {
+		printf("%s (SIGUSR1): FAIL (%#x)\n", testname, status);
+		exit_status = KSFT_FAIL;
+	}
+}
+
+/* Run a test and make sure it exits with success. */
+static void test_ok(const char *testname, void (*setup_func)(),
+		    int (*test_func)())
+{
+	int status;
+
+	status = run_test(setup_func, test_func, PR_TASK_ISOLATION_ENABLE);
+	if (status == KSFT_PASS) {
+		printf("%s: OK\n", testname);
+	} else {
+		printf("%s: FAIL (%#x)\n", testname, status);
+		exit_status = KSFT_FAIL;
+	}
+}
+
+/* Run a test with no signals and make sure it exits with success. */
+static void test_nosig(const char *testname, void (*setup_func)(),
+		       int (*test_func)())
+{
+	int status;
+
+	status = run_test(setup_func, test_func,
+			  PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_NOSIG);
+	if (status == KSFT_PASS) {
+		printf("%s: OK\n", testname);
+	} else {
+		printf("%s: FAIL (%#x)\n", testname, status);
+		exit_status = KSFT_FAIL;
+	}
+}
+
+/* Mapping address passed from setup function to test function. */
+static char *fault_file_mapping;
+
+/* mmap() a file in so we can test touching an unmapped page. */
+static void setup_fault(void)
+{
+	char fault_file[] = "/tmp/isolation_XXXXXX";
+	int fd, rc;
+
+	fd = mkstemp(fault_file);
+	assert(fd >= 0);
+	rc = ftruncate(fd, getpagesize());
+	assert(rc == 0);
+	fault_file_mapping = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
+				  MAP_SHARED, fd, 0);
+	assert(fault_file_mapping != MAP_FAILED);
+	close(fd);
+	unlink(fault_file);
+}
+
+/* Now touch the unmapped page (and be killed). */
+static int do_fault(void)
+{
+	*fault_file_mapping = 1;
+	return KSFT_FAIL;
+}
+
+/* Make a syscall (and be killed). */
+static int do_syscall(void)
+{
+	write(STDOUT_FILENO, "goodbye, world\n", 13);
+	return KSFT_FAIL;
+}
+
+/* Turn isolation back off and don't be killed. */
+static int do_syscall_off(void)
+{
+	prctl(PR_SET_TASK_ISOLATION, 0);
+	write(STDOUT_FILENO, "==> hello, world\n", 17);
+	return KSFT_PASS;
+}
+
+/* If we're not getting a signal, make sure we can do multiple system calls. */
+static int do_syscall_multi(void)
+{
+	write(STDOUT_FILENO, "==> hello, world 1\n", 19);
+	write(STDOUT_FILENO, "==> hello, world 2\n", 19);
+	return KSFT_PASS;
+}
+
+#ifdef __aarch64__
+/* ARM64 uses tlbi instructions so doesn't need to interrupt the remote core. */
+static void test_munmap(void) {}
+#else
+
+/*
+ * Fork a thread that will munmap() after a short while.
+ * It will deliver a TLB flush to the task isolation core.
+ */
+
+static void *start_munmap(void *p)
+{
+	usleep(500000);   /* 0.5s */
+	munmap(p, getpagesize());
+	return 0;
+}
+
+static void setup_munmap(void)
+{
+	pthread_t thr;
+	void *p;
+	int rc;
+
+	/* First, go back to cpu 0 and allocate some memory. */
+	set_my_cpu(0);
+	p = mmap(0, getpagesize(), PROT_READ|PROT_WRITE,
+		 MAP_ANONYMOUS|MAP_POPULATE|MAP_PRIVATE, 0, 0);
+	assert(p != MAP_FAILED);
+
+	/*
+	 * Now fire up a thread that will wait half a second on cpu 0
+	 * and then munmap the mapping.
+	 */
+	rc = pthread_create(&thr, NULL, start_munmap, p);
+	assert(rc == 0);
+
+	/* Back to the task-isolation cpu. */
+	set_my_cpu(task_isolation_cpu);
+}
+
+/* Global variable to avoid the compiler outsmarting us. */
+int munmap_spin;
+
+static int do_munmap(void)
+{
+	while (munmap_spin < 1000000000)
+		WRITE_ONCE(munmap_spin, munmap_spin + 1);
+	return KSFT_FAIL;
+}
+
+static void test_munmap(void)
+{
+	test_killed("test_munmap", setup_munmap, do_munmap);
+}
+#endif
+
+#ifdef __tilegx__
+/*
+ * Make an unaligned access (and be killed).
+ * Only for tilegx, since other platforms don't do in-kernel fixups.
+ */
+static int
+do_unaligned(void)
+{
+	static int buf[2];
+	int *addr = (int *)((char *)buf + 1);
+
+	READ_ONCE(*addr);
+
+	asm("nop");
+	return KSFT_FAIL;
+}
+
+static void test_unaligned(void)
+{
+	test_killed("test_unaligned", NULL, do_unaligned);
+}
+#else
+static void test_unaligned(void) {}
+#endif
+
+/*
+ * Fork a process that will spin annoyingly on the same core
+ * for a second.  Since prctl() won't work if this task is actively
+ * running, we following this handshake sequence:
+ *
+ * 1. Child (in setup_quiesce, here) starts up, sets state 1 to let the
+ *    parent know it's running, and starts doing short sleeps waiting on a
+ *    state change.
+ * 2. Parent (in do_quiesce, below) starts up, spins waiting for state 1,
+ *    then spins waiting on prctl() to succeed.  At that point it is in
+ *    isolation mode and the child is completing its most recent sleep.
+ *    Now, as soon as the parent is scheduled out, it won't schedule back
+ *    in until the child stops spinning.
+ * 3. Child sees the state change to 2, sets it to 3, and starts spinning
+ *    waiting for a second to elapse, at which point it exits.
+ * 4. Parent spins waiting for the state to get to 3, then makes one
+ *    syscall.  This should take about a second even though the child
+ *    was spinning for a whole second after changing the state to 3.
+ */
+
+int *statep, *childstate;
+struct timeval quiesce_start, quiesce_end;
+int child_pid;
+
+static void setup_quiesce(void)
+{
+	struct timeval start, tv;
+	int rc;
+
+	/* First, go back to cpu 0 and allocate some shared memory. */
+	set_my_cpu(0);
+	statep = mmap(0, getpagesize(), PROT_READ|PROT_WRITE,
+		      MAP_ANONYMOUS|MAP_SHARED, 0, 0);
+	assert(statep != MAP_FAILED);
+	childstate = statep + 1;
+
+	gettimeofday(&quiesce_start, NULL);
+
+	/* Fork and fault in all memory in both. */
+	child_pid = fork();
+	assert(child_pid >= 0);
+	if (child_pid == 0)
+		*childstate = 1;
+	rc = mlockall(MCL_CURRENT);
+	assert(rc == 0);
+	if (child_pid != 0) {
+		set_my_cpu(task_isolation_cpu);
+		return;
+	}
+
+	/*
+	 * In child.  Wait until parent notifies us that it has completed
+	 * its prctl, then jump to its cpu and let it know.
+	 */
+	*childstate = 2;
+	while (READ_ONCE(*statep) == 0)
+		;
+	*childstate = 3;
+	set_my_cpu(task_isolation_cpu);
+	*statep = 2;
+	*childstate = 4;
+
+	/*
+	 * Now we are competing for the runqueue on task_isolation_cpu.
+	 * Spin for one second to ensure the parent gets caught in kernel space.
+	 */
+	gettimeofday(&start, NULL);
+	while (1) {
+		double time;
+
+		gettimeofday(&tv, NULL);
+		time = (tv.tv_sec - start.tv_sec) +
+			(tv.tv_usec - start.tv_usec) / 1000000.0;
+		if (time >= 0.5)
+			exit(0);
+	}
+}
+
+static int do_quiesce(void)
+{
+	double time;
+	int rc;
+
+	rc = prctl(PR_SET_TASK_ISOLATION,
+		   PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_NOSIG);
+	if (rc != 0) {
+		prctl(PR_SET_TASK_ISOLATION, 0);
+		printf("prctl failed: rc %d", rc);
+		goto fail;
+	}
+	*statep = 1;
+
+	/* Wait for child to come disturb us. */
+	while (*statep == 1) {
+		gettimeofday(&quiesce_end, NULL);
+		time = (quiesce_end.tv_sec - quiesce_start.tv_sec) +
+			(quiesce_end.tv_usec - quiesce_start.tv_usec)/1000000.0;
+		if (time > 0.1 && *statep == 1)	{
+			char buf[100];
+
+			prctl(PR_SET_TASK_ISOLATION, 0);
+			printf("timed out at %gs in child migrate loop (%d)\n",
+			       time, *childstate);
+			sprintf(buf, "cat /proc/%d/stack", child_pid);
+			system(buf);
+			goto fail;
+		}
+	}
+	assert(*statep == 2);
+
+	/*
+	 * At this point the child is spinning, so any interrupt will keep us
+	 * in kernel space.  Make a syscall to make sure it happens at least
+	 * once during the second that the child is spinning.
+	 */
+	kill(0, 0);
+	gettimeofday(&quiesce_end, NULL);
+	prctl(PR_SET_TASK_ISOLATION, 0);
+	time = (quiesce_end.tv_sec - quiesce_start.tv_sec) +
+		(quiesce_end.tv_usec - quiesce_start.tv_usec) / 1000000.0;
+	if (time < 0.4 || time > 0.6) {
+		printf("expected 1s wait after quiesce: was %g\n", time);
+		goto fail;
+	}
+	kill(child_pid, SIGKILL);
+	return KSFT_PASS;
+
+fail:
+	kill(child_pid, SIGKILL);
+	return KSFT_FAIL;
+}
+
+#ifdef __tile__
+#include <arch/spr_def.h>
+#endif
+
+static inline unsigned long get_cycle_count(void)
+{
+#ifdef __x86_64__
+	unsigned int lower, upper;
+
+	asm volatile("rdtsc" : "=a"(lower), "=d"(upper));
+	return lower | ((unsigned long)upper << 32);
+#elif defined(__tile__)
+	return __insn_mfspr(SPR_CYCLE);
+#elif defined(__aarch64__)
+	unsigned long vtick;
+
+	asm volatile("mrs %0, cntvct_el0" : "=r" (vtick));
+	return vtick;
+#else
+#error Unsupported architecture
+#endif
+}
+
+/* Histogram of cycle counts up to HISTSIZE cycles. */
+#define HISTSIZE 500
+long hist[HISTSIZE];
+
+/* Information on loss of control of the cpu (more than HISTSIZE cycles). */
+struct jitter_info {
+	unsigned long at;      /* cycle of jitter event */
+	long cycles;           /* how long we lost the cpu for */
+};
+#define MAX_EVENTS 100
+struct jitter_info jitter[MAX_EVENTS];
+unsigned int count;            /* index into jitter[] */
+
+void jitter_summarize(void)
+{
+	unsigned int i;
+
+	printf("INFO: loop times:\n");
+	for (i = 0; i < HISTSIZE; ++i)
+		if (hist[i])
+			printf("  %d x %ld\n", i, hist[i]);
+
+	if (count)
+		printf("ERROR: jitter:\n");
+	for (i = 0; i < count; ++i)
+		printf("  %ld: %ld cycles\n", jitter[i].at, jitter[i].cycles);
+	if (count == ARRAY_SIZE(jitter))
+		printf("  ... more\n");
+}
+
+void jitter_handler(int sig)
+{
+	printf("\n");
+	if (sig == SIGUSR1) {
+		exit_status = KSFT_FAIL;
+		printf("ERROR: Program unexpectedly entered kernel.\n");
+	}
+	jitter_summarize();
+	exit(exit_status);
+}
+
+void test_jitter(unsigned long waitticks)
+{
+	unsigned long start, last, elapsed;
+	int rc;
+
+	printf("testing task isolation jitter for %ld ticks\n", waitticks);
+
+	signal(SIGINT, jitter_handler);
+	signal(SIGUSR1, jitter_handler);
+	set_my_cpu(task_isolation_cpu);
+	rc = mlockall(MCL_CURRENT);
+	assert(rc == 0);
+
+	do
+		rc = prctl(PR_SET_TASK_ISOLATION,
+			   PR_TASK_ISOLATION_ENABLE |
+			   PR_TASK_ISOLATION_USERSIG |
+			   PR_TASK_ISOLATION_SET_SIG(SIGUSR1));
+	while (rc != 0 && errno == EAGAIN);
+	if (rc != 0) {
+		printf("couldn't enable isolation (%d): FAIL\n", errno);
+		ksft_exit_fail();
+	}
+
+	last = start = get_cycle_count();
+	do {
+		unsigned long next = get_cycle_count();
+		unsigned long delta = next - last;
+
+		elapsed = next - start;
+		if (__builtin_expect(delta > HISTSIZE, 0)) {
+			exit_status = KSFT_FAIL;
+			if (count < ARRAY_SIZE(jitter)) {
+				jitter[count].cycles = delta;
+				jitter[count].at = elapsed;
+				WRITE_ONCE(count, count + 1);
+			}
+		} else {
+			hist[delta]++;
+		}
+		last = next;
+
+	} while (elapsed < waitticks);
+
+	prctl(PR_SET_TASK_ISOLATION, 0);
+	jitter_summarize();
+}
+
+int main(int argc, char **argv)
+{
+	/* How many billion ticks to wait after running the other tests? */
+	unsigned long waitticks;
+	char buf[100];
+	char *result, *end;
+	FILE *f;
+
+	if (argc == 1)
+		waitticks = 10;
+	else if (argc == 2)
+		waitticks = strtol(argv[1], NULL, 10);
+	else {
+		printf("syntax: isolation [gigaticks]\n");
+		ksft_exit_fail();
+	}
+	waitticks *= 1000000000;
+
+	/* Test that the /sys device is present and pick a cpu. */
+	f = fopen("/sys/devices/system/cpu/task_isolation", "r");
+	if (f == NULL) {
+		printf("/sys device: SKIP (%s)\n", strerror(errno));
+		ksft_exit_skip();
+	}
+	result = fgets(buf, sizeof(buf), f);
+	assert(result == buf);
+	fclose(f);
+	if (*buf == '\n') {
+		printf("No task_isolation cores configured.\n");
+		ksft_exit_skip();
+	}
+	task_isolation_cpu = strtol(buf, &end, 10);
+	assert(end != buf);
+	assert(*end == ',' || *end == '-' || *end == '\n');
+	assert(task_isolation_cpu >= 0);
+	printf("/sys device : OK (using task isolation cpu %d)\n",
+	       task_isolation_cpu);
+
+	/* Test to see if with no mask set, we fail. */
+	if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) == 0 ||
+	    errno != EINVAL) {
+		printf("prctl unaffinitized: FAIL\n");
+		exit_status = KSFT_FAIL;
+	} else {
+		printf("prctl unaffinitized: OK\n");
+	}
+
+	/* Or if affinitized to the wrong cpu. */
+	set_my_cpu(0);
+	if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) == 0 ||
+	    errno != EINVAL) {
+		printf("prctl on cpu 0: FAIL\n");
+		exit_status = KSFT_FAIL;
+	} else {
+		printf("prctl on cpu 0: OK\n");
+	}
+
+	/* Run the tests. */
+	test_killed("test_fault", setup_fault, do_fault);
+	test_killed("test_syscall", NULL, do_syscall);
+	test_munmap();
+	test_unaligned();
+	test_ok("test_off", NULL, do_syscall_off);
+	test_nosig("test_multi", NULL, do_syscall_multi);
+	test_nosig("test_quiesce", setup_quiesce, do_quiesce);
+
+	/* Exit failure if any test failed. */
+	if (exit_status != KSFT_PASS) {
+		printf("Skipping jitter testing due to test failures\n");
+		return exit_status;
+	}
+
+	test_jitter(waitticks);
+
+	return exit_status;
+}
-- 
2.7.2

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 07/13] arm64: factor work_pending state machine to C
  2016-08-16 21:19 ` [PATCH v15 07/13] arm64: factor work_pending state machine to C Chris Metcalf
@ 2016-08-17  8:05   ` Will Deacon
  0 siblings, 0 replies; 80+ messages in thread
From: Will Deacon @ 2016-08-17  8:05 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Andy Lutomirski, Mark Rutland,
	linux-arm-kernel, linux-kernel

Hi Chris,

On Tue, Aug 16, 2016 at 05:19:30PM -0400, Chris Metcalf wrote:
> Currently ret_fast_syscall, work_pending, and ret_to_user form an ad-hoc
> state machine that can be difficult to reason about due to duplicated
> code and a large number of branch targets.
> 
> This patch factors the common logic out into the existing
> do_notify_resume function, converting the code to C in the process,
> making the code more legible.
> 
> This patch tries to closely mirror the existing behaviour while using
> the usual C control flow primitives. As local_irq_{disable,enable} may
> be instrumented, we balance exception entry (where we will almost most
> likely enable IRQs) with a call to trace_hardirqs_on just before the
> return to userspace.
> 
> Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
> ---
>  arch/arm64/kernel/entry.S  | 12 ++++--------
>  arch/arm64/kernel/signal.c | 36 ++++++++++++++++++++++++++----------
>  2 files changed, 30 insertions(+), 18 deletions(-)

I plan to queue this one in the arm64 tree for 4.9. Should hit -next
sometime next week.

Will

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH] Fix /proc/stat freezes (was [PATCH v15] "task_isolation" mode)
  2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
                   ` (12 preceding siblings ...)
  2016-08-16 21:19 ` [PATCH v15 13/13] task_isolation self test Chris Metcalf
@ 2016-08-17 19:37 ` Christoph Lameter
  2016-08-20  1:42   ` Chris Metcalf
  2016-09-28 13:16   ` Frederic Weisbecker
  2016-08-29 16:27 ` Ping: [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
  2016-11-05  4:04 ` task isolation discussion at Linux Plumbers Chris Metcalf
  15 siblings, 2 replies; 80+ messages in thread
From: Christoph Lameter @ 2016-08-17 19:37 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, Francis Giraldeau,
	linux-doc, linux-api, linux-kernel

On Tue, 16 Aug 2016, Chris Metcalf wrote:

> - Dropped Christoph Lameter's patch to avoid scheduling the
>   clocksource watchdog on nohz cores; the recommendation is to just
>   boot with tsc=reliable for NOHZ in any case, if necessary.

We also said that there should be a WARN_ON if tsc=reliable is not
specified and processors are put into NOHZ mode. This is something not
obvious causing scheduling events on NOHZ processors.


> Frederic, do you have a sense of what is left to be done there?
> I can certainly try to contribute to that effort as well.

Here is a potential fix to the problem that /proc/stat values freeze when
processors go into NOHZ busy mode. I'd like to hear what people think
about the approach here. In particular one issue may be that I am
accessing remote tick-sched structures without serialization. But for
top/ps this may be ok. I noticed that other values shown by top/os also
sometime are a bit fuzzy.



Subject: NOHZ: Correctly display increasing cputime when processor is busy

The tick may be switched off when the processor gets busy with nohz full.
The user time fields in /proc/stat will then no longer increase because
the tick is not run to update the cpustat values anymore.

Compensate for the missing ticks by checking if a processor is in
such a mode. If so then add the ticks that have passed since
the tick was switched off to the usertime.

Note that this introduces a slight inaccuracy. The process may
actually do syscalls without triggering a tick again but the
processing time in those calls is negligible. Any wait or sleep
occurrence during syscalls would activate the tick again.

Any inaccuracy is corrected once the tick is switched on again
since the actual value where cputime aggregates is not changed.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/fs/proc/stat.c
===================================================================
--- linux.orig/fs/proc/stat.c	2016-08-04 09:04:57.681480937 -0500
+++ linux/fs/proc/stat.c	2016-08-17 14:27:37.813445675 -0500
@@ -77,6 +77,12 @@ static u64 get_iowait_time(int cpu)

 #endif

+static unsigned long inline get_cputime_user(int cpu)
+{
+	return kcpustat_cpu(cpu).cpustat[CPUTIME_USER] +
+			tick_stopped_busy_ticks(cpu);
+}
+
 static int show_stat(struct seq_file *p, void *v)
 {
 	int i, j;
@@ -93,7 +99,7 @@ static int show_stat(struct seq_file *p,
 	getboottime64(&boottime);

 	for_each_possible_cpu(i) {
-		user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
+		user += get_cputime_user(i);
 		nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
 		system += kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
 		idle += get_idle_time(i);
@@ -130,7 +136,7 @@ static int show_stat(struct seq_file *p,

 	for_each_online_cpu(i) {
 		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kcpustat_cpu(i).cpustat[CPUTIME_USER];
+		user = get_cputime_user(i);
 		nice = kcpustat_cpu(i).cpustat[CPUTIME_NICE];
 		system = kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
 		idle = get_idle_time(i);
Index: linux/kernel/time/tick-sched.c
===================================================================
--- linux.orig/kernel/time/tick-sched.c	2016-07-27 08:41:17.109862517 -0500
+++ linux/kernel/time/tick-sched.c	2016-08-17 14:16:42.073835333 -0500
@@ -990,6 +990,24 @@ ktime_t tick_nohz_get_sleep_length(void)
 	return ts->sleep_length;
 }

+/**
+ * tick_stopped_busy_ticks - return the ticks that did not occur while the
+ *				processor was busy and the tick was off
+ *
+ * Called from sysfs to correctly calculate cputime of nohz full processors
+ */
+unsigned long tick_stopped_busy_ticks(int cpu)
+{
+#ifdef CONFIG_NOHZ_FULL
+	struct tick_sched *ts = per_cpu_ptr(&tick_cpu_sched, cpu);
+
+	if (!ts->inidle && ts->tick_stopped)
+		return jiffies - ts->idle_jiffies;
+	else
+#endif
+		return 0;
+}
+
 static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h	2016-08-04 09:04:57.688480730 -0500
+++ linux/include/linux/sched.h	2016-08-17 14:18:30.983613830 -0500
@@ -2516,6 +2516,9 @@ static inline void wake_up_nohz_cpu(int

 #ifdef CONFIG_NO_HZ_FULL
 extern u64 scheduler_tick_max_deferment(void);
+extern unsigned long tick_stopped_busy_ticks(int cpu);
+#else
+static inline unsigned long tick_stopped_busy_ticks(int cpu) { return 0; }
 #endif

 #ifdef CONFIG_SCHED_AUTOGROUP

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH] Fix /proc/stat freezes (was [PATCH v15] "task_isolation" mode)
  2016-08-17 19:37 ` [PATCH] Fix /proc/stat freezes (was [PATCH v15] "task_isolation" mode) Christoph Lameter
@ 2016-08-20  1:42   ` Chris Metcalf
  2016-09-28 13:16   ` Frederic Weisbecker
  1 sibling, 0 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-08-20  1:42 UTC (permalink / raw)
  To: Christoph Lameter, Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, Francis Giraldeau, linux-doc,
	linux-api, linux-kernel

On 8/17/2016 3:37 PM, Christoph Lameter wrote:
> On Tue, 16 Aug 2016, Chris Metcalf wrote:
>
>> - Dropped Christoph Lameter's patch to avoid scheduling the
>>    clocksource watchdog on nohz cores; the recommendation is to just
>>    boot with tsc=reliable for NOHZ in any case, if necessary.
> We also said that there should be a WARN_ON if tsc=reliable is not
> specified and processors are put into NOHZ mode. This is something not
> obvious causing scheduling events on NOHZ processors.

Yes, I agree.  Frederic said he would queue a patch to do that, so I
didn't want to propose another patch that would conflict.

>> Frederic, do you have a sense of what is left to be done there?
>> I can certainly try to contribute to that effort as well.
> Here is a potential fix to the problem that /proc/stat values freeze when
> processors go into NOHZ busy mode. I'd like to hear what people think
> about the approach here. In particular one issue may be that I am
> accessing remote tick-sched structures without serialization. But for
> top/ps this may be ok. I noticed that other values shown by top/os also
> sometime are a bit fuzzy.

This seems pretty plausible to me, but I'm not an expert on what kind
of locking might be required for these data structures.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 08/13] arch/arm64: enable task isolation functionality
  2016-08-16 21:19 ` [PATCH v15 08/13] arch/arm64: enable task isolation functionality Chris Metcalf
@ 2016-08-26 16:25   ` Catalin Marinas
  0 siblings, 0 replies; 80+ messages in thread
From: Catalin Marinas @ 2016-08-26 16:25 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Will Deacon, Andy Lutomirski, Mark Rutland,
	linux-arm-kernel, linux-kernel

On Tue, Aug 16, 2016 at 05:19:31PM -0400, Chris Metcalf wrote:
> In do_notify_resume(), call task_isolation_ready() for
> TIF_TASK_ISOLATION tasks when we are checking the thread-info flags;
> and after we've handled the other work, call task_isolation_enter()
> for such tasks.  To ensure we always call task_isolation_enter() when
> returning to userspace, add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
> while leaving the old bitmask value as _TIF_WORK_LOOP_MASK to
> check while looping.
> 
> We tweak syscall_trace_enter() slightly to carry the "flags"
> value from current_thread_info()->flags for each of the tests,
> rather than doing a volatile read from memory for each one.  This
> avoids a small overhead for each test, and in particular avoids
> that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.
> 
> We instrument the smp_send_reschedule() routine so that it checks for
> isolated tasks and generates a suitable warning if we are about
> to disturb one of them in strict or debug mode.
> 
> Finally, report on page faults in task-isolation processes in
> do_page_faults().
> 
> Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
> ---
>  arch/arm64/Kconfig                   |  1 +
>  arch/arm64/include/asm/thread_info.h |  5 ++++-
>  arch/arm64/kernel/ptrace.c           | 18 +++++++++++++++---
>  arch/arm64/kernel/signal.c           | 10 ++++++++++
>  arch/arm64/kernel/smp.c              |  2 ++
>  arch/arm64/mm/fault.c                |  8 +++++++-

Not sure when/how this series will be merged (Will already picked patch
07/13 as a general clean-up) but this arm64 patch:

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Ping: [PATCH v15 00/13] support "task_isolation" mode
  2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
                   ` (13 preceding siblings ...)
  2016-08-17 19:37 ` [PATCH] Fix /proc/stat freezes (was [PATCH v15] "task_isolation" mode) Christoph Lameter
@ 2016-08-29 16:27 ` Chris Metcalf
  2016-09-07 21:11   ` Francis Giraldeau
  2016-09-27 14:35   ` Frederic Weisbecker
  2016-11-05  4:04 ` task isolation discussion at Linux Plumbers Chris Metcalf
  15 siblings, 2 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-08-29 16:27 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, Francis Giraldeau, linux-doc, linux-api,
	linux-kernel

On 8/16/2016 5:19 PM, Chris Metcalf wrote:
> Here is a respin of the task-isolation patch set.
>
> Again, I have been getting email asking me when and where this patch
> will be upstreamed so folks can start using it.  I had been thinking
> the obvious path was via Frederic Weisbecker to Ingo as a NOHZ kind of
> thing.  But perhaps it touches enough other subsystems that that
> doesn't really make sense?  Andrew, would it make sense to take it
> directly via your tree?  Frederic, Ingo, what do you think?

Ping!

No concerns have been raised yet with the v15 version of the patch series
in the two weeks since I posted it, and I think I have addressed all
previously-raised concerns (or perhaps people have just given up arguing
with me).

I did add Catalin's Reviewed-by to 08/13 (thanks!) and updated my
kernel.org repo.

Does this feel like something we can merge when the 4.9 merge window opens?
If so, whose tree is best suited for it?  Or should I ask Stephen to put it into
linux-next now and then ask Linus to merge it directly?  I recall Ingo thought
this was a bad idea when I suggested it back in January, but I'm not sure where
we got to in terms of a better approach.

Thanks all!

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-16 21:19 ` [PATCH v15 04/13] task_isolation: add initial support Chris Metcalf
@ 2016-08-29 16:33   ` Peter Zijlstra
  2016-08-29 16:40     ` Chris Metcalf
  2017-02-02 16:13   ` Eugene Syromiatnikov
  1 sibling, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2016-08-29 16:33 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
> +	/*
> +	 * Request rescheduling unless we are in full dynticks mode.
> +	 * We would eventually get pre-empted without this, and if
> +	 * there's another task waiting, it would run; but by
> +	 * explicitly requesting the reschedule, we may reduce the
> +	 * latency.  We could directly call schedule() here as well,
> +	 * but since our caller is the standard place where schedule()
> +	 * is called, we defer to the caller.
> +	 *
> +	 * A more substantive approach here would be to use a struct
> +	 * completion here explicitly, and complete it when we shut
> +	 * down dynticks, but since we presumably have nothing better
> +	 * to do on this core anyway, just spinning seems plausible.
> +	 */
> +	if (!tick_nohz_tick_stopped())
> +		set_tsk_need_resched(current);

This is broken.. and it would be really good if you don't actually need
to do this.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-29 16:33   ` Peter Zijlstra
@ 2016-08-29 16:40     ` Chris Metcalf
  2016-08-29 16:48       ` Peter Zijlstra
  2016-08-30  7:58       ` Peter Zijlstra
  0 siblings, 2 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-08-29 16:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
> On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
>> +	/*
>> +	 * Request rescheduling unless we are in full dynticks mode.
>> +	 * We would eventually get pre-empted without this, and if
>> +	 * there's another task waiting, it would run; but by
>> +	 * explicitly requesting the reschedule, we may reduce the
>> +	 * latency.  We could directly call schedule() here as well,
>> +	 * but since our caller is the standard place where schedule()
>> +	 * is called, we defer to the caller.
>> +	 *
>> +	 * A more substantive approach here would be to use a struct
>> +	 * completion here explicitly, and complete it when we shut
>> +	 * down dynticks, but since we presumably have nothing better
>> +	 * to do on this core anyway, just spinning seems plausible.
>> +	 */
>> +	if (!tick_nohz_tick_stopped())
>> +		set_tsk_need_resched(current);
> This is broken.. and it would be really good if you don't actually need
> to do this.

Can you elaborate?  We clearly do want to wait until we are in full
dynticks mode before we return to userspace.

We could do it just in the prctl() syscall only, but then we lose the
ability to implement the NOSIG mode, which can be a convenience.

Even without that consideration, we really can't be sure we stay in
dynticks mode if we disable the dynamic tick, but then enable interrupts,
and end up taking an interrupt on the way back to userspace, and
it turns the tick back on.  That's why we do it here, where we know
interrupts will stay disabled until we get to userspace.

So if we are doing it here, what else can/should we do?  There really
shouldn't be any other tasks waiting to run at this point, so there's
not a heck of a lot else to do on this core.  We could just spin and
check need_resched and signal status manually instead, but that
seems kind of duplicative of code already done in our caller here.

So... thoughts?

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-29 16:40     ` Chris Metcalf
@ 2016-08-29 16:48       ` Peter Zijlstra
  2016-08-29 16:53         ` Chris Metcalf
  2016-08-30  7:58       ` Peter Zijlstra
  1 sibling, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2016-08-29 16:48 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
> >On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
> >>+	/*
> >>+	 * Request rescheduling unless we are in full dynticks mode.
> >>+	 * We would eventually get pre-empted without this, and if
> >>+	 * there's another task waiting, it would run; but by
> >>+	 * explicitly requesting the reschedule, we may reduce the
> >>+	 * latency.  We could directly call schedule() here as well,
> >>+	 * but since our caller is the standard place where schedule()
> >>+	 * is called, we defer to the caller.
> >>+	 *
> >>+	 * A more substantive approach here would be to use a struct
> >>+	 * completion here explicitly, and complete it when we shut
> >>+	 * down dynticks, but since we presumably have nothing better
> >>+	 * to do on this core anyway, just spinning seems plausible.
> >>+	 */
> >>+	if (!tick_nohz_tick_stopped())
> >>+		set_tsk_need_resched(current);
> >This is broken.. and it would be really good if you don't actually need
> >to do this.
> 
> Can you elaborate?  

Naked use of TIF_NEED_RESCHED like this is busted. There is more state
that needs to be poked to keep things consistent / working.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-29 16:48       ` Peter Zijlstra
@ 2016-08-29 16:53         ` Chris Metcalf
  2016-08-30  7:59           ` Peter Zijlstra
  0 siblings, 1 reply; 80+ messages in thread
From: Chris Metcalf @ 2016-08-29 16:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On 8/29/2016 12:48 PM, Peter Zijlstra wrote:
> On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
>> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
>>> On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
>>>> +	/*
>>>> +	 * Request rescheduling unless we are in full dynticks mode.
>>>> +	 * We would eventually get pre-empted without this, and if
>>>> +	 * there's another task waiting, it would run; but by
>>>> +	 * explicitly requesting the reschedule, we may reduce the
>>>> +	 * latency.  We could directly call schedule() here as well,
>>>> +	 * but since our caller is the standard place where schedule()
>>>> +	 * is called, we defer to the caller.
>>>> +	 *
>>>> +	 * A more substantive approach here would be to use a struct
>>>> +	 * completion here explicitly, and complete it when we shut
>>>> +	 * down dynticks, but since we presumably have nothing better
>>>> +	 * to do on this core anyway, just spinning seems plausible.
>>>> +	 */
>>>> +	if (!tick_nohz_tick_stopped())
>>>> +		set_tsk_need_resched(current);
>>> This is broken.. and it would be really good if you don't actually need
>>> to do this.
>> Can you elaborate?
> Naked use of TIF_NEED_RESCHED like this is busted. There is more state
> that needs to be poked to keep things consistent / working.

Would it be cleaner to just replace the set_tsk_need_resched() call
with something like:

     set_current_state(TASK_INTERRUPTIBLE);
     schedule();
     __set_current_state(TASK_RUNNING);

or what would you recommend?

Or, as I said, just doing a busy loop here while testing to see
if need_resched or signal had been set?

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-29 16:40     ` Chris Metcalf
  2016-08-29 16:48       ` Peter Zijlstra
@ 2016-08-30  7:58       ` Peter Zijlstra
  2016-08-30 15:32         ` Chris Metcalf
  1 sibling, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2016-08-30  7:58 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
> >On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
> >>+	/*
> >>+	 * Request rescheduling unless we are in full dynticks mode.
> >>+	 * We would eventually get pre-empted without this, and if
> >>+	 * there's another task waiting, it would run; but by
> >>+	 * explicitly requesting the reschedule, we may reduce the
> >>+	 * latency.  We could directly call schedule() here as well,
> >>+	 * but since our caller is the standard place where schedule()
> >>+	 * is called, we defer to the caller.
> >>+	 *
> >>+	 * A more substantive approach here would be to use a struct
> >>+	 * completion here explicitly, and complete it when we shut
> >>+	 * down dynticks, but since we presumably have nothing better
> >>+	 * to do on this core anyway, just spinning seems plausible.
> >>+	 */
> >>+	if (!tick_nohz_tick_stopped())
> >>+		set_tsk_need_resched(current);
> >This is broken.. and it would be really good if you don't actually need
> >to do this.
> 
> Can you elaborate?  We clearly do want to wait until we are in full
> dynticks mode before we return to userspace.
> 
> We could do it just in the prctl() syscall only, but then we lose the
> ability to implement the NOSIG mode, which can be a convenience.

So this isn't spelled out anywhere. Why does this need to be in the
return to user path?

> Even without that consideration, we really can't be sure we stay in
> dynticks mode if we disable the dynamic tick, but then enable interrupts,
> and end up taking an interrupt on the way back to userspace, and
> it turns the tick back on.  That's why we do it here, where we know
> interrupts will stay disabled until we get to userspace.

But but but.. task_isolation_enter() is explicitly ran with IRQs
_enabled_!! It even WARNs if they're disabled.

> So if we are doing it here, what else can/should we do?  There really
> shouldn't be any other tasks waiting to run at this point, so there's
> not a heck of a lot else to do on this core.  We could just spin and
> check need_resched and signal status manually instead, but that
> seems kind of duplicative of code already done in our caller here.

What !? I really don't get this, what are you waiting for? Why is
rescheduling making things better.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-29 16:53         ` Chris Metcalf
@ 2016-08-30  7:59           ` Peter Zijlstra
  0 siblings, 0 replies; 80+ messages in thread
From: Peter Zijlstra @ 2016-08-30  7:59 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On Mon, Aug 29, 2016 at 12:53:30PM -0400, Chris Metcalf wrote:

> Would it be cleaner to just replace the set_tsk_need_resched() call
> with something like:
> 
>     set_current_state(TASK_INTERRUPTIBLE);
>     schedule();
>     __set_current_state(TASK_RUNNING);
> 
> or what would you recommend?

That'll just get you to sleep _forever_...

> Or, as I said, just doing a busy loop here while testing to see
> if need_resched or signal had been set?

Why do you care about need_resched() and or signals? How is that related
to the tick being stopped or not?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30  7:58       ` Peter Zijlstra
@ 2016-08-30 15:32         ` Chris Metcalf
  2016-08-30 16:30           ` Andy Lutomirski
  2016-09-01 10:06           ` Peter Zijlstra
  0 siblings, 2 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-08-30 15:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On 8/30/2016 3:58 AM, Peter Zijlstra wrote:
> On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
>> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
>>> On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
>>>> +	/*
>>>> +	 * Request rescheduling unless we are in full dynticks mode.
>>>> +	 * We would eventually get pre-empted without this, and if
>>>> +	 * there's another task waiting, it would run; but by
>>>> +	 * explicitly requesting the reschedule, we may reduce the
>>>> +	 * latency.  We could directly call schedule() here as well,
>>>> +	 * but since our caller is the standard place where schedule()
>>>> +	 * is called, we defer to the caller.
>>>> +	 *
>>>> +	 * A more substantive approach here would be to use a struct
>>>> +	 * completion here explicitly, and complete it when we shut
>>>> +	 * down dynticks, but since we presumably have nothing better
>>>> +	 * to do on this core anyway, just spinning seems plausible.
>>>> +	 */
>>>> +	if (!tick_nohz_tick_stopped())
>>>> +		set_tsk_need_resched(current);
>>> This is broken.. and it would be really good if you don't actually need
>>> to do this.
>> Can you elaborate?  We clearly do want to wait until we are in full
>> dynticks mode before we return to userspace.
>>
>> We could do it just in the prctl() syscall only, but then we lose the
>> ability to implement the NOSIG mode, which can be a convenience.
> So this isn't spelled out anywhere. Why does this need to be in the
> return to user path?

I'm not sure where this should be spelled out, to be honest.  I guess
I can add some commentary to the commit message explaining this part.

The basic idea is just that we don't want to be at risk from the
dyntick getting enabled.  Similarly, we don't want to be at risk of a
later global IPI due to lru_add_drain stuff, for example.  And, we may
want to add additional stuff, like catching kernel TLB flushes and
deferring them when a remote core is in userspace.  To do all of this
kind of stuff, we need to run in the return to user path so we are
late enough to guarantee no further kernel things will happen to
perturb our carefully-arranged isolation state that includes dyntick
off, per-cpu lru cache empty, etc etc.

>> Even without that consideration, we really can't be sure we stay in
>> dynticks mode if we disable the dynamic tick, but then enable interrupts,
>> and end up taking an interrupt on the way back to userspace, and
>> it turns the tick back on.  That's why we do it here, where we know
>> interrupts will stay disabled until we get to userspace.
> But but but.. task_isolation_enter() is explicitly ran with IRQs
> _enabled_!! It even WARNs if they're disabled.

Yes, true!  But if you pop up to the caller, the key thing is the
task_isolation_ready() routine where we are invoked with interrupts
disabled, and we confirm that all our criteria are met (including
tick_nohz_tick_stopped), and then leave interrupts disabled as we
return from there onwards to userspace.

The task_isolation_enter() code just does its best-faith attempt to
make sure all these criteria are met, just like all the other TIF_xxx
flag tests do in exit_to_usermode_loop() on x86, like scheduling,
delivering signals, etc.  As you know, we might run that code, go
around the loop, and discover that the TIF flag has been re-set, and
we have to run the code again before all of that stuff has "quiesced".
The isolation code uses that same model; the only difference is that
we clear the TIF flag manually in the loop by checking
task_isolation_ready().

>> So if we are doing it here, what else can/should we do?  There really
>> shouldn't be any other tasks waiting to run at this point, so there's
>> not a heck of a lot else to do on this core.  We could just spin and
>> check need_resched and signal status manually instead, but that
>> seems kind of duplicative of code already done in our caller here.
> What !? I really don't get this, what are you waiting for? Why is
> rescheduling making things better.

We need to wait for the last dyntick to fire before we can return to
userspace.  There are plenty of options as to what we can do in the
meanwhile.

1. Try to schedule().  Good luck with that in practice, since a
userspace process that has enabled task isolation is going to be alone
on its core unless something pretty broken is happening on the system.
But, at least folks understand the idiom of scheduling out while you wait.

2. Another variant of that: set up a wait completion and have the
dynticks code complete it when the tick turns off.  But this adds
complexity to option 1, and really doesn't buy us much in practice
that I can see.

3. Just admit that we are likely alone on the core, and just burn
cycles in a busy loop waiting for that last tick to fire.  Obviously
if we do this we also need to test for signals and resched so the core
remains responsive.  We can either do this in a loop just by spinning
explicitly, or I could literally just remove the line in the current
patchset that sets TIF_NEED_RESCHED, at which point we busy-wait by
just going around and around in exit_to_usermode_loop().  The only
flaw here is that we don't mark the task explicitly as TASK_INTERRUPTIBLE
while we are doing this - and that's probably worth doing.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30 15:32         ` Chris Metcalf
@ 2016-08-30 16:30           ` Andy Lutomirski
  2016-08-30 17:02             ` Chris Metcalf
  2016-09-01 10:06           ` Peter Zijlstra
  1 sibling, 1 reply; 80+ messages in thread
From: Andy Lutomirski @ 2016-08-30 16:30 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Peter Zijlstra, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Michal Hocko,
	linux-mm, linux-doc, Linux API, linux-kernel

On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
> On 8/30/2016 3:58 AM, Peter Zijlstra wrote:
>>
>> On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
>>>
>>> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
>>>>
>>>> On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
>>>>>
>>>>> +       /*
>>>>> +        * Request rescheduling unless we are in full dynticks mode.
>>>>> +        * We would eventually get pre-empted without this, and if
>>>>> +        * there's another task waiting, it would run; but by
>>>>> +        * explicitly requesting the reschedule, we may reduce the
>>>>> +        * latency.  We could directly call schedule() here as well,
>>>>> +        * but since our caller is the standard place where schedule()
>>>>> +        * is called, we defer to the caller.
>>>>> +        *
>>>>> +        * A more substantive approach here would be to use a struct
>>>>> +        * completion here explicitly, and complete it when we shut
>>>>> +        * down dynticks, but since we presumably have nothing better
>>>>> +        * to do on this core anyway, just spinning seems plausible.
>>>>> +        */
>>>>> +       if (!tick_nohz_tick_stopped())
>>>>> +               set_tsk_need_resched(current);
>>>>
>>>> This is broken.. and it would be really good if you don't actually need
>>>> to do this.
>>>
>>> Can you elaborate?  We clearly do want to wait until we are in full
>>> dynticks mode before we return to userspace.
>>>
>>> We could do it just in the prctl() syscall only, but then we lose the
>>> ability to implement the NOSIG mode, which can be a convenience.
>>
>> So this isn't spelled out anywhere. Why does this need to be in the
>> return to user path?
>
>
> I'm not sure where this should be spelled out, to be honest.  I guess
> I can add some commentary to the commit message explaining this part.
>
> The basic idea is just that we don't want to be at risk from the
> dyntick getting enabled.  Similarly, we don't want to be at risk of a
> later global IPI due to lru_add_drain stuff, for example.  And, we may
> want to add additional stuff, like catching kernel TLB flushes and
> deferring them when a remote core is in userspace.  To do all of this
> kind of stuff, we need to run in the return to user path so we are
> late enough to guarantee no further kernel things will happen to
> perturb our carefully-arranged isolation state that includes dyntick
> off, per-cpu lru cache empty, etc etc.

None of the above should need to *loop*, though, AFAIK.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30 16:30           ` Andy Lutomirski
@ 2016-08-30 17:02             ` Chris Metcalf
  2016-08-30 18:43               ` Andy Lutomirski
  0 siblings, 1 reply; 80+ messages in thread
From: Chris Metcalf @ 2016-08-30 17:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Michal Hocko,
	linux-mm, linux-doc, Linux API, linux-kernel

On 8/30/2016 12:30 PM, Andy Lutomirski wrote:
> On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>> On 8/30/2016 3:58 AM, Peter Zijlstra wrote:
>>> On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
>>>> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
>>>>> On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
>>>>>> +       /*
>>>>>> +        * Request rescheduling unless we are in full dynticks mode.
>>>>>> +        * We would eventually get pre-empted without this, and if
>>>>>> +        * there's another task waiting, it would run; but by
>>>>>> +        * explicitly requesting the reschedule, we may reduce the
>>>>>> +        * latency.  We could directly call schedule() here as well,
>>>>>> +        * but since our caller is the standard place where schedule()
>>>>>> +        * is called, we defer to the caller.
>>>>>> +        *
>>>>>> +        * A more substantive approach here would be to use a struct
>>>>>> +        * completion here explicitly, and complete it when we shut
>>>>>> +        * down dynticks, but since we presumably have nothing better
>>>>>> +        * to do on this core anyway, just spinning seems plausible.
>>>>>> +        */
>>>>>> +       if (!tick_nohz_tick_stopped())
>>>>>> +               set_tsk_need_resched(current);
>>>>> This is broken.. and it would be really good if you don't actually need
>>>>> to do this.
>>>> Can you elaborate?  We clearly do want to wait until we are in full
>>>> dynticks mode before we return to userspace.
>>>>
>>>> We could do it just in the prctl() syscall only, but then we lose the
>>>> ability to implement the NOSIG mode, which can be a convenience.
>>> So this isn't spelled out anywhere. Why does this need to be in the
>>> return to user path?
>>
>> I'm not sure where this should be spelled out, to be honest.  I guess
>> I can add some commentary to the commit message explaining this part.
>>
>> The basic idea is just that we don't want to be at risk from the
>> dyntick getting enabled.  Similarly, we don't want to be at risk of a
>> later global IPI due to lru_add_drain stuff, for example.  And, we may
>> want to add additional stuff, like catching kernel TLB flushes and
>> deferring them when a remote core is in userspace.  To do all of this
>> kind of stuff, we need to run in the return to user path so we are
>> late enough to guarantee no further kernel things will happen to
>> perturb our carefully-arranged isolation state that includes dyntick
>> off, per-cpu lru cache empty, etc etc.
> None of the above should need to *loop*, though, AFAIK.

Ordering is a problem, though.

We really want to run task isolation last, so we can guarantee that
all the isolation prerequisites are met (dynticks stopped, per-cpu lru
cache empty, etc).  But achieving that state can require enabling
interrupts - most obviously if we have to schedule, e.g. for vmstat
clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
just while waiting for that last dyntick interrupt to occur.  I'm also
not sure that even something as simple as draining the per-cpu lru
cache can be done holding interrupts disabled throughout - certainly
there's a !SMP code path there that just re-enables interrupts
unconditionally, which gives me pause.

At any rate at that point you need to retest for signals, resched,
etc, all as usual, and then you need to recheck the task isolation
prerequisites once more.

I may be missing something here, but it's really not obvious to me
that there's a way to do this without having task isolation integrated
into the usual return-to-userspace loop.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30 17:02             ` Chris Metcalf
@ 2016-08-30 18:43               ` Andy Lutomirski
  2016-08-30 19:37                 ` Chris Metcalf
  2016-09-30 16:59                 ` Chris Metcalf
  0 siblings, 2 replies; 80+ messages in thread
From: Andy Lutomirski @ 2016-08-30 18:43 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: linux-doc, Thomas Gleixner, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Linux API, Viresh Kumar,
	Ingo Molnar, Steven Rostedt, Tejun Heo, Will Deacon,
	Rik van Riel, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On Aug 30, 2016 10:02 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>
> On 8/30/2016 12:30 PM, Andy Lutomirski wrote:
>>
>> On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>>>
>>> On 8/30/2016 3:58 AM, Peter Zijlstra wrote:
>>>>
>>>> On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
>>>>>
>>>>> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
>>>>>>
>>>>>> On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
>>>>>>>
>>>>>>> +       /*
>>>>>>> +        * Request rescheduling unless we are in full dynticks mode.
>>>>>>> +        * We would eventually get pre-empted without this, and if
>>>>>>> +        * there's another task waiting, it would run; but by
>>>>>>> +        * explicitly requesting the reschedule, we may reduce the
>>>>>>> +        * latency.  We could directly call schedule() here as well,
>>>>>>> +        * but since our caller is the standard place where schedule()
>>>>>>> +        * is called, we defer to the caller.
>>>>>>> +        *
>>>>>>> +        * A more substantive approach here would be to use a struct
>>>>>>> +        * completion here explicitly, and complete it when we shut
>>>>>>> +        * down dynticks, but since we presumably have nothing better
>>>>>>> +        * to do on this core anyway, just spinning seems plausible.
>>>>>>> +        */
>>>>>>> +       if (!tick_nohz_tick_stopped())
>>>>>>> +               set_tsk_need_resched(current);
>>>>>>
>>>>>> This is broken.. and it would be really good if you don't actually need
>>>>>> to do this.
>>>>>
>>>>> Can you elaborate?  We clearly do want to wait until we are in full
>>>>> dynticks mode before we return to userspace.
>>>>>
>>>>> We could do it just in the prctl() syscall only, but then we lose the
>>>>> ability to implement the NOSIG mode, which can be a convenience.
>>>>
>>>> So this isn't spelled out anywhere. Why does this need to be in the
>>>> return to user path?
>>>
>>>
>>> I'm not sure where this should be spelled out, to be honest.  I guess
>>> I can add some commentary to the commit message explaining this part.
>>>
>>> The basic idea is just that we don't want to be at risk from the
>>> dyntick getting enabled.  Similarly, we don't want to be at risk of a
>>> later global IPI due to lru_add_drain stuff, for example.  And, we may
>>> want to add additional stuff, like catching kernel TLB flushes and
>>> deferring them when a remote core is in userspace.  To do all of this
>>> kind of stuff, we need to run in the return to user path so we are
>>> late enough to guarantee no further kernel things will happen to
>>> perturb our carefully-arranged isolation state that includes dyntick
>>> off, per-cpu lru cache empty, etc etc.
>>
>> None of the above should need to *loop*, though, AFAIK.
>
>
> Ordering is a problem, though.
>
> We really want to run task isolation last, so we can guarantee that
> all the isolation prerequisites are met (dynticks stopped, per-cpu lru
> cache empty, etc).  But achieving that state can require enabling
> interrupts - most obviously if we have to schedule, e.g. for vmstat
> clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
> just while waiting for that last dyntick interrupt to occur.  I'm also
> not sure that even something as simple as draining the per-cpu lru
> cache can be done holding interrupts disabled throughout - certainly
> there's a !SMP code path there that just re-enables interrupts
> unconditionally, which gives me pause.
>
> At any rate at that point you need to retest for signals, resched,
> etc, all as usual, and then you need to recheck the task isolation
> prerequisites once more.
>
> I may be missing something here, but it's really not obvious to me
> that there's a way to do this without having task isolation integrated
> into the usual return-to-userspace loop.
>

What if we did it the other way around: set a percpu flag saying
"going quiescent; disallow new deferred work", then finish all
existing work and return to userspace.  Then, on the next entry, clear
that flag.  With the flag set, vmstat would just flush anything that
it accumulates immediately, nothing would be added to the LRU list,
etc.

Also, this cond_resched stuff doesn't worry me too much at a
fundamental level -- if we're really going quiescent, shouldn't we be
able to arrange that there are no other schedulable tasks on the CPU
in question?

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30 18:43               ` Andy Lutomirski
@ 2016-08-30 19:37                 ` Chris Metcalf
  2016-08-30 19:50                   ` Andy Lutomirski
  2016-09-30 16:59                 ` Chris Metcalf
  1 sibling, 1 reply; 80+ messages in thread
From: Chris Metcalf @ 2016-08-30 19:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-doc, Thomas Gleixner, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Linux API, Viresh Kumar,
	Ingo Molnar, Steven Rostedt, Tejun Heo, Will Deacon,
	Rik van Riel, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On 8/30/2016 2:43 PM, Andy Lutomirski wrote:
> On Aug 30, 2016 10:02 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>> On 8/30/2016 12:30 PM, Andy Lutomirski wrote:
>>> On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>>>> The basic idea is just that we don't want to be at risk from the
>>>> dyntick getting enabled.  Similarly, we don't want to be at risk of a
>>>> later global IPI due to lru_add_drain stuff, for example.  And, we may
>>>> want to add additional stuff, like catching kernel TLB flushes and
>>>> deferring them when a remote core is in userspace.  To do all of this
>>>> kind of stuff, we need to run in the return to user path so we are
>>>> late enough to guarantee no further kernel things will happen to
>>>> perturb our carefully-arranged isolation state that includes dyntick
>>>> off, per-cpu lru cache empty, etc etc.
>>> None of the above should need to *loop*, though, AFAIK.
>> Ordering is a problem, though.
>>
>> We really want to run task isolation last, so we can guarantee that
>> all the isolation prerequisites are met (dynticks stopped, per-cpu lru
>> cache empty, etc).  But achieving that state can require enabling
>> interrupts - most obviously if we have to schedule, e.g. for vmstat
>> clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
>> just while waiting for that last dyntick interrupt to occur.  I'm also
>> not sure that even something as simple as draining the per-cpu lru
>> cache can be done holding interrupts disabled throughout - certainly
>> there's a !SMP code path there that just re-enables interrupts
>> unconditionally, which gives me pause.
>>
>> At any rate at that point you need to retest for signals, resched,
>> etc, all as usual, and then you need to recheck the task isolation
>> prerequisites once more.
>>
>> I may be missing something here, but it's really not obvious to me
>> that there's a way to do this without having task isolation integrated
>> into the usual return-to-userspace loop.
>>
> What if we did it the other way around: set a percpu flag saying
> "going quiescent; disallow new deferred work", then finish all
> existing work and return to userspace.  Then, on the next entry, clear
> that flag.  With the flag set, vmstat would just flush anything that
> it accumulates immediately, nothing would be added to the LRU list,
> etc.

This is an interesting idea!

However, there are a number of implementation ideas that make me
worry that it might be a trickier approach overall.

First, "on the next entry" hides a world of hurt in four simple words.
Some platforms (arm64 and tile, that I'm familiar with) have a common
chunk of code that always runs on every entry to the kernel.  It would
not be too hard to poke at the assembly and make those platforms
always run some task-isolation specific code on entry.  But x86 scares
me - there seem to be a whole lot of ways to get into the kernel, and
I'm not convinced there is a lot of shared macrology or whatever that
would make it straightforward to intercept all of them.

Then, there are the two actual subsystems in question.  It looks like
we could intercept LRU reasonably cleanly by hooking pagevec_add()
to return zero when we are in this "going quiescent" mode, and that
would keep the per-cpu vectors empty.  The vmstat stuff is a little
trickier since all the existing code is built around updating the per-cpu
stuff and then only later copying it off to the global state.  I suppose
we could add a test-and-flush at the end of every public API and not
worry about the implementation cost.

But it does seem like we are adding noticeable maintenance cost on
the mainline kernel to support task isolation by doing this.  My guess
is that it is easier to support the kind of "are you clean?" / "get clean"
APIs for subsystems, rather than weaving a whole set of "stay clean"
mechanism into each subsystem.

So to pop up a level, what is your actual concern about the existing
"do it in a loop" model?  The macrology currently in use means there
is zero cost if you don't configure TASK_ISOLATION, and the software
maintenance cost seems low since the idioms used for task isolation
in the loop are generally familiar to people reading that code.

> Also, this cond_resched stuff doesn't worry me too much at a
> fundamental level -- if we're really going quiescent, shouldn't we be
> able to arrange that there are no other schedulable tasks on the CPU
> in question?

We aren't currently planning to enforce things in the scheduler, so if
the application affinitizes another task on top of an existing task
isolation task, by default the task isolation task just dies. (Unless
it's using NOSIG mode, in which case it just ends up stuck in the
kernel trying to wait out the dyntick until you either kill it, or
re-affinitize the offending task.)  But I'm reluctant to guarantee
every possible way that you might (perhaps briefly) have some
schedulable task, and the current approach seems pretty robust if that
sort of thing happens.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30 19:37                 ` Chris Metcalf
@ 2016-08-30 19:50                   ` Andy Lutomirski
  2016-09-02 14:04                     ` Chris Metcalf
  0 siblings, 1 reply; 80+ messages in thread
From: Andy Lutomirski @ 2016-08-30 19:50 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: linux-doc, Thomas Gleixner, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Linux API, Viresh Kumar,
	Ingo Molnar, Steven Rostedt, Tejun Heo, Will Deacon,
	Rik van Riel, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
> On 8/30/2016 2:43 PM, Andy Lutomirski wrote:
>>
>> On Aug 30, 2016 10:02 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>>>
>>> On 8/30/2016 12:30 PM, Andy Lutomirski wrote:
>>>>
>>>> On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf <cmetcalf@mellanox.com>
>>>> wrote:
>>>>>
>>>>> The basic idea is just that we don't want to be at risk from the
>>>>> dyntick getting enabled.  Similarly, we don't want to be at risk of a
>>>>> later global IPI due to lru_add_drain stuff, for example.  And, we may
>>>>> want to add additional stuff, like catching kernel TLB flushes and
>>>>> deferring them when a remote core is in userspace.  To do all of this
>>>>> kind of stuff, we need to run in the return to user path so we are
>>>>> late enough to guarantee no further kernel things will happen to
>>>>> perturb our carefully-arranged isolation state that includes dyntick
>>>>> off, per-cpu lru cache empty, etc etc.
>>>>
>>>> None of the above should need to *loop*, though, AFAIK.
>>>
>>> Ordering is a problem, though.
>>>
>>> We really want to run task isolation last, so we can guarantee that
>>> all the isolation prerequisites are met (dynticks stopped, per-cpu lru
>>> cache empty, etc).  But achieving that state can require enabling
>>> interrupts - most obviously if we have to schedule, e.g. for vmstat
>>> clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
>>> just while waiting for that last dyntick interrupt to occur.  I'm also
>>> not sure that even something as simple as draining the per-cpu lru
>>> cache can be done holding interrupts disabled throughout - certainly
>>> there's a !SMP code path there that just re-enables interrupts
>>> unconditionally, which gives me pause.
>>>
>>> At any rate at that point you need to retest for signals, resched,
>>> etc, all as usual, and then you need to recheck the task isolation
>>> prerequisites once more.
>>>
>>> I may be missing something here, but it's really not obvious to me
>>> that there's a way to do this without having task isolation integrated
>>> into the usual return-to-userspace loop.
>>>
>> What if we did it the other way around: set a percpu flag saying
>> "going quiescent; disallow new deferred work", then finish all
>> existing work and return to userspace.  Then, on the next entry, clear
>> that flag.  With the flag set, vmstat would just flush anything that
>> it accumulates immediately, nothing would be added to the LRU list,
>> etc.
>
>
> This is an interesting idea!
>
> However, there are a number of implementation ideas that make me
> worry that it might be a trickier approach overall.
>
> First, "on the next entry" hides a world of hurt in four simple words.
> Some platforms (arm64 and tile, that I'm familiar with) have a common
> chunk of code that always runs on every entry to the kernel.  It would
> not be too hard to poke at the assembly and make those platforms
> always run some task-isolation specific code on entry.  But x86 scares
> me - there seem to be a whole lot of ways to get into the kernel, and
> I'm not convinced there is a lot of shared macrology or whatever that
> would make it straightforward to intercept all of them.

Just use the context tracking entry hook.  It's 100% reliable.  The
relevant x86 function is enter_from_user_mode(), but I would just hook
into user_exit() in the common code.  (This code is had better be
reliable, because context tracking depends on it, and, if context
tracking doesn't work on a given arch, then isolation isn't going to
work regardless.

>
> Then, there are the two actual subsystems in question.  It looks like
> we could intercept LRU reasonably cleanly by hooking pagevec_add()
> is to return zero when we are in this "going quiescent" mode, and that
> would keep the per-cpu vectors empty.  The vmstat stuff is a little
> trickier since all the existing code is built around updating the per-cpu
> stuff and then only later copying it off to the global state.  I suppose
> we could add a test-and-flush at the end of every public API and not
> worry about the implementation cost.

Seems reasonable to me.  If anyone cares about the performance hit,
they can fix it.

>
> But it does seem like we are adding noticeable maintenance cost on
> the mainline kernel to support task isolation by doing this.  My guess
> is that it is easier to support the kind of "are you clean?" / "get clean"
> APIs for subsystems, rather than weaving a whole set of "stay clean"
> mechanism into each subsystem.

My intuition is that it's the other way around.  For the mainline
kernel, having a nice clean well-integrated implementation is nicer
than having a bolted-on implementation that interacts in potentially
complicated ways.  Once quiescence support is in mainline, the size of
the diff or the degree to which it's scattered around is irrelevant
because it's not a diff any more.

>
> So to pop up a level, what is your actual concern about the existing
> "do it in a loop" model?  The macrology currently in use means there
> is zero cost if you don't configure TASK_ISOLATION, and the software
> maintenance cost seems low since the idioms used for task isolation
> in the loop are generally familiar to people reading that code.

My concern is that it's not obvious to readers of the code that the
loop ever terminates.  It really ought to, but it's doing something
very odd.  Normally we can loop because we get scheduled out, but
actually blocking in the return-to-userspace path, especially blocking
on a condition that doesn't have a wakeup associated with it, is odd.

>
>> Also, this cond_resched stuff doesn't worry me too much at a
>> fundamental level -- if we're really going quiescent, shouldn't we be
>> able to arrange that there are no other schedulable tasks on the CPU
>> in question?
>
>
> We aren't currently planning to enforce things in the scheduler, so if
> the application affinitizes another task on top of an existing task
> isolation task, by default the task isolation task just dies. (Unless
> it's using NOSIG mode, in which case it just ends up stuck in the
> kernel trying to wait out the dyntick until you either kill it, or
> re-affinitize the offending task.)  But I'm reluctant to guarantee
> every possible way that you might (perhaps briefly) have some
> schedulable task, and the current approach seems pretty robust if that
> sort of thing happens.
>

This kind of waiting out the dyntick scares me.  Why is there ever a
dyntick that you're waiting out?  If quiescence is to be a supported
mainline feature, shouldn't the scheduler be integrated well enough
with it that you don't need to wait like this?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 06/13] arch/x86: enable task isolation functionality
  2016-08-16 21:19 ` [PATCH v15 06/13] arch/x86: enable task isolation functionality Chris Metcalf
@ 2016-08-30 21:46   ` Andy Lutomirski
  0 siblings, 0 replies; 80+ messages in thread
From: Andy Lutomirski @ 2016-08-30 21:46 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Thomas Gleixner, Christoph Lameter, Gilad Ben Yossef,
	Andrew Morton, Viresh Kumar, Ingo Molnar, Steven Rostedt,
	Tejun Heo, Will Deacon, Rik van Riel, Frederic Weisbecker,
	Paul E. McKenney, linux-kernel, X86 ML, H. Peter Anvin,
	Catalin Marinas, Peter Zijlstra

On Aug 16, 2016 11:20 PM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>
> In exit_to_usermode_loop(), call task_isolation_ready() for
> TIF_TASK_ISOLATION tasks when we are checking the thread-info flags,
> and after we've handled the other work, call task_isolation_enter()
> for such tasks.
>
> In syscall_trace_enter_phase1(), we add the necessary support for
> reporting syscalls for task-isolation processes.
>
> We add strict reporting for the kernel exception types that do
> not result in signals, namely non-signalling page faults and
> non-signalling MPX fixups.
>
> Tested-by: Christoph Lameter <cl@linux.com>
> Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
> ---
>  arch/x86/Kconfig                   |  1 +
>  arch/x86/entry/common.c            | 21 ++++++++++++++++++++-
>  arch/x86/include/asm/thread_info.h |  4 +++-
>  arch/x86/kernel/smp.c              |  2 ++
>  arch/x86/kernel/traps.c            |  3 +++
>  arch/x86/mm/fault.c                |  5 +++++
>  6 files changed, 34 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index c580d8c33562..7f6ec46d18d0 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -90,6 +90,7 @@ config X86
>         select HAVE_ARCH_MMAP_RND_COMPAT_BITS   if MMU && COMPAT
>         select HAVE_ARCH_SECCOMP_FILTER
>         select HAVE_ARCH_SOFT_DIRTY             if X86_64
> +       select HAVE_ARCH_TASK_ISOLATION
>         select HAVE_ARCH_TRACEHOOK
>         select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>         select HAVE_ARCH_WITHIN_STACK_FRAMES
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 1433f6b4607d..3b23b3542909 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -21,6 +21,7 @@
>  #include <linux/context_tracking.h>
>  #include <linux/user-return-notifier.h>
>  #include <linux/uprobes.h>
> +#include <linux/isolation.h>
>
>  #include <asm/desc.h>
>  #include <asm/traps.h>
> @@ -91,6 +92,16 @@ static long syscall_trace_enter(struct pt_regs *regs)
>         if (emulated)
>                 return -1L;
>
> +       /*
> +        * In task isolation mode, we may prevent the syscall from
> +        * running, and if so we also deliver a signal to the process.
> +        */
> +       if (work & _TIF_TASK_ISOLATION) {
> +               if (task_isolation_syscall(regs->orig_ax) == -1)
> +                       return -1L;
> +               work &= ~_TIF_TASK_ISOLATION;
> +       }
> +

[apparently i failed to hit send earlier...]

Have you confirmed that this works correctly wrt PTRACE_SYSCALL?  It
should result in an even number of events (like raise(2) or an async
signal) and that should have a test case.

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30 15:32         ` Chris Metcalf
  2016-08-30 16:30           ` Andy Lutomirski
@ 2016-09-01 10:06           ` Peter Zijlstra
  2016-09-02 14:03             ` Chris Metcalf
  1 sibling, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2016-09-01 10:06 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On Tue, Aug 30, 2016 at 11:32:16AM -0400, Chris Metcalf wrote:
> On 8/30/2016 3:58 AM, Peter Zijlstra wrote:

> >What !? I really don't get this, what are you waiting for? Why is
> >rescheduling making things better.
> 
> We need to wait for the last dyntick to fire before we can return to
> userspace.  There are plenty of options as to what we can do in the
> meanwhile.

Why not keep your _TIF_TASK_ISOLATION_FOO flag set and re-enter the
loop?

I really don't see how setting TIF_NEED_RESCHED is helping anything.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-01 10:06           ` Peter Zijlstra
@ 2016-09-02 14:03             ` Chris Metcalf
  2016-09-02 16:40               ` Peter Zijlstra
  0 siblings, 1 reply; 80+ messages in thread
From: Chris Metcalf @ 2016-09-02 14:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On 9/1/2016 6:06 AM, Peter Zijlstra wrote:
> On Tue, Aug 30, 2016 at 11:32:16AM -0400, Chris Metcalf wrote:
>> On 8/30/2016 3:58 AM, Peter Zijlstra wrote:
>>> What !? I really don't get this, what are you waiting for? Why is
>>> rescheduling making things better.
>> We need to wait for the last dyntick to fire before we can return to
>> userspace.  There are plenty of options as to what we can do in the
>> meanwhile.
> Why not keep your _TIF_TASK_ISOLATION_FOO flag set and re-enter the
> loop?
>
> I really don't see how setting TIF_NEED_RESCHED is helping anything.

Yes, I think I addressed that in an earlier reply to Frederic; and you're right,
I don't think TIF_NEED_RESCHED or schedule() are the way to go.

https://lkml.kernel.org/g/107bd666-dbcf-7fa5-ff9c-f79358899712@mellanox.com

Any thoughts on the question of "just re-enter the loop" vs. schedule_timeout()?

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30 19:50                   ` Andy Lutomirski
@ 2016-09-02 14:04                     ` Chris Metcalf
  2016-09-02 17:28                       ` Andy Lutomirski
  0 siblings, 1 reply; 80+ messages in thread
From: Chris Metcalf @ 2016-09-02 14:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-doc, Thomas Gleixner, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Linux API, Viresh Kumar,
	Ingo Molnar, Steven Rostedt, Tejun Heo, Will Deacon,
	Rik van Riel, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On 8/30/2016 3:50 PM, Andy Lutomirski wrote:
> On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>> On 8/30/2016 2:43 PM, Andy Lutomirski wrote:
>>> What if we did it the other way around: set a percpu flag saying
>>> "going quiescent; disallow new deferred work", then finish all
>>> existing work and return to userspace.  Then, on the next entry, clear
>>> that flag.  With the flag set, vmstat would just flush anything that
>>> it accumulates immediately, nothing would be added to the LRU list,
>>> etc.
>>
>> This is an interesting idea!
>>
>> However, there are a number of implementation ideas that make me
>> worry that it might be a trickier approach overall.
>>
>> First, "on the next entry" hides a world of hurt in four simple words.
>> Some platforms (arm64 and tile, that I'm familiar with) have a common
>> chunk of code that always runs on every entry to the kernel.  It would
>> not be too hard to poke at the assembly and make those platforms
>> always run some task-isolation specific code on entry.  But x86 scares
>> me - there seem to be a whole lot of ways to get into the kernel, and
>> I'm not convinced there is a lot of shared macrology or whatever that
>> would make it straightforward to intercept all of them.
> Just use the context tracking entry hook.  It's 100% reliable.  The
> relevant x86 function is enter_from_user_mode(), but I would just hook
> into user_exit() in the common code.  (This code is had better be
> reliable, because context tracking depends on it, and, if context
> tracking doesn't work on a given arch, then isolation isn't going to
> work regardless.

This looks a lot cleaner than last time I looked at the x86 code. So yes, I think
we could do an entry-point approach plausibly now.

This is also good for when we want to look at deferring the kernel TLB flush,
since it's the same mechanism that would be required for that.

>> But it does seem like we are adding noticeable maintenance cost on
>> the mainline kernel to support task isolation by doing this.  My guess
>> is that it is easier to support the kind of "are you clean?" / "get clean"
>> APIs for subsystems, rather than weaving a whole set of "stay clean"
>> mechanism into each subsystem.
> My intuition is that it's the other way around.  For the mainline
> kernel, having a nice clean well-integrated implementation is nicer
> than having a bolted-on implementation that interacts in potentially
> complicated ways.  Once quiescence support is in mainline, the size of
> the diff or the degree to which it's scattered around is irrelevant
> because it's not a diff any more.

I'm not concerned with the size of the diff, just with the intrusiveness into
the various subsystems.

That said, code talks, so let me take a swing at doing it the way you suggest
for vmstat/lru and we'll see what it looks like.

>> So to pop up a level, what is your actual concern about the existing
>> "do it in a loop" model?  The macrology currently in use means there
>> is zero cost if you don't configure TASK_ISOLATION, and the software
>> maintenance cost seems low since the idioms used for task isolation
>> in the loop are generally familiar to people reading that code.
> My concern is that it's not obvious to readers of the code that the
> loop ever terminates.  It really ought to, but it's doing something
> very odd.  Normally we can loop because we get scheduled out, but
> actually blocking in the return-to-userspace path, especially blocking
> on a condition that doesn't have a wakeup associated with it, is odd.

True, although, comments :-)

Regardless, though, this doesn't seem at all weird to me in the
context of the vmstat and lru stuff, though.  It's exactly parallel to
the fact that we loop around on checking need_resched and signal, and
in some cases you could imagine multiple loops around when we schedule
out and get a signal, so loop around again, and then another
reschedule event happens during signal processing so we go around
again, etc.  Eventually it settles down.  It's the same with the
vmstat/lru stuff.

>>> Also, this cond_resched stuff doesn't worry me too much at a
>>> fundamental level -- if we're really going quiescent, shouldn't we be
>>> able to arrange that there are no other schedulable tasks on the CPU
>>> in question?
>> We aren't currently planning to enforce things in the scheduler, so if
>> the application affinitizes another task on top of an existing task
>> isolation task, by default the task isolation task just dies. (Unless
>> it's using NOSIG mode, in which case it just ends up stuck in the
>> kernel trying to wait out the dyntick until you either kill it, or
>> re-affinitize the offending task.)  But I'm reluctant to guarantee
>> every possible way that you might (perhaps briefly) have some
>> schedulable task, and the current approach seems pretty robust if that
>> sort of thing happens.
> This kind of waiting out the dyntick scares me.  Why is there ever a
> dyntick that you're waiting out?  If quiescence is to be a supported
> mainline feature, shouldn't the scheduler be integrated well enough
> with it that you don't need to wait like this?

Well, this is certainly the funkiest piece of the task isolation
stuff.  The problem is that the dyntick stuff may, for example, need
one more tick 4us from now (or whatever) just to close out the current
RCU period.  We can't return to userspace until that happens.  So what
else can we do when the task is ready to return to userspace?  We
could punt into the idle task instead of waiting in this task, which
was my earlier schedule_time() suggestion.  Do you think that's cleaner?

> Have you confirmed that this works correctly wrt PTRACE_SYSCALL?  It
> should result in an even number of events (like raise(2) or an async
> signal) and that should have a test case.

I have not tested PTRACE_SYSCALL.  I'll see about adding that to the
selftest code.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-02 14:03             ` Chris Metcalf
@ 2016-09-02 16:40               ` Peter Zijlstra
  0 siblings, 0 replies; 80+ messages in thread
From: Peter Zijlstra @ 2016-09-02 16:40 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On Fri, Sep 02, 2016 at 10:03:52AM -0400, Chris Metcalf wrote:
> Any thoughts on the question of "just re-enter the loop" vs. schedule_timeout()?

schedule_timeout() should only be used for things you do not have
control over, like things outside of the machine.

If you want to actually block running, use that completion you were
talking of.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-02 14:04                     ` Chris Metcalf
@ 2016-09-02 17:28                       ` Andy Lutomirski
  2016-09-09 17:40                         ` Chris Metcalf
  2016-09-27 14:22                         ` Frederic Weisbecker
  0 siblings, 2 replies; 80+ messages in thread
From: Andy Lutomirski @ 2016-09-02 17:28 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Thomas Gleixner, linux-doc, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Viresh Kumar, Linux API,
	Steven Rostedt, Ingo Molnar, Tejun Heo, Rik van Riel,
	Will Deacon, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On Sep 2, 2016 7:04 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>
> On 8/30/2016 3:50 PM, Andy Lutomirski wrote:
>>
>> On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>>>
>>> On 8/30/2016 2:43 PM, Andy Lutomirski wrote:
>>>>
>>>> What if we did it the other way around: set a percpu flag saying
>>>> "going quiescent; disallow new deferred work", then finish all
>>>> existing work and return to userspace.  Then, on the next entry, clear
>>>> that flag.  With the flag set, vmstat would just flush anything that
>>>> it accumulates immediately, nothing would be added to the LRU list,
>>>> etc.
>>>
>>>
>>> This is an interesting idea!
>>>
>>> However, there are a number of implementation ideas that make me
>>> worry that it might be a trickier approach overall.
>>>
>>> First, "on the next entry" hides a world of hurt in four simple words.
>>> Some platforms (arm64 and tile, that I'm familiar with) have a common
>>> chunk of code that always runs on every entry to the kernel.  It would
>>> not be too hard to poke at the assembly and make those platforms
>>> always run some task-isolation specific code on entry.  But x86 scares
>>> me - there seem to be a whole lot of ways to get into the kernel, and
>>> I'm not convinced there is a lot of shared macrology or whatever that
>>> would make it straightforward to intercept all of them.
>>
>> Just use the context tracking entry hook.  It's 100% reliable.  The
>> relevant x86 function is enter_from_user_mode(), but I would just hook
>> into user_exit() in the common code.  (This code is had better be
>> reliable, because context tracking depends on it, and, if context
>> tracking doesn't work on a given arch, then isolation isn't going to
>> work regardless.
>
>
> This looks a lot cleaner than last time I looked at the x86 code. So yes, I think
> we could do an entry-point approach plausibly now.
>
> This is also good for when we want to look at deferring the kernel TLB flush,
> since it's the same mechanism that would be required for that.
>
>

There's at least one gotcha for the latter: NMIs aren't currently
guaranteed to go through context tracking.  Instead they use their own
RCU hooks.  Deferred TLB flushes can still be made to work, but a bit
more care will be needed.  I would probably approach it with an
additional NMI hook in the same places as rcu_nmi_enter() that does,
more or less:

if (need_tlb_flush) flush();

and then make sure that the normal exit hook looks like:

if (need_tlb_flush) {
  flush();
  barrier(); /* An NMI must not see !need_tlb_flush if the TLB hasn't
been flushed */
  flush the TLB;
}

>>> But it does seem like we are adding noticeable maintenance cost on
>>> the mainline kernel to support task isolation by doing this.  My guess
>>> is that it is easier to support the kind of "are you clean?" / "get clean"
>>> APIs for subsystems, rather than weaving a whole set of "stay clean"
>>> mechanism into each subsystem.
>>
>> My intuition is that it's the other way around.  For the mainline
>> kernel, having a nice clean well-integrated implementation is nicer
>> than having a bolted-on implementation that interacts in potentially
>> complicated ways.  Once quiescence support is in mainline, the size of
>> the diff or the degree to which it's scattered around is irrelevant
>> because it's not a diff any more.
>
>
> I'm not concerned with the size of the diff, just with the intrusiveness into
> the various subsystems.
>
> That said, code talks, so let me take a swing at doing it the way you suggest
> for vmstat/lru and we'll see what it looks like.

Thanks :)

>
>
>>> So to pop up a level, what is your actual concern about the existing
>>> "do it in a loop" model?  The macrology currently in use means there
>>> is zero cost if you don't configure TASK_ISOLATION, and the software
>>> maintenance cost seems low since the idioms used for task isolation
>>> in the loop are generally familiar to people reading that code.
>>
>> My concern is that it's not obvious to readers of the code that the
>> loop ever terminates.  It really ought to, but it's doing something
>> very odd.  Normally we can loop because we get scheduled out, but
>> actually blocking in the return-to-userspace path, especially blocking
>> on a condition that doesn't have a wakeup associated with it, is odd.
>
>
> True, although, comments :-)
>
> Regardless, though, this doesn't seem at all weird to me in the
> context of the vmstat and lru stuff, though.  It's exactly parallel to
> the fact that we loop around on checking need_resched and signal, and
> in some cases you could imagine multiple loops around when we schedule
> out and get a signal, so loop around again, and then another
> reschedule event happens during signal processing so we go around
> again, etc.  Eventually it settles down.  It's the same with the
> vmstat/lru stuff.

Only kind of.

When we say, effectively, while (need_resched()) schedule();, we're
not waiting for an event or condition per se.  We're runnable (in the
sense that userspace wants to run and we're not blocked on anything)
the entire time -- we're simply yielding to some other thread that is
also runnable.  So if that loop runs forever, it either means that
we're at low priority and we genuinely shouldn't be running or that
there's a scheduler bug.

If, on the other hand, we say while (not quiesced) schedule(); (or
equivalent), we're saying that we're *not* really ready to run and
that we're waiting for some condition to change.  The condition in
question is fairly complicated and won't wake us when we are ready.  I
can also imagine the scheduler getting rather confused, since, as far
as the scheduler knows, we are runnable and we are supposed to be
running.

>
>
>>>> Also, this cond_resched stuff doesn't worry me too much at a
>>>> fundamental level -- if we're really going quiescent, shouldn't we be
>>>> able to arrange that there are no other schedulable tasks on the CPU
>>>> in question?
>>>
>>> We aren't currently planning to enforce things in the scheduler, so if
>>> the application affinitizes another task on top of an existing task
>>> isolation task, by default the task isolation task just dies. (Unless
>>> it's using NOSIG mode, in which case it just ends up stuck in the
>>> kernel trying to wait out the dyntick until you either kill it, or
>>> re-affinitize the offending task.)  But I'm reluctant to guarantee
>>> every possible way that you might (perhaps briefly) have some
>>> schedulable task, and the current approach seems pretty robust if that
>>> sort of thing happens.
>>
>> This kind of waiting out the dyntick scares me.  Why is there ever a
>> dyntick that you're waiting out?  If quiescence is to be a supported
>> mainline feature, shouldn't the scheduler be integrated well enough
>> with it that you don't need to wait like this?
>
>
> Well, this is certainly the funkiest piece of the task isolation
> stuff.  The problem is that the dyntick stuff may, for example, need
> one more tick 4us from now (or whatever) just to close out the current
> RCU period.  We can't return to userspace until that happens.  So what
> else can we do when the task is ready to return to userspace?  We
> could punt into the idle task instead of waiting in this task, which
> was my earlier schedule_time() suggestion.  Do you think that's cleaner?
>

Unless I'm missing something (which is reasonably likely), couldn't
the isolation code just force or require rcu_nocbs on the isolated
CPUs to avoid this problem entirely.

I admit I still don't understand why the RCU context tracking code
can't just run the callback right away instead of waiting however many
microseconds in general.  I feel like paulmck has explained it to me
at least once, but that doesn't mean I remember the answer.

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: Ping: [PATCH v15 00/13] support "task_isolation" mode
  2016-08-29 16:27 ` Ping: [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
@ 2016-09-07 21:11   ` Francis Giraldeau
  2016-09-07 21:39     ` Francis Giraldeau
                       ` (2 more replies)
  2016-09-27 14:35   ` Frederic Weisbecker
  1 sibling, 3 replies; 80+ messages in thread
From: Francis Giraldeau @ 2016-09-07 21:11 UTC (permalink / raw)
  To: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, linux-doc, linux-api,
	linux-kernel

On 2016-08-29 12:27 PM, Chris Metcalf wrote:
> On 8/16/2016 5:19 PM, Chris Metcalf wrote:
>> Here is a respin of the task-isolation patch set.
>
> No concerns have been raised yet with the v15 version of the patch series
> in the two weeks since I posted it, and I think I have addressed all
> previously-raised concerns (or perhaps people have just given up arguing
> with me).

There is a cycle with header include in the v15 patch set on x86_64 that cause a compilation error. You will find the patch (and other fixes) in the following branch:

    https://github.com/giraldeau/linux/commits/dataplane-x86-fix-inc

In the test file, it is indicated to change timer-tick.c to disable the 1Hz tick, is there a reason why the change is not in the patch set? I added this trivial change in the branch dataplane-x86-fix-inc above.

The syscall test fails on x86:

    $ sudo ./isolation
    [...]
    test_syscall: FAIL (0x100)
    test_syscall (SIGUSR1): FAIL (0x100)

I wanted to debug this problem with gdb and a KVM virtual machine. However, the TSC clock source is detected as non reliable, even with the boot parameter tsc=reliable, and therefore prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) always returns EAGAIN. Is there a trick to run task isolation in a VM (at least for debugging purposes)?

BTW, this was causing the test to enter an infinite loop. If the clock source is not reliable, maybe a different error code should be returned, because this situation not transient. In the mean time, I added a check for 100 max retries in the test (prctl_safe()).

When running only the test_jitter(), the isolation mode is lost:

    [ 6741.566048] isolation/9515: task_isolation mode lost due to irq_work

With ftrace (events/workqueue/workqueue_execute_start), I get a bit more info:

     kworker/1:1-676   [001] ....  6610.097128: workqueue_execute_start: work struct ffff8801a784ca20: function dbs_work_handler

The governor was ondemand, so I tried to set the frequency scaling governor to performance, but that does not solve the issue. Is there a way to suppress this irq_work? Should we run the isolated task with high real-time priority, such that it never get preempted?

Thanks!

Francis

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: Ping: [PATCH v15 00/13] support "task_isolation" mode
  2016-09-07 21:11   ` Francis Giraldeau
@ 2016-09-07 21:39     ` Francis Giraldeau
  2016-09-08 16:21     ` Francis Giraldeau
  2016-09-12 16:01     ` Chris Metcalf
  2 siblings, 0 replies; 80+ messages in thread
From: Francis Giraldeau @ 2016-09-07 21:39 UTC (permalink / raw)
  To: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, linux-doc, linux-api,
	linux-kernel

On 2016-09-07 05:11 PM, Francis Giraldeau wrote:
> The syscall test fails on x86:
>     $ sudo ./isolation
>     [...]
>     test_syscall: FAIL (0x100)
>     test_syscall (SIGUSR1): FAIL (0x100)
>
> I wanted to debug this problem with gdb and a KVM virtual machine. However, the TSC clock source is detected as non reliable, even with the boot parameter tsc=reliable, and therefore prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) always returns EAGAIN. Is there a trick to run task isolation in a VM (at least for debugging purposes)?

OK, got it. The guest kernel must be compiled with CONFIG_KVM_GUEST, and then with virsh edit, set the clock configuration of the VM (under <domain>):

 <clock offset='utc'>
    <timer name='kvmclock'/>
  </clock>

Of course, the jitter is horrible, but at least it is possible to debug with GDB.

Cheers,

Francis

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: Ping: [PATCH v15 00/13] support "task_isolation" mode
  2016-09-07 21:11   ` Francis Giraldeau
  2016-09-07 21:39     ` Francis Giraldeau
@ 2016-09-08 16:21     ` Francis Giraldeau
  2016-09-12 16:01     ` Chris Metcalf
  2 siblings, 0 replies; 80+ messages in thread
From: Francis Giraldeau @ 2016-09-08 16:21 UTC (permalink / raw)
  To: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, linux-doc, linux-api,
	linux-kernel

On 2016-09-07 05:11 PM, Francis Giraldeau wrote:
> The syscall test fails on x86:
>     $ sudo ./isolation
>     [...]
>     test_syscall: FAIL (0x100)
>     test_syscall (SIGUSR1): FAIL (0x100)

The fix is indeed very simple:

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 7255367..449e2b5 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -139,7 +139,7 @@ struct thread_info {
 #define _TIF_WORK_SYSCALL_ENTRY        \
        (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |   \
         _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT |       \
-        _TIF_NOHZ)
+        _TIF_NOHZ | _TIF_TASK_ISOLATION)
 
 /* work to do on any return to user space */
 #define _TIF_ALLWORK_MASK                                              \


I updated the branch accordingly:

https://github.com/giraldeau/linux/commit/057761af7bde087fa10aa42554a13a7b69071938

Thanks,

Francis

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-02 17:28                       ` Andy Lutomirski
@ 2016-09-09 17:40                         ` Chris Metcalf
  2016-09-12 17:41                           ` Andy Lutomirski
  2016-09-27 14:22                         ` Frederic Weisbecker
  1 sibling, 1 reply; 80+ messages in thread
From: Chris Metcalf @ 2016-09-09 17:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, linux-doc, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Viresh Kumar, Linux API,
	Steven Rostedt, Ingo Molnar, Tejun Heo, Rik van Riel,
	Will Deacon, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On 9/2/2016 1:28 PM, Andy Lutomirski wrote:
> On Sep 2, 2016 7:04 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>> On 8/30/2016 3:50 PM, Andy Lutomirski wrote:
>>> On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>>>> On 8/30/2016 2:43 PM, Andy Lutomirski wrote:
>>>>> What if we did it the other way around: set a percpu flag saying
>>>>> "going quiescent; disallow new deferred work", then finish all
>>>>> existing work and return to userspace.  Then, on the next entry, clear
>>>>> that flag.  With the flag set, vmstat would just flush anything that
>>>>> it accumulates immediately, nothing would be added to the LRU list,
>>>>> etc.
>>>>
>>>> This is an interesting idea!
>>>>
>>>> However, there are a number of implementation ideas that make me
>>>> worry that it might be a trickier approach overall.
>>>>
>>>> First, "on the next entry" hides a world of hurt in four simple words.
>>>> Some platforms (arm64 and tile, that I'm familiar with) have a common
>>>> chunk of code that always runs on every entry to the kernel.  It would
>>>> not be too hard to poke at the assembly and make those platforms
>>>> always run some task-isolation specific code on entry.  But x86 scares
>>>> me - there seem to be a whole lot of ways to get into the kernel, and
>>>> I'm not convinced there is a lot of shared macrology or whatever that
>>>> would make it straightforward to intercept all of them.
>>> Just use the context tracking entry hook.  It's 100% reliable.  The
>>> relevant x86 function is enter_from_user_mode(), but I would just hook
>>> into user_exit() in the common code.  (This code is had better be
>>> reliable, because context tracking depends on it, and, if context
>>> tracking doesn't work on a given arch, then isolation isn't going to
>>> work regardless.
>>
>> This looks a lot cleaner than last time I looked at the x86 code. So yes, I think
>> we could do an entry-point approach plausibly now.
>>
>> This is also good for when we want to look at deferring the kernel TLB flush,
>> since it's the same mechanism that would be required for that.
>>
>>
> There's at least one gotcha for the latter: NMIs aren't currently
> guaranteed to go through context tracking.  Instead they use their own
> RCU hooks.  Deferred TLB flushes can still be made to work, but a bit
> more care will be needed.  I would probably approach it with an
> additional NMI hook in the same places as rcu_nmi_enter() that does,
> more or less:
>
> if (need_tlb_flush) flush();
>
> and then make sure that the normal exit hook looks like:
>
> if (need_tlb_flush) {
>    flush();
>    barrier(); /* An NMI must not see !need_tlb_flush if the TLB hasn't
> been flushed */
>    flush the TLB;
> }

This is a good point.  For now I will continue not trying to include the TLB flush
in the current patch series, so I will sit on this until we're ready to do so.

>>>> So to pop up a level, what is your actual concern about the existing
>>>> "do it in a loop" model?  The macrology currently in use means there
>>>> is zero cost if you don't configure TASK_ISOLATION, and the software
>>>> maintenance cost seems low since the idioms used for task isolation
>>>> in the loop are generally familiar to people reading that code.
>>> My concern is that it's not obvious to readers of the code that the
>>> loop ever terminates.  It really ought to, but it's doing something
>>> very odd.  Normally we can loop because we get scheduled out, but
>>> actually blocking in the return-to-userspace path, especially blocking
>>> on a condition that doesn't have a wakeup associated with it, is odd.
>>
>> True, although, comments :-)
>>
>> Regardless, though, this doesn't seem at all weird to me in the
>> context of the vmstat and lru stuff, though.  It's exactly parallel to
>> the fact that we loop around on checking need_resched and signal, and
>> in some cases you could imagine multiple loops around when we schedule
>> out and get a signal, so loop around again, and then another
>> reschedule event happens during signal processing so we go around
>> again, etc.  Eventually it settles down.  It's the same with the
>> vmstat/lru stuff.
> Only kind of.
>
> When we say, effectively, while (need_resched()) schedule();, we're
> not waiting for an event or condition per se.  We're runnable (in the
> sense that userspace wants to run and we're not blocked on anything)
> the entire time -- we're simply yielding to some other thread that is
> also runnable.  So if that loop runs forever, it either means that
> we're at low priority and we genuinely shouldn't be running or that
> there's a scheduler bug.
>
> If, on the other hand, we say while (not quiesced) schedule(); (or
> equivalent), we're saying that we're *not* really ready to run and
> that we're waiting for some condition to change.  The condition in
> question is fairly complicated and won't wake us when we are ready.  I
> can also imagine the scheduler getting rather confused, since, as far
> as the scheduler knows, we are runnable and we are supposed to be
> running.

So, how about a code structure like this?

In the main return-to-userspace loop where we check TIF flags,
we keep the notion of our TIF_TASK_ISOLATION flag that causes
us to invoke a task_isolation_prepare() routine.  This routine
does the following things:

1. As you suggested, set a new TIF bit (or equivalent) that says the
system should no longer create deferred work on this core, and then
flush any necessary already-deferred work (currently, the LRU cache
and the vmstat stuff).  We never have to go flush the deferred work
again during this task's return to userspace.  Note that this bit can
only be set on a core marked for task isolation, so it can't be used
for denial of service type attacks on normal cores that are trying to
multitask normal Linux processes.

2. Check if the dyntick is stopped, and if not, wait on a completion
that will be set when it does stop.  This means we may schedule out at
this point, but when we return, the deferred work stuff is still safe
since your bit is still set, and in principle the dyn tick is
stopped.

Then, after we disable interrupts and re-read the thread-info flags,
we check to see if the TIF_TASK_ISOLATION flag is the ONLY flag still
set that would keep us in the loop.  This will always end up happening
on each return to userspace, since the only thing that actually clears
the bit is a prctl() call.  When that happens we know we are about to
return to userspace, so we call task_isolation_ready(), which now has
two things to do:

1. We check that the dyntick is in fact stopped, since it's possible
that a race condition led to it being somehow restarted by an interrupt.
If it is not stopped, we go around the loop again so we can go back in
to the completion discussed above and wait some more.  This may merit
a WARN_ON or other notice since it seems like people aren't convinced
there are things that could restart it, but honestly the dyntick stuff
is complex enough that I think a belt-and-suspenders kind of test here
at the last minute is just defensive programming.

2. Assuming it's stopped, we clear your bit at this point, and
return "true" so the loop code knows to break out of the loop and do
the actual return to userspace.  Clearing the bit at this point is
better than waiting until we re-enter the kernel later, since it
avoids having to figure out all the ways we actually can re-enter.
With interrupts disabled, and this late in the return to userspace
process, there's no way additional deferred work can be created.

>>>>> Also, this cond_resched stuff doesn't worry me too much at a
>>>>> fundamental level -- if we're really going quiescent, shouldn't we be
>>>>> able to arrange that there are no other schedulable tasks on the CPU
>>>>> in question?
>>>> We aren't currently planning to enforce things in the scheduler, so if
>>>> the application affinitizes another task on top of an existing task
>>>> isolation task, by default the task isolation task just dies. (Unless
>>>> it's using NOSIG mode, in which case it just ends up stuck in the
>>>> kernel trying to wait out the dyntick until you either kill it, or
>>>> re-affinitize the offending task.)  But I'm reluctant to guarantee
>>>> every possible way that you might (perhaps briefly) have some
>>>> schedulable task, and the current approach seems pretty robust if that
>>>> sort of thing happens.
>>> This kind of waiting out the dyntick scares me.  Why is there ever a
>>> dyntick that you're waiting out?  If quiescence is to be a supported
>>> mainline feature, shouldn't the scheduler be integrated well enough
>>> with it that you don't need to wait like this?
>>
>> Well, this is certainly the funkiest piece of the task isolation
>> stuff.  The problem is that the dyntick stuff may, for example, need
>> one more tick 4us from now (or whatever) just to close out the current
>> RCU period.  We can't return to userspace until that happens.  So what
>> else can we do when the task is ready to return to userspace?  We
>> could punt into the idle task instead of waiting in this task, which
>> was my earlier schedule_time() suggestion.  Do you think that's cleaner?
>>
> Unless I'm missing something (which is reasonably likely), couldn't
> the isolation code just force or require rcu_nocbs on the isolated
> CPUs to avoid this problem entirely.
>
> I admit I still don't understand why the RCU context tracking code
> can't just run the callback right away instead of waiting however many
> microseconds in general.  I feel like paulmck has explained it to me
> at least once, but that doesn't mean I remember the answer.

I admit I am not clear on this either.  However, since there are a
bunch of reasons why the dyntick might run (not just LRU), I think
fixing LRU may well not be enough to guarantee the dyntick
turns off exactly when we'd like it to.

And, with the structure proposed here, we can always come back
and revisit this by just removing the code that does the completion
waiting and replacing it with a call that just tells the dyntick to
just stop immediately, once we're confident we can make that work.

Then separately, we can also think about removing the code that
re-checks dyntick being stopped as we are about to return to
userspace with interrupts disabled, if we're convinced there's
also no way for the dyntick to get restarted due to an interrupt
being handled after we think the dyntick has been stopped.
I'd argue always leaving a WARN_ON() there would be good, though.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: Ping: [PATCH v15 00/13] support "task_isolation" mode
  2016-09-07 21:11   ` Francis Giraldeau
  2016-09-07 21:39     ` Francis Giraldeau
  2016-09-08 16:21     ` Francis Giraldeau
@ 2016-09-12 16:01     ` Chris Metcalf
  2016-09-12 16:14       ` Peter Zijlstra
  2016-09-13  0:20       ` Francis Giraldeau
  2 siblings, 2 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-09-12 16:01 UTC (permalink / raw)
  To: Francis Giraldeau, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, linux-doc, linux-api,
	linux-kernel

On 9/7/2016 5:11 PM, Francis Giraldeau wrote:
> On 2016-08-29 12:27 PM, Chris Metcalf wrote:
>> On 8/16/2016 5:19 PM, Chris Metcalf wrote:
>>> Here is a respin of the task-isolation patch set.
>> No concerns have been raised yet with the v15 version of the patch series
>> in the two weeks since I posted it, and I think I have addressed all
>> previously-raised concerns (or perhaps people have just given up arguing
>> with me).
> There is a cycle with header include in the v15 patch set on x86_64 that cause a compilation error. You will find the patch (and other fixes) in the following branch:
>
>      https://github.com/giraldeau/linux/commits/dataplane-x86-fix-inc

Thanks, I fixed the header inclusion loop by converting
task_isolation_set_flags() to a macro, removing the unnecessary
include of <linux/smp.h>, and replacing the include of <linux/sched.h>
with a "struct task_struct;" declaration.  That avoids having to dump
too much isolation-related stuff into the apic.h header (note that
you'd also need to include the empty #define for when isolation is
configured off).

> In the test file, it is indicated to change timer-tick.c to disable the 1Hz tick, is there a reason why the change is not in the patch set? I added this trivial change in the branch dataplane-x86-fix-inc above.

Yes, Frederic prefers that we not allow any way of automatically
disabling the tick for now.  Hopefully we will clean up the last
few things that are requiring it to keep ticking shortly.  But for
now it's on a parallel track to the task isolation stuff.

> The syscall test fails on x86:
>
>      $ sudo ./isolation
>      [...]
>      test_syscall: FAIL (0x100)
>      test_syscall (SIGUSR1): FAIL (0x100)

Your next email suggested adding TIF_TASK_ISOLATION to the set of
flags in _TIF_WORK_SYSCALL_ENTRY.  I'm happy to make this change
regardless (it's consistent with Andy's request to add the task
isolation flag to _TIF_ALLWORK_MASK), but I'm puzzled: as far as
I know there is no way for TIF_TASK_ISOLATION to be set unless
TIF_NOHZ is also set.  The context_tracking_init() code forces TIF_NOHZ
on for every task during boot up, and nothing ever clears it, so...

> I wanted to debug this problem with gdb and a KVM virtual machine. However, the TSC clock source is detected as non reliable, even with the boot parameter tsc=reliable, and therefore prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) always returns EAGAIN. Is there a trick to run task isolation in a VM (at least for debugging purposes)?
>
> BTW, this was causing the test to enter an infinite loop. If the clock source is not reliable, maybe a different error code should be returned, because this situation not transient.

That's a good idea - do you know what the check should be in that
case?  We can just return EINVAL, as you suggest.

> In the mean time, I added a check for 100 max retries in the test (prctl_safe()).

Thanks, that's a good idea.  I'll add your changes to the selftest code for the
next release.

> When running only the test_jitter(), the isolation mode is lost:
>
>      [ 6741.566048] isolation/9515: task_isolation mode lost due to irq_work
>
> With ftrace (events/workqueue/workqueue_execute_start), I get a bit more info:
>
>       kworker/1:1-676   [001] ....  6610.097128: workqueue_execute_start: work struct ffff8801a784ca20: function dbs_work_handler
>
> The governor was ondemand, so I tried to set the frequency scaling governor to performance, but that does not solve the issue. Is there a way to suppress this irq_work? Should we run the isolated task with high real-time priority, such that it never get preempted?

On the tile platform we don't have the frequency scaling stuff to contend with, so
I don't know much about it.  I'd be very curious to know what you can figure out
on this front.

Thanks a lot for your help!

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: Ping: [PATCH v15 00/13] support "task_isolation" mode
  2016-09-12 16:01     ` Chris Metcalf
@ 2016-09-12 16:14       ` Peter Zijlstra
  2016-09-12 21:15         ` Rafael J. Wysocki
  2016-09-13  0:20       ` Francis Giraldeau
  1 sibling, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2016-09-12 16:14 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Francis Giraldeau, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel,
	Rafael J. Wysocki

On Mon, Sep 12, 2016 at 12:01:58PM -0400, Chris Metcalf wrote:
> On 9/7/2016 5:11 PM, Francis Giraldeau wrote:
> >When running only the test_jitter(), the isolation mode is lost:
> >
> >     [ 6741.566048] isolation/9515: task_isolation mode lost due to irq_work
> >
> >With ftrace (events/workqueue/workqueue_execute_start), I get a bit more info:
> >
> >      kworker/1:1-676   [001] ....  6610.097128: workqueue_execute_start: work struct ffff8801a784ca20: function dbs_work_handler
> >
> >The governor was ondemand, so I tried to set the frequency scaling
> >governor to performance, but that does not solve the issue. Is there
> >a way to suppress this irq_work? Should we run the isolated task with
> >high real-time priority, such that it never get preempted?
> 
> On the tile platform we don't have the frequency scaling stuff to contend with, so
> I don't know much about it.  I'd be very curious to know what you can figure out
> on this front.

Rafael, I'm thinking the performance governor should be able to run
without sending IPIs. Is there anything we can quickly do about that?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-09 17:40                         ` Chris Metcalf
@ 2016-09-12 17:41                           ` Andy Lutomirski
  2016-09-12 19:25                             ` Chris Metcalf
  0 siblings, 1 reply; 80+ messages in thread
From: Andy Lutomirski @ 2016-09-12 17:41 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: linux-doc, Thomas Gleixner, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Linux API, Viresh Kumar,
	Ingo Molnar, Steven Rostedt, Tejun Heo, Will Deacon,
	Rik van Riel, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On Sep 9, 2016 1:40 PM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>
> On 9/2/2016 1:28 PM, Andy Lutomirski wrote:
>>
>> On Sep 2, 2016 7:04 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>>>
>>> On 8/30/2016 3:50 PM, Andy Lutomirski wrote:
>>>>
>>>> On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>>>>>
>>>>> On 8/30/2016 2:43 PM, Andy Lutomirski wrote:
>>>>>>
>>>>>> What if we did it the other way around: set a percpu flag saying
>>>>>> "going quiescent; disallow new deferred work", then finish all
>>>>>> existing work and return to userspace.  Then, on the next entry, clear
>>>>>> that flag.  With the flag set, vmstat would just flush anything that
>>>>>> it accumulates immediately, nothing would be added to the LRU list,
>>>>>> etc.
>>>>>
>>>>>
>>>>> This is an interesting idea!
>>>>>
>>>>> However, there are a number of implementation ideas that make me
>>>>> worry that it might be a trickier approach overall.
>>>>>
>>>>> First, "on the next entry" hides a world of hurt in four simple words.
>>>>> Some platforms (arm64 and tile, that I'm familiar with) have a common
>>>>> chunk of code that always runs on every entry to the kernel.  It would
>>>>> not be too hard to poke at the assembly and make those platforms
>>>>> always run some task-isolation specific code on entry.  But x86 scares
>>>>> me - there seem to be a whole lot of ways to get into the kernel, and
>>>>> I'm not convinced there is a lot of shared macrology or whatever that
>>>>> would make it straightforward to intercept all of them.
>>>>
>>>> Just use the context tracking entry hook.  It's 100% reliable.  The
>>>> relevant x86 function is enter_from_user_mode(), but I would just hook
>>>> into user_exit() in the common code.  (This code is had better be
>>>> reliable, because context tracking depends on it, and, if context
>>>> tracking doesn't work on a given arch, then isolation isn't going to
>>>> work regardless.
>>>
>>>
>>> This looks a lot cleaner than last time I looked at the x86 code. So yes, I think
>>> we could do an entry-point approach plausibly now.
>>>
>>> This is also good for when we want to look at deferring the kernel TLB flush,
>>> since it's the same mechanism that would be required for that.
>>>
>>>
>> There's at least one gotcha for the latter: NMIs aren't currently
>> guaranteed to go through context tracking.  Instead they use their own
>> RCU hooks.  Deferred TLB flushes can still be made to work, but a bit
>> more care will be needed.  I would probably approach it with an
>> additional NMI hook in the same places as rcu_nmi_enter() that does,
>> more or less:
>>
>> if (need_tlb_flush) flush();
>>
>> and then make sure that the normal exit hook looks like:
>>
>> if (need_tlb_flush) {
>>    flush();
>>    barrier(); /* An NMI must not see !need_tlb_flush if the TLB hasn't
>> been flushed */
>>    flush the TLB;
>> }
>
>
> This is a good point.  For now I will continue not trying to include the TLB flush
> in the current patch series, so I will sit on this until we're ready to do so.
>
>
>>>>> So to pop up a level, what is your actual concern about the existing
>>>>> "do it in a loop" model?  The macrology currently in use means there
>>>>> is zero cost if you don't configure TASK_ISOLATION, and the software
>>>>> maintenance cost seems low since the idioms used for task isolation
>>>>> in the loop are generally familiar to people reading that code.
>>>>
>>>> My concern is that it's not obvious to readers of the code that the
>>>> loop ever terminates.  It really ought to, but it's doing something
>>>> very odd.  Normally we can loop because we get scheduled out, but
>>>> actually blocking in the return-to-userspace path, especially blocking
>>>> on a condition that doesn't have a wakeup associated with it, is odd.
>>>
>>>
>>> True, although, comments :-)
>>>
>>> Regardless, though, this doesn't seem at all weird to me in the
>>> context of the vmstat and lru stuff, though.  It's exactly parallel to
>>> the fact that we loop around on checking need_resched and signal, and
>>> in some cases you could imagine multiple loops around when we schedule
>>> out and get a signal, so loop around again, and then another
>>> reschedule event happens during signal processing so we go around
>>> again, etc.  Eventually it settles down.  It's the same with the
>>> vmstat/lru stuff.
>>
>> Only kind of.
>>
>> When we say, effectively, while (need_resched()) schedule();, we're
>> not waiting for an event or condition per se.  We're runnable (in the
>> sense that userspace wants to run and we're not blocked on anything)
>> the entire time -- we're simply yielding to some other thread that is
>> also runnable.  So if that loop runs forever, it either means that
>> we're at low priority and we genuinely shouldn't be running or that
>> there's a scheduler bug.
>>
>> If, on the other hand, we say while (not quiesced) schedule(); (or
>> equivalent), we're saying that we're *not* really ready to run and
>> that we're waiting for some condition to change.  The condition in
>> question is fairly complicated and won't wake us when we are ready.  I
>> can also imagine the scheduler getting rather confused, since, as far
>> as the scheduler knows, we are runnable and we are supposed to be
>> running.
>
>
> So, how about a code structure like this?
>
> In the main return-to-userspace loop where we check TIF flags,
> we keep the notion of our TIF_TASK_ISOLATION flag that causes
> us to invoke a task_isolation_prepare() routine.  This routine
> does the following things:
>
> 1. As you suggested, set a new TIF bit (or equivalent) that says the
> system should no longer create deferred work on this core, and then
> flush any necessary already-deferred work (currently, the LRU cache
> and the vmstat stuff).  We never have to go flush the deferred work
> again during this task's return to userspace.  Note that this bit can
> only be set on a core marked for task isolation, so it can't be used
> for denial of service type attacks on normal cores that are trying to
> multitask normal Linux processes.

I think it can't be a TIF flag unless you can do the whole mess with
preemption off because, if you get preempted, other tasks on the CPU
won't see the flag.  You could do it with a percpu flag, I think.

>
> 2. Check if the dyntick is stopped, and if not, wait on a completion
> that will be set when it does stop.  This means we may schedule out at
> this point, but when we return, the deferred work stuff is still safe
> since your bit is still set, and in principle the dyn tick is
> stopped.
>
> Then, after we disable interrupts and re-read the thread-info flags,
> we check to see if the TIF_TASK_ISOLATION flag is the ONLY flag still
> set that would keep us in the loop.  This will always end up happening
> on each return to userspace, since the only thing that actually clears
> the bit is a prctl() call.  When that happens we know we are about to
> return to userspace, so we call task_isolation_ready(), which now has
> two things to do:

Would it perhaps be more straightforward to do the stuff before the
loop and not check TIF_TASK_ISOLATION in the loop?

>
> 1. We check that the dyntick is in fact stopped, since it's possible
> that a race condition led to it being somehow restarted by an interrupt.
> If it is not stopped, we go around the loop again so we can go back in
> to the completion discussed above and wait some more.  This may merit
> a WARN_ON or other notice since it seems like people aren't convinced
> there are things that could restart it, but honestly the dyntick stuff
> is complex enough that I think a belt-and-suspenders kind of test here
> at the last minute is just defensive programming.

Seems reasonable.  But maybe this could go after the loop and, if the
dyntick is back, it could be treated like any other kernel bug that
interrupts an isolated task?  That would preserve more of the existing
code structure.

If that works, it could go in user_enter().

>
> 2. Assuming it's stopped, we clear your bit at this point, and
> return "true" so the loop code knows to break out of the loop and do
> the actual return to userspace.  Clearing the bit at this point is
> better than waiting until we re-enter the kernel later, since it
> avoids having to figure out all the ways we actually can re-enter.
> With interrupts disabled, and this late in the return to userspace
> process, there's no way additional deferred work can be created.
>
>
>>>>>> Also, this cond_resched stuff doesn't worry me too much at a
>>>>>> fundamental level -- if we're really going quiescent, shouldn't we be
>>>>>> able to arrange that there are no other schedulable tasks on the CPU
>>>>>> in question?
>>>>>
>>>>> We aren't currently planning to enforce things in the scheduler, so if
>>>>> the application affinitizes another task on top of an existing task
>>>>> isolation task, by default the task isolation task just dies. (Unless
>>>>> it's using NOSIG mode, in which case it just ends up stuck in the
>>>>> kernel trying to wait out the dyntick until you either kill it, or
>>>>> re-affinitize the offending task.)  But I'm reluctant to guarantee
>>>>> every possible way that you might (perhaps briefly) have some
>>>>> schedulable task, and the current approach seems pretty robust if that
>>>>> sort of thing happens.
>>>>
>>>> This kind of waiting out the dyntick scares me.  Why is there ever a
>>>> dyntick that you're waiting out?  If quiescence is to be a supported
>>>> mainline feature, shouldn't the scheduler be integrated well enough
>>>> with it that you don't need to wait like this?
>>>
>>>
>>> Well, this is certainly the funkiest piece of the task isolation
>>> stuff.  The problem is that the dyntick stuff may, for example, need
>>> one more tick 4us from now (or whatever) just to close out the current
>>> RCU period.  We can't return to userspace until that happens.  So what
>>> else can we do when the task is ready to return to userspace?  We
>>> could punt into the idle task instead of waiting in this task, which
>>> was my earlier schedule_time() suggestion.  Do you think that's cleaner?
>>>
>> Unless I'm missing something (which is reasonably likely), couldn't
>> the isolation code just force or require rcu_nocbs on the isolated
>> CPUs to avoid this problem entirely.
>>
>> I admit I still don't understand why the RCU context tracking code
>> can't just run the callback right away instead of waiting however many
>> microseconds in general.  I feel like paulmck has explained it to me
>> at least once, but that doesn't mean I remember the answer.
>
>
> I admit I am not clear on this either.  However, since there are a
> bunch of reasons why the dyntick might run (not just LRU), I think
> fixing LRU may well not be enough to guarantee the dyntick
> turns off exactly when we'd like it to.
>
> And, with the structure proposed here, we can always come back
> and revisit this by just removing the code that does the completion
> waiting and replacing it with a call that just tells the dyntick to
> just stop immediately, once we're confident we can make that work.
>
> Then separately, we can also think about removing the code that
> re-checks dyntick being stopped as we are about to return to
> userspace with interrupts disabled, if we're convinced there's
> also no way for the dyntick to get restarted due to an interrupt
> being handled after we think the dyntick has been stopped.
> I'd argue always leaving a WARN_ON() there would be good, though.
>
>
> --
> Chris Metcalf, Mellanox Technologies
> http://www.mellanox.com
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-12 17:41                           ` Andy Lutomirski
@ 2016-09-12 19:25                             ` Chris Metcalf
  0 siblings, 0 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-09-12 19:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-doc, Thomas Gleixner, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Linux API, Viresh Kumar,
	Ingo Molnar, Steven Rostedt, Tejun Heo, Will Deacon,
	Rik van Riel, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On 9/12/2016 1:41 PM, Andy Lutomirski wrote:
> On Sep 9, 2016 1:40 PM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>> On 9/2/2016 1:28 PM, Andy Lutomirski wrote:
>>> On Sep 2, 2016 7:04 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>>>> On 8/30/2016 3:50 PM, Andy Lutomirski wrote:
>>>>> On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>>>>>> So to pop up a level, what is your actual concern about the existing
>>>>>> "do it in a loop" model?  The macrology currently in use means there
>>>>>> is zero cost if you don't configure TASK_ISOLATION, and the software
>>>>>> maintenance cost seems low since the idioms used for task isolation
>>>>>> in the loop are generally familiar to people reading that code.
>>>>> My concern is that it's not obvious to readers of the code that the
>>>>> loop ever terminates.  It really ought to, but it's doing something
>>>>> very odd.  Normally we can loop because we get scheduled out, but
>>>>> actually blocking in the return-to-userspace path, especially blocking
>>>>> on a condition that doesn't have a wakeup associated with it, is odd.
>>>>
>>>> True, although, comments :-)
>>>>
>>>> Regardless, though, this doesn't seem at all weird to me in the
>>>> context of the vmstat and lru stuff, though.  It's exactly parallel to
>>>> the fact that we loop around on checking need_resched and signal, and
>>>> in some cases you could imagine multiple loops around when we schedule
>>>> out and get a signal, so loop around again, and then another
>>>> reschedule event happens during signal processing so we go around
>>>> again, etc.  Eventually it settles down.  It's the same with the
>>>> vmstat/lru stuff.
>>> Only kind of.
>>>
>>> When we say, effectively, while (need_resched()) schedule();, we're
>>> not waiting for an event or condition per se.  We're runnable (in the
>>> sense that userspace wants to run and we're not blocked on anything)
>>> the entire time -- we're simply yielding to some other thread that is
>>> also runnable.  So if that loop runs forever, it either means that
>>> we're at low priority and we genuinely shouldn't be running or that
>>> there's a scheduler bug.
>>>
>>> If, on the other hand, we say while (not quiesced) schedule(); (or
>>> equivalent), we're saying that we're *not* really ready to run and
>>> that we're waiting for some condition to change.  The condition in
>>> question is fairly complicated and won't wake us when we are ready.  I
>>> can also imagine the scheduler getting rather confused, since, as far
>>> as the scheduler knows, we are runnable and we are supposed to be
>>> running.
>>
>> So, how about a code structure like this?
>>
>> In the main return-to-userspace loop where we check TIF flags,
>> we keep the notion of our TIF_TASK_ISOLATION flag that causes
>> us to invoke a task_isolation_prepare() routine.  This routine
>> does the following things:
>>
>> 1. As you suggested, set a new TIF bit (or equivalent) that says the
>> system should no longer create deferred work on this core, and then
>> flush any necessary already-deferred work (currently, the LRU cache
>> and the vmstat stuff).  We never have to go flush the deferred work
>> again during this task's return to userspace.  Note that this bit can
>> only be set on a core marked for task isolation, so it can't be used
>> for denial of service type attacks on normal cores that are trying to
>> multitask normal Linux processes.
> I think it can't be a TIF flag unless you can do the whole mess with
> preemption off because, if you get preempted, other tasks on the CPU
> won't see the flag.  You could do it with a percpu flag, I think.

Yes, a percpu flag - you're right.  I think it will make sense for this to
be a flag declared in linux/isolation.h which can be read by vmstat, LRU, etc.

>> 2. Check if the dyntick is stopped, and if not, wait on a completion
>> that will be set when it does stop.  This means we may schedule out at
>> this point, but when we return, the deferred work stuff is still safe
>> since your bit is still set, and in principle the dyn tick is
>> stopped.
>>
>> Then, after we disable interrupts and re-read the thread-info flags,
>> we check to see if the TIF_TASK_ISOLATION flag is the ONLY flag still
>> set that would keep us in the loop.  This will always end up happening
>> on each return to userspace, since the only thing that actually clears
>> the bit is a prctl() call.  When that happens we know we are about to
>> return to userspace, so we call task_isolation_ready(), which now has
>> two things to do:
> Would it perhaps be more straightforward to do the stuff before the
> loop and not check TIF_TASK_ISOLATION in the loop?

We can certainly play around with just not looping in this case, but
in particular I can imagine an isolated task entering the kernel and
then doing something that requires scheduling a kernel task.  We'd
clearly like that other task to run before the isolated task returns to
userspace.  But then, that other task might do something to re-enable
the dyntick.  That's why we'd like to recheck that dyntick is off in
the loop after each potential call to schedule().

>> 1. We check that the dyntick is in fact stopped, since it's possible
>> that a race condition led to it being somehow restarted by an interrupt.
>> If it is not stopped, we go around the loop again so we can go back in
>> to the completion discussed above and wait some more.  This may merit
>> a WARN_ON or other notice since it seems like people aren't convinced
>> there are things that could restart it, but honestly the dyntick stuff
>> is complex enough that I think a belt-and-suspenders kind of test here
>> at the last minute is just defensive programming.
> Seems reasonable.  But maybe this could go after the loop and, if the
> dyntick is back, it could be treated like any other kernel bug that
> interrupts an isolated task?  That would preserve more of the existing
> code structure.

Well, we can certainly try it that way.  If I move it out and my testing
doesn't trigger the bug, that's at least an initial sign that it might be
OK.  But I worry/suspect that it will trip up at some point in some use
case and we'll have to fix it at that point.

> If that works, it could go in user_enter().

Presumably with trace_user_enter() and vtime_user_enter()
in __context_tracking_enter()?

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: Ping: [PATCH v15 00/13] support "task_isolation" mode
  2016-09-12 16:14       ` Peter Zijlstra
@ 2016-09-12 21:15         ` Rafael J. Wysocki
  2016-09-13  0:05           ` Rafael J. Wysocki
  0 siblings, 1 reply; 80+ messages in thread
From: Rafael J. Wysocki @ 2016-09-12 21:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Metcalf, Francis Giraldeau, Gilad Ben Yossef,
	Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel,
	Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	linux-doc, linux-api, linux-kernel

On Monday, September 12, 2016 06:14:44 PM Peter Zijlstra wrote:
> On Mon, Sep 12, 2016 at 12:01:58PM -0400, Chris Metcalf wrote:
> > On 9/7/2016 5:11 PM, Francis Giraldeau wrote:
> > >When running only the test_jitter(), the isolation mode is lost:
> > >
> > >     [ 6741.566048] isolation/9515: task_isolation mode lost due to irq_work
> > >
> > >With ftrace (events/workqueue/workqueue_execute_start), I get a bit more info:
> > >
> > >      kworker/1:1-676   [001] ....  6610.097128: workqueue_execute_start: work struct ffff8801a784ca20: function dbs_work_handler
> > >
> > >The governor was ondemand, so I tried to set the frequency scaling
> > >governor to performance, but that does not solve the issue. Is there
> > >a way to suppress this irq_work? Should we run the isolated task with
> > >high real-time priority, such that it never get preempted?
> > 
> > On the tile platform we don't have the frequency scaling stuff to contend with, so
> > I don't know much about it.  I'd be very curious to know what you can figure out
> > on this front.
> 
> Rafael, I'm thinking the performance governor should be able to run
> without sending IPIs. Is there anything we can quickly do about that?

The performance governor doesn't do any IPIs.

At this point I'm not sure what's going on.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: Ping: [PATCH v15 00/13] support "task_isolation" mode
  2016-09-12 21:15         ` Rafael J. Wysocki
@ 2016-09-13  0:05           ` Rafael J. Wysocki
  2016-09-13 16:00             ` Francis Giraldeau
  0 siblings, 1 reply; 80+ messages in thread
From: Rafael J. Wysocki @ 2016-09-13  0:05 UTC (permalink / raw)
  To: Peter Zijlstra, Francis Giraldeau
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel

On Monday, September 12, 2016 11:15:45 PM Rafael J. Wysocki wrote:
> On Monday, September 12, 2016 06:14:44 PM Peter Zijlstra wrote:
> > On Mon, Sep 12, 2016 at 12:01:58PM -0400, Chris Metcalf wrote:
> > > On 9/7/2016 5:11 PM, Francis Giraldeau wrote:
> > > >When running only the test_jitter(), the isolation mode is lost:
> > > >
> > > >     [ 6741.566048] isolation/9515: task_isolation mode lost due to irq_work
> > > >
> > > >With ftrace (events/workqueue/workqueue_execute_start), I get a bit more info:
> > > >
> > > >      kworker/1:1-676   [001] ....  6610.097128: workqueue_execute_start: work struct ffff8801a784ca20: function dbs_work_handler
> > > >
> > > >The governor was ondemand, so I tried to set the frequency scaling
> > > >governor to performance, but that does not solve the issue. Is there
> > > >a way to suppress this irq_work? Should we run the isolated task with
> > > >high real-time priority, such that it never get preempted?
> > > 
> > > On the tile platform we don't have the frequency scaling stuff to contend with, so
> > > I don't know much about it.  I'd be very curious to know what you can figure out
> > > on this front.
> > 
> > Rafael, I'm thinking the performance governor should be able to run
> > without sending IPIs. Is there anything we can quickly do about that?
> 
> The performance governor doesn't do any IPIs.
> 
> At this point I'm not sure what's going on.

I've just tried and switching over to the performance governor makes
dbs_work_handler go away for me (w/ -rc4 with some extra irrelevant patches
on top) as it should.

What doesn't work, in particular?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: Ping: [PATCH v15 00/13] support "task_isolation" mode
  2016-09-12 16:01     ` Chris Metcalf
  2016-09-12 16:14       ` Peter Zijlstra
@ 2016-09-13  0:20       ` Francis Giraldeau
  2016-09-13 16:12         ` Chris Metcalf
  2016-09-27 14:49         ` Frederic Weisbecker
  1 sibling, 2 replies; 80+ messages in thread
From: Francis Giraldeau @ 2016-09-13  0:20 UTC (permalink / raw)
  To: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, linux-doc, linux-api,
	linux-kernel

On 2016-09-12 12:01 PM, Chris Metcalf wrote:
> The syscall test fails on x86:
>>
>>      $ sudo ./isolation
>>      [...]
>>      test_syscall: FAIL (0x100)
>>      test_syscall (SIGUSR1): FAIL (0x100) 
>
> Your next email suggested adding TIF_TASK_ISOLATION to the set of
> flags in _TIF_WORK_SYSCALL_ENTRY.  I'm happy to make this change
> regardless (it's consistent with Andy's request to add the task
> isolation flag to _TIF_ALLWORK_MASK), but I'm puzzled: as far as
> I know there is no way for TIF_TASK_ISOLATION to be set unless
> TIF_NOHZ is also set.  The context_tracking_init() code forces TIF_NOHZ
> on for every task during boot up, and nothing ever clears it, so...
>

Hello!

You are right, on entry to syscall_trace_enter() the flags is
(_TIF_NOHZ | _TIF_TASK_ISOLATION):

    [   22.634988] isolation thread flags: 0x82000

But at linux/arch/x86/entry/common.c:83

    work = ACCESS_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY;

the flag _TIF_TASK_ISOLATION was cleared because it is not included in
_TIF_WORK_SYSCALL_ENTRY. Then, the test below is always false:

    if (work & _TIF_TASK_ISOLATION) {
        if (task_isolation_syscall(regs->orig_ax) == -1)
            return -1L;
        work &= ~_TIF_TASK_ISOLATION;
    }

To fix the issue, _TIF_TASK_ISOLATION must be in _TIF_WORK_SYSCALL_ENTRY.
It works on arm64 because the flags are used directly without a mask applied.

>> BTW, this was causing the test to enter an infinite loop. If the clock
>> source is not reliable, maybe a different error code should be returned,
>> because this situation not transient. 
>
> That's a good idea - do you know what the check should be in that
> case?  We can just return EINVAL, as you suggest. 

The args are valid, but the system has an unstable clock, therefore the
operation is not supported. In the user point of view, maybe ENOTSUPP
would be more appropriate? But then, we need to check the reason and
can_stop_my_full_tick() returns only a boolean.

On a side note, the NOSIG mode may be confusing for the users. At first,
I was expecting that NOSIG behaves the same way as the normal task isolation
mode. In the current situation, if the user wants the normal behavior, but
does not care about the signal, the user must register an empty signal handler.

However, if I understand correctly, other settings beside NOHZ and isolcpus
are required to support quiet CPUs, such as irq_affinity and rcu_nocb. It would
be very convenient from the user point of view if these other settings were configure
correctly.

I can work on that and also write some doc (Documentation/task-isolation.txt ?).

> Thanks a lot for your help! 

Many thanks for your feedback,

Francis

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: Ping: [PATCH v15 00/13] support "task_isolation" mode
  2016-09-13  0:05           ` Rafael J. Wysocki
@ 2016-09-13 16:00             ` Francis Giraldeau
  0 siblings, 0 replies; 80+ messages in thread
From: Francis Giraldeau @ 2016-09-13 16:00 UTC (permalink / raw)
  To: Rafael J. Wysocki, Peter Zijlstra
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel

On 2016-09-12 08:05 PM, Rafael J. Wysocki wrote:
> On Monday, September 12, 2016 11:15:45 PM Rafael J. Wysocki wrote:
>> On Monday, September 12, 2016 06:14:44 PM Peter Zijlstra wrote:
>>> On Mon, Sep 12, 2016 at 12:01:58PM -0400, Chris Metcalf wrote:
>>>> On 9/7/2016 5:11 PM, Francis Giraldeau wrote:
>>>>> When running only the test_jitter(), the isolation mode is lost:
>>>>>
>>>>>     [ 6741.566048] isolation/9515: task_isolation mode lost due to irq_work
>>>>>
>>>>> With ftrace (events/workqueue/workqueue_execute_start), I get a bit more info:
>>>>>
>>>>>      kworker/1:1-676   [001] ....  6610.097128: workqueue_execute_start: work struct ffff8801a784ca20: function dbs_work_handler
>>>>>
>>>>> The governor was ondemand, so I tried to set the frequency scaling
>>>>> governor to performance, but that does not solve the issue.
>>>
>>> Rafael, I'm thinking the performance governor should be able to run
>>> without sending IPIs. Is there anything we can quickly do about that?
>>
>> The performance governor doesn't do any IPIs.
>>
>> At this point I'm not sure what's going on.
> 
> I've just tried and switching over to the performance governor makes
> dbs_work_handler go away for me (w/ -rc4 with some extra irrelevant patches
> on top) as it should.

Ho gosh, the command "cpufreq-set -g performance" sets the governor only
for cpu0. I was expecting it to set for all cpus. The isolation test runs
on cpu1 and this cpu was still with ondemand governor. With performance
governor, the dbs_work_handler does not occurs anymore and the isolated
task is not preempted by that kworker thread anymore.

Sorry for the noise and thanks for checking!

Francis

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: Ping: [PATCH v15 00/13] support "task_isolation" mode
  2016-09-13  0:20       ` Francis Giraldeau
@ 2016-09-13 16:12         ` Chris Metcalf
  2016-09-27 14:49         ` Frederic Weisbecker
  1 sibling, 0 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-09-13 16:12 UTC (permalink / raw)
  To: Francis Giraldeau, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, linux-doc, linux-api,
	linux-kernel

Thanks for your explanation of the TIF_TASK_ISOLATION
flag being needed for x86  _TIF_WORK_SYSCALL_ENTRY.
It makes perfect sense in retrospect :-)

On 9/12/2016 8:20 PM, Francis Giraldeau wrote:
> On a side note, the NOSIG mode may be confusing for the users. At first,
> I was expecting that NOSIG behaves the same way as the normal task isolation
> mode. In the current situation, if the user wants the normal behavior, but
> does not care about the signal, the user must register an empty signal handler.

So, "normal behavior" isn't really well defined once we start
allowing empty signal handlers.  In particular, task isolation will
be turned off before invoking your signal handler, and if the
handler is empty, you just lose isolation and that's that.  By
contrast, the NOSIG mode will try to keep you isolated.

I'm definitely open to suggestions about how to frame the API
for NOSIG or equivalent modes.  What were you expecting to
be able to do by suppressing the signal, and how is NOSIG not
the thing you wanted?

> However, if I understand correctly, other settings beside NOHZ and isolcpus
> are required to support quiet CPUs, such as irq_affinity and rcu_nocb. It would
> be very convenient from the user point of view if these other settings were configure
> correctly.

I think it makes sense to move towards a mode where enabling
task_isolation sets up the rcu_nocbs and irq_affinity automatically,
rather than requiring users to understand all the fiddly configuration
and boot argument details.

> I can work on that and also write some doc (Documentation/task-isolation.txt ?).

Sure, documentation is always welcome!

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-02 17:28                       ` Andy Lutomirski
  2016-09-09 17:40                         ` Chris Metcalf
@ 2016-09-27 14:22                         ` Frederic Weisbecker
  2016-09-27 14:39                           ` Peter Zijlstra
  2016-09-27 14:48                           ` Paul E. McKenney
  1 sibling, 2 replies; 80+ messages in thread
From: Frederic Weisbecker @ 2016-09-27 14:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Thomas Gleixner, linux-doc, Christoph Lameter,
	Michal Hocko, Gilad Ben Yossef, Andrew Morton, Viresh Kumar,
	Linux API, Steven Rostedt, Ingo Molnar, Tejun Heo, Rik van Riel,
	Will Deacon, Paul E. McKenney, linux-mm, linux-kernel,
	Catalin Marinas, Peter Zijlstra

On Fri, Sep 02, 2016 at 10:28:00AM -0700, Andy Lutomirski wrote:
> 
> Unless I'm missing something (which is reasonably likely), couldn't
> the isolation code just force or require rcu_nocbs on the isolated
> CPUs to avoid this problem entirely.

rcu_nocb is already implied by nohz_full. Which means that RCU callbacks
are offlined outside the nohz_full set of CPUs.

> 
> I admit I still don't understand why the RCU context tracking code
> can't just run the callback right away instead of waiting however many
> microseconds in general.  I feel like paulmck has explained it to me
> at least once, but that doesn't mean I remember the answer.

The RCU context tracking doesn't take care of callbacks. It's only there
to tell the RCU core whether the CPU runs code that may or may not run
RCU read side critical sections. This is assumed by "kernel may use RCU,
userspace can't".

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: Ping: [PATCH v15 00/13] support "task_isolation" mode
  2016-08-29 16:27 ` Ping: [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
  2016-09-07 21:11   ` Francis Giraldeau
@ 2016-09-27 14:35   ` Frederic Weisbecker
  2016-09-30 17:07     ` Chris Metcalf
  1 sibling, 1 reply; 80+ messages in thread
From: Frederic Weisbecker @ 2016-09-27 14:35 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	Francis Giraldeau, linux-doc, linux-api, linux-kernel

On Mon, Aug 29, 2016 at 12:27:06PM -0400, Chris Metcalf wrote:
> On 8/16/2016 5:19 PM, Chris Metcalf wrote:
> >Here is a respin of the task-isolation patch set.
> >
> >Again, I have been getting email asking me when and where this patch
> >will be upstreamed so folks can start using it.  I had been thinking
> >the obvious path was via Frederic Weisbecker to Ingo as a NOHZ kind of
> >thing.  But perhaps it touches enough other subsystems that that
> >doesn't really make sense?  Andrew, would it make sense to take it
> >directly via your tree?  Frederic, Ingo, what do you think?
> 
> Ping!
> 
> No concerns have been raised yet with the v15 version of the patch series
> in the two weeks since I posted it, and I think I have addressed all
> previously-raised concerns (or perhaps people have just given up arguing
> with me).
> 
> I did add Catalin's Reviewed-by to 08/13 (thanks!) and updated my
> kernel.org repo.
> 
> Does this feel like something we can merge when the 4.9 merge window opens?
> If so, whose tree is best suited for it?  Or should I ask Stephen to put it into
> linux-next now and then ask Linus to merge it directly?  I recall Ingo thought
> this was a bad idea when I suggested it back in January, but I'm not sure where
> we got to in terms of a better approach.

As it seems we are still debating a lot of things in this patchset that has already
reached v15, I think you should split it in smaller steps in order to move forward
and only get into the next step once the previous is merged.

You could start with a first batch that introduces the prctl() and does the best effort
one-shot isolation part. Which means the actions that only need to be performed once
on the prctl call.

Once we get that merged we can focus on what needs to be performed on every return
to userspace if that's really needed. Including possibly waiting on some completion.

Then once we rip out the residual 1hz we can start to think about signal the user
on any interruption, etc...

Does that sound feasible to you?

Thanks.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-27 14:22                         ` Frederic Weisbecker
@ 2016-09-27 14:39                           ` Peter Zijlstra
  2016-09-27 14:51                             ` Frederic Weisbecker
  2016-09-27 14:48                           ` Paul E. McKenney
  1 sibling, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2016-09-27 14:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andy Lutomirski, Chris Metcalf, Thomas Gleixner, linux-doc,
	Christoph Lameter, Michal Hocko, Gilad Ben Yossef, Andrew Morton,
	Viresh Kumar, Linux API, Steven Rostedt, Ingo Molnar, Tejun Heo,
	Rik van Riel, Will Deacon, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas

On Tue, Sep 27, 2016 at 04:22:20PM +0200, Frederic Weisbecker wrote:

> The RCU context tracking doesn't take care of callbacks. It's only there
> to tell the RCU core whether the CPU runs code that may or may not run
> RCU read side critical sections. This is assumed by "kernel may use RCU,
> userspace can't".

Userspace never can use the kernels RCU in any case. What you mean to
say is that userspace is treated like an idle CPU in that the CPU will
no longer be part of the RCU quescent state machine.

The transition to userspace (as per context tracking) must ensure that
CPUs RCU state is 'complete', just like our transition to idle (mostly)
does.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-27 14:22                         ` Frederic Weisbecker
  2016-09-27 14:39                           ` Peter Zijlstra
@ 2016-09-27 14:48                           ` Paul E. McKenney
  1 sibling, 0 replies; 80+ messages in thread
From: Paul E. McKenney @ 2016-09-27 14:48 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andy Lutomirski, Chris Metcalf, Thomas Gleixner, linux-doc,
	Christoph Lameter, Michal Hocko, Gilad Ben Yossef, Andrew Morton,
	Viresh Kumar, Linux API, Steven Rostedt, Ingo Molnar, Tejun Heo,
	Rik van Riel, Will Deacon, linux-mm, linux-kernel,
	Catalin Marinas, Peter Zijlstra

On Tue, Sep 27, 2016 at 04:22:20PM +0200, Frederic Weisbecker wrote:
> On Fri, Sep 02, 2016 at 10:28:00AM -0700, Andy Lutomirski wrote:
> > 
> > Unless I'm missing something (which is reasonably likely), couldn't
> > the isolation code just force or require rcu_nocbs on the isolated
> > CPUs to avoid this problem entirely.
> 
> rcu_nocb is already implied by nohz_full. Which means that RCU callbacks
> are offlined outside the nohz_full set of CPUs.

Indeed, at boot time, RCU makes any nohz_full CPU also be a rcu_nocb
CPU.

> > I admit I still don't understand why the RCU context tracking code
> > can't just run the callback right away instead of waiting however many
> > microseconds in general.  I feel like paulmck has explained it to me
> > at least once, but that doesn't mean I remember the answer.
> 
> The RCU context tracking doesn't take care of callbacks. It's only there
> to tell the RCU core whether the CPU runs code that may or may not run
> RCU read side critical sections. This is assumed by "kernel may use RCU,
> userspace can't".

And RCU has to wait for read-side critical sections to complete before
invoking callbacks.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: Ping: [PATCH v15 00/13] support "task_isolation" mode
  2016-09-13  0:20       ` Francis Giraldeau
  2016-09-13 16:12         ` Chris Metcalf
@ 2016-09-27 14:49         ` Frederic Weisbecker
  1 sibling, 0 replies; 80+ messages in thread
From: Frederic Weisbecker @ 2016-09-27 14:49 UTC (permalink / raw)
  To: Francis Giraldeau
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel

On Mon, Sep 12, 2016 at 08:20:16PM -0400, Francis Giraldeau wrote:
> 
> The args are valid, but the system has an unstable clock, therefore the
> operation is not supported. In the user point of view, maybe ENOTSUPP
> would be more appropriate? But then, we need to check the reason and
> can_stop_my_full_tick() returns only a boolean.
> 
> On a side note, the NOSIG mode may be confusing for the users. At first,
> I was expecting that NOSIG behaves the same way as the normal task isolation
> mode. In the current situation, if the user wants the normal behavior, but
> does not care about the signal, the user must register an empty signal handler.
> 
> However, if I understand correctly, other settings beside NOHZ and isolcpus
> are required to support quiet CPUs, such as irq_affinity and rcu_nocb. It would
> be very convenient from the user point of view if these other settings were configure
> correctly.
> 
> I can work on that and also write some doc (Documentation/task-isolation.txt ?).

That would be lovely! Part of this documentation already exists in
Documentation/timers/NO_HZ.txt and also in Documentation/kernel-per-CPU-kthreads.txt

I think we should extract the isolation informations that aren't related to the tick
from NO_HZ.txt and put them in task-isolation.txt, perhaps merge kernel-per-CPU-kthreads.txt
into it or at least add a pointer to it. Then add all the missing informations as many things
have evolved since then.

I'll gladly help.

Thanks.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-27 14:39                           ` Peter Zijlstra
@ 2016-09-27 14:51                             ` Frederic Weisbecker
  0 siblings, 0 replies; 80+ messages in thread
From: Frederic Weisbecker @ 2016-09-27 14:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Chris Metcalf, Thomas Gleixner, linux-doc,
	Christoph Lameter, Michal Hocko, Gilad Ben Yossef, Andrew Morton,
	Viresh Kumar, Linux API, Steven Rostedt, Ingo Molnar, Tejun Heo,
	Rik van Riel, Will Deacon, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas

On Tue, Sep 27, 2016 at 04:39:26PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 27, 2016 at 04:22:20PM +0200, Frederic Weisbecker wrote:
> 
> > The RCU context tracking doesn't take care of callbacks. It's only there
> > to tell the RCU core whether the CPU runs code that may or may not run
> > RCU read side critical sections. This is assumed by "kernel may use RCU,
> > userspace can't".
> 
> Userspace never can use the kernels RCU in any case. What you mean to
> say is that userspace is treated like an idle CPU in that the CPU will
> no longer be part of the RCU quescent state machine.
> 
> The transition to userspace (as per context tracking) must ensure that
> CPUs RCU state is 'complete', just like our transition to idle (mostly)
> does.

Exactly!

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH] Fix /proc/stat freezes (was [PATCH v15] "task_isolation" mode)
  2016-08-17 19:37 ` [PATCH] Fix /proc/stat freezes (was [PATCH v15] "task_isolation" mode) Christoph Lameter
  2016-08-20  1:42   ` Chris Metcalf
@ 2016-09-28 13:16   ` Frederic Weisbecker
  1 sibling, 0 replies; 80+ messages in thread
From: Frederic Weisbecker @ 2016-09-28 13:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, Francis Giraldeau,
	linux-doc, linux-api, linux-kernel

On Wed, Aug 17, 2016 at 02:37:46PM -0500, Christoph Lameter wrote:
> On Tue, 16 Aug 2016, Chris Metcalf wrote:
> Subject: NOHZ: Correctly display increasing cputime when processor is busy
> 
> The tick may be switched off when the processor gets busy with nohz full.
> The user time fields in /proc/stat will then no longer increase because
> the tick is not run to update the cpustat values anymore.
> 
> Compensate for the missing ticks by checking if a processor is in
> such a mode. If so then add the ticks that have passed since
> the tick was switched off to the usertime.
> 
> Note that this introduces a slight inaccuracy. The process may
> actually do syscalls without triggering a tick again but the
> processing time in those calls is negligible. Any wait or sleep
> occurrence during syscalls would activate the tick again.
> 
> Any inaccuracy is corrected once the tick is switched on again
> since the actual value where cputime aggregates is not changed.
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> Index: linux/fs/proc/stat.c
> ===================================================================
> --- linux.orig/fs/proc/stat.c	2016-08-04 09:04:57.681480937 -0500
> +++ linux/fs/proc/stat.c	2016-08-17 14:27:37.813445675 -0500
> @@ -77,6 +77,12 @@ static u64 get_iowait_time(int cpu)
> 
>  #endif
> 
> +static unsigned long inline get_cputime_user(int cpu)
> +{
> +	return kcpustat_cpu(cpu).cpustat[CPUTIME_USER] +
> +			tick_stopped_busy_ticks(cpu);
> +}
> +
>  static int show_stat(struct seq_file *p, void *v)
>  {
>  	int i, j;
> @@ -93,7 +99,7 @@ static int show_stat(struct seq_file *p,
>  	getboottime64(&boottime);
> 
>  	for_each_possible_cpu(i) {
> -		user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
> +		user += get_cputime_user(i);
>  		nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
>  		system += kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
>  		idle += get_idle_time(i);
> @@ -130,7 +136,7 @@ static int show_stat(struct seq_file *p,
> 
>  	for_each_online_cpu(i) {
>  		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
> -		user = kcpustat_cpu(i).cpustat[CPUTIME_USER];
> +		user = get_cputime_user(i);
>  		nice = kcpustat_cpu(i).cpustat[CPUTIME_NICE];
>  		system = kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
>  		idle = get_idle_time(i);
> Index: linux/kernel/time/tick-sched.c
> ===================================================================
> --- linux.orig/kernel/time/tick-sched.c	2016-07-27 08:41:17.109862517 -0500
> +++ linux/kernel/time/tick-sched.c	2016-08-17 14:16:42.073835333 -0500
> @@ -990,6 +990,24 @@ ktime_t tick_nohz_get_sleep_length(void)
>  	return ts->sleep_length;
>  }
> 
> +/**
> + * tick_stopped_busy_ticks - return the ticks that did not occur while the
> + *				processor was busy and the tick was off
> + *
> + * Called from sysfs to correctly calculate cputime of nohz full processors
> + */
> +unsigned long tick_stopped_busy_ticks(int cpu)
> +{
> +#ifdef CONFIG_NOHZ_FULL
> +	struct tick_sched *ts = per_cpu_ptr(&tick_cpu_sched, cpu);
> +
> +	if (!ts->inidle && ts->tick_stopped)
> +		return jiffies - ts->idle_jiffies;


It won't work, ts->idle_jiffies only takes care about idle time.

That said, the tick is supposed to fire once per second, the reason for the freeze is
still unknown. Now in order to get rid of the 1hz, we'll need to force updates on
cpustats like that patch intended to.

But I see only two sane ways to do so:

_ fetch the task of CPU X and deduce on top of vtime values where it is executing and
  how much delta is to be added to cpustat. The problem here is that we may need to do that
  under the rq lock to make sure the task is really in CPU X and stays there. Perhaps we could
  cheat though and add the CPU number on vtime fields then vtime_seqcount would be enough
  to get stable results.

_ have housekeeping update all those CPUs cpustat periodically. But that means we need to
  turn back vtime_seqcount into a seqlock and that would be a shame for nohz_full performance.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30 18:43               ` Andy Lutomirski
  2016-08-30 19:37                 ` Chris Metcalf
@ 2016-09-30 16:59                 ` Chris Metcalf
  1 sibling, 0 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-09-30 16:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Linux API, Viresh Kumar,
	Ingo Molnar, Steven Rostedt, Tejun Heo, Will Deacon,
	Rik van Riel, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On 8/30/2016 2:43 PM, Andy Lutomirski wrote:
> On Aug 30, 2016 10:02 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>> We really want to run task isolation last, so we can guarantee that
>> all the isolation prerequisites are met (dynticks stopped, per-cpu lru
>> cache empty, etc).  But achieving that state can require enabling
>> interrupts - most obviously if we have to schedule, e.g. for vmstat
>> clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
>> just while waiting for that last dyntick interrupt to occur.  I'm also
>> not sure that even something as simple as draining the per-cpu lru
>> cache can be done holding interrupts disabled throughout - certainly
>> there's a !SMP code path there that just re-enables interrupts
>> unconditionally, which gives me pause.
>>
>> At any rate at that point you need to retest for signals, resched,
>> etc, all as usual, and then you need to recheck the task isolation
>> prerequisites once more.
>>
>> I may be missing something here, but it's really not obvious to me
>> that there's a way to do this without having task isolation integrated
>> into the usual return-to-userspace loop.
> What if we did it the other way around: set a percpu flag saying
> "going quiescent; disallow new deferred work", then finish all
> existing work and return to userspace.  Then, on the next entry, clear
> that flag.  With the flag set, vmstat would just flush anything that
> it accumulates immediately, nothing would be added to the LRU list,
> etc.

Thinking about this some more, I was struck by an even simpler way
to approach this.  What if we just said that on task isolation cores, no
kernel subsystem should do something that would require a future
interruption?  So vmstat would just always sync immediately on task
isolation cores, the mm subsystem wouldn't use per-cpu LRU stuff on
task isolation cores, etc.  That way we don't have to worry about the
status of those things as we are returning to userspace for a task
isolation process, since it's just always kept "pristine".

The task-isolation setting per-core is not user-customizable, and the
task-stealing scheduler doesn't even run there, so it's not like any
processes will land there and be in a position to complain about the
performance overhead of having no deferred work being created...

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: Ping: [PATCH v15 00/13] support "task_isolation" mode
  2016-09-27 14:35   ` Frederic Weisbecker
@ 2016-09-30 17:07     ` Chris Metcalf
  0 siblings, 0 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-09-30 17:07 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	Francis Giraldeau, linux-doc, linux-api, linux-kernel

On 9/27/2016 10:35 AM, Frederic Weisbecker wrote:
> On 8/16/2016 5:19 PM, Chris Metcalf wrote:
>>> Here is a respin of the task-isolation patch set.
>>>
>>> Again, I have been getting email asking me when and where this patch
>>> will be upstreamed so folks can start using it.  I had been thinking
>>> the obvious path was via Frederic Weisbecker to Ingo as a NOHZ kind of
>>> thing.  But perhaps it touches enough other subsystems that that
>>> doesn't really make sense?  Andrew, would it make sense to take it
>>> directly via your tree?  Frederic, Ingo, what do you think?
> As it seems we are still debating a lot of things in this patchset that has already
> reached v15, I think you should split it in smaller steps in order to move forward
> and only get into the next step once the previous is merged.
>
> You could start with a first batch that introduces the prctl() and does the best effort
> one-shot isolation part. Which means the actions that only need to be performed once
> on the prctl call.

So combining this with my reply a moment ago to Andy about just
disabling all deferrable work creation on task isolation cores, that
means we just need a way of checking that the dyntick is off on return
from the prctl.

We could do this in the prctl() itself, but it feels a bit fragile, since
we could do the check for no dyntick and try to return success,
and then some kind of interrupt and/or schedule event might happen
and by the time we actually got back to userspace the dyntick might
be running again.

I think what we can do is arrange to set a bit in the process state
that says we are returning from prctl, and then right as we are
returning to userspace with interrupts disabled, we can check if
that bit is set, and if so check at that point to see if the dyntick
is enabled, and if it is, force the syscall return value to EAGAIN
(and clear the bit regardless).

Within the prctl() code itself, we check for hard prerequisites like being on
a task-isolation cpu, and fail -EINVAL if not.

The upshot is that we end up spinning on a loop through userspace where
we keep retrying the prctl() until the timer quiesces.

> Once we get that merged we can focus on what needs to be performed on every return
> to userspace if that's really needed. Including possibly waiting on some completion.

So in NOSIG mode, instead of setting EAGAIN in the return to
userspace path, we arrange to just wait.  We can figure out in a
follow-on patch whether we want to wait by spinning in some way
or by actually waiting on a completion.  For now I'll just include the
remainder of the patch (with spinning) as an RFC just so people
can have the next piece to look ahead to, but I like your idea of
breaking it out of the main patch series entirely.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* task isolation discussion at Linux Plumbers
  2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
                   ` (14 preceding siblings ...)
  2016-08-29 16:27 ` Ping: [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
@ 2016-11-05  4:04 ` Chris Metcalf
  2016-11-05 16:05   ` Christoph Lameter
                     ` (4 more replies)
  15 siblings, 5 replies; 80+ messages in thread
From: Chris Metcalf @ 2016-11-05  4:04 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, Francis Giraldeau, Andi Kleen, Arnd Bergmann,
	linux-kernel

A bunch of people got together this week at the Linux Plumbers
Conference to discuss nohz_full, task isolation, and related stuff.
(Thanks to Thomas for getting everyone gathered at one place and time!)

Here are the notes I took; I welcome any corrections and follow-up.


== rcu_nocbs ==

We started out by discussing this option.  It is automatically enabled
by nohz_full, but we spent a little while side-tracking on the
implementation of one kthread per rcu flavor per core.  The suggestion
was made (by Peter or Andy; I forget) that each kthread could handle
all flavors per core by using a dedicated worklist.  It certainly
seems like removing potentially dozens or hundreds of kthreads from
larger systems will be a win if this works out.

Paul said he would look into this possibility.


== Remote statistics ==

We discussed the possibility of remote statistics gathering, i.e. load
average etc.  The idea would be that we could have housekeeping
core(s) periodically iterate over the nohz cores to load their rq
remotely and do update_current etc.  Presumably it should be possible
for a single housekeeping core to handle doing this for all the
nohz_full cores, as we only need to do it quite infrequently.

Thomas suggested that this might be the last remaining thing that
needed to be done to allow disabling the current behavior of falling
back to a 1 Hz clock in nohz_full.

I believe Thomas said he had a patch to do this already.


== Remote LRU cache drain ==

One of the issues with task isolation currently is that the LRU cache
drain must be done prior to entering userspace, but it requires
interrupts enabled and thus can't be done atomically.  My previous
patch series have handled this by checking with interrupts disabled,
but then looping around with interrupts enabled to try to drain the
LRU pagevecs.  Experimentally this works, but it's not provable that
it terminates, which is worrisome.  Andy suggested adding a percpu
flag to disable creation of deferred work like LRU cache pages.

Thomas suggested using an RT "local lock" to guard the LRU cache
flush; he is planning on bringing the concept to mainline in any case.
However, after some discussion we converged on simply using a spinlock
to guard the appropriate resources.  As a result, the
lru_add_drain_all() code that currently queues work on each remote cpu
to drain it, can instead simply acquire the lock and drain it remotely.
This means that a task isolation task no longer needs to worry about
being interrupted by SMP function call IPIs, so we don't have to deal
with this in the task isolation framework any more.

I don't recall anyone else volunteering to tackle this, so I will plan
to look at it.  The patch to do that should be orthogonal to the
revised task isolation patch series.


== Quiescing vmstat ==

Another issue that task isolation handles is ensuring that the vmstat
worker is quiesced before returning to user space.  Currently we
cancel the vmstat delayed work, then invoke refresh_cpu_vm_stats().
Currently neither of these things appears safe to do in the
interrupts-disabled context just before return to userspace, because
they both can call schedule(): refresh_cpu_vm_stats() via a
cond_resched() under CONFIG_NUMA, and cancel_delayed_work_sync() via a
schedule() in __cancel_work_timer().

Christoph offered to work with me to make sure that we could do the
appropriate quiescing with interrupts disabled, and seemed confident
it should be doable.


== Remote kernel TLB flush ==

Andy then brought up the issue of remote kernel TLB flush, which I've
been trying to sweep under the rug for the initial task isolation
series.  Remote TLB flush causes an interrupt on many systems (x86 and
tile, for example, although not arm64), so to the extent that it
occurs frequently, it becomes important to handle for task isolation.
With the recent addition of vmap kernel stacks, this becomes suddenly
much more important than it used to be, to the point where we now
really have to handle it for task isolation.

The basic insight here is that you can safely skip interrupting
userspace cores when you are sending remote kernel TLB flushes, since
by definition they can't touch the kernel pages in question anyway.
Then you just need to guarantee to flush the kernel TLB space next
time the userspace task re-enters the kernel.

The original Tilera dataplane code handled this by tracking task state
(kernel, user, or user-flushed) and manipulating the state atomically
at TLB flush time and kernel entry time.  After some discussion of the
overheads of such atomics, Andy pointed out that there is already an
atomic increment being done in the RCU code, and we should be able to
leverage that word to achieve this effect.  The idea is that remote
cores would do a compare-exchange of 0 to 1, which if it succeeded
would indicate that the remote core was in userspace and thus didn't
need to be IPI'd, but that it was now tagged for a kernel flush next
time the remote task entered the kernel.  Then, when the remote task
enters the kernel, it does an atomic update of its own dynticks and
discovers the low bit set, it does a kernel TLB flush before
continuing.

It was agreed that this makes sense to do unconditionally, since it's
not just helpful for nohz_full and task isolation, but also for idle,
since interrupting an idle core periodically just to do repeated
kernel tlb flushes isn't good for power consumption.

One open question is whether we discover the low bit set early enough
in kernel entry that we can trust that we haven't tried to touch any
pages that have been invalidated in the TLB.

Paul agreed to take a look at implementing this.


== Optimizing vfree via RCU ==

An orthogonal issue was also brought up, which is whether we could use
RCU to handle the kernel TLB flush from freeing vmaps; presumably if
we have enough vmap space, we can arrange to return the freed VA space
via RCU, and simply defer the TLB flush until the next grace period.

I'm not sure if this is practical if we encounter a high volume of
vfrees, but I don't think we really reached a definitive agreement on
it during the discussion either.


== Disabling the dyn tick ==

One issue that the current task isolation patch series encounters is
when we request disabling the dyntick, but it doesn't happen.  At the
moment we just wait until the the tick is properly disabled, by
busy-waiting in the kernel (calling schedule etc as needed).  No one
is particularly fond of this scheme.  The consensus seems to be to try
harder to figure out what is going on, fix whatever problems exist,
then consider it a regression going forward if something causes the
dyntick to become difficult to disable again in the future.  I will
take a look at this and try to gather more data on if and when this is
happening in 4.9.


== Missing oneshot_stopped callbacks ==

I raised the issue that various clock_event_device sources don't
always support oneshot_stopped, which can cause an additional
final interrupt to occur after the timer infrastructure believes the
interrupt has been stopped.  I have patches to fix this for tile and
arm64 in my patch series; Thomas volunteered to look at adding
equivalent support for x86.


Many thanks to all those who participated in the discussion.
Frederic, we wished you had been there!

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-05  4:04 ` task isolation discussion at Linux Plumbers Chris Metcalf
@ 2016-11-05 16:05   ` Christoph Lameter
  2016-11-07 16:55   ` Thomas Gleixner
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 80+ messages in thread
From: Christoph Lameter @ 2016-11-05 16:05 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas,
	Will Deacon, Andy Lutomirski, Daniel Lezcano, Francis Giraldeau,
	Andi Kleen, Arnd Bergmann, linux-kernel

On Sat, 5 Nov 2016, Chris Metcalf wrote:

> Here are the notes I took; I welcome any corrections and follow-up.

Thank you for writing this up. I hope we can now move forward further on
these issues.

> == Remote statistics ==
>
> We discussed the possibility of remote statistics gathering, i.e. load
> average etc.  The idea would be that we could have housekeeping
> core(s) periodically iterate over the nohz cores to load their rq
> remotely and do update_current etc.  Presumably it should be possible
> for a single housekeeping core to handle doing this for all the
> nohz_full cores, as we only need to do it quite infrequently.
>
> Thomas suggested that this might be the last remaining thing that
> needed to be done to allow disabling the current behavior of falling
> back to a 1 Hz clock in nohz_full.
>
> I believe Thomas said he had a patch to do this already.


Note that the vmstat_shepherd already scans idle cpus ever 2 seconds to
see if there are new updates top vm statistics  that require the
reactivation of the vmstat updater. It would be possible to
opportunistically update the thread statistics should the remove cpu be in
user space (if one figures out the synchronization issues for remote per
cpu updates)

> == Quiescing vmstat ==
>
> Another issue that task isolation handles is ensuring that the vmstat
> worker is quiesced before returning to user space.  Currently we
> cancel the vmstat delayed work, then invoke refresh_cpu_vm_stats().
> Currently neither of these things appears safe to do in the
> interrupts-disabled context just before return to userspace, because
> they both can call schedule(): refresh_cpu_vm_stats() via a
> cond_resched() under CONFIG_NUMA, and cancel_delayed_work_sync() via a
> schedule() in __cancel_work_timer().
>
> Christoph offered to work with me to make sure that we could do the
> appropriate quiescing with interrupts disabled, and seemed confident
> it should be doable.

This is already implemented. You can call

	refresh_cpu_vm_stats(false)

to do the quiescing with interrupts disabled.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-05  4:04 ` task isolation discussion at Linux Plumbers Chris Metcalf
  2016-11-05 16:05   ` Christoph Lameter
@ 2016-11-07 16:55   ` Thomas Gleixner
  2016-11-07 18:36     ` Thomas Gleixner
  2016-11-11 20:54     ` Luiz Capitulino
  2016-11-09  1:40   ` Paul E. McKenney
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 80+ messages in thread
From: Thomas Gleixner @ 2016-11-07 16:55 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	Francis Giraldeau, Andi Kleen, Arnd Bergmann, linux-kernel

On Sat, 5 Nov 2016, Chris Metcalf wrote:
> == Remote statistics ==
> 
> We discussed the possibility of remote statistics gathering, i.e. load
> average etc.  The idea would be that we could have housekeeping
> core(s) periodically iterate over the nohz cores to load their rq
> remotely and do update_current etc.  Presumably it should be possible
> for a single housekeeping core to handle doing this for all the
> nohz_full cores, as we only need to do it quite infrequently.
> 
> Thomas suggested that this might be the last remaining thing that
> needed to be done to allow disabling the current behavior of falling
> back to a 1 Hz clock in nohz_full.
> 
> I believe Thomas said he had a patch to do this already.

No, Riek was working on that.

> == Remote LRU cache drain ==
> 
> One of the issues with task isolation currently is that the LRU cache
> drain must be done prior to entering userspace, but it requires
> interrupts enabled and thus can't be done atomically.  My previous
> patch series have handled this by checking with interrupts disabled,
> but then looping around with interrupts enabled to try to drain the
> LRU pagevecs.  Experimentally this works, but it's not provable that
> it terminates, which is worrisome.  Andy suggested adding a percpu
> flag to disable creation of deferred work like LRU cache pages.
> 
> Thomas suggested using an RT "local lock" to guard the LRU cache
> flush; he is planning on bringing the concept to mainline in any case.
> However, after some discussion we converged on simply using a spinlock
> to guard the appropriate resources.  As a result, the
> lru_add_drain_all() code that currently queues work on each remote cpu
> to drain it, can instead simply acquire the lock and drain it remotely.
> This means that a task isolation task no longer needs to worry about
> being interrupted by SMP function call IPIs, so we don't have to deal
> with this in the task isolation framework any more.
> 
> I don't recall anyone else volunteering to tackle this, so I will plan
> to look at it.  The patch to do that should be orthogonal to the
> revised task isolation patch series.

I offered to clean up the patch from RT. I'll do that in the next days.
 
> == Missing oneshot_stopped callbacks ==
> 
> I raised the issue that various clock_event_device sources don't
> always support oneshot_stopped, which can cause an additional
> final interrupt to occur after the timer infrastructure believes the
> interrupt has been stopped.  I have patches to fix this for tile and
> arm64 in my patch series; Thomas volunteered to look at adding
> equivalent support for x86.

Right.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-07 16:55   ` Thomas Gleixner
@ 2016-11-07 18:36     ` Thomas Gleixner
  2016-11-07 19:12       ` Rik van Riel
  2016-11-11 20:54     ` Luiz Capitulino
  1 sibling, 1 reply; 80+ messages in thread
From: Thomas Gleixner @ 2016-11-07 18:36 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	Francis Giraldeau, Andi Kleen, Arnd Bergmann, linux-kernel

On Mon, 7 Nov 2016, Thomas Gleixner wrote:
> > == Missing oneshot_stopped callbacks ==
> > 
> > I raised the issue that various clock_event_device sources don't
> > always support oneshot_stopped, which can cause an additional
> > final interrupt to occur after the timer infrastructure believes the
> > interrupt has been stopped.  I have patches to fix this for tile and
> > arm64 in my patch series; Thomas volunteered to look at adding
> > equivalent support for x86.
> 
> Right.

Untested patch below should fix that.

Thanks,

	tglx

8<---------------
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -530,18 +530,20 @@ static void lapic_timer_broadcast(const
  * The local apic timer can be used for any function which is CPU local.
  */
 static struct clock_event_device lapic_clockevent = {
-	.name			= "lapic",
-	.features		= CLOCK_EVT_FEAT_PERIODIC |
-				  CLOCK_EVT_FEAT_ONESHOT | CLOCK_EVT_FEAT_C3STOP
-				  | CLOCK_EVT_FEAT_DUMMY,
-	.shift			= 32,
-	.set_state_shutdown	= lapic_timer_shutdown,
-	.set_state_periodic	= lapic_timer_set_periodic,
-	.set_state_oneshot	= lapic_timer_set_oneshot,
-	.set_next_event		= lapic_next_event,
-	.broadcast		= lapic_timer_broadcast,
-	.rating			= 100,
-	.irq			= -1,
+	.name				= "lapic",
+	.features			= CLOCK_EVT_FEAT_PERIODIC |
+					  CLOCK_EVT_FEAT_ONESHOT |
+					  CLOCK_EVT_FEAT_C3STOP |
+					  CLOCK_EVT_FEAT_DUMMY,
+	.shift				= 32,
+	.set_state_shutdown		= lapic_timer_shutdown,
+	.set_state_periodic		= lapic_timer_set_periodic,
+	.set_state_oneshot		= lapic_timer_set_oneshot,
+	.set_state_oneshot_stopped	= lapic_timer_shutdown,
+	.set_next_event			= lapic_next_event,
+	.broadcast			= lapic_timer_broadcast,
+	.rating				= 100,
+	.irq				= -1,
 };
 static DEFINE_PER_CPU(struct clock_event_device, lapic_events);
 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-07 18:36     ` Thomas Gleixner
@ 2016-11-07 19:12       ` Rik van Riel
  2016-11-07 19:16         ` Will Deacon
  0 siblings, 1 reply; 80+ messages in thread
From: Rik van Riel @ 2016-11-07 19:12 UTC (permalink / raw)
  To: Thomas Gleixner, Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Tejun Heo, Frederic Weisbecker, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, Francis Giraldeau, Andi Kleen,
	Arnd Bergmann, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 830 bytes --]

On Mon, 2016-11-07 at 19:36 +0100, Thomas Gleixner wrote:
> On Mon, 7 Nov 2016, Thomas Gleixner wrote:
> > 
> > > 
> > > == Missing oneshot_stopped callbacks ==
> > > 
> > > I raised the issue that various clock_event_device sources don't
> > > always support oneshot_stopped, which can cause an additional
> > > final interrupt to occur after the timer infrastructure believes
> > > the
> > > interrupt has been stopped.  I have patches to fix this for tile
> > > and
> > > arm64 in my patch series; Thomas volunteered to look at adding
> > > equivalent support for x86.
> > 
> > Right.
> 
> Untested patch below should fix that.
> 

That whitespace cleanup looks awesome, but I am not
optimistic about its chances to bring about functional
change.

What am I overlooking?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-07 19:12       ` Rik van Riel
@ 2016-11-07 19:16         ` Will Deacon
  2016-11-07 19:18           ` Rik van Riel
  0 siblings, 1 reply; 80+ messages in thread
From: Will Deacon @ 2016-11-07 19:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Thomas Gleixner, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Tejun Heo,
	Frederic Weisbecker, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Andy Lutomirski, Daniel Lezcano,
	Francis Giraldeau, Andi Kleen, Arnd Bergmann, linux-kernel

On Mon, Nov 07, 2016 at 02:12:13PM -0500, Rik van Riel wrote:
> On Mon, 2016-11-07 at 19:36 +0100, Thomas Gleixner wrote:
> > On Mon, 7 Nov 2016, Thomas Gleixner wrote:
> > > 
> > > > 
> > > > == Missing oneshot_stopped callbacks ==
> > > > 
> > > > I raised the issue that various clock_event_device sources don't
> > > > always support oneshot_stopped, which can cause an additional
> > > > final interrupt to occur after the timer infrastructure believes
> > > > the
> > > > interrupt has been stopped.  I have patches to fix this for tile
> > > > and
> > > > arm64 in my patch series; Thomas volunteered to look at adding
> > > > equivalent support for x86.
> > > 
> > > Right.
> > 
> > Untested patch below should fix that.
> > 
> 
> That whitespace cleanup looks awesome, but I am not
> optimistic about its chances to bring about functional
> change.
> 
> What am I overlooking?

It hooks up .set_state_oneshot_stopped?

Will

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-07 19:16         ` Will Deacon
@ 2016-11-07 19:18           ` Rik van Riel
  0 siblings, 0 replies; 80+ messages in thread
From: Rik van Riel @ 2016-11-07 19:18 UTC (permalink / raw)
  To: Will Deacon
  Cc: Thomas Gleixner, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Tejun Heo,
	Frederic Weisbecker, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Andy Lutomirski, Daniel Lezcano,
	Francis Giraldeau, Andi Kleen, Arnd Bergmann, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1207 bytes --]

On Mon, 2016-11-07 at 19:16 +0000, Will Deacon wrote:
> On Mon, Nov 07, 2016 at 02:12:13PM -0500, Rik van Riel wrote:
> > 
> > On Mon, 2016-11-07 at 19:36 +0100, Thomas Gleixner wrote:
> > > 
> > > On Mon, 7 Nov 2016, Thomas Gleixner wrote:
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > > == Missing oneshot_stopped callbacks ==
> > > > > 
> > > > > I raised the issue that various clock_event_device sources
> > > > > don't
> > > > > always support oneshot_stopped, which can cause an additional
> > > > > final interrupt to occur after the timer infrastructure
> > > > > believes
> > > > > the
> > > > > interrupt has been stopped.  I have patches to fix this for
> > > > > tile
> > > > > and
> > > > > arm64 in my patch series; Thomas volunteered to look at
> > > > > adding
> > > > > equivalent support for x86.
> > > > 
> > > > Right.
> > > 
> > > Untested patch below should fix that.
> > >  
> > 
> > That whitespace cleanup looks awesome, but I am not
> > optimistic about its chances to bring about functional
> > change.
> > 
> > What am I overlooking?
> 
> It hooks up .set_state_oneshot_stopped?

Gah, indeed. Never mind :)

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-05  4:04 ` task isolation discussion at Linux Plumbers Chris Metcalf
  2016-11-05 16:05   ` Christoph Lameter
  2016-11-07 16:55   ` Thomas Gleixner
@ 2016-11-09  1:40   ` Paul E. McKenney
  2016-11-09 11:14     ` Andy Lutomirski
  2016-11-09 11:07   ` Frederic Weisbecker
  2016-12-19 14:37   ` Paul E. McKenney
  4 siblings, 1 reply; 80+ messages in thread
From: Paul E. McKenney @ 2016-11-09  1:40 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	Francis Giraldeau, Andi Kleen, Arnd Bergmann, linux-kernel

On Sat, Nov 05, 2016 at 12:04:45AM -0400, Chris Metcalf wrote:
> A bunch of people got together this week at the Linux Plumbers
> Conference to discuss nohz_full, task isolation, and related stuff.
> (Thanks to Thomas for getting everyone gathered at one place and time!)
> 
> Here are the notes I took; I welcome any corrections and follow-up.

[ . . . ]

> == Remote kernel TLB flush ==
> 
> Andy then brought up the issue of remote kernel TLB flush, which I've
> been trying to sweep under the rug for the initial task isolation
> series.  Remote TLB flush causes an interrupt on many systems (x86 and
> tile, for example, although not arm64), so to the extent that it
> occurs frequently, it becomes important to handle for task isolation.
> With the recent addition of vmap kernel stacks, this becomes suddenly
> much more important than it used to be, to the point where we now
> really have to handle it for task isolation.
> 
> The basic insight here is that you can safely skip interrupting
> userspace cores when you are sending remote kernel TLB flushes, since
> by definition they can't touch the kernel pages in question anyway.
> Then you just need to guarantee to flush the kernel TLB space next
> time the userspace task re-enters the kernel.
> 
> The original Tilera dataplane code handled this by tracking task state
> (kernel, user, or user-flushed) and manipulating the state atomically
> at TLB flush time and kernel entry time.  After some discussion of the
> overheads of such atomics, Andy pointed out that there is already an
> atomic increment being done in the RCU code, and we should be able to
> leverage that word to achieve this effect.  The idea is that remote
> cores would do a compare-exchange of 0 to 1, which if it succeeded
> would indicate that the remote core was in userspace and thus didn't
> need to be IPI'd, but that it was now tagged for a kernel flush next
> time the remote task entered the kernel.  Then, when the remote task
> enters the kernel, it does an atomic update of its own dynticks and
> discovers the low bit set, it does a kernel TLB flush before
> continuing.
> 
> It was agreed that this makes sense to do unconditionally, since it's
> not just helpful for nohz_full and task isolation, but also for idle,
> since interrupting an idle core periodically just to do repeated
> kernel tlb flushes isn't good for power consumption.
> 
> One open question is whether we discover the low bit set early enough
> in kernel entry that we can trust that we haven't tried to touch any
> pages that have been invalidated in the TLB.
> 
> Paul agreed to take a look at implementing this.

Please see a prototype at 49961e272333 at:

	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git

Or see below for the patch.

As discussed earlier, once I get this working, I hand it off to you
to add your code.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

commit 49961e272333ac720ac4ccbaba45521bfea259ae
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Tue Nov 8 14:25:21 2016 -0800

    rcu: Maintain special bits at bottom of ->dynticks counter
    
    Currently, IPIs are used to force other CPUs to invalidate their TLBs
    in response to a kernel virtual-memory mapping change.  This works, but
    degrades both battery lifetime (for idle CPUs) and real-time response
    (for nohz_full CPUs), and in addition results in unnecessary IPIs due to
    the fact that CPUs executing in usermode are unaffected by stale kernel
    mappings.  It would be better to cause a CPU executing in usermode to
    wait until it is entering kernel mode to
    
    This commit therefore reserves a bit at the bottom of the ->dynticks
    counter, which is checked upon exit from extended quiescent states.  If it
    is set, it is cleared and then a new rcu_dynticks_special_exit() macro
    is invoked, which, if not supplied, is an empty single-pass do-while loop.
    If this bottom bit is set on -entry- to an extended quiescent state,
    then a WARN_ON_ONCE() triggers.
    
    This bottom bit may be set using a new rcu_dynticks_special_set()
    function, which returns true if the bit was set, or false if the CPU
    turned out to not be in an extended quiescent state.  Please note that
    this function refuses to set the bit for a non-nohz_full CPU when that
    CPU is executing in usermode because usermode execution is tracked by
    RCU as a dyntick-idle extended quiescent state only for nohz_full CPUs.
    
    Reported-by: Andy Lutomirski <luto@amacapital.net>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 4f9b2fa2173d..130d911e4ba1 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -33,6 +33,11 @@ static inline int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
 	return 0;
 }
 
+static inline bool rcu_dynticks_special_set(int cpu)
+{
+	return false;  /* Never flag non-existent other CPUs! */
+}
+
 static inline unsigned long get_state_synchronize_rcu(void)
 {
 	return 0;
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index dbf20b058f48..8de83830e86b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -279,23 +279,36 @@ static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
 };
 
 /*
+ * Steal a bit from the bottom of ->dynticks for idle entry/exit
+ * control.  Initially this is for TLB flushing.
+ */
+#define RCU_DYNTICK_CTRL_MASK 0x1
+#define RCU_DYNTICK_CTRL_CTR  (RCU_DYNTICK_CTRL_MASK + 1)
+#ifndef rcu_dynticks_special_exit
+#define rcu_dynticks_special_exit() do { } while (0)
+#endif
+
+/*
  * Record entry into an extended quiescent state.  This is only to be
  * called when not already in an extended quiescent state.
  */
 static void rcu_dynticks_eqs_enter(void)
 {
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+	int seq;
 
 	/*
-	 * CPUs seeing atomic_inc() must see prior RCU read-side critical
-	 * sections, and we also must force ordering with the next idle
-	 * sojourn.
+	 * CPUs seeing atomic_inc_return() must see prior RCU read-side
+	 * critical sections, and we also must force ordering with the
+	 * next idle sojourn.
 	 */
-	smp_mb__before_atomic(); /* See above. */
-	atomic_inc(&rdtp->dynticks);
-	smp_mb__after_atomic(); /* See above. */
+	seq = atomic_inc_return(&rdtp->dynticks);
+	/* Better be in an extended quiescent state! */
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+		     (seq & RCU_DYNTICK_CTRL_CTR));
+	/* Better not have special action (TLB flush) pending! */
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
-		     atomic_read(&rdtp->dynticks) & 0x1);
+		     (seq & RCU_DYNTICK_CTRL_MASK));
 }
 
 /*
@@ -305,17 +318,21 @@ static void rcu_dynticks_eqs_enter(void)
 static void rcu_dynticks_eqs_exit(void)
 {
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+	int seq;
 
 	/*
-	 * CPUs seeing atomic_inc() must see prior idle sojourns,
+	 * CPUs seeing atomic_inc_return() must see prior idle sojourns,
 	 * and we also must force ordering with the next RCU read-side
 	 * critical section.
 	 */
-	smp_mb__before_atomic(); /* See above. */
-	atomic_inc(&rdtp->dynticks);
-	smp_mb__after_atomic(); /* See above. */
+	seq = atomic_inc_return(&rdtp->dynticks);
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
-		     !(atomic_read(&rdtp->dynticks) & 0x1));
+		     !(seq & RCU_DYNTICK_CTRL_CTR));
+	if (seq & RCU_DYNTICK_CTRL_MASK) {
+		atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
+		smp_mb__after_atomic(); /* Clear bits before acting on them */
+		rcu_dynticks_special_exit();
+	}
 }
 
 /*
@@ -326,7 +343,7 @@ int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
 {
 	int snap = atomic_add_return(0, &rdtp->dynticks);
 
-	return snap;
+	return snap & ~RCU_DYNTICK_CTRL_MASK;
 }
 
 /*
@@ -335,7 +352,7 @@ int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
  */
 static bool rcu_dynticks_in_eqs(int snap)
 {
-	return !(snap & 0x1);
+	return !(snap & RCU_DYNTICK_CTRL_CTR);
 }
 
 /*
@@ -355,10 +372,33 @@ static bool rcu_dynticks_in_eqs_since(struct rcu_dynticks *rdtp, int snap)
 static void rcu_dynticks_momentary_idle(void)
 {
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
-	int special = atomic_add_return(2, &rdtp->dynticks);
+	int special = atomic_add_return(2 * RCU_DYNTICK_CTRL_CTR,
+					&rdtp->dynticks);
 
 	/* It is illegal to call this from idle state. */
-	WARN_ON_ONCE(!(special & 0x1));
+	WARN_ON_ONCE(!(special & RCU_DYNTICK_CTRL_CTR));
+}
+
+/*
+ * Set the special (bottom) bit of the specified CPU so that it
+ * will take special action (such as flushing its TLB) on the
+ * next exit from an extended quiescent state.  Returns true if
+ * the bit was successfully set, or false if the CPU was not in
+ * an extended quiescent state.
+ */
+bool rcu_dynticks_special_set(int cpu)
+{
+	int old;
+	int new;
+	struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
+
+	do {
+		old = atomic_read(&rdtp->dynticks);
+		if (old & RCU_DYNTICK_CTRL_CTR)
+			return false;
+		new = old | ~RCU_DYNTICK_CTRL_MASK;
+	} while (atomic_cmpxchg(&rdtp->dynticks, old, new) != old);
+	return true;
 }
 
 DEFINE_PER_CPU_SHARED_ALIGNED(unsigned long, rcu_qs_ctr);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 3b953dcf6afc..c444787a3bdc 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -596,6 +596,7 @@ extern struct rcu_state rcu_preempt_state;
 #endif /* #ifdef CONFIG_PREEMPT_RCU */
 
 int rcu_dynticks_snap(struct rcu_dynticks *rdtp);
+bool rcu_dynticks_special_set(int cpu);
 
 #ifdef CONFIG_RCU_BOOST
 DECLARE_PER_CPU(unsigned int, rcu_cpu_kthread_status);

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-05  4:04 ` task isolation discussion at Linux Plumbers Chris Metcalf
                     ` (2 preceding siblings ...)
  2016-11-09  1:40   ` Paul E. McKenney
@ 2016-11-09 11:07   ` Frederic Weisbecker
  2016-12-19 14:37   ` Paul E. McKenney
  4 siblings, 0 replies; 80+ messages in thread
From: Frederic Weisbecker @ 2016-11-09 11:07 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	Francis Giraldeau, Andi Kleen, Arnd Bergmann, LKML

2016-11-05 4:04 GMT+00:00 Chris Metcalf <cmetcalf@mellanox.com>:
> A bunch of people got together this week at the Linux Plumbers
> Conference to discuss nohz_full, task isolation, and related stuff.
> (Thanks to Thomas for getting everyone gathered at one place and time!)
>
> Here are the notes I took; I welcome any corrections and follow-up.
>

Thanks for that report Chris!

> == rcu_nocbs ==
>
> We started out by discussing this option.  It is automatically enabled
> by nohz_full, but we spent a little while side-tracking on the
> implementation of one kthread per rcu flavor per core.  The suggestion
> was made (by Peter or Andy; I forget) that each kthread could handle
> all flavors per core by using a dedicated worklist.  It certainly
> seems like removing potentially dozens or hundreds of kthreads from
> larger systems will be a win if this works out.
>
> Paul said he would look into this possibility.

Sounds good.

>
>
> == Remote statistics ==
>
> We discussed the possibility of remote statistics gathering, i.e. load
> average etc.  The idea would be that we could have housekeeping
> core(s) periodically iterate over the nohz cores to load their rq
> remotely and do update_current etc.  Presumably it should be possible
> for a single housekeeping core to handle doing this for all the
> nohz_full cores, as we only need to do it quite infrequently.
>
> Thomas suggested that this might be the last remaining thing that
> needed to be done to allow disabling the current behavior of falling
> back to a 1 Hz clock in nohz_full.
>
> I believe Thomas said he had a patch to do this already.
>

There are also some other details among update_curr to take care of,
but that's certainly a big piece of it.
I had wished we could find a solution that doesn't involve remote
accounting but at least it could be a first step.
I have let that idea rotting for too long, I need to get my hands into
it for good.

> == Disabling the dyn tick ==
>
> One issue that the current task isolation patch series encounters is
> when we request disabling the dyntick, but it doesn't happen.  At the
> moment we just wait until the the tick is properly disabled, by
> busy-waiting in the kernel (calling schedule etc as needed).  No one
> is particularly fond of this scheme.  The consensus seems to be to try
> harder to figure out what is going on, fix whatever problems exist,
> then consider it a regression going forward if something causes the
> dyntick to become difficult to disable again in the future.  I will
> take a look at this and try to gather more data on if and when this is
> happening in 4.9.
>

We could enhance dynticks tracing, expand the tick stop failure codes
for example in order to report more details about what's going on.

> == Missing oneshot_stopped callbacks ==
>
> I raised the issue that various clock_event_device sources don't
> always support oneshot_stopped, which can cause an additional
> final interrupt to occur after the timer infrastructure believes the
> interrupt has been stopped.  I have patches to fix this for tile and
> arm64 in my patch series; Thomas volunteered to look at adding
> equivalent support for x86.
>
>
> Many thanks to all those who participated in the discussion.
> Frederic, we wished you had been there!

I wish I had too!

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-09  1:40   ` Paul E. McKenney
@ 2016-11-09 11:14     ` Andy Lutomirski
  2016-11-09 17:38       ` Paul E. McKenney
  0 siblings, 1 reply; 80+ messages in thread
From: Andy Lutomirski @ 2016-11-09 11:14 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	Francis Giraldeau, Andi Kleen, Arnd Bergmann, linux-kernel

On Tue, Nov 8, 2016 at 5:40 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:

> commit 49961e272333ac720ac4ccbaba45521bfea259ae
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Tue Nov 8 14:25:21 2016 -0800
>
>     rcu: Maintain special bits at bottom of ->dynticks counter
>
>     Currently, IPIs are used to force other CPUs to invalidate their TLBs
>     in response to a kernel virtual-memory mapping change.  This works, but
>     degrades both battery lifetime (for idle CPUs) and real-time response
>     (for nohz_full CPUs), and in addition results in unnecessary IPIs due to
>     the fact that CPUs executing in usermode are unaffected by stale kernel
>     mappings.  It would be better to cause a CPU executing in usermode to
>     wait until it is entering kernel mode to

missing words here?

>
>     This commit therefore reserves a bit at the bottom of the ->dynticks
>     counter, which is checked upon exit from extended quiescent states.  If it
>     is set, it is cleared and then a new rcu_dynticks_special_exit() macro
>     is invoked, which, if not supplied, is an empty single-pass do-while loop.
>     If this bottom bit is set on -entry- to an extended quiescent state,
>     then a WARN_ON_ONCE() triggers.
>
>     This bottom bit may be set using a new rcu_dynticks_special_set()
>     function, which returns true if the bit was set, or false if the CPU
>     turned out to not be in an extended quiescent state.  Please note that
>     this function refuses to set the bit for a non-nohz_full CPU when that
>     CPU is executing in usermode because usermode execution is tracked by
>     RCU as a dyntick-idle extended quiescent state only for nohz_full CPUs.

I'm inclined to suggest s/dynticks/eqs/ in the public API.  To me,
"dynticks" is a feature, whereas "eqs" means "extended quiescent
state" and means something concrete about the CPU state

>
>     Reported-by: Andy Lutomirski <luto@amacapital.net>
>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> index 4f9b2fa2173d..130d911e4ba1 100644
> --- a/include/linux/rcutiny.h
> +++ b/include/linux/rcutiny.h
> @@ -33,6 +33,11 @@ static inline int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
>         return 0;
>  }
>
> +static inline bool rcu_dynticks_special_set(int cpu)
> +{
> +       return false;  /* Never flag non-existent other CPUs! */
> +}
> +
>  static inline unsigned long get_state_synchronize_rcu(void)
>  {
>         return 0;
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index dbf20b058f48..8de83830e86b 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -279,23 +279,36 @@ static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
>  };
>
>  /*
> + * Steal a bit from the bottom of ->dynticks for idle entry/exit
> + * control.  Initially this is for TLB flushing.
> + */
> +#define RCU_DYNTICK_CTRL_MASK 0x1
> +#define RCU_DYNTICK_CTRL_CTR  (RCU_DYNTICK_CTRL_MASK + 1)
> +#ifndef rcu_dynticks_special_exit
> +#define rcu_dynticks_special_exit() do { } while (0)
> +#endif
> +

>  /*
> @@ -305,17 +318,21 @@ static void rcu_dynticks_eqs_enter(void)
>  static void rcu_dynticks_eqs_exit(void)
>  {
>         struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> +       int seq;
>
>         /*
> -        * CPUs seeing atomic_inc() must see prior idle sojourns,
> +        * CPUs seeing atomic_inc_return() must see prior idle sojourns,
>          * and we also must force ordering with the next RCU read-side
>          * critical section.
>          */
> -       smp_mb__before_atomic(); /* See above. */
> -       atomic_inc(&rdtp->dynticks);
> -       smp_mb__after_atomic(); /* See above. */
> +       seq = atomic_inc_return(&rdtp->dynticks);
>         WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
> -                    !(atomic_read(&rdtp->dynticks) & 0x1));
> +                    !(seq & RCU_DYNTICK_CTRL_CTR));
> +       if (seq & RCU_DYNTICK_CTRL_MASK) {
> +               atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
> +               smp_mb__after_atomic(); /* Clear bits before acting on them */
> +               rcu_dynticks_special_exit();

I think this needs to be reversed for NMI safety: do the callback and
then clear the bits.

> +/*
> + * Set the special (bottom) bit of the specified CPU so that it
> + * will take special action (such as flushing its TLB) on the
> + * next exit from an extended quiescent state.  Returns true if
> + * the bit was successfully set, or false if the CPU was not in
> + * an extended quiescent state.
> + */
> +bool rcu_dynticks_special_set(int cpu)
> +{
> +       int old;
> +       int new;
> +       struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
> +
> +       do {
> +               old = atomic_read(&rdtp->dynticks);
> +               if (old & RCU_DYNTICK_CTRL_CTR)
> +                       return false;
> +               new = old | ~RCU_DYNTICK_CTRL_MASK;

Shouldn't this be old | RCU_DYNTICK_CTRL_MASK?

> +       } while (atomic_cmpxchg(&rdtp->dynticks, old, new) != old);
> +       return true;
>  }

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-09 11:14     ` Andy Lutomirski
@ 2016-11-09 17:38       ` Paul E. McKenney
  2016-11-09 18:57         ` Will Deacon
  2016-11-10  1:44         ` Andy Lutomirski
  0 siblings, 2 replies; 80+ messages in thread
From: Paul E. McKenney @ 2016-11-09 17:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	Francis Giraldeau, Andi Kleen, Arnd Bergmann, linux-kernel

On Wed, Nov 09, 2016 at 03:14:35AM -0800, Andy Lutomirski wrote:
> On Tue, Nov 8, 2016 at 5:40 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:

Thank you for the review and comments!

> > commit 49961e272333ac720ac4ccbaba45521bfea259ae
> > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Date:   Tue Nov 8 14:25:21 2016 -0800
> >
> >     rcu: Maintain special bits at bottom of ->dynticks counter
> >
> >     Currently, IPIs are used to force other CPUs to invalidate their TLBs
> >     in response to a kernel virtual-memory mapping change.  This works, but
> >     degrades both battery lifetime (for idle CPUs) and real-time response
> >     (for nohz_full CPUs), and in addition results in unnecessary IPIs due to
> >     the fact that CPUs executing in usermode are unaffected by stale kernel
> >     mappings.  It would be better to cause a CPU executing in usermode to
> >     wait until it is entering kernel mode to
> 
> missing words here?

Just a few, added more.  ;-)

> >     This commit therefore reserves a bit at the bottom of the ->dynticks
> >     counter, which is checked upon exit from extended quiescent states.  If it
> >     is set, it is cleared and then a new rcu_dynticks_special_exit() macro
> >     is invoked, which, if not supplied, is an empty single-pass do-while loop.
> >     If this bottom bit is set on -entry- to an extended quiescent state,
> >     then a WARN_ON_ONCE() triggers.
> >
> >     This bottom bit may be set using a new rcu_dynticks_special_set()
> >     function, which returns true if the bit was set, or false if the CPU
> >     turned out to not be in an extended quiescent state.  Please note that
> >     this function refuses to set the bit for a non-nohz_full CPU when that
> >     CPU is executing in usermode because usermode execution is tracked by
> >     RCU as a dyntick-idle extended quiescent state only for nohz_full CPUs.
> 
> I'm inclined to suggest s/dynticks/eqs/ in the public API.  To me,
> "dynticks" is a feature, whereas "eqs" means "extended quiescent
> state" and means something concrete about the CPU state

OK, I have changed rcu_dynticks_special_exit() to rcu_eqs_special_exit().
I also changed rcu_dynticks_special_set() to rcu_eqs_special_set().

I left rcu_dynticks_snap() as is because it is internal to RCU.  (External
only for the benefit of kernel/rcu/tree_trace.c.)

Any others?

Current state of patch below.

> >     Reported-by: Andy Lutomirski <luto@amacapital.net>
> >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >
> > diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> > index 4f9b2fa2173d..130d911e4ba1 100644
> > --- a/include/linux/rcutiny.h
> > +++ b/include/linux/rcutiny.h
> > @@ -33,6 +33,11 @@ static inline int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
> >         return 0;
> >  }
> >
> > +static inline bool rcu_dynticks_special_set(int cpu)
> > +{
> > +       return false;  /* Never flag non-existent other CPUs! */
> > +}
> > +
> >  static inline unsigned long get_state_synchronize_rcu(void)
> >  {
> >         return 0;
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index dbf20b058f48..8de83830e86b 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -279,23 +279,36 @@ static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
> >  };
> >
> >  /*
> > + * Steal a bit from the bottom of ->dynticks for idle entry/exit
> > + * control.  Initially this is for TLB flushing.
> > + */
> > +#define RCU_DYNTICK_CTRL_MASK 0x1
> > +#define RCU_DYNTICK_CTRL_CTR  (RCU_DYNTICK_CTRL_MASK + 1)
> > +#ifndef rcu_dynticks_special_exit
> > +#define rcu_dynticks_special_exit() do { } while (0)
> > +#endif
> > +
> 
> >  /*
> > @@ -305,17 +318,21 @@ static void rcu_dynticks_eqs_enter(void)
> >  static void rcu_dynticks_eqs_exit(void)
> >  {
> >         struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > +       int seq;
> >
> >         /*
> > -        * CPUs seeing atomic_inc() must see prior idle sojourns,
> > +        * CPUs seeing atomic_inc_return() must see prior idle sojourns,
> >          * and we also must force ordering with the next RCU read-side
> >          * critical section.
> >          */
> > -       smp_mb__before_atomic(); /* See above. */
> > -       atomic_inc(&rdtp->dynticks);
> > -       smp_mb__after_atomic(); /* See above. */
> > +       seq = atomic_inc_return(&rdtp->dynticks);
> >         WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
> > -                    !(atomic_read(&rdtp->dynticks) & 0x1));
> > +                    !(seq & RCU_DYNTICK_CTRL_CTR));
> > +       if (seq & RCU_DYNTICK_CTRL_MASK) {
> > +               atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
> > +               smp_mb__after_atomic(); /* Clear bits before acting on them */
> > +               rcu_dynticks_special_exit();
> 
> I think this needs to be reversed for NMI safety: do the callback and
> then clear the bits.

OK.  Ah, the race that I was worried about can't happen due to the
fact that rdtp->dynticks gets incremented before the call to
rcu_dynticks_special_exit().

Good catch, fixed.

And the other thing I forgot is that I cannot clear the bottom bits if
this is an NMI handler.  But now I cannot construct a case where this
is a problem.  The only way this could matter is if an NMI is taken in
an extended quiescent state.  In that case, the code flushes and clears
the bit, and any later remote-flush request to this CPU will set the
bit again.  And any races between the NMI handler and the other CPU look
the same as between IRQ handlers and process entry.

Yes, this one needs some formal verification, doesn't it?

In the meantime, if you can reproduce the race that led us to believe
that NMI handlers should not clear the bottom bits, please let me know.

> > +/*
> > + * Set the special (bottom) bit of the specified CPU so that it
> > + * will take special action (such as flushing its TLB) on the
> > + * next exit from an extended quiescent state.  Returns true if
> > + * the bit was successfully set, or false if the CPU was not in
> > + * an extended quiescent state.
> > + */
> > +bool rcu_dynticks_special_set(int cpu)
> > +{
> > +       int old;
> > +       int new;
> > +       struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
> > +
> > +       do {
> > +               old = atomic_read(&rdtp->dynticks);
> > +               if (old & RCU_DYNTICK_CTRL_CTR)
> > +                       return false;
> > +               new = old | ~RCU_DYNTICK_CTRL_MASK;
> 
> Shouldn't this be old | RCU_DYNTICK_CTRL_MASK?

Indeed it should!  (What -was- I thinking?)  Fixed.

> > +       } while (atomic_cmpxchg(&rdtp->dynticks, old, new) != old);
> > +       return true;
> >  }

Thank you again, please see update below.

							Thanx, Paul

------------------------------------------------------------------------

commit 7bbb80d5f612e7f0ffc826b11d292a3616150b34
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Tue Nov 8 14:25:21 2016 -0800

    rcu: Maintain special bits at bottom of ->dynticks counter
    
    Currently, IPIs are used to force other CPUs to invalidate their TLBs
    in response to a kernel virtual-memory mapping change.  This works, but
    degrades both battery lifetime (for idle CPUs) and real-time response
    (for nohz_full CPUs), and in addition results in unnecessary IPIs due to
    the fact that CPUs executing in usermode are unaffected by stale kernel
    mappings.  It would be better to cause a CPU executing in usermode to
    wait until it is entering kernel mode to do the flush, first to avoid
    interrupting usemode tasks and second to handle multiple flush requests
    with a single flush in the case of a long-running user task.
    
    This commit therefore reserves a bit at the bottom of the ->dynticks
    counter, which is checked upon exit from extended quiescent states.
    If it is set, it is cleared and then a new rcu_eqs_special_exit() macro is
    invoked, which, if not supplied, is an empty single-pass do-while loop.
    If this bottom bit is set on -entry- to an extended quiescent state,
    then a WARN_ON_ONCE() triggers.
    
    This bottom bit may be set using a new rcu_eqs_special_set() function,
    which returns true if the bit was set, or false if the CPU turned
    out to not be in an extended quiescent state.  Please note that this
    function refuses to set the bit for a non-nohz_full CPU when that CPU
    is executing in usermode because usermode execution is tracked by RCU
    as a dyntick-idle extended quiescent state only for nohz_full CPUs.
    
    Reported-by: Andy Lutomirski <luto@amacapital.net>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 4f9b2fa2173d..7232d199a81c 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -33,6 +33,11 @@ static inline int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
 	return 0;
 }
 
+static inline bool rcu_eqs_special_set(int cpu)
+{
+	return false;  /* Never flag non-existent other CPUs! */
+}
+
 static inline unsigned long get_state_synchronize_rcu(void)
 {
 	return 0;
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index dbf20b058f48..342c8ee402d6 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -279,23 +279,36 @@ static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
 };
 
 /*
+ * Steal a bit from the bottom of ->dynticks for idle entry/exit
+ * control.  Initially this is for TLB flushing.
+ */
+#define RCU_DYNTICK_CTRL_MASK 0x1
+#define RCU_DYNTICK_CTRL_CTR  (RCU_DYNTICK_CTRL_MASK + 1)
+#ifndef rcu_eqs_special_exit
+#define rcu_eqs_special_exit() do { } while (0)
+#endif
+
+/*
  * Record entry into an extended quiescent state.  This is only to be
  * called when not already in an extended quiescent state.
  */
 static void rcu_dynticks_eqs_enter(void)
 {
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+	int seq;
 
 	/*
-	 * CPUs seeing atomic_inc() must see prior RCU read-side critical
-	 * sections, and we also must force ordering with the next idle
-	 * sojourn.
+	 * CPUs seeing atomic_inc_return() must see prior RCU read-side
+	 * critical sections, and we also must force ordering with the
+	 * next idle sojourn.
 	 */
-	smp_mb__before_atomic(); /* See above. */
-	atomic_inc(&rdtp->dynticks);
-	smp_mb__after_atomic(); /* See above. */
+	seq = atomic_inc_return(&rdtp->dynticks);
+	/* Better be in an extended quiescent state! */
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+		     (seq & RCU_DYNTICK_CTRL_CTR));
+	/* Better not have special action (TLB flush) pending! */
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
-		     atomic_read(&rdtp->dynticks) & 0x1);
+		     (seq & RCU_DYNTICK_CTRL_MASK));
 }
 
 /*
@@ -305,17 +318,22 @@ static void rcu_dynticks_eqs_enter(void)
 static void rcu_dynticks_eqs_exit(void)
 {
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+	int seq;
 
 	/*
-	 * CPUs seeing atomic_inc() must see prior idle sojourns,
+	 * CPUs seeing atomic_inc_return() must see prior idle sojourns,
 	 * and we also must force ordering with the next RCU read-side
 	 * critical section.
 	 */
-	smp_mb__before_atomic(); /* See above. */
-	atomic_inc(&rdtp->dynticks);
-	smp_mb__after_atomic(); /* See above. */
+	seq = atomic_inc_return(&rdtp->dynticks);
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
-		     !(atomic_read(&rdtp->dynticks) & 0x1));
+		     !(seq & RCU_DYNTICK_CTRL_CTR));
+	if (seq & RCU_DYNTICK_CTRL_MASK) {
+		rcu_eqs_special_exit();
+		/* Prefer duplicate flushes to losing a flush. */
+		smp_mb__before_atomic(); /* NMI safety. */
+		atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
+	}
 }
 
 /*
@@ -326,7 +344,7 @@ int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
 {
 	int snap = atomic_add_return(0, &rdtp->dynticks);
 
-	return snap;
+	return snap & ~RCU_DYNTICK_CTRL_MASK;
 }
 
 /*
@@ -335,7 +353,7 @@ int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
  */
 static bool rcu_dynticks_in_eqs(int snap)
 {
-	return !(snap & 0x1);
+	return !(snap & RCU_DYNTICK_CTRL_CTR);
 }
 
 /*
@@ -355,10 +373,33 @@ static bool rcu_dynticks_in_eqs_since(struct rcu_dynticks *rdtp, int snap)
 static void rcu_dynticks_momentary_idle(void)
 {
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
-	int special = atomic_add_return(2, &rdtp->dynticks);
+	int special = atomic_add_return(2 * RCU_DYNTICK_CTRL_CTR,
+					&rdtp->dynticks);
 
 	/* It is illegal to call this from idle state. */
-	WARN_ON_ONCE(!(special & 0x1));
+	WARN_ON_ONCE(!(special & RCU_DYNTICK_CTRL_CTR));
+}
+
+/*
+ * Set the special (bottom) bit of the specified CPU so that it
+ * will take special action (such as flushing its TLB) on the
+ * next exit from an extended quiescent state.  Returns true if
+ * the bit was successfully set, or false if the CPU was not in
+ * an extended quiescent state.
+ */
+bool rcu_eqs_special_set(int cpu)
+{
+	int old;
+	int new;
+	struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
+
+	do {
+		old = atomic_read(&rdtp->dynticks);
+		if (old & RCU_DYNTICK_CTRL_CTR)
+			return false;
+		new = old | RCU_DYNTICK_CTRL_MASK;
+	} while (atomic_cmpxchg(&rdtp->dynticks, old, new) != old);
+	return true;
 }
 
 DEFINE_PER_CPU_SHARED_ALIGNED(unsigned long, rcu_qs_ctr);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 3b953dcf6afc..7dcdd59d894c 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -596,6 +596,7 @@ extern struct rcu_state rcu_preempt_state;
 #endif /* #ifdef CONFIG_PREEMPT_RCU */
 
 int rcu_dynticks_snap(struct rcu_dynticks *rdtp);
+bool rcu_eqs_special_set(int cpu);
 
 #ifdef CONFIG_RCU_BOOST
 DECLARE_PER_CPU(unsigned int, rcu_cpu_kthread_status);

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-09 17:38       ` Paul E. McKenney
@ 2016-11-09 18:57         ` Will Deacon
  2016-11-09 19:11           ` Paul E. McKenney
  2016-11-10  1:44         ` Andy Lutomirski
  1 sibling, 1 reply; 80+ messages in thread
From: Will Deacon @ 2016-11-09 18:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Daniel Lezcano,
	Francis Giraldeau, Andi Kleen, Arnd Bergmann, linux-kernel

Hi Paul,

Just a couple of comments, but they be more suited to Andy.

On Wed, Nov 09, 2016 at 09:38:08AM -0800, Paul E. McKenney wrote:
> @@ -355,10 +373,33 @@ static bool rcu_dynticks_in_eqs_since(struct rcu_dynticks *rdtp, int snap)
>  static void rcu_dynticks_momentary_idle(void)
>  {
>  	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> -	int special = atomic_add_return(2, &rdtp->dynticks);
> +	int special = atomic_add_return(2 * RCU_DYNTICK_CTRL_CTR,
> +					&rdtp->dynticks);
>  
>  	/* It is illegal to call this from idle state. */
> -	WARN_ON_ONCE(!(special & 0x1));
> +	WARN_ON_ONCE(!(special & RCU_DYNTICK_CTRL_CTR));
> +}
> +
> +/*
> + * Set the special (bottom) bit of the specified CPU so that it
> + * will take special action (such as flushing its TLB) on the
> + * next exit from an extended quiescent state.  Returns true if
> + * the bit was successfully set, or false if the CPU was not in
> + * an extended quiescent state.
> + */

Given that TLB maintenance on arm is handled in hardware (no need for IPI),
I'd like to avoid this work if at all possible. However, without seeing the
call site I can't tell if it's optional.

> +bool rcu_eqs_special_set(int cpu)
> +{
> +	int old;
> +	int new;
> +	struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
> +
> +	do {
> +		old = atomic_read(&rdtp->dynticks);
> +		if (old & RCU_DYNTICK_CTRL_CTR)
> +			return false;
> +		new = old | RCU_DYNTICK_CTRL_MASK;
> +	} while (atomic_cmpxchg(&rdtp->dynticks, old, new) != old);
> +	return true;
>  }

Can this be a cmpxchg_relaxed? What is it attempting to order?

Will

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-09 18:57         ` Will Deacon
@ 2016-11-09 19:11           ` Paul E. McKenney
  0 siblings, 0 replies; 80+ messages in thread
From: Paul E. McKenney @ 2016-11-09 19:11 UTC (permalink / raw)
  To: Will Deacon
  Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Daniel Lezcano,
	Francis Giraldeau, Andi Kleen, Arnd Bergmann, linux-kernel

On Wed, Nov 09, 2016 at 06:57:43PM +0000, Will Deacon wrote:
> Hi Paul,
> 
> Just a couple of comments, but they be more suited to Andy.
> 
> On Wed, Nov 09, 2016 at 09:38:08AM -0800, Paul E. McKenney wrote:
> > @@ -355,10 +373,33 @@ static bool rcu_dynticks_in_eqs_since(struct rcu_dynticks *rdtp, int snap)
> >  static void rcu_dynticks_momentary_idle(void)
> >  {
> >  	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > -	int special = atomic_add_return(2, &rdtp->dynticks);
> > +	int special = atomic_add_return(2 * RCU_DYNTICK_CTRL_CTR,
> > +					&rdtp->dynticks);
> >  
> >  	/* It is illegal to call this from idle state. */
> > -	WARN_ON_ONCE(!(special & 0x1));
> > +	WARN_ON_ONCE(!(special & RCU_DYNTICK_CTRL_CTR));
> > +}
> > +
> > +/*
> > + * Set the special (bottom) bit of the specified CPU so that it
> > + * will take special action (such as flushing its TLB) on the
> > + * next exit from an extended quiescent state.  Returns true if
> > + * the bit was successfully set, or false if the CPU was not in
> > + * an extended quiescent state.
> > + */
> 
> Given that TLB maintenance on arm is handled in hardware (no need for IPI),
> I'd like to avoid this work if at all possible. However, without seeing the
> call site I can't tell if it's optional.

For this, I must defer to Andy.

> > +bool rcu_eqs_special_set(int cpu)
> > +{
> > +	int old;
> > +	int new;
> > +	struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
> > +
> > +	do {
> > +		old = atomic_read(&rdtp->dynticks);
> > +		if (old & RCU_DYNTICK_CTRL_CTR)
> > +			return false;
> > +		new = old | RCU_DYNTICK_CTRL_MASK;
> > +	} while (atomic_cmpxchg(&rdtp->dynticks, old, new) != old);
> > +	return true;
> >  }
> 
> Can this be a cmpxchg_relaxed? What is it attempting to order?

It is attmepting to order my paranoia.  ;-)

If Andy shows me that less ordering is possible, I can weaken it.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-09 17:38       ` Paul E. McKenney
  2016-11-09 18:57         ` Will Deacon
@ 2016-11-10  1:44         ` Andy Lutomirski
  2016-11-10  4:52           ` Paul E. McKenney
  1 sibling, 1 reply; 80+ messages in thread
From: Andy Lutomirski @ 2016-11-10  1:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	Francis Giraldeau, Andi Kleen, Arnd Bergmann, linux-kernel

On Wed, Nov 9, 2016 at 9:38 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:

Are you planning on changing rcu_nmi_enter()?  It would make it easier
to figure out how they interact if I could see the code.

> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index dbf20b058f48..342c8ee402d6 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c


>  /*
> @@ -305,17 +318,22 @@ static void rcu_dynticks_eqs_enter(void)
>  static void rcu_dynticks_eqs_exit(void)
>  {
>         struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> +       int seq;
>
>         /*
> -        * CPUs seeing atomic_inc() must see prior idle sojourns,
> +        * CPUs seeing atomic_inc_return() must see prior idle sojourns,
>          * and we also must force ordering with the next RCU read-side
>          * critical section.
>          */
> -       smp_mb__before_atomic(); /* See above. */
> -       atomic_inc(&rdtp->dynticks);
> -       smp_mb__after_atomic(); /* See above. */
> +       seq = atomic_inc_return(&rdtp->dynticks);
>         WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
> -                    !(atomic_read(&rdtp->dynticks) & 0x1));
> +                    !(seq & RCU_DYNTICK_CTRL_CTR));

I think there's still a race here.  Suppose we're running this code on
cpu n and...

> +       if (seq & RCU_DYNTICK_CTRL_MASK) {
> +               rcu_eqs_special_exit();
> +               /* Prefer duplicate flushes to losing a flush. */
> +               smp_mb__before_atomic(); /* NMI safety. */

... another CPU changes the page tables and calls rcu_eqs_special_set(n) here.

That CPU expects that we will flush prior to continuing, but we won't.
Admittedly it's highly unlikely that any stale TLB entries would be
created yet, but nothing rules it out.

> +               atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
> +       }

Maybe the way to handle it is something like:

this_cpu_write(rcu_nmi_needs_eqs_special, 1);
barrier();

/* NMI here will call rcu_eqs_special_exit() regardless of the value
in dynticks */

atomic_and(...);
smp_mb__after_atomic();
rcu_eqs_special_exit();

barrier();
this_cpu_write(rcu_nmi_needs_eqs_special, 0);


Then rcu_nmi_enter() would call rcu_eqs_special_exit() if the dynticks
bit is set *or* rcu_nmi_needs_eqs_special is set.

Does that make sense?

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-10  1:44         ` Andy Lutomirski
@ 2016-11-10  4:52           ` Paul E. McKenney
  2016-11-10  5:10             ` Paul E. McKenney
  2016-11-11 17:00             ` Andy Lutomirski
  0 siblings, 2 replies; 80+ messages in thread
From: Paul E. McKenney @ 2016-11-10  4:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	Francis Giraldeau, Andi Kleen, Arnd Bergmann, linux-kernel

On Wed, Nov 09, 2016 at 05:44:02PM -0800, Andy Lutomirski wrote:
> On Wed, Nov 9, 2016 at 9:38 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> 
> Are you planning on changing rcu_nmi_enter()?  It would make it easier
> to figure out how they interact if I could see the code.

It already calls rcu_dynticks_eqs_exit(), courtesy of the earlier
consolidation patches.

> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index dbf20b058f48..342c8ee402d6 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> 
> 
> >  /*
> > @@ -305,17 +318,22 @@ static void rcu_dynticks_eqs_enter(void)
> >  static void rcu_dynticks_eqs_exit(void)
> >  {
> >         struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > +       int seq;
> >
> >         /*
> > -        * CPUs seeing atomic_inc() must see prior idle sojourns,
> > +        * CPUs seeing atomic_inc_return() must see prior idle sojourns,
> >          * and we also must force ordering with the next RCU read-side
> >          * critical section.
> >          */
> > -       smp_mb__before_atomic(); /* See above. */
> > -       atomic_inc(&rdtp->dynticks);
> > -       smp_mb__after_atomic(); /* See above. */
> > +       seq = atomic_inc_return(&rdtp->dynticks);
> >         WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
> > -                    !(atomic_read(&rdtp->dynticks) & 0x1));
> > +                    !(seq & RCU_DYNTICK_CTRL_CTR));
> 
> I think there's still a race here.  Suppose we're running this code on
> cpu n and...
> 
> > +       if (seq & RCU_DYNTICK_CTRL_MASK) {
> > +               rcu_eqs_special_exit();
> > +               /* Prefer duplicate flushes to losing a flush. */
> > +               smp_mb__before_atomic(); /* NMI safety. */
> 
> ... another CPU changes the page tables and calls rcu_eqs_special_set(n) here.

But then rcu_eqs_special_set() will return false because we already
exited the extended quiescent state at the atomic_inc_return() above.

That should tell the caller to send an IPI.

> That CPU expects that we will flush prior to continuing, but we won't.
> Admittedly it's highly unlikely that any stale TLB entries would be
> created yet, but nothing rules it out.

That said, 0day is having some heartburn from this, so I must have broken
something somewhere.  My own tests of course complete just fine...

> > +               atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
> > +       }
> 
> Maybe the way to handle it is something like:
> 
> this_cpu_write(rcu_nmi_needs_eqs_special, 1);
> barrier();
> 
> /* NMI here will call rcu_eqs_special_exit() regardless of the value
> in dynticks */
> 
> atomic_and(...);
> smp_mb__after_atomic();
> rcu_eqs_special_exit();
> 
> barrier();
> this_cpu_write(rcu_nmi_needs_eqs_special, 0);
> 
> 
> Then rcu_nmi_enter() would call rcu_eqs_special_exit() if the dynticks
> bit is set *or* rcu_nmi_needs_eqs_special is set.
> 
> Does that make sense?

I believe that rcu_eqs_special_set() returning false covers this, but
could easily be missing something.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-10  4:52           ` Paul E. McKenney
@ 2016-11-10  5:10             ` Paul E. McKenney
  2016-11-11 17:00             ` Andy Lutomirski
  1 sibling, 0 replies; 80+ messages in thread
From: Paul E. McKenney @ 2016-11-10  5:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	Francis Giraldeau, Andi Kleen, Arnd Bergmann, linux-kernel

On Wed, Nov 09, 2016 at 08:52:13PM -0800, Paul E. McKenney wrote:
> On Wed, Nov 09, 2016 at 05:44:02PM -0800, Andy Lutomirski wrote:
> > On Wed, Nov 9, 2016 at 9:38 AM, Paul E. McKenney
> > <paulmck@linux.vnet.ibm.com> wrote:
> > 
> > Are you planning on changing rcu_nmi_enter()?  It would make it easier
> > to figure out how they interact if I could see the code.
> 
> It already calls rcu_dynticks_eqs_exit(), courtesy of the earlier
> consolidation patches.
> 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index dbf20b058f48..342c8ee402d6 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > 
> > 
> > >  /*
> > > @@ -305,17 +318,22 @@ static void rcu_dynticks_eqs_enter(void)
> > >  static void rcu_dynticks_eqs_exit(void)
> > >  {
> > >         struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > > +       int seq;
> > >
> > >         /*
> > > -        * CPUs seeing atomic_inc() must see prior idle sojourns,
> > > +        * CPUs seeing atomic_inc_return() must see prior idle sojourns,
> > >          * and we also must force ordering with the next RCU read-side
> > >          * critical section.
> > >          */
> > > -       smp_mb__before_atomic(); /* See above. */
> > > -       atomic_inc(&rdtp->dynticks);
> > > -       smp_mb__after_atomic(); /* See above. */
> > > +       seq = atomic_inc_return(&rdtp->dynticks);
> > >         WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
> > > -                    !(atomic_read(&rdtp->dynticks) & 0x1));
> > > +                    !(seq & RCU_DYNTICK_CTRL_CTR));
> > 
> > I think there's still a race here.  Suppose we're running this code on
> > cpu n and...
> > 
> > > +       if (seq & RCU_DYNTICK_CTRL_MASK) {
> > > +               rcu_eqs_special_exit();
> > > +               /* Prefer duplicate flushes to losing a flush. */
> > > +               smp_mb__before_atomic(); /* NMI safety. */
> > 
> > ... another CPU changes the page tables and calls rcu_eqs_special_set(n) here.
> 
> But then rcu_eqs_special_set() will return false because we already
> exited the extended quiescent state at the atomic_inc_return() above.
> 
> That should tell the caller to send an IPI.
> 
> > That CPU expects that we will flush prior to continuing, but we won't.
> > Admittedly it's highly unlikely that any stale TLB entries would be
> > created yet, but nothing rules it out.
> 
> That said, 0day is having some heartburn from this, so I must have broken
> something somewhere.  My own tests of course complete just fine...
> 
> > > +               atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
> > > +       }
> > 
> > Maybe the way to handle it is something like:
> > 
> > this_cpu_write(rcu_nmi_needs_eqs_special, 1);
> > barrier();
> > 
> > /* NMI here will call rcu_eqs_special_exit() regardless of the value
> > in dynticks */
> > 
> > atomic_and(...);
> > smp_mb__after_atomic();
> > rcu_eqs_special_exit();
> > 
> > barrier();
> > this_cpu_write(rcu_nmi_needs_eqs_special, 0);
> > 
> > 
> > Then rcu_nmi_enter() would call rcu_eqs_special_exit() if the dynticks
> > bit is set *or* rcu_nmi_needs_eqs_special is set.
> > 
> > Does that make sense?
> 
> I believe that rcu_eqs_special_set() returning false covers this, but
> could easily be missing something.

And fixing a couple of stupid errors might help.  Probably more where
those came from...

							Thanx, Paul

------------------------------------------------------------------------

commit 0ea13930e9a89b1741897f5af308692eab08d9e8
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Tue Nov 8 14:25:21 2016 -0800

    rcu: Maintain special bits at bottom of ->dynticks counter
    
    Currently, IPIs are used to force other CPUs to invalidate their TLBs
    in response to a kernel virtual-memory mapping change.  This works, but
    degrades both battery lifetime (for idle CPUs) and real-time response
    (for nohz_full CPUs), and in addition results in unnecessary IPIs due to
    the fact that CPUs executing in usermode are unaffected by stale kernel
    mappings.  It would be better to cause a CPU executing in usermode to
    wait until it is entering kernel mode to do the flush, first to avoid
    interrupting usemode tasks and second to handle multiple flush requests
    with a single flush in the case of a long-running user task.
    
    This commit therefore reserves a bit at the bottom of the ->dynticks
    counter, which is checked upon exit from extended quiescent states.
    If it is set, it is cleared and then a new rcu_eqs_special_exit() macro is
    invoked, which, if not supplied, is an empty single-pass do-while loop.
    If this bottom bit is set on -entry- to an extended quiescent state,
    then a WARN_ON_ONCE() triggers.
    
    This bottom bit may be set using a new rcu_eqs_special_set() function,
    which returns true if the bit was set, or false if the CPU turned
    out to not be in an extended quiescent state.  Please note that this
    function refuses to set the bit for a non-nohz_full CPU when that CPU
    is executing in usermode because usermode execution is tracked by RCU
    as a dyntick-idle extended quiescent state only for nohz_full CPUs.
    
    Reported-by: Andy Lutomirski <luto@amacapital.net>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 4f9b2fa2173d..7232d199a81c 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -33,6 +33,11 @@ static inline int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
 	return 0;
 }
 
+static inline bool rcu_eqs_special_set(int cpu)
+{
+	return false;  /* Never flag non-existent other CPUs! */
+}
+
 static inline unsigned long get_state_synchronize_rcu(void)
 {
 	return 0;
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index dbf20b058f48..c1e3d4a333b2 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -279,23 +279,36 @@ static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
 };
 
 /*
+ * Steal a bit from the bottom of ->dynticks for idle entry/exit
+ * control.  Initially this is for TLB flushing.
+ */
+#define RCU_DYNTICK_CTRL_MASK 0x1
+#define RCU_DYNTICK_CTRL_CTR  (RCU_DYNTICK_CTRL_MASK + 1)
+#ifndef rcu_eqs_special_exit
+#define rcu_eqs_special_exit() do { } while (0)
+#endif
+
+/*
  * Record entry into an extended quiescent state.  This is only to be
  * called when not already in an extended quiescent state.
  */
 static void rcu_dynticks_eqs_enter(void)
 {
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+	int seq;
 
 	/*
-	 * CPUs seeing atomic_inc() must see prior RCU read-side critical
-	 * sections, and we also must force ordering with the next idle
-	 * sojourn.
+	 * CPUs seeing atomic_inc_return() must see prior RCU read-side
+	 * critical sections, and we also must force ordering with the
+	 * next idle sojourn.
 	 */
-	smp_mb__before_atomic(); /* See above. */
-	atomic_inc(&rdtp->dynticks);
-	smp_mb__after_atomic(); /* See above. */
+	seq = atomic_add_return(RCU_DYNTICK_CTRL_CTR, &rdtp->dynticks);
+	/* Better be in an extended quiescent state! */
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+		     (seq & RCU_DYNTICK_CTRL_CTR));
+	/* Better not have special action (TLB flush) pending! */
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
-		     atomic_read(&rdtp->dynticks) & 0x1);
+		     (seq & RCU_DYNTICK_CTRL_MASK));
 }
 
 /*
@@ -305,17 +318,22 @@ static void rcu_dynticks_eqs_enter(void)
 static void rcu_dynticks_eqs_exit(void)
 {
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+	int seq;
 
 	/*
-	 * CPUs seeing atomic_inc() must see prior idle sojourns,
+	 * CPUs seeing atomic_inc_return() must see prior idle sojourns,
 	 * and we also must force ordering with the next RCU read-side
 	 * critical section.
 	 */
-	smp_mb__before_atomic(); /* See above. */
-	atomic_inc(&rdtp->dynticks);
-	smp_mb__after_atomic(); /* See above. */
+	seq = atomic_add_return(RCU_DYNTICK_CTRL_CTR, &rdtp->dynticks);
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
-		     !(atomic_read(&rdtp->dynticks) & 0x1));
+		     !(seq & RCU_DYNTICK_CTRL_CTR));
+	if (seq & RCU_DYNTICK_CTRL_MASK) {
+		rcu_eqs_special_exit();
+		/* Prefer duplicate flushes to losing a flush. */
+		smp_mb__before_atomic(); /* NMI safety. */
+		atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
+	}
 }
 
 /*
@@ -326,7 +344,7 @@ int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
 {
 	int snap = atomic_add_return(0, &rdtp->dynticks);
 
-	return snap;
+	return snap & ~RCU_DYNTICK_CTRL_MASK;
 }
 
 /*
@@ -335,7 +353,7 @@ int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
  */
 static bool rcu_dynticks_in_eqs(int snap)
 {
-	return !(snap & 0x1);
+	return !(snap & RCU_DYNTICK_CTRL_CTR);
 }
 
 /*
@@ -355,10 +373,33 @@ static bool rcu_dynticks_in_eqs_since(struct rcu_dynticks *rdtp, int snap)
 static void rcu_dynticks_momentary_idle(void)
 {
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
-	int special = atomic_add_return(2, &rdtp->dynticks);
+	int special = atomic_add_return(2 * RCU_DYNTICK_CTRL_CTR,
+					&rdtp->dynticks);
 
 	/* It is illegal to call this from idle state. */
-	WARN_ON_ONCE(!(special & 0x1));
+	WARN_ON_ONCE(!(special & RCU_DYNTICK_CTRL_CTR));
+}
+
+/*
+ * Set the special (bottom) bit of the specified CPU so that it
+ * will take special action (such as flushing its TLB) on the
+ * next exit from an extended quiescent state.  Returns true if
+ * the bit was successfully set, or false if the CPU was not in
+ * an extended quiescent state.
+ */
+bool rcu_eqs_special_set(int cpu)
+{
+	int old;
+	int new;
+	struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
+
+	do {
+		old = atomic_read(&rdtp->dynticks);
+		if (old & RCU_DYNTICK_CTRL_CTR)
+			return false;
+		new = old | RCU_DYNTICK_CTRL_MASK;
+	} while (atomic_cmpxchg(&rdtp->dynticks, old, new) != old);
+	return true;
 }
 
 DEFINE_PER_CPU_SHARED_ALIGNED(unsigned long, rcu_qs_ctr);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 3b953dcf6afc..7dcdd59d894c 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -596,6 +596,7 @@ extern struct rcu_state rcu_preempt_state;
 #endif /* #ifdef CONFIG_PREEMPT_RCU */
 
 int rcu_dynticks_snap(struct rcu_dynticks *rdtp);
+bool rcu_eqs_special_set(int cpu);
 
 #ifdef CONFIG_RCU_BOOST
 DECLARE_PER_CPU(unsigned int, rcu_cpu_kthread_status);

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-10  4:52           ` Paul E. McKenney
  2016-11-10  5:10             ` Paul E. McKenney
@ 2016-11-11 17:00             ` Andy Lutomirski
  1 sibling, 0 replies; 80+ messages in thread
From: Andy Lutomirski @ 2016-11-11 17:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Daniel Lezcano,
	Francis Giraldeau, Andi Kleen, Arnd Bergmann, linux-kernel

On Wed, Nov 9, 2016 at 8:52 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Wed, Nov 09, 2016 at 05:44:02PM -0800, Andy Lutomirski wrote:
>> On Wed, Nov 9, 2016 at 9:38 AM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>>
>> Are you planning on changing rcu_nmi_enter()?  It would make it easier
>> to figure out how they interact if I could see the code.
>
> It already calls rcu_dynticks_eqs_exit(), courtesy of the earlier
> consolidation patches.
>
>> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> > index dbf20b058f48..342c8ee402d6 100644
>> > --- a/kernel/rcu/tree.c
>> > +++ b/kernel/rcu/tree.c
>>
>>
>> >  /*
>> > @@ -305,17 +318,22 @@ static void rcu_dynticks_eqs_enter(void)
>> >  static void rcu_dynticks_eqs_exit(void)
>> >  {
>> >         struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> > +       int seq;
>> >
>> >         /*
>> > -        * CPUs seeing atomic_inc() must see prior idle sojourns,
>> > +        * CPUs seeing atomic_inc_return() must see prior idle sojourns,
>> >          * and we also must force ordering with the next RCU read-side
>> >          * critical section.
>> >          */
>> > -       smp_mb__before_atomic(); /* See above. */
>> > -       atomic_inc(&rdtp->dynticks);
>> > -       smp_mb__after_atomic(); /* See above. */
>> > +       seq = atomic_inc_return(&rdtp->dynticks);
>> >         WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
>> > -                    !(atomic_read(&rdtp->dynticks) & 0x1));
>> > +                    !(seq & RCU_DYNTICK_CTRL_CTR));
>>
>> I think there's still a race here.  Suppose we're running this code on
>> cpu n and...
>>
>> > +       if (seq & RCU_DYNTICK_CTRL_MASK) {
>> > +               rcu_eqs_special_exit();
>> > +               /* Prefer duplicate flushes to losing a flush. */
>> > +               smp_mb__before_atomic(); /* NMI safety. */
>>
>> ... another CPU changes the page tables and calls rcu_eqs_special_set(n) here.
>
> But then rcu_eqs_special_set() will return false because we already
> exited the extended quiescent state at the atomic_inc_return() above.
>
> That should tell the caller to send an IPI.
>
>> That CPU expects that we will flush prior to continuing, but we won't.
>> Admittedly it's highly unlikely that any stale TLB entries would be
>> created yet, but nothing rules it out.
>
> That said, 0day is having some heartburn from this, so I must have broken
> something somewhere.  My own tests of course complete just fine...
>
>> > +               atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
>> > +       }
>>
>> Maybe the way to handle it is something like:
>>
>> this_cpu_write(rcu_nmi_needs_eqs_special, 1);
>> barrier();
>>
>> /* NMI here will call rcu_eqs_special_exit() regardless of the value
>> in dynticks */
>>
>> atomic_and(...);
>> smp_mb__after_atomic();
>> rcu_eqs_special_exit();
>>
>> barrier();
>> this_cpu_write(rcu_nmi_needs_eqs_special, 0);
>>
>>
>> Then rcu_nmi_enter() would call rcu_eqs_special_exit() if the dynticks
>> bit is set *or* rcu_nmi_needs_eqs_special is set.
>>
>> Does that make sense?
>
> I believe that rcu_eqs_special_set() returning false covers this, but
> could easily be missing something.

I think you're right.  I'll stare at it some more when I do the actual
TLB flush patch.

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-07 16:55   ` Thomas Gleixner
  2016-11-07 18:36     ` Thomas Gleixner
@ 2016-11-11 20:54     ` Luiz Capitulino
  1 sibling, 0 replies; 80+ messages in thread
From: Luiz Capitulino @ 2016-11-11 20:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, Francis Giraldeau, Andi Kleen, Arnd Bergmann,
	linux-kernel

On Mon, 7 Nov 2016 17:55:47 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Sat, 5 Nov 2016, Chris Metcalf wrote:
> > == Remote statistics ==
> > 
> > We discussed the possibility of remote statistics gathering, i.e. load
> > average etc.  The idea would be that we could have housekeeping
> > core(s) periodically iterate over the nohz cores to load their rq
> > remotely and do update_current etc.  Presumably it should be possible
> > for a single housekeeping core to handle doing this for all the
> > nohz_full cores, as we only need to do it quite infrequently.
> > 
> > Thomas suggested that this might be the last remaining thing that
> > needed to be done to allow disabling the current behavior of falling
> > back to a 1 Hz clock in nohz_full.
> > 
> > I believe Thomas said he had a patch to do this already.  
> 
> No, Riek was working on that.

Rik series made it possible to have remote tick sampling. That is,
calling account_process_tick(cpu) from a housekeeping CPU, so that
tick accounting is done for a remote "cpu". The series was intended
to improve overhead on nohz_full CPUs.

However, to get rid of the 1Hz tick, we need to do the same thing
for scheduler_tick(). I'm not sure if this has been attempted yet,
and if it's at all possible.

> > == Remote LRU cache drain ==
> > 
> > One of the issues with task isolation currently is that the LRU cache
> > drain must be done prior to entering userspace, but it requires
> > interrupts enabled and thus can't be done atomically.  My previous
> > patch series have handled this by checking with interrupts disabled,
> > but then looping around with interrupts enabled to try to drain the
> > LRU pagevecs.  Experimentally this works, but it's not provable that
> > it terminates, which is worrisome.  Andy suggested adding a percpu
> > flag to disable creation of deferred work like LRU cache pages.
> > 
> > Thomas suggested using an RT "local lock" to guard the LRU cache
> > flush; he is planning on bringing the concept to mainline in any case.
> > However, after some discussion we converged on simply using a spinlock
> > to guard the appropriate resources.  As a result, the
> > lru_add_drain_all() code that currently queues work on each remote cpu
> > to drain it, can instead simply acquire the lock and drain it remotely.
> > This means that a task isolation task no longer needs to worry about
> > being interrupted by SMP function call IPIs, so we don't have to deal
> > with this in the task isolation framework any more.
> > 
> > I don't recall anyone else volunteering to tackle this, so I will plan
> > to look at it.  The patch to do that should be orthogonal to the
> > revised task isolation patch series.  
> 
> I offered to clean up the patch from RT. I'll do that in the next days.

Yes, the RT kernel got a patch that fixes this. I wonder if the same
idea could work for vm_stat (that is, gathering those stats from
housekeeping CPUs vs. having to queue deferrable work to do it).

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: task isolation discussion at Linux Plumbers
  2016-11-05  4:04 ` task isolation discussion at Linux Plumbers Chris Metcalf
                     ` (3 preceding siblings ...)
  2016-11-09 11:07   ` Frederic Weisbecker
@ 2016-12-19 14:37   ` Paul E. McKenney
  4 siblings, 0 replies; 80+ messages in thread
From: Paul E. McKenney @ 2016-12-19 14:37 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	Francis Giraldeau, Andi Kleen, Arnd Bergmann, linux-kernel,
	boqun.feng

On Sat, Nov 05, 2016 at 12:04:45AM -0400, Chris Metcalf wrote:
> A bunch of people got together this week at the Linux Plumbers
> Conference to discuss nohz_full, task isolation, and related stuff.
> (Thanks to Thomas for getting everyone gathered at one place and time!)

Which reminds me...

One spirited side discussion at Santa Fe involved RCU's rcu_node tree
and CPU numbering.  Several people insisted that RCU should remap CPU
numbers to allow for interesting CPU-numbering schemes.  I resisted
this added complexity rather strenuously, but eventually said that I
would consider adding such complexity if someone provided me with valid
system-level performance data showing a need for this sort of remapping.

I haven't heard anything since, so I figured that I should follow up.
How are things going collecting this data?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-16 21:19 ` [PATCH v15 04/13] task_isolation: add initial support Chris Metcalf
  2016-08-29 16:33   ` Peter Zijlstra
@ 2017-02-02 16:13   ` Eugene Syromiatnikov
  2017-02-02 18:12     ` Chris Metcalf
  1 sibling, 1 reply; 80+ messages in thread
From: Eugene Syromiatnikov @ 2017-02-02 16:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: Chris Metcalf, Frederic Weisbecker

>  	case PR_GET_FP_MODE:
>  		error = GET_FP_MODE(me);
>  		break;
> +#ifdef CONFIG_TASK_ISOLATION
> +	case PR_SET_TASK_ISOLATION:
> +		error = task_isolation_set(arg2);
> +		break;
> +	case PR_GET_TASK_ISOLATION:
> +		error = me->task_isolation_flags;
> +		break;
> +#endif
>  	default:
>  		error = -EINVAL;
>  		break;

It is not a very good idea to ignore the values of unused arguments; it
prevents future their usage, as user space can pass some garbage values
here. Check out the code for newer prctl handlers, like
PR_SET_NO_NEW_PRIVS, PR_SET_THP_DISABLE, or PR_MPX_ENABLE_MANAGEMENT
(PR_[SG]_FP_MODE is an unfortunate recent omission).

The other thing is the usage of #ifdef's, which is generally avoided
there. Also, the patch for man-pages, describing the new prctl calls, is
missing.

Please take a look at the following patch, which is made in an
attempt to fix aforementioned issues (it is assumed it can be squashed
with the current one).

---
 include/linux/isolation.h | 12 ++++++++++++
 kernel/sys.c              | 16 ++++++++++++----
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index 02728b1..ae24cad 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -9,6 +9,13 @@
 
 #ifdef CONFIG_TASK_ISOLATION
 
+#ifndef GET_TASK_ISOLATION
+# define GET_TASK_ISOLATION(me)		task_isolation_get(me)
+#endif
+#ifndef SET_TASK_ISOLATION
+# define SET_TASK_ISOLATION(me, a)	task_isolation_set(a)
+#endif
+
 /* cpus that are configured to support task isolation */
 extern cpumask_var_t task_isolation_map;
 
@@ -22,6 +29,11 @@ static inline bool task_isolation_possible(int cpu)
 
 extern int task_isolation_set(unsigned int flags);
 
+static inline unsigned int task_isolation_get(struct task_struct *ts)
+{
+	return ts->task_isolation_flags;
+}
+
 extern bool task_isolation_ready(void);
 extern void task_isolation_enter(void);
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 3de863f..96e0873 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -104,6 +104,12 @@
 #ifndef SET_FP_MODE
 # define SET_FP_MODE(a,b)	(-EINVAL)
 #endif
+#ifndef GET_TASK_ISOLATION
+# define GET_TASK_ISOLATION(me)		(-EINVAL)
+#endif
+#ifndef SET_TASK_ISOLATION
+# define SET_TASK_ISOLATION(me, a)	(-EINVAL)
+#endif
 
 /*
  * this is where the system-wide overflow UID and GID are defined, for
@@ -2262,14 +2268,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
-#ifdef CONFIG_TASK_ISOLATION
 	case PR_SET_TASK_ISOLATION:
-		error = task_isolation_set(arg2);
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = SET_TASK_ISOLATION(me, arg2);
 		break;
 	case PR_GET_TASK_ISOLATION:
-		error = me->task_isolation_flags;
+		if (arg2 || arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = GET_TASK_ISOLATION(me);
 		break;
-#endif
 	default:
 		error = -EINVAL;
 		break;
-- 
2.9.3

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2017-02-02 16:13   ` Eugene Syromiatnikov
@ 2017-02-02 18:12     ` Chris Metcalf
  0 siblings, 0 replies; 80+ messages in thread
From: Chris Metcalf @ 2017-02-02 18:12 UTC (permalink / raw)
  To: Eugene Syromiatnikov, linux-kernel; +Cc: Frederic Weisbecker

On 2/2/2017 11:13 AM, Eugene Syromiatnikov wrote:
>>   	case PR_GET_FP_MODE:
>>   		error = GET_FP_MODE(me);
>>   		break;
>> +#ifdef CONFIG_TASK_ISOLATION
>> +	case PR_SET_TASK_ISOLATION:
>> +		error = task_isolation_set(arg2);
>> +		break;
>> +	case PR_GET_TASK_ISOLATION:
>> +		error = me->task_isolation_flags;
>> +		break;
>> +#endif
>>   	default:
>>   		error = -EINVAL;
>>   		break;
> It is not a very good idea to ignore the values of unused arguments; it
> prevents future their usage, as user space can pass some garbage values
> here. Check out the code for newer prctl handlers, like
> PR_SET_NO_NEW_PRIVS, PR_SET_THP_DISABLE, or PR_MPX_ENABLE_MANAGEMENT
> (PR_[SG]_FP_MODE is an unfortunate recent omission).
>
> The other thing is the usage of #ifdef's, which is generally avoided
> there. Also, the patch for man-pages, describing the new prctl calls, is
> missing.

Thanks, I appreciate the feedback.  I'll fold this into the next spin of the series!

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2017-02-02 18:12 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-16 21:19 [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 01/13] vmstat: add quiet_vmstat_sync function Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 02/13] vmstat: add vmstat_idle function Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 03/13] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 04/13] task_isolation: add initial support Chris Metcalf
2016-08-29 16:33   ` Peter Zijlstra
2016-08-29 16:40     ` Chris Metcalf
2016-08-29 16:48       ` Peter Zijlstra
2016-08-29 16:53         ` Chris Metcalf
2016-08-30  7:59           ` Peter Zijlstra
2016-08-30  7:58       ` Peter Zijlstra
2016-08-30 15:32         ` Chris Metcalf
2016-08-30 16:30           ` Andy Lutomirski
2016-08-30 17:02             ` Chris Metcalf
2016-08-30 18:43               ` Andy Lutomirski
2016-08-30 19:37                 ` Chris Metcalf
2016-08-30 19:50                   ` Andy Lutomirski
2016-09-02 14:04                     ` Chris Metcalf
2016-09-02 17:28                       ` Andy Lutomirski
2016-09-09 17:40                         ` Chris Metcalf
2016-09-12 17:41                           ` Andy Lutomirski
2016-09-12 19:25                             ` Chris Metcalf
2016-09-27 14:22                         ` Frederic Weisbecker
2016-09-27 14:39                           ` Peter Zijlstra
2016-09-27 14:51                             ` Frederic Weisbecker
2016-09-27 14:48                           ` Paul E. McKenney
2016-09-30 16:59                 ` Chris Metcalf
2016-09-01 10:06           ` Peter Zijlstra
2016-09-02 14:03             ` Chris Metcalf
2016-09-02 16:40               ` Peter Zijlstra
2017-02-02 16:13   ` Eugene Syromiatnikov
2017-02-02 18:12     ` Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 05/13] task_isolation: track asynchronous interrupts Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 06/13] arch/x86: enable task isolation functionality Chris Metcalf
2016-08-30 21:46   ` Andy Lutomirski
2016-08-16 21:19 ` [PATCH v15 07/13] arm64: factor work_pending state machine to C Chris Metcalf
2016-08-17  8:05   ` Will Deacon
2016-08-16 21:19 ` [PATCH v15 08/13] arch/arm64: enable task isolation functionality Chris Metcalf
2016-08-26 16:25   ` Catalin Marinas
2016-08-16 21:19 ` [PATCH v15 09/13] arch/tile: " Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 10/13] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 11/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 12/13] task_isolation: add user-settable notification signal Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 13/13] task_isolation self test Chris Metcalf
2016-08-17 19:37 ` [PATCH] Fix /proc/stat freezes (was [PATCH v15] "task_isolation" mode) Christoph Lameter
2016-08-20  1:42   ` Chris Metcalf
2016-09-28 13:16   ` Frederic Weisbecker
2016-08-29 16:27 ` Ping: [PATCH v15 00/13] support "task_isolation" mode Chris Metcalf
2016-09-07 21:11   ` Francis Giraldeau
2016-09-07 21:39     ` Francis Giraldeau
2016-09-08 16:21     ` Francis Giraldeau
2016-09-12 16:01     ` Chris Metcalf
2016-09-12 16:14       ` Peter Zijlstra
2016-09-12 21:15         ` Rafael J. Wysocki
2016-09-13  0:05           ` Rafael J. Wysocki
2016-09-13 16:00             ` Francis Giraldeau
2016-09-13  0:20       ` Francis Giraldeau
2016-09-13 16:12         ` Chris Metcalf
2016-09-27 14:49         ` Frederic Weisbecker
2016-09-27 14:35   ` Frederic Weisbecker
2016-09-30 17:07     ` Chris Metcalf
2016-11-05  4:04 ` task isolation discussion at Linux Plumbers Chris Metcalf
2016-11-05 16:05   ` Christoph Lameter
2016-11-07 16:55   ` Thomas Gleixner
2016-11-07 18:36     ` Thomas Gleixner
2016-11-07 19:12       ` Rik van Riel
2016-11-07 19:16         ` Will Deacon
2016-11-07 19:18           ` Rik van Riel
2016-11-11 20:54     ` Luiz Capitulino
2016-11-09  1:40   ` Paul E. McKenney
2016-11-09 11:14     ` Andy Lutomirski
2016-11-09 17:38       ` Paul E. McKenney
2016-11-09 18:57         ` Will Deacon
2016-11-09 19:11           ` Paul E. McKenney
2016-11-10  1:44         ` Andy Lutomirski
2016-11-10  4:52           ` Paul E. McKenney
2016-11-10  5:10             ` Paul E. McKenney
2016-11-11 17:00             ` Andy Lutomirski
2016-11-09 11:07   ` Frederic Weisbecker
2016-12-19 14:37   ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).