linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v16 00/13] support "task_isolation" mode
@ 2017-11-03 17:04 Chris Metcalf
  2017-11-03 17:04 ` [PATCH v16 01/13] vmstat: add quiet_vmstat_sync function Chris Metcalf
                   ` (15 more replies)
  0 siblings, 16 replies; 29+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:04 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	Francis Giraldeau, linux-mm, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

Here, finally, is a new spin of the task isolation work (v16), with
changes based on the issues that were raised at last year's Linux
Plumbers Conference and in the email discussion that followed.

This version of the patch series cleans up a number of areas that were
a little dodgy in the previous patch series.

- It no longer loops in the final code that prepares to return to
  userspace; instead, it sets things up in the prctl() and then
  validates when preparing to return to userspace, adjusting the
  syscall return value to -EAGAIN at that point if something doesn't
  line up quite correctly.

- We no longer support the NOSIG mode that let you freely call into
  the kernel multiple times while in task isolation.  This was always
  a little odd, since you really should be in sufficient control of
  task isolation code that you can explicitly stop isolation with a
  "prctl(PR_TASK_ISOLATION, 0)" before using the kernel for anything
  else.  It also made the semantics of migrating the task confusing.
  More importantly, removing that support means that the only path
  that sets up task isolation is the return from prctl(), which allows
  us to make the simplification above.

- We no longer try to signal the task isolation process from a remote
  core when we detect that we are about to violate its isolation.
  Instead, we just print a message there (and optionally dump stack),
  and rely on the eventual interrupt on the core itself to trigger the
  signal.  We are always in a safe context to generate a signal when
  we enter the kernel, unlike when we are deep in a call stack sending
  an SMP IPI or whatever.

- We notice the case of an unstable scheduler clock and return
  EINVAL rather than spinning forever with EAGAIN (suggestion from
  Francis Giraldeau).

- The prctl() call requires zeros for arg3/4/5 (suggestion from
  Eugene Syromiatnikov).

The kernel internal isolation API is also now cleaner, and I have
included kerneldoc APIs for all the interfaces so that it should be
easier to port it to additional architectures; in fact looking at
include/linux/isolation.h is a good place to start understanding the
overall patch set.

I removed Catalin's Reviewed-by for arm64, and Christoph's Tested-by
for x86, since this version is sufficiently different to merit
re-review and re-testing.

Note that this is not rebased on top of Frederic's recent housekeeping
patch series, although it is largely orthogonal right now.  After
Frederic's patch series lands, task isolation is enabled with
"isolcpus=nohz,domain,CPUS".  We could add some shorthand for that
("isolcpus=full,CPUS"?) or just use it as-is.

The previous (v15) patch series is here:

https://lkml.kernel.org/r/1471382376-5443-1-git-send-email-cmetcalf@mellanox.com

This patch series is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Some folks raised some good points at the LPC discussion and then in
email discussions that followed.  Rather than trying to respond to
everyone in a flurry of emails, I'll answer some questions here:


Why not just instrument user_exit() to raise the isolation-lost signal?

Andy pointed me in this direction.  The advantage is that you catch
*everything*, by definition.  There is a hook that can do this in the
current patch set, but you have to #define DEBUG_TASK_ISOLATION
manually to take advantage of it, because as written it has two issues:

1. You can't actually exit the kernel with prctl(PR_TASK_ISOLATION,0)
   because the user_exit hook kills you first.
2. You lose the ability to get much better diagnostics by waiting
   until you are further into kernel entry and know what you're doing.

You could work around #2 in several ways, but #1 is harder.  I looked
at x86 for a while, and although you could imagine this, you really
want to generate a lost-isolation signal on any syscall that isn't
that exact prctl(), and it's awkward to try to do all of that checking
before user_exit().  Since in any case we do want to have the more
specific probes at the various kernel entry points where we generate
the diagnostics, I felt like it wasn't the right approach to enable
as a compilation-time default.  I'm open to discussion on this though!


Can't we do all the exit-to-userspace work with irqs disabled?

In fact, it turns out that you can do lru_add_drain() with irqs
disabled, so that's what we're doing in the patch series now.

However, it doesn't seem possible to do the synchronous cancellation of
the vmstat deferred work with irqs disabled, though if there's a way,
it would be a little cleaner to do that; Christoph?  We can certainly
update the statistics with interrupts disabled via
refresh_cpu_vm_stats(false), but that's not sufficient.  For now, I
just issue the cancellation during sys_prctl() call, and then if it
isn't synchronized by the time we are exiting to userspace, we just
jam in an EAGAIN and let userspace retry.  In practice, this doesn't
seem to ever happen.


What about using a per-cpu flag to stop doing new deferred work?

Andy also suggested we could structure the code to have the prctl()
set a per-cpu flag to stop adding new future work (e.g. vmstat per-cpu
data, or lru page cache).  Then, we could flush those structures right
from the sys_prctl() call, and when we were returning to user space,
we'd be confident that there wasn't going to be any new work added.

With the current set of things that we are disabling for task
isolation, though, it didn't seem necessary.  Quiescing the vmstat
shepherd seems like it is generally pretty safe since we will likely
be able to sync up the per-cpu cache and kill the deferred work with
high probability, with no expectation that additional work will show
up.  And since we can flush the LRU page cache with interrupts
disabled, that turns out not to be an issue either.

I could imagine that if we have to deal with some new kind of deferred
work, we might find the per-cpu flag becomes a good solution, but for
now we don't have a good use case for that approach.


How about stopping the dyn tick?

Right now we try to stop it on return to userspace, but if we can't,
we just return EAGAIN to userspace.  In practice, what I see is that
usually the tick stops immediately, but occasionally it doesn't; in
this case I've always seen that nr_running is >1, presumably with some
temporary kernel worker threads, and the user code just needs to call
prctl() until those threads are done.  We could structure things with
a completion that we wait for, which is set by the timer code when it
finally does stop the tick, but this may be overkill, particularly
since we'll only be running this prctl() loop from userspace on cores
where we have no other useful work that we're trying to run anyway.


What about TLB flushing?

We talked about this at Plumbers and some of the email discussion also
was about TLB flushing.  I haven't tried to add it to this patch set,
because I really want to avoid scope creep; in any case, I think I
managed to convince Andy that he was going to work on it himself. :)
Paul McKenney already contributed some framework for such a patch, in
commit b8c17e6664c4 ("rcu: Maintain special bits at bottom of
->dynticks counter").

What about that d*mn 1 Hz clock?

It's still there, so this code still requires some further work before
it can actually get a process into long-term task isolation (without
the obvious one-line kernel hack).  Frederic suggested a while ago
forcing updates on cpustats was required as the last gating factor; do
we think that is still true?  Christoph was working on this at one
point - any progress from your point of view?


Chris Metcalf (12):
  vmstat: add quiet_vmstat_sync function
  vmstat: add vmstat_idle function
  Revert "sched/core: Drop the unused try_get_task_struct() helper
    function"
  Add try_get_task_struct_on_cpu() to scheduler for task isolation
  Add try_stop_full_tick() API for NO_HZ_FULL
  task_isolation: userspace hard isolation from kernel
  Add task isolation hooks to arch-independent code
  arch/x86: enable task isolation functionality
  arch/arm64: enable task isolation functionality
  arch/tile: enable task isolation functionality
  arm, tile: turn off timer tick for oneshot_stopped state
  task_isolation self test

Francis Giraldeau (1):
  arch/arm: enable task isolation functionality

 Documentation/admin-guide/kernel-parameters.txt    |   6 +
 arch/arm/Kconfig                                   |   1 +
 arch/arm/include/asm/thread_info.h                 |  10 +-
 arch/arm/kernel/entry-common.S                     |  12 +-
 arch/arm/kernel/ptrace.c                           |  10 +
 arch/arm/kernel/signal.c                           |  10 +-
 arch/arm/kernel/smp.c                              |   4 +
 arch/arm/mm/fault.c                                |   8 +-
 arch/arm64/Kconfig                                 |   1 +
 arch/arm64/include/asm/thread_info.h               |   5 +-
 arch/arm64/kernel/ptrace.c                         |  18 +-
 arch/arm64/kernel/signal.c                         |  10 +-
 arch/arm64/kernel/smp.c                            |   7 +
 arch/arm64/mm/fault.c                              |   5 +
 arch/tile/Kconfig                                  |   1 +
 arch/tile/include/asm/thread_info.h                |   2 +
 arch/tile/kernel/hardwall.c                        |   2 +
 arch/tile/kernel/irq.c                             |   3 +
 arch/tile/kernel/messaging.c                       |   4 +
 arch/tile/kernel/process.c                         |   4 +
 arch/tile/kernel/ptrace.c                          |  10 +
 arch/tile/kernel/single_step.c                     |   7 +
 arch/tile/kernel/smp.c                             |  21 +-
 arch/tile/kernel/time.c                            |   3 +
 arch/tile/kernel/unaligned.c                       |   4 +
 arch/tile/mm/fault.c                               |  13 +-
 arch/tile/mm/homecache.c                           |  11 +
 arch/x86/Kconfig                                   |   1 +
 arch/x86/entry/common.c                            |  14 +
 arch/x86/include/asm/apic.h                        |   3 +
 arch/x86/include/asm/thread_info.h                 |   8 +-
 arch/x86/kernel/smp.c                              |   2 +
 arch/x86/kernel/traps.c                            |   3 +
 arch/x86/mm/fault.c                                |   5 +
 drivers/clocksource/arm_arch_timer.c               |   2 +
 include/linux/isolation.h                          | 175 ++++++
 include/linux/sched.h                              |   4 +
 include/linux/sched/task.h                         |   3 +
 include/linux/tick.h                               |   1 +
 include/linux/vmstat.h                             |   4 +
 include/uapi/linux/prctl.h                         |   6 +
 init/Kconfig                                       |  28 +
 kernel/Makefile                                    |   1 +
 kernel/context_tracking.c                          |   2 +
 kernel/exit.c                                      |  13 +
 kernel/irq/irqdesc.c                               |   5 +
 kernel/irq_work.c                                  |   5 +-
 kernel/isolation.c                                 | 401 +++++++++++++
 kernel/sched/core.c                                |  11 +
 kernel/signal.c                                    |   2 +
 kernel/smp.c                                       |   6 +-
 kernel/sys.c                                       |   6 +
 kernel/time/tick-sched.c                           |  18 +
 mm/vmstat.c                                        |  19 +
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/task_isolation/Makefile    |   6 +
 tools/testing/selftests/task_isolation/config      |   1 +
 tools/testing/selftests/task_isolation/isolation.c | 643 +++++++++++++++++++++
 58 files changed, 1561 insertions(+), 30 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c
 create mode 100644 tools/testing/selftests/task_isolation/Makefile
 create mode 100644 tools/testing/selftests/task_isolation/config
 create mode 100644 tools/testing/selftests/task_isolation/isolation.c

-- 
2.1.2

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v16 01/13] vmstat: add quiet_vmstat_sync function
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
@ 2017-11-03 17:04 ` Chris Metcalf
  2017-11-03 17:04 ` [PATCH v16 02/13] vmstat: add vmstat_idle function Chris Metcalf
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:04 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-kernel
  Cc: Chris Metcalf

In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
the quiet_vmstat() function became asynchronous, in the sense that
the vmstat work was still scheduled to run on the core when the
function returned.  For task isolation, we need a synchronous
version of the function that guarantees that the vmstat worker
will not run on the core on return from the function.  Add a
quiet_vmstat_sync() function with that semantic.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/vmstat.h | 2 ++
 mm/vmstat.c            | 9 +++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index ade7cb5f1359..e0b504594593 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -264,6 +264,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
+void quiet_vmstat_sync(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -366,6 +367,7 @@ static inline void __dec_node_page_state(struct page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat_sync(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4bb13e72ac97..8ad1b84ca9cf 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1837,6 +1837,15 @@ void quiet_vmstat(void)
 }
 
 /*
+ * Synchronously quiet vmstat so the work is guaranteed not to run on return.
+ */
+void quiet_vmstat_sync(void)
+{
+	cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work));
+	refresh_cpu_vm_stats(false);
+}
+
+/*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
  * threads for vm statistics updates disabled because of
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v16 02/13] vmstat: add vmstat_idle function
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
  2017-11-03 17:04 ` [PATCH v16 01/13] vmstat: add quiet_vmstat_sync function Chris Metcalf
@ 2017-11-03 17:04 ` Chris Metcalf
  2017-11-03 17:04 ` [PATCH v16 03/13] Revert "sched/core: Drop the unused try_get_task_struct() helper function" Chris Metcalf
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:04 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-mm,
	linux-kernel
  Cc: Chris Metcalf

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index e0b504594593..80212a952448 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -265,6 +265,7 @@ extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
 void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -368,6 +369,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8ad1b84ca9cf..8b13a6ca494c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1846,6 +1846,16 @@ void quiet_vmstat_sync(void)
 }
 
 /*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+	return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) &&
+		!need_update(smp_processor_id());
+}
+
+/*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
  * threads for vm statistics updates disabled because of
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v16 03/13] Revert "sched/core: Drop the unused try_get_task_struct() helper function"
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
  2017-11-03 17:04 ` [PATCH v16 01/13] vmstat: add quiet_vmstat_sync function Chris Metcalf
  2017-11-03 17:04 ` [PATCH v16 02/13] vmstat: add vmstat_idle function Chris Metcalf
@ 2017-11-03 17:04 ` Chris Metcalf
  2017-11-03 17:04 ` [PATCH v16 04/13] Add try_get_task_struct_on_cpu() to scheduler for task isolation Chris Metcalf
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:04 UTC (permalink / raw)
  To: Davidlohr Bueso, Oleg Nesterov, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Michal Hocko, linux-kernel
  Cc: Chris Metcalf

This reverts commit f11cc0760b8397e0d230122606421b6a96e9f869.
We do need this function for try_get_task_struct_on_cpu().

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/sched/task.h |  2 ++
 kernel/exit.c              | 13 +++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 79a2a744648d..270ff76d43d9 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -96,6 +96,8 @@ static inline void put_task_struct(struct task_struct *t)
 }
 
 struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+struct task_struct *try_get_task_struct(struct task_struct **ptask);
+
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
 extern int arch_task_struct_size __read_mostly;
diff --git a/kernel/exit.c b/kernel/exit.c
index f2cd53e92147..e2a3e7458d0f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -318,6 +318,19 @@ void rcuwait_wake_up(struct rcuwait *w)
 	rcu_read_unlock();
 }
 
+struct task_struct *try_get_task_struct(struct task_struct **ptask)
+{
+	struct task_struct *task;
+
+	rcu_read_lock();
+	task = task_rcu_dereference(ptask);
+	if (task)
+		get_task_struct(task);
+	rcu_read_unlock();
+
+	return task;
+}
+
 /*
  * Determine if a process group is "orphaned", according to the POSIX
  * definition in 2.2.2.52.  Orphaned process groups are not to be affected
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v16 04/13] Add try_get_task_struct_on_cpu() to scheduler for task isolation
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
                   ` (2 preceding siblings ...)
  2017-11-03 17:04 ` [PATCH v16 03/13] Revert "sched/core: Drop the unused try_get_task_struct() helper function" Chris Metcalf
@ 2017-11-03 17:04 ` Chris Metcalf
  2017-11-03 17:04 ` [PATCH v16 05/13] Add try_stop_full_tick() API for NO_HZ_FULL Chris Metcalf
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:04 UTC (permalink / raw)
  To: Davidlohr Bueso, Oleg Nesterov, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Michal Hocko, linux-kernel
  Cc: Chris Metcalf

Task isolation wants to be able to verify that a remote core is
running an isolated task to determine if it should generate a
diagnostic, and also possibly interrupt it.

This API returns a pointer to the task_struct of the task that
was running on the specified core at the moment of the request;
it uses try_get_task_struct() to increment the ref count on the
returned task_struct so that the caller can examine it even if
the actual remote task has already exited by that point.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/sched/task.h |  1 +
 kernel/sched/core.c        | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 270ff76d43d9..6785db926857 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -97,6 +97,7 @@ static inline void put_task_struct(struct task_struct *t)
 
 struct task_struct *task_rcu_dereference(struct task_struct **ptask);
 struct task_struct *try_get_task_struct(struct task_struct **ptask);
+struct task_struct *try_get_task_struct_on_cpu(int cpu);
 
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d17c5da523a0..2728154057ae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -670,6 +670,17 @@ bool sched_can_stop_tick(struct rq *rq)
 }
 #endif /* CONFIG_NO_HZ_FULL */
 
+/*
+ * Return a pointer to the task_struct for the task that is running on
+ * the specified cpu at the time of the call (note that the task may have
+ * exited by the time the caller inspects the resulting task_struct).
+ * Caller must put_task_struct() with the pointer when finished with it.
+ */
+struct task_struct *try_get_task_struct_on_cpu(int cpu)
+{
+	return try_get_task_struct(&cpu_rq(cpu)->curr);
+}
+
 void sched_avg_update(struct rq *rq)
 {
 	s64 period = sched_avg_period();
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v16 05/13] Add try_stop_full_tick() API for NO_HZ_FULL
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
                   ` (3 preceding siblings ...)
  2017-11-03 17:04 ` [PATCH v16 04/13] Add try_get_task_struct_on_cpu() to scheduler for task isolation Chris Metcalf
@ 2017-11-03 17:04 ` Chris Metcalf
  2017-11-03 17:04 ` [PATCH v16 06/13] task_isolation: userspace hard isolation from kernel Chris Metcalf
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:04 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-kernel
  Cc: Chris Metcalf

This API checks to see if the scheduler tick can be stopped,
and if so, stops it and returns 0; otherwise it returns an error.
This is intended for use with task isolation, where we will want to
be able to stop the tick synchronously when returning to userspace.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/tick.h     |  1 +
 kernel/time/tick-sched.c | 18 ++++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index fe01e68bf520..078ff2464b00 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -234,6 +234,7 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
 
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
+extern int try_stop_full_tick(void);
 #else
 static inline int housekeeping_any_cpu(void)
 {
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c7a899c5ce64..c026145eba2f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -861,6 +861,24 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts)
 #endif
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+int try_stop_full_tick(void)
+{
+	int cpu = smp_processor_id();
+	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+	/* For an unstable clock, we should return a permanent error code. */
+	if (atomic_read(&tick_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE)
+		return -EINVAL;
+
+	if (!can_stop_full_tick(cpu, ts))
+		return -EAGAIN;
+
+	tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
+	return 0;
+}
+#endif
+
 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 {
 	/*
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v16 06/13] task_isolation: userspace hard isolation from kernel
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
                   ` (4 preceding siblings ...)
  2017-11-03 17:04 ` [PATCH v16 05/13] Add try_stop_full_tick() API for NO_HZ_FULL Chris Metcalf
@ 2017-11-03 17:04 ` Chris Metcalf
  2018-03-18 14:22   ` Yury Norov
  2017-11-03 17:04 ` [PATCH v16 07/13] Add task isolation hooks to arch-independent code Chris Metcalf
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 29+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:04 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	Jonathan Corbet, linux-mm, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
"nohz_full=CPULIST isolcpus=CPULIST" boot argument to enable
nohz_full and isolcpus.  The "task_isolation" state is then indicated
by setting a new task struct field, task_isolation_flag, to the
value passed by prctl(), and also setting a TIF_TASK_ISOLATION
bit in the thread_info flags.  When the kernel is returning to
userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
it calls the new task_isolation_start() routine to arrange for
the task to avoid being interrupted in the future.

With interrupts disabled, task_isolation_start() ensures that kernel
subsystems that might cause a future interrupt are quiesced.  If it
doesn't succeed, it adjusts the syscall return value to indicate that
fact, and userspace can retry as desired.  In addition to stopping
the scheduler tick, the code takes any actions that might avoid
a future interrupt to the core, such as a worker thread being
scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
other exception or irq, the kernel will kill it with SIGKILL.
In addition to sending a signal, the code supports a kernel
command-line "task_isolation_debug" flag which causes a stack
backtrace to be generated whenever a task loses isolation.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
clear the bit again later, and ignores exit/exit_group to allow
exiting the task without a pointless signal being delivered.

The prctl() API allows for specifying a signal number to use instead
of the default SIGKILL, to allow for catching the notification
signal; for example, in a production environment, it might be
helpful to log information to the application logging mechanism
before exiting.  Or, the signal handler might choose to reset the
program counter back to the code segment intended to be run isolated
via prctl() to continue execution.

In a number of cases we can tell on a remote cpu that we are
going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
In that case we generate the diagnostic (and optional stack dump)
on the remote core to be able to deliver better diagnostics.
If the interrupt is not something caught by Linux (e.g. a
hypervisor interrupt) we can also request a reschedule IPI to
be sent to the remote core so it can be sure to generate a
signal to notify the process.

Separate patches that follow provide these changes for x86, tile,
arm, and arm64.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 Documentation/admin-guide/kernel-parameters.txt |   6 +
 include/linux/isolation.h                       | 175 +++++++++++
 include/linux/sched.h                           |   4 +
 include/uapi/linux/prctl.h                      |   6 +
 init/Kconfig                                    |  28 ++
 kernel/Makefile                                 |   1 +
 kernel/context_tracking.c                       |   2 +
 kernel/isolation.c                              | 402 ++++++++++++++++++++++++
 kernel/signal.c                                 |   2 +
 kernel/sys.c                                    |   6 +
 10 files changed, 631 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 05496622b4ef..aaf278f2cfc3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4025,6 +4025,12 @@
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation_debug	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION, this
+			setting will generate console backtraces to
+			accompany the diagnostics generated about
+			interrupting tasks running with task isolation.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..8189a772affd
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,175 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <stdarg.h>
+#include <linux/errno.h>
+#include <linux/cpumask.h>
+#include <linux/prctl.h>
+#include <linux/types.h>
+
+struct task_struct;
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/**
+ * task_isolation_request() - prctl hook to request task isolation
+ * @flags:	Flags from <linux/prctl.h> PR_TASK_ISOLATION_xxx.
+ *
+ * This is called from the generic prctl() code for PR_TASK_ISOLATION.
+
+ * Return: Returns 0 when task isolation enabled, otherwise a negative
+ * errno.
+ */
+extern int task_isolation_request(unsigned int flags);
+
+/**
+ * task_isolation_start() - attempt to actually start task isolation
+ *
+ * This function should be invoked as the last thing prior to returning to
+ * user space if TIF_TASK_ISOLATION is set in the thread_info flags.  It
+ * will attempt to quiesce the core and enter task-isolation mode.  If it
+ * fails, it will reset the system call return value to an error code that
+ * indicates the failure mode.
+ */
+extern void task_isolation_start(void);
+
+/**
+ * task_isolation_syscall() - report a syscall from an isolated task
+ * @nr:		The syscall number.
+ *
+ * This routine should be invoked at syscall entry if TIF_TASK_ISOLATION is
+ * set in the thread_info flags.  It checks for valid syscalls,
+ * specifically prctl() with PR_TASK_ISOLATION, exit(), and exit_group().
+ * For any other syscall it will raise a signal and return failure.
+ *
+ * Return: 0 for acceptable syscalls, -1 for all others.
+ */
+extern int task_isolation_syscall(int nr);
+
+/**
+ * _task_isolation_interrupt() - report an interrupt of an isolated task
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This routine should be invoked at any exception or IRQ if
+ * TIF_TASK_ISOLATION is set in the thread_info flags.  It is not necessary
+ * to invoke it if the exception will generate a signal anyway (e.g. a bad
+ * page fault), and in that case it is preferable not to invoke it but just
+ * rely on the standard Linux signal.  The macro task_isolation_syscall()
+ * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
+ */
+extern void _task_isolation_interrupt(const char *fmt, ...);
+#define task_isolation_interrupt(fmt, ...)				\
+	do {								\
+		if (current_thread_info()->flags & _TIF_TASK_ISOLATION) \
+			_task_isolation_interrupt(fmt, ## __VA_ARGS__); \
+	} while (0)
+
+/**
+ * _task_isolation_remote() - report a remote interrupt of an isolated task
+ * @cpu:	The remote cpu that is about to be interrupted.
+ * @do_interrupt: Whether we should generate an extra interrupt.
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This routine should be invoked any time a remote IPI or other type of
+ * interrupt is being delivered to another cpu.  The function will check to
+ * see if the target core is running a task-isolation task, and generate a
+ * diagnostic on the console if so; in addition, we tag the task so it
+ * doesn't generate another diagnostic when the interrupt actually arrives.
+ * Generating a diagnostic remotely yields a clearer indication of what
+ * happened then just reporting only when the remote core is interrupted.
+ *
+ * The @do_interrupt flag, if true, causes the routine to not just print
+ * the diagnostic, but also to generate a reschedule interrupt to the
+ * remote core that is being interrupted.  This is necessary if the remote
+ * interrupt being diagnosed will not otherwise be visible to the remote
+ * core (e.g. a hypervisor service is being invoked on the remote core).
+ * Sending a reschedule will force the core to trigger the isolation signal
+ * and exit isolation mode.
+ *
+ * The task_isolation_remote() macro passes @do_interrupt as false, and the
+ * task_isolation_remote_interrupt() passes the flag as true.
+ */
+extern void _task_isolation_remote(int cpu, bool do_interrupt,
+				   const char *fmt, ...);
+#define task_isolation_remote(cpu, fmt, ...) \
+	_task_isolation_remote(cpu, false, fmt, ## __VA_ARGS__)
+#define task_isolation_remote_interrupt(cpu, fmt, ...) \
+	_task_isolation_remote(cpu, true, fmt, ## __VA_ARGS__)
+
+/**
+ * _task_isolation_remote_cpumask() - report interruption of multiple cpus
+ * @mask:	The set of remotes cpus that are about to be interrupted.
+ * @do_interrupt: Whether we should generate an extra interrupt.
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This is the cpumask variant of _task_isolation_remote().  We
+ * generate a single-line diagnostic message even if multiple remote
+ * task-isolation cpus are being interrupted.
+ */
+extern void _task_isolation_remote_cpumask(const struct cpumask *mask,
+					   bool do_interrupt,
+					   const char *fmt, ...);
+#define task_isolation_remote_cpumask(cpumask, fmt, ...) \
+	_task_isolation_remote_cpumask(cpumask, false, fmt, ## __VA_ARGS__)
+#define task_isolation_remote_cpumask_interrupt(cpumask, fmt, ...) \
+	_task_isolation_remote_cpumask(cpumask, true, fmt, ## __VA_ARGS__)
+
+/**
+ * _task_isolation_signal() - disable task isolation when signal is pending
+ * @task:	The task for which to disable isolation.
+ *
+ * This function generates a diagnostic and disables task isolation; it
+ * should be called if TIF_TASK_ISOLATION is set when notifying a task of a
+ * pending signal.  The task_isolation_interrupt() function normally
+ * generates a diagnostic for events that just interrupt a task without
+ * generating a signal; here we need to hook the paths that correspond to
+ * interrupts that do generate a signal.  The macro task_isolation_signal()
+ * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
+ */
+extern void _task_isolation_signal(struct task_struct *task);
+#define task_isolation_signal(task) do {				\
+		if (task_thread_info(task)->flags & _TIF_TASK_ISOLATION) \
+			_task_isolation_signal(task);			\
+	} while (0)
+
+/**
+ * task_isolation_user_exit() - debug all user_exit calls
+ *
+ * By default, we don't generate an exception in the low-level user_exit()
+ * code, because programs lose the ability to disable task isolation: the
+ * user_exit() hook will cause a signal prior to task_isolation_syscall()
+ * disabling task isolation.  In addition, it means that we lose all the
+ * diagnostic info otherwise available from task_isolation_interrupt() hooks
+ * later in the interrupt-handling process.  But you may enable it here for
+ * a special kernel build if you are having undiagnosed userspace jitter.
+ */
+static inline void task_isolation_user_exit(void)
+{
+#ifdef DEBUG_TASK_ISOLATION
+	task_isolation_interrupt("user_exit");
+#endif
+}
+
+#else /* !CONFIG_TASK_ISOLATION */
+static inline int task_isolation_request(unsigned int flags) { return -EINVAL; }
+static inline void task_isolation_start(void) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_interrupt(const char *fmt, ...) { }
+static inline void task_isolation_remote(int cpu, const char *fmt, ...) { }
+static inline void task_isolation_remote_interrupt(int cpu,
+						   const char *fmt, ...) { }
+static inline void task_isolation_remote_cpumask(const struct cpumask *mask,
+						 const char *fmt, ...) { }
+static inline void task_isolation_remote_cpumask_interrupt(
+	const struct cpumask *mask, const char *fmt, ...) { }
+static inline void task_isolation_signal(struct task_struct *task) { }
+static inline void task_isolation_user_exit(void) { }
+#endif
+
+#endif /* _LINUX_ISOLATION_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 26a7df4e558c..739d81a44e13 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1096,6 +1096,10 @@ struct task_struct {
 	/* Used by LSM modules for access restriction: */
 	void				*security;
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned short			task_isolation_flags;  /* prctl */
+	unsigned short			task_isolation_state;
+#endif
 
 	/*
 	 * New fields for task_struct should be added above here, so that
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..5a954302fdea 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,10 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Enable task_isolation mode for TASK_ISOLATION kernels. */
+#define PR_TASK_ISOLATION		48
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 78cb2461012e..500c2aeb49b7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -474,6 +474,34 @@ endmenu # "CPU/Task time and stats accounting"
 
 source "kernel/rcu/Kconfig"
 
+config HAVE_ARCH_TASK_ISOLATION
+	bool
+
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
+	help
+
+	 Allow userspace processes that place themselves on cores with
+	 nohz_full and isolcpus enabled, and run prctl(PR_TASK_ISOLATION),
+	 to "isolate" themselves from the kernel.  Prior to returning to
+	 userspace, isolated tasks will arrange that no future kernel
+	 activity will interrupt the task while the task is running in
+	 userspace.  Attempting to re-enter the kernel while in this mode
+	 will cause the task to be terminated with a signal; you must
+	 explicitly use prctl() to disable task isolation before resuming
+	 normal use of the kernel.
+
+	 This "hard" isolation from the kernel is required for userspace
+	 tasks that are running hard real-time tasks in userspace, such as
+	 a high-speed network driver in userspace.  Without this option, but
+	 with NO_HZ_FULL enabled, the kernel will make a best-faith, "soft"
+	 effort to shield a single userspace process from interrupts, but
+	 makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index ed470aac53da..21c908c3083c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -111,6 +111,7 @@ obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 9ad37b9e44a7..df0c7a07c11f 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -156,6 +157,7 @@ void __context_tracking_exit(enum ctx_state state)
 			if (state == CONTEXT_USER) {
 				vtime_user_exit(current);
 				trace_user_exit(0);
+				task_isolation_user_exit();
 			}
 		}
 		__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..e2baa21af03a
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,402 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include <linux/smp.h>
+#include <linux/tick.h>
+#include <asm/unistd.h>
+#include <asm/syscall.h>
+#include "time/tick-sched.h"
+
+/*
+ * These values are stored in task_isolation_state.
+ * Note that STATE_NORMAL + TIF_TASK_ISOLATION means we are still
+ * returning from sys_prctl() to userspace.
+ */
+enum {
+	STATE_NORMAL = 0,	/* Not isolated */
+	STATE_ISOLATED = 1,	/* In userspace, isolated */
+	STATE_WARNED = 2	/* Like ISOLATED but console warning issued */
+};
+
+cpumask_var_t task_isolation_map;
+
+/* We can run on cpus that are isolated from the scheduler and are nohz_full. */
+static int __init task_isolation_init(void)
+{
+	if (alloc_cpumask_var(&task_isolation_map, GFP_KERNEL))
+		cpumask_and(task_isolation_map, cpu_isolated_map,
+			    tick_nohz_full_mask);
+
+	return 0;
+}
+core_initcall(task_isolation_init)
+
+static inline bool is_isolation_cpu(int cpu)
+{
+	return task_isolation_map != NULL &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+/* Enable stack backtraces of any interrupts of task_isolation cores. */
+static bool task_isolation_debug;
+static int __init task_isolation_debug_func(char *str)
+{
+	task_isolation_debug = true;
+	return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+/*
+ * Dump stack if need be. This can be helpful even from the final exit
+ * to usermode code since stack traces sometimes carry information about
+ * what put you into the kernel, e.g. an interrupt number encoded in
+ * the initial entry stack frame that is still visible at exit time.
+ */
+static void debug_dump_stack(void)
+{
+	if (task_isolation_debug)
+		dump_stack();
+}
+
+/*
+ * Set the flags word but don't try to actually start task isolation yet.
+ * We will start it when entering user space in task_isolation_start().
+ */
+int task_isolation_request(unsigned int flags)
+{
+	struct task_struct *task = current;
+
+	/*
+	 * The task isolation flags should always be cleared just by
+	 * virtue of having entered the kernel.
+	 */
+	WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_TASK_ISOLATION));
+	WARN_ON_ONCE(task->task_isolation_flags != 0);
+	WARN_ON_ONCE(task->task_isolation_state != STATE_NORMAL);
+
+	task->task_isolation_flags = flags;
+	if (!(task->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
+		return 0;
+
+	/* We are trying to enable task isolation. */
+	set_tsk_thread_flag(task, TIF_TASK_ISOLATION);
+
+	/*
+	 * Shut down the vmstat worker so we're not interrupted later.
+	 * We have to try to do this here (with interrupts enabled) since
+	 * we are canceling delayed work and will call flush_work()
+	 * (which enables interrupts) and possibly schedule().
+	 */
+	quiet_vmstat_sync();
+
+	/* We return 0 here but we may change that in task_isolation_start(). */
+	return 0;
+}
+
+/* Disable task isolation in the specified task. */
+static void stop_isolation(struct task_struct *p)
+{
+	p->task_isolation_flags = 0;
+	p->task_isolation_state = STATE_NORMAL;
+	clear_tsk_thread_flag(p, TIF_TASK_ISOLATION);
+}
+
+/*
+ * This code runs with interrupts disabled just before the return to
+ * userspace, after a prctl() has requested enabling task isolation.
+ * We take whatever steps are needed to avoid being interrupted later:
+ * drain the lru pages, stop the scheduler tick, etc.  More
+ * functionality may be added here later to avoid other types of
+ * interrupts from other kernel subsystems.
+ *
+ * If we can't enable task isolation, we update the syscall return
+ * value with an appropriate error.
+ */
+void task_isolation_start(void)
+{
+	int error;
+
+	/*
+	 * We should only be called in STATE_NORMAL (isolation disabled),
+	 * on our way out of the kernel from the prctl() that turned it on.
+	 * If we are exiting from the kernel in another state, it means we
+	 * made it back into the kernel without disabling task isolation,
+	 * and we should investigate how (and in any case disable task
+	 * isolation at this point).  We are clearly not on the path back
+	 * from the prctl() so we don't touch the syscall return value.
+	 */
+	if (WARN_ON_ONCE(current->task_isolation_state != STATE_NORMAL)) {
+		stop_isolation(current);
+		return;
+	}
+
+	/*
+	 * Must be affinitized to a single core with task isolation possible.
+	 * In principle this could be remotely modified between the prctl()
+	 * and the return to userspace, so we have to check it here.
+	 */
+	if (cpumask_weight(&current->cpus_allowed) != 1 ||
+	    !is_isolation_cpu(smp_processor_id())) {
+		error = -EINVAL;
+		goto error;
+	}
+
+	/* If the vmstat delayed work is not canceled, we have to try again. */
+	if (!vmstat_idle()) {
+		error = -EAGAIN;
+		goto error;
+	}
+
+	/* Try to stop the dynamic tick. */
+	error = try_stop_full_tick();
+	if (error)
+		goto error;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	current->task_isolation_state = STATE_ISOLATED;
+	return;
+
+error:
+	stop_isolation(current);
+	syscall_set_return_value(current, current_pt_regs(), error, 0);
+}
+
+/* Stop task isolation on the remote task and send it a signal. */
+static void send_isolation_signal(struct task_struct *task)
+{
+	int flags = task->task_isolation_flags;
+	siginfo_t info = {
+		.si_signo = PR_TASK_ISOLATION_GET_SIG(flags) ?: SIGKILL,
+	};
+
+	stop_isolation(task);
+	send_sig_info(info.si_signo, &info, task);
+}
+
+/* Only a few syscalls are valid once we are in task isolation mode. */
+static bool is_acceptable_syscall(int syscall)
+{
+	/* No need to incur an isolation signal if we are just exiting. */
+	if (syscall == __NR_exit || syscall == __NR_exit_group)
+		return true;
+
+	/* Check to see if it's the prctl for isolation. */
+	if (syscall == __NR_prctl) {
+		unsigned long arg;
+
+		syscall_get_arguments(current, current_pt_regs(), 0, 1, &arg);
+		if (arg == PR_TASK_ISOLATION)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * This routine is called from syscall entry, prevents most syscalls
+ * from executing, and if needed raises a signal to notify the process.
+ *
+ * Note that we have to stop isolation before we even print a message
+ * here, since otherwise we might end up reporting an interrupt due to
+ * kicking the printk handling code, rather than reporting the true
+ * cause of interrupt here.
+ */
+int task_isolation_syscall(int syscall)
+{
+	struct task_struct *task = current;
+
+	if (is_acceptable_syscall(syscall)) {
+		stop_isolation(task);
+		return 0;
+	}
+
+	send_isolation_signal(task);
+
+	pr_warn("%s/%d (cpu %d): task_isolation lost due to syscall %d\n",
+		task->comm, task->pid, smp_processor_id(), syscall);
+	debug_dump_stack();
+
+	syscall_set_return_value(task, current_pt_regs(), -ERESTARTNOINTR, -1);
+	return -1;
+}
+
+/*
+ * This routine is called from any exception or irq that doesn't
+ * otherwise trigger a signal to the user process (e.g. page fault).
+ * We don't warn if we are in STATE_WARNED in case a remote cpu already
+ * reported that it was going to interrupt us, so we don't generate
+ * a lot of confusingly similar messages about the same event.
+ */
+void _task_isolation_interrupt(const char *fmt, ...)
+{
+	struct task_struct *task = current;
+	va_list args;
+	char buf[100];
+	bool do_warn;
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	/*
+	 * Avoid reporting interrupts that happen after we have prctl'ed
+	 * to enable isolation, but before we have returned to userspace.
+	 */
+	if (task->task_isolation_state == STATE_NORMAL)
+		return;
+
+	do_warn = (task->task_isolation_state == STATE_ISOLATED);
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	/* Handle NMIs minimally, since we can't send a signal. */
+	if (in_nmi()) {
+		pr_err("%s/%d (cpu %d): in NMI; not delivering signal\n",
+			task->comm, task->pid, smp_processor_id());
+	} else {
+		send_isolation_signal(task);
+	}
+
+	if (do_warn) {
+		pr_warn("%s/%d (cpu %d): task_isolation lost due to %s\n",
+			task->comm, task->pid, smp_processor_id(), buf);
+		debug_dump_stack();
+	}
+}
+
+/*
+ * Called before we wake up a task that has a signal to process.
+ * Needs to be done to handle interrupts that trigger signals, which
+ * we don't catch with task_isolation_interrupt() hooks.
+ */
+void _task_isolation_signal(struct task_struct *task)
+{
+	bool do_warn = (task->task_isolation_state == STATE_ISOLATED);
+
+	stop_isolation(task);
+	if (do_warn) {
+		pr_warn("%s/%d (cpu %d): task_isolation lost due to signal\n",
+			task->comm, task->pid, task_cpu(task));
+		debug_dump_stack();
+	}
+}
+
+/*
+ * Return a task_struct pointer (with ref count bumped up) for the
+ * specified cpu if the task running on that cpu at this moment is in
+ * isolation mode and hasn't yet been warned, otherwise NULL.
+ * In addition, toggle the task state to WARNED in anticipation of
+ * doing a printk, and send a reschedule IPI if needed.
+ */
+static struct task_struct *isolation_task(int cpu, int do_interrupt)
+{
+	struct task_struct *p = try_get_task_struct_on_cpu(cpu);
+
+	if (p == NULL)
+		return NULL;
+
+	if (p->task_isolation_state != STATE_ISOLATED)
+		goto bad_task;
+
+	/*
+	 * If we are claiming to be delivering a remote interrupt to our
+	 * own task, this has to be a bug, since here we are already in the
+	 * kernel, and somehow we didn't reset to STATE_NORMAL.
+	 */
+	if (WARN_ON_ONCE(p == current)) {
+		stop_isolation(p);
+		goto bad_task;
+	}
+
+	p->task_isolation_state = STATE_WARNED;
+	if (do_interrupt)
+		smp_send_reschedule(cpu);
+
+	return p;
+
+bad_task:
+	put_task_struct(p);
+	return NULL;
+}
+
+/*
+ * Generate a stack backtrace if we are going to interrupt another task
+ * isolation process.
+ */
+void _task_isolation_remote(int cpu, bool do_interrupt, const char *fmt, ...)
+{
+	struct task_struct *p;
+	va_list args;
+	char buf[200];
+
+	if (!is_isolation_cpu(cpu))
+		return;
+
+	p = isolation_task(cpu, do_interrupt);
+	if (p == NULL)
+		return;
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	pr_warn("%s/%d (cpu %d): task_isolation lost due to %s by %s/%d on cpu %d\n",
+		p->comm, p->pid, cpu, buf,
+		current->comm, current->pid, smp_processor_id());
+	put_task_struct(p);
+	debug_dump_stack();
+}
+
+/*
+ * Generate a stack backtrace if any of the cpus in "mask" are running
+ * task isolation processes.
+ */
+void _task_isolation_remote_cpumask(const struct cpumask *mask,
+				    bool do_interrupt, const char *fmt, ...)
+{
+	struct task_struct *p = NULL;
+	cpumask_var_t warn_mask;
+	va_list args;
+	char buf[200];
+	int cpu;
+
+	if (task_isolation_map == NULL ||
+	    !zalloc_cpumask_var(&warn_mask, GFP_KERNEL))
+		return;
+
+	for_each_cpu_and(cpu, mask, task_isolation_map) {
+		if (p)
+			put_task_struct(p);
+		p = isolation_task(cpu, do_interrupt);
+		if (p)
+			cpumask_set_cpu(cpu, warn_mask);
+	}
+	if (p == NULL)
+		goto done;
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	pr_warn("%s/%d %s %*pbl): task_isolation lost due to %s by %s/%d on cpu %d\n",
+		p->comm, p->pid,
+		cpumask_weight(warn_mask) == 1 ? "(cpu" : "etc (cpus",
+		cpumask_pr_args(warn_mask), buf,
+		current->comm, current->pid, smp_processor_id());
+	put_task_struct(p);
+	debug_dump_stack();
+
+done:
+	free_cpumask_var(warn_mask);
+}
diff --git a/kernel/signal.c b/kernel/signal.c
index 800a18f77732..fa4786050431 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -40,6 +40,7 @@
 #include <linux/cn_proc.h>
 #include <linux/compiler.h>
 #include <linux/posix-timers.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -658,6 +659,7 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+	task_isolation_signal(t);
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/sys.c b/kernel/sys.c
index 9aebc2935013..91ed79605768 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2385,6 +2386,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+	case PR_TASK_ISOLATION:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = task_isolation_request(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v16 07/13] Add task isolation hooks to arch-independent code
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
                   ` (5 preceding siblings ...)
  2017-11-03 17:04 ` [PATCH v16 06/13] task_isolation: userspace hard isolation from kernel Chris Metcalf
@ 2017-11-03 17:04 ` Chris Metcalf
  2017-11-03 17:04 ` [PATCH v16 08/13] arch/x86: enable task isolation functionality Chris Metcalf
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:04 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-kernel
  Cc: Chris Metcalf

This commit adds task isolation hooks as follows:

- __handle_domain_irq() generates an isolation warning for the
  local task

- irq_work_queue_on() generates an isolation warning for the remote
  task being interrupted for irq_work

- generic_exec_single() generates a remote isolation warning for
  the remote cpu being IPI'd

- smp_call_function_many() generates a remote isolation warning for
  the set of remote cpus being IPI'd

Calls to task_isolation_remote() or task_isolation_interrupt() can
be placed in the platform-independent code like this when doing so
results in fewer lines of code changes, as for example is true of
the users of the arch_send_call_function_*() APIs.  Or, they can
be placed in the per-architecture code when there are many callers,
as for example is true of the smp_send_reschedule() call.

A further cleanup might be to create an intermediate layer, so that
for example smp_send_reschedule() is a single generic function that
just calls arch_smp_send_reschedule(), allowing generic code to be
called every time smp_send_reschedule() is invoked.  But for now,
we just update either callers or callees as makes most sense.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 kernel/irq/irqdesc.c | 5 +++++
 kernel/irq_work.c    | 5 ++++-
 kernel/smp.c         | 6 +++++-
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 82afb7ed369f..1b114c6b7ab8 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -16,6 +16,7 @@
 #include <linux/bitmap.h>
 #include <linux/irqdomain.h>
 #include <linux/sysfs.h>
+#include <linux/isolation.h>
 
 #include "internals.h"
 
@@ -633,6 +634,10 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq,
 		irq = irq_find_mapping(domain, hwirq);
 #endif
 
+	task_isolation_interrupt((irq == hwirq) ?
+				 "irq %d (%s)" : "irq %d (%s hwirq %d)",
+				 irq, domain ? domain->name : "", hwirq);
+
 	/*
 	 * Some hardware gives randomly wrong interrupts.  Rather
 	 * than crashing, do something sensible.
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index bcf107ce0854..cde49f1f31f7 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/isolation.h>
 #include <asm/processor.h>
 
 
@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (!irq_work_claim(work))
 		return false;
 
-	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+		task_isolation_remote(cpu, "irq_work");
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return true;
 }
diff --git a/kernel/smp.c b/kernel/smp.c
index c94dd85c8d41..44252aa650ac 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -19,6 +19,7 @@
 #include <linux/sched.h>
 #include <linux/sched/idle.h>
 #include <linux/hypervisor.h>
+#include <linux/isolation.h>
 
 #include "smpboot.h"
 
@@ -175,8 +176,10 @@ static int generic_exec_single(int cpu, call_single_data_t *csd,
 	 * locking and barrier primitives. Generic code isn't really
 	 * equipped to do the right thing...
 	 */
-	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) {
+		task_isolation_remote(cpu, "IPI function");
 		arch_send_call_function_single_ipi(cpu);
+	}
 
 	return 0;
 }
@@ -458,6 +461,7 @@ void smp_call_function_many(const struct cpumask *mask,
 	}
 
 	/* Send a message to all CPUs in the map */
+	task_isolation_remote_cpumask(cfd->cpumask_ipi, "IPI function");
 	arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
 
 	if (wait) {
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v16 08/13] arch/x86: enable task isolation functionality
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
                   ` (6 preceding siblings ...)
  2017-11-03 17:04 ` [PATCH v16 07/13] Add task isolation hooks to arch-independent code Chris Metcalf
@ 2017-11-03 17:04 ` Chris Metcalf
  2017-11-03 17:04 ` [PATCH v16 09/13] arch/arm64: " Chris Metcalf
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:04 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, H. Peter Anvin,
	x86, linux-kernel
  Cc: Chris Metcalf

In prepare_exit_to_usermode(), call task_isolation_start() for
TIF_TASK_ISOLATION tasks.

In syscall_trace_enter_phase1(), add the necessary support for
reporting syscalls for task-isolation processes.

Add task_isolation_remote() calls for the kernel exception types
that do not result in signals, namely non-signalling page faults
and non-signalling MPX fixups.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/entry/common.c            | 14 ++++++++++++++
 arch/x86/include/asm/apic.h        |  3 +++
 arch/x86/include/asm/thread_info.h |  8 +++++---
 arch/x86/kernel/smp.c              |  2 ++
 arch/x86/kernel/traps.c            |  3 +++
 arch/x86/mm/fault.c                |  5 +++++
 7 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 971feac13506..45967840b81a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -114,6 +114,7 @@ config X86
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS	if MMU && COMPAT
 	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 03505ffbe1b6..2c70b915d1f2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -24,6 +24,7 @@
 #include <linux/uprobes.h>
 #include <linux/livepatch.h>
 #include <linux/syscalls.h>
+#include <linux/isolation.h>
 
 #include <asm/desc.h>
 #include <asm/traps.h>
@@ -87,6 +88,16 @@ static long syscall_trace_enter(struct pt_regs *regs)
 	if (emulated)
 		return -1L;
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->orig_ax) == -1)
+			return -1L;
+		work &= ~_TIF_TASK_ISOLATION;
+	}
+
 #ifdef CONFIG_SECCOMP
 	/*
 	 * Do seccomp after ptrace, to catch any tracer changes.
@@ -196,6 +207,9 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
 		exit_to_usermode_loop(regs, cached_flags);
 
+	if (cached_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
+
 #ifdef CONFIG_COMPAT
 	/*
 	 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 5f01671c68f2..c70cb9cacfc0 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -2,6 +2,7 @@
 #define _ASM_X86_APIC_H
 
 #include <linux/cpumask.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/cpufeature.h>
@@ -618,6 +619,7 @@ extern void irq_exit(void);
 
 static inline void entering_irq(void)
 {
+	task_isolation_interrupt("irq");
 	irq_enter();
 }
 
@@ -629,6 +631,7 @@ static inline void entering_ack_irq(void)
 
 static inline void ipi_entering_ack_irq(void)
 {
+	task_isolation_interrupt("ack irq");
 	irq_enter();
 	ack_APIC_irq();
 }
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 89e7eeb5cec1..aa9d9d817f8b 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -85,6 +85,7 @@ struct thread_info {
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
 #define TIF_UPROBE		12	/* breakpointed or singlestepping */
 #define TIF_PATCH_PENDING	13	/* pending live patching update */
+#define TIF_TASK_ISOLATION	14	/* task isolation enabled for task */
 #define TIF_NOCPUID		15	/* CPUID is not accessible in userland */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
@@ -111,6 +112,7 @@ struct thread_info {
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_UPROBE		(1 << TIF_UPROBE)
 #define _TIF_PATCH_PENDING	(1 << TIF_PATCH_PENDING)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOCPUID		(1 << TIF_NOCPUID)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
@@ -132,15 +134,15 @@ struct thread_info {
 #define _TIF_WORK_SYSCALL_ENTRY	\
 	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |	\
 	 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT |	\
-	 _TIF_NOHZ)
+	 _TIF_NOHZ | _TIF_TASK_ISOLATION)
 
 /* work to do on any return to user space */
 #define _TIF_ALLWORK_MASK						\
 	(_TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME | _TIF_SIGPENDING |	\
 	 _TIF_NEED_RESCHED | _TIF_SINGLESTEP | _TIF_SYSCALL_EMU |	\
 	 _TIF_SYSCALL_AUDIT | _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE |	\
-	 _TIF_PATCH_PENDING | _TIF_NOHZ | _TIF_SYSCALL_TRACEPOINT |	\
-	 _TIF_FSCHECK)
+	 _TIF_PATCH_PENDING | _TIF_TASK_ISOLATION | _TIF_NOHZ |		\
+	 _TIF_SYSCALL_TRACEPOINT | _TIF_NOHZ | _TIF_FSCHECK)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW							\
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 5c574dff4c1a..73bc58029292 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/isolation.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -128,6 +129,7 @@ static void native_smp_send_reschedule(int cpu)
 		WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu);
 		return;
 	}
+	task_isolation_remote(cpu, "reschedule IPI");
 	apic->send_IPI(cpu, RESCHEDULE_VECTOR);
 }
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 67db4f43309e..a48ebfc7b98c 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -37,6 +37,7 @@
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/io.h>
+#include <linux/isolation.h>
 
 #if defined(CONFIG_EDAC)
 #include <linux/edac.h>
@@ -467,6 +468,8 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 	case 2:	/* Bound directory has invalid entry. */
 		if (mpx_handle_bd_fault())
 			goto exit_trap;
+		/* No signal was generated, but notify task-isolation tasks. */
+		task_isolation_interrupt("bounds check");
 		break; /* Success, it was handled */
 	case 1: /* Bound violation. */
 		info = mpx_generate_siginfo(regs);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e2baeaa053a5..bd479142e3ca 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -15,6 +15,7 @@
 #include <linux/prefetch.h>		/* prefetchw			*/
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
+#include <linux/isolation.h>		/* task_isolation_interrupt	*/
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1486,6 +1487,10 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
 	}
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	if (flags & PF_USER)
+		task_isolation_interrupt("page fault at %#lx", address);
+
 	check_v8086_mode(regs, address, tsk);
 }
 NOKPROBE_SYMBOL(__do_page_fault);
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v16 09/13] arch/arm64: enable task isolation functionality
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
                   ` (7 preceding siblings ...)
  2017-11-03 17:04 ` [PATCH v16 08/13] arch/x86: enable task isolation functionality Chris Metcalf
@ 2017-11-03 17:04 ` Chris Metcalf
  2017-11-03 17:32   ` Mark Rutland
  2017-11-03 17:04 ` [PATCH v16 10/13] arch/arm: " Chris Metcalf
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 29+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:04 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Mark Rutland,
	linux-arm-kernel, linux-kernel
  Cc: Chris Metcalf

In do_notify_resume(), call task_isolation_start() for
TIF_TASK_ISOLATION tasks.  Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop,
since we don't clear _TIF_TASK_ISOLATION in the loop.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one.  This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

We instrument the smp_send_reschedule() routine so that it checks for
isolated tasks and generates a suitable warning if needed.

Finally, report on page faults in task-isolation processes in
do_page_faults().

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/thread_info.h |  5 ++++-
 arch/arm64/kernel/ptrace.c           | 18 +++++++++++++++---
 arch/arm64/kernel/signal.c           | 10 +++++++++-
 arch/arm64/kernel/smp.c              |  7 +++++++
 arch/arm64/mm/fault.c                |  5 +++++
 6 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6a56d4..d77ecdb29765 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -73,6 +73,7 @@ config ARM64
 	select HAVE_ARCH_MMAP_RND_BITS
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_ARCH_VMAP_STACK
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index ddded6497a8a..9c749eca7384 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -82,6 +82,7 @@ void arch_setup_new_exec(void);
 #define TIF_FOREIGN_FPSTATE	3	/* CPU's FP state is not current's */
 #define TIF_UPROBE		4	/* uprobe breakpoint or singlestep */
 #define TIF_FSCHECK		5	/* Check FS is USER_DS on return */
+#define TIF_TASK_ISOLATION	6
 #define TIF_NOHZ		7
 #define TIF_SYSCALL_TRACE	8
 #define TIF_SYSCALL_AUDIT	9
@@ -97,6 +98,7 @@ void arch_setup_new_exec(void);
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
@@ -108,7 +110,8 @@ void arch_setup_new_exec(void);
 
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
-				 _TIF_UPROBE | _TIF_FSCHECK)
+				 _TIF_UPROBE | _TIF_FSCHECK | \
+				 _TIF_TASK_ISOLATION)
 
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 9cbb6123208f..e5c0d7cdaf4e 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -38,6 +38,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/debug-monitors.h>
@@ -1371,14 +1372,25 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 
 asmlinkage int syscall_trace_enter(struct pt_regs *regs)
 {
-	if (test_thread_flag(TIF_SYSCALL_TRACE))
+	unsigned long work = READ_ONCE(current_thread_info()->flags);
+
+	if (work & _TIF_SYSCALL_TRACE)
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
-	/* Do the secure computing after ptrace; failures should be fast. */
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->syscallno) == -1)
+			return -1;
+	}
+
+	/* Do the secure computing check early; failures should be fast. */
 	if (secure_computing(NULL) == -1)
 		return -1;
 
-	if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
+	if (work & _TIF_SYSCALL_TRACEPOINT)
 		trace_sys_enter(regs, regs->syscallno);
 
 	audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 0bdc96c61bc0..d8f4904e992f 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -30,6 +30,7 @@
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
 #include <linux/syscalls.h>
+#include <linux/isolation.h>
 
 #include <asm/debug-monitors.h>
 #include <asm/elf.h>
@@ -741,6 +742,10 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
+#define NOTIFY_RESUME_LOOP_FLAGS				     \
+	(_TIF_NEED_RESCHED | _TIF_SIGPENDING |  _TIF_NOTIFY_RESUME | \
+	 _TIF_FOREIGN_FPSTATE | _TIF_UPROBE | _TIF_FSCHECK)
+
 asmlinkage void do_notify_resume(struct pt_regs *regs,
 				 unsigned int thread_flags)
 {
@@ -777,5 +782,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 
 		local_irq_disable();
 		thread_flags = READ_ONCE(current_thread_info()->flags);
-	} while (thread_flags & _TIF_WORK_MASK);
+	} while (thread_flags & NOTIFY_RESUME_LOOP_FLAGS);
+
+	if (thread_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
 }
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 9f7195a5773e..4159c40de3b4 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -40,6 +40,7 @@
 #include <linux/of.h>
 #include <linux/irq_work.h>
 #include <linux/kexec.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -818,6 +819,7 @@ void arch_send_call_function_single_ipi(int cpu)
 #ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
 void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "wakeup IPI");
 	smp_cross_call(mask, IPI_WAKEUP);
 }
 #endif
@@ -879,6 +881,9 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 		__inc_irq_stat(cpu, ipi_irqs[ipinr]);
 	}
 
+	task_isolation_interrupt("IPI type %d (%s)", ipinr,
+				 ipinr < NR_IPI ? ipi_types[ipinr] : "unknown");
+
 	switch (ipinr) {
 	case IPI_RESCHEDULE:
 		scheduler_ipi();
@@ -941,12 +946,14 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 
 void smp_send_reschedule(int cpu)
 {
+	task_isolation_remote(cpu, "reschedule IPI");
 	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
 }
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 void tick_broadcast(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "timer IPI");
 	smp_cross_call(mask, IPI_TIMER);
 }
 #endif
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index b64958b23a7f..bff2f84d5f4e 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -32,6 +32,7 @@
 #include <linux/perf_event.h>
 #include <linux/preempt.h>
 #include <linux/hugetlb.h>
+#include <linux/isolation.h>
 
 #include <asm/bug.h>
 #include <asm/cmpxchg.h>
@@ -495,6 +496,10 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 	 */
 	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
 			      VM_FAULT_BADACCESS)))) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (user_mode(regs))
+			task_isolation_interrupt("page fault at %#lx", addr);
+
 		/*
 		 * Major/minor page fault accounting is only done
 		 * once. If we go through a retry, it is extremely
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v16 10/13] arch/arm: enable task isolation functionality
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
                   ` (8 preceding siblings ...)
  2017-11-03 17:04 ` [PATCH v16 09/13] arch/arm64: " Chris Metcalf
@ 2017-11-03 17:04 ` Chris Metcalf
  2017-11-03 17:23   ` Russell King - ARM Linux
  2018-03-18 14:48   ` Yury Norov
  2017-11-03 17:04 ` [PATCH v16 11/13] arch/tile: " Chris Metcalf
                   ` (5 subsequent siblings)
  15 siblings, 2 replies; 29+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:04 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Mark Rutland,
	Russell King, linux-arm-kernel, linux-kernel
  Cc: Francis Giraldeau, Chris Metcalf

From: Francis Giraldeau <francis.giraldeau@gmail.com>

This patch is a port of the task isolation functionality to the arm 32-bit
architecture. The task isolation needs an additional thread flag that
requires to change the entry assembly code to accept a bitfield larger than
one byte.  The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now
defined in the literal pool. The rest of the patch is straightforward and
reflects what is done on other architectures.

To avoid problems with the tst instruction in the v7m build, we renumber
TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7.

Signed-off-by: Francis Giraldeau <francis.giraldeau@gmail.com>
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> [with modifications]
---
 arch/arm/Kconfig                   |  1 +
 arch/arm/include/asm/thread_info.h | 10 +++++++---
 arch/arm/kernel/entry-common.S     | 12 ++++++++----
 arch/arm/kernel/ptrace.c           | 10 ++++++++++
 arch/arm/kernel/signal.c           | 10 +++++++++-
 arch/arm/kernel/smp.c              |  4 ++++
 arch/arm/mm/fault.c                |  8 +++++++-
 7 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 7888c9803eb0..3423c655a32b 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -48,6 +48,7 @@ config ARM
 	select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
 	select HAVE_ARCH_MMAP_RND_BITS if MMU
 	select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARM_SMCCC if CPU_V7
 	select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
diff --git a/arch/arm/include/asm/thread_info.h b/arch/arm/include/asm/thread_info.h
index 776757d1604a..a7b76ac9543d 100644
--- a/arch/arm/include/asm/thread_info.h
+++ b/arch/arm/include/asm/thread_info.h
@@ -142,7 +142,8 @@ extern int vfp_restore_user_hwstate(struct user_vfp __user *,
 #define TIF_SYSCALL_TRACE	4	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	5	/* syscall auditing active */
 #define TIF_SYSCALL_TRACEPOINT	6	/* syscall tracepoint instrumentation */
-#define TIF_SECCOMP		7	/* seccomp syscall filtering active */
+#define TIF_TASK_ISOLATION	7	/* task isolation active */
+#define TIF_SECCOMP		8	/* seccomp syscall filtering active */
 
 #define TIF_NOHZ		12	/* in adaptive nohz mode */
 #define TIF_USING_IWMMXT	17
@@ -156,18 +157,21 @@ extern int vfp_restore_user_hwstate(struct user_vfp __user *,
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SYSCALL_TRACEPOINT	(1 << TIF_SYSCALL_TRACEPOINT)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_USING_IWMMXT	(1 << TIF_USING_IWMMXT)
 
 /* Checks for any syscall work in entry-common.S */
 #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
-			   _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP)
+			   _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
+			   _TIF_TASK_ISOLATION)
 
 /*
  * Change these and you break ASM code in entry-common.S
  */
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
-				 _TIF_NOTIFY_RESUME | _TIF_UPROBE)
+				 _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
+				 _TIF_TASK_ISOLATION)
 
 #endif /* __KERNEL__ */
 #endif /* __ASM_ARM_THREAD_INFO_H */
diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
index 99c908226065..9ae3ef2dbc1e 100644
--- a/arch/arm/kernel/entry-common.S
+++ b/arch/arm/kernel/entry-common.S
@@ -53,7 +53,8 @@ ret_fast_syscall:
 	cmp	r2, #TASK_SIZE
 	blne	addr_limit_check_failed
 	ldr	r1, [tsk, #TI_FLAGS]		@ re-check for syscall tracing
-	tst	r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	ldr	r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	tst	r1, r2
 	bne	fast_work_pending
 
 
@@ -83,7 +84,8 @@ ret_fast_syscall:
 	cmp	r2, #TASK_SIZE
 	blne	addr_limit_check_failed
 	ldr	r1, [tsk, #TI_FLAGS]		@ re-check for syscall tracing
-	tst	r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	ldr	r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+	tst	r1, r2
 	beq	no_work_pending
  UNWIND(.fnend		)
 ENDPROC(ret_fast_syscall)
@@ -91,7 +93,8 @@ ENDPROC(ret_fast_syscall)
 	/* Slower path - fall through to work_pending */
 #endif
 
-	tst	r1, #_TIF_SYSCALL_WORK
+	ldr	r2, =_TIF_SYSCALL_WORK
+	tst	r1, r2
 	bne	__sys_trace_return_nosave
 slow_work_pending:
 	mov	r0, sp				@ 'regs'
@@ -238,7 +241,8 @@ local_restart:
 	ldr	r10, [tsk, #TI_FLAGS]		@ check for syscall tracing
 	stmdb	sp!, {r4, r5}			@ push fifth and sixth args
 
-	tst	r10, #_TIF_SYSCALL_WORK		@ are we tracing syscalls?
+	ldr	r11, =_TIF_SYSCALL_WORK		@ are we tracing syscalls?
+	tst	r10, r11
 	bne	__sys_trace
 
 	cmp	scno, #NR_syscalls		@ check upper syscall limit
diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
index 58e3771e4c5b..0cfcba5a93df 100644
--- a/arch/arm/kernel/ptrace.c
+++ b/arch/arm/kernel/ptrace.c
@@ -27,6 +27,7 @@
 #include <linux/audit.h>
 #include <linux/tracehook.h>
 #include <linux/unistd.h>
+#include <linux/isolation.h>
 
 #include <asm/pgtable.h>
 #include <asm/traps.h>
@@ -936,6 +937,15 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs, int scno)
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (test_thread_flag(TIF_TASK_ISOLATION)) {
+		if (task_isolation_syscall(scno) == -1)
+			return -1;
+	}
+
 	/* Do seccomp after ptrace; syscall may have changed. */
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
 	if (secure_computing(NULL) == -1)
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index b67ae12503f3..7c526efb301a 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -15,6 +15,7 @@
 #include <linux/tracehook.h>
 #include <linux/uprobes.h>
 #include <linux/syscalls.h>
+#include <linux/isolation.h>
 
 #include <asm/elf.h>
 #include <asm/cacheflush.h>
@@ -605,6 +606,9 @@ static int do_signal(struct pt_regs *regs, int syscall)
 	return 0;
 }
 
+#define WORK_PENDING_LOOP_FLAGS \
+	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE)
+
 asmlinkage int
 do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 {
@@ -641,7 +645,11 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
 		}
 		local_irq_disable();
 		thread_flags = current_thread_info()->flags;
-	} while (thread_flags & _TIF_WORK_MASK);
+	} while (thread_flags & WORK_PENDING_LOOP_FLAGS);
+
+	if (thread_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
+
 	return 0;
 }
 
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index c9a0a5299827..76f8b2010ddf 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -29,6 +29,7 @@
 #include <linux/completion.h>
 #include <linux/cpufreq.h>
 #include <linux/irq_work.h>
+#include <linux/isolation.h>
 
 #include <linux/atomic.h>
 #include <asm/smp.h>
@@ -525,6 +526,7 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask)
 
 void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "wakeup IPI");
 	smp_cross_call(mask, IPI_WAKEUP);
 }
 
@@ -544,6 +546,7 @@ void arch_irq_work_raise(void)
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 void tick_broadcast(const struct cpumask *mask)
 {
+	task_isolation_remote_cpumask(mask, "timer IPI");
 	smp_cross_call(mask, IPI_TIMER);
 }
 #endif
@@ -665,6 +668,7 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
 
 void smp_send_reschedule(int cpu)
 {
+	task_isolation_remote(cpu, "reschedule IPI");
 	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
 }
 
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 42f585379e19..052860948771 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -20,6 +20,7 @@
 #include <linux/sched/debug.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/isolation.h>
 
 #include <asm/exception.h>
 #include <asm/pgtable.h>
@@ -352,8 +353,13 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	/*
 	 * Handle the "normal" case first - VM_FAULT_MAJOR
 	 */
-	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
+	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
+			      VM_FAULT_BADACCESS)))) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (user_mode(regs))
+			task_isolation_interrupt("page fault at %#lx", addr);
 		return 0;
+	}
 
 	/*
 	 * If we are in kernel mode at this point, we
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v16 11/13] arch/tile: enable task isolation functionality
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
                   ` (9 preceding siblings ...)
  2017-11-03 17:04 ` [PATCH v16 10/13] arch/arm: " Chris Metcalf
@ 2017-11-03 17:04 ` Chris Metcalf
  2017-11-03 17:04 ` [PATCH v16 12/13] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:04 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-kernel
  Cc: Chris Metcalf

We add the necessary call to task_isolation_start() in the
prepare_exit_to_usermode() routine.  We already unconditionally
call into this routine if TIF_NOHZ is set, since that's where
we do the user_enter() call.

We add calls to task_isolation_interrupt() in places where exceptions
may not generate signals to the application.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 arch/tile/Kconfig                   |  1 +
 arch/tile/include/asm/thread_info.h |  2 ++
 arch/tile/kernel/hardwall.c         |  2 ++
 arch/tile/kernel/irq.c              |  3 +++
 arch/tile/kernel/messaging.c        |  4 ++++
 arch/tile/kernel/process.c          |  4 ++++
 arch/tile/kernel/ptrace.c           | 10 ++++++++++
 arch/tile/kernel/single_step.c      |  7 +++++++
 arch/tile/kernel/smp.c              | 21 +++++++++++----------
 arch/tile/kernel/time.c             |  2 ++
 arch/tile/kernel/unaligned.c        |  4 ++++
 arch/tile/mm/fault.c                | 13 ++++++++++++-
 arch/tile/mm/homecache.c            | 11 +++++++++++
 13 files changed, 73 insertions(+), 11 deletions(-)

diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig
index 4583c0320059..2d644138f2eb 100644
--- a/arch/tile/Kconfig
+++ b/arch/tile/Kconfig
@@ -16,6 +16,7 @@ config TILE
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_CONTEXT_TRACKING
 	select HAVE_DEBUG_BUGVERBOSE
diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index b7659b8f1117..3e298bd43d11 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -126,6 +126,7 @@ extern void _cpu_idle(void);
 #define TIF_SYSCALL_TRACEPOINT	9	/* syscall tracepoint instrumentation */
 #define TIF_POLLING_NRFLAG	10	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_NOHZ		11	/* in adaptive nohz mode */
+#define TIF_TASK_ISOLATION	12	/* in task isolation mode */
 
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1<<TIF_NEED_RESCHED)
@@ -139,6 +140,7 @@ extern void _cpu_idle(void);
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
 #define _TIF_NOHZ		(1<<TIF_NOHZ)
+#define _TIF_TASK_ISOLATION	(1<<TIF_TASK_ISOLATION)
 
 /* Work to do as we loop to exit to user space. */
 #define _TIF_WORK_MASK \
diff --git a/arch/tile/kernel/hardwall.c b/arch/tile/kernel/hardwall.c
index 2fd1694ac1d0..9559f04d1c2a 100644
--- a/arch/tile/kernel/hardwall.c
+++ b/arch/tile/kernel/hardwall.c
@@ -23,6 +23,7 @@
 #include <linux/smp.h>
 #include <linux/cdev.h>
 #include <linux/compat.h>
+#include <linux/isolation.h>
 #include <asm/hardwall.h>
 #include <asm/traps.h>
 #include <asm/siginfo.h>
@@ -328,6 +329,7 @@ void __kprobes do_hardwall_trap(struct pt_regs* regs, int fault_num)
 	int found_processes;
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
+	task_isolation_interrupt("hardwall trap");
 	irq_enter();
 
 	/* Figure out which network trapped. */
diff --git a/arch/tile/kernel/irq.c b/arch/tile/kernel/irq.c
index 22044fc691ef..0b1b24b9c496 100644
--- a/arch/tile/kernel/irq.c
+++ b/arch/tile/kernel/irq.c
@@ -18,6 +18,7 @@
 #include <linux/irq.h>
 #include <linux/kernel_stat.h>
 #include <linux/uaccess.h>
+#include <linux/isolation.h>
 #include <hv/drv_pcie_rc_intf.h>
 #include <arch/spr_def.h>
 #include <asm/traps.h>
@@ -100,6 +101,8 @@ void tile_dev_intr(struct pt_regs *regs, int intnum)
 
 	/* Track time spent here in an interrupt context. */
 	old_regs = set_irq_regs(regs);
+
+	task_isolation_interrupt("IPI: IRQ mask %#lx", remaining_irqs);
 	irq_enter();
 
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
diff --git a/arch/tile/kernel/messaging.c b/arch/tile/kernel/messaging.c
index 7475af3aacec..1cf1630215f0 100644
--- a/arch/tile/kernel/messaging.c
+++ b/arch/tile/kernel/messaging.c
@@ -16,6 +16,7 @@
 #include <linux/smp.h>
 #include <linux/hardirq.h>
 #include <linux/ptrace.h>
+#include <linux/isolation.h>
 #include <asm/hv_driver.h>
 #include <asm/irq_regs.h>
 #include <asm/traps.h>
@@ -86,6 +87,7 @@ void hv_message_intr(struct pt_regs *regs, int intnum)
 
 			tag = message[0];
 #ifdef CONFIG_SMP
+			task_isolation_interrupt("SMP message %d", tag);
 			evaluate_message(message[0]);
 #else
 			panic("Received IPI message %d in UP mode", tag);
@@ -94,6 +96,8 @@ void hv_message_intr(struct pt_regs *regs, int intnum)
 			HV_IntrMsg *him = (HV_IntrMsg *)message;
 			struct hv_driver_cb *cb =
 				(struct hv_driver_cb *)him->intarg;
+			task_isolation_interrupt("interrupt message %#lx(%#lx)",
+					   him->intarg, him->intdata);
 			cb->callback(cb, him->intdata);
 			__this_cpu_inc(irq_stat.irq_hv_msg_count);
 		}
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index f0a0e18e4dfb..ac22e971dc1d 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -32,6 +32,7 @@
 #include <linux/signal.h>
 #include <linux/delay.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 #include <asm/stack.h>
 #include <asm/switch_to.h>
 #include <asm/homecache.h>
@@ -516,6 +517,9 @@ void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags)
 #endif
 	}
 
+	if (thread_info_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
+
 	user_enter();
 }
 
diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index e1a078e6828e..908d57d3d2cf 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -24,6 +24,7 @@
 #include <linux/tracehook.h>
 #include <linux/context_tracking.h>
 #include <linux/sched/task_stack.h>
+#include <linux/isolation.h>
 
 #include <asm/traps.h>
 #include <arch/chip.h>
@@ -263,6 +264,15 @@ int do_syscall_trace_enter(struct pt_regs *regs)
 		return -1;
 	}
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (work & _TIF_TASK_ISOLATION) {
+		if (task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]) == -1)
+			return -1;
+	}
+
 	if (secure_computing(NULL) == -1)
 		return -1;
 
diff --git a/arch/tile/kernel/single_step.c b/arch/tile/kernel/single_step.c
index de3eae813e52..4d5d4897e70c 100644
--- a/arch/tile/kernel/single_step.c
+++ b/arch/tile/kernel/single_step.c
@@ -23,6 +23,7 @@
 #include <linux/types.h>
 #include <linux/err.h>
 #include <linux/prctl.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <linux/uaccess.h>
@@ -320,6 +321,9 @@ void single_step_once(struct pt_regs *regs)
 	int size = 0, sign_ext = 0;  /* happy compiler */
 	int align_ctl;
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	task_isolation_interrupt("single step at %#lx", regs->pc);
+
 	align_ctl = unaligned_fixup;
 	switch (task_thread_info(current)->align_ctl) {
 	case PR_UNALIGN_NOPRINT:
@@ -767,6 +771,9 @@ void single_step_once(struct pt_regs *regs)
 	unsigned long *ss_pc = this_cpu_ptr(&ss_saved_pc);
 	unsigned long control = __insn_mfspr(SPR_SINGLE_STEP_CONTROL_K);
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	task_isolation_interrupt("single step at %#lx", regs->pc);
+
 	*ss_pc = regs->pc;
 	control |= SPR_SINGLE_STEP_CONTROL_1__CANCELED_MASK;
 	control |= SPR_SINGLE_STEP_CONTROL_1__INHIBIT_MASK;
diff --git a/arch/tile/kernel/smp.c b/arch/tile/kernel/smp.c
index 94a62e1197ce..6b100faa33ef 100644
--- a/arch/tile/kernel/smp.c
+++ b/arch/tile/kernel/smp.c
@@ -20,6 +20,7 @@
 #include <linux/irq.h>
 #include <linux/irq_work.h>
 #include <linux/module.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/homecache.h>
 
@@ -258,10 +259,8 @@ void __init ipi_init(void)
 
 #if CHIP_HAS_IPI()
 
-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
 {
-	WARN_ON(cpu_is_offline(cpu));
-
 	/*
 	 * We just want to do an MMIO store.  The traditional writeq()
 	 * functions aren't really correct here, since they're always
@@ -273,15 +272,17 @@ void smp_send_reschedule(int cpu)
 
 #else
 
-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
 {
-	HV_Coord coord;
-
-	WARN_ON(cpu_is_offline(cpu));
-
-	coord.y = cpu_y(cpu);
-	coord.x = cpu_x(cpu);
+	HV_Coord coord = { .y = cpu_y(cpu), .x = cpu_x(cpu) };
 	hv_trigger_ipi(coord, IRQ_RESCHEDULE);
 }
 
 #endif /* CHIP_HAS_IPI() */
+
+void smp_send_reschedule(int cpu)
+{
+	WARN_ON(cpu_is_offline(cpu));
+	task_isolation_remote(cpu, "reschedule IPI");
+	__smp_send_reschedule(cpu);
+}
diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index 6643ffbc0615..f74f10d827fa 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -25,6 +25,7 @@
 #include <linux/delay.h>
 #include <linux/module.h>
 #include <linux/timekeeper_internal.h>
+#include <linux/isolation.h>
 #include <asm/irq_regs.h>
 #include <asm/traps.h>
 #include <asm/vdso.h>
@@ -196,6 +197,7 @@ void do_timer_interrupt(struct pt_regs *regs, int fault_num)
 	arch_local_irq_mask(INT_TILE_TIMER);
 
 	/* Track time spent here in an interrupt context */
+	task_isolation_interrupt("timer");
 	irq_enter();
 
 	/* Track interrupt count. */
diff --git a/arch/tile/kernel/unaligned.c b/arch/tile/kernel/unaligned.c
index 8149c38f67b6..ee14a61d0eee 100644
--- a/arch/tile/kernel/unaligned.c
+++ b/arch/tile/kernel/unaligned.c
@@ -27,6 +27,7 @@
 #include <linux/extable.h>
 #include <linux/compat.h>
 #include <linux/prctl.h>
+#include <linux/isolation.h>
 #include <asm/cacheflush.h>
 #include <asm/traps.h>
 #include <linux/uaccess.h>
@@ -1547,6 +1548,9 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
 		return;
 	}
 
+	/* No signal was generated, but notify task-isolation tasks. */
+	task_isolation_interrupt("unaligned access at %#lx", regs->pc);
+
 	if (!info->unalign_jit_base) {
 		void __user *user_page;
 
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index f58fa06a2214..0a7a731d7413 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -38,6 +38,7 @@
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 #include <linux/kdebug.h>
+#include <linux/isolation.h>
 
 #include <asm/pgalloc.h>
 #include <asm/sections.h>
@@ -311,8 +312,13 @@ static int handle_page_fault(struct pt_regs *regs,
 	 */
 	pgd = get_current_pgd();
 	if (handle_migrating_pte(pgd, fault_num, address, regs->pc,
-				 is_kernel_mode, write))
+				 is_kernel_mode, write)) {
+		/* No signal was generated, but notify task-isolation tasks. */
+		if (!is_kernel_mode)
+			task_isolation_interrupt("migration fault at %#lx",
+						       address);
 		return 1;
+	}
 
 	si_code = SEGV_MAPERR;
 
@@ -482,6 +488,11 @@ static int handle_page_fault(struct pt_regs *regs,
 #endif
 
 	up_read(&mm->mmap_sem);
+
+	/* No signal was generated, but notify task-isolation tasks. */
+	if (flags & FAULT_FLAG_USER)
+		task_isolation_interrupt("page fault at %#lx", address);
+
 	return 1;
 
 /*
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index b51cc28acd0a..82bcffb26ec4 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
 #include <linux/smp.h>
 #include <linux/module.h>
 #include <linux/hugetlb.h>
+#include <linux/isolation.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -85,6 +86,16 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
 	 */
 	for_each_cpu(cpu, &mask)
 		++per_cpu(irq_stat, cpu).irq_hv_flush_count;
+
+	/*
+	 * Send an extra Linux-level interrupt here to task isolation tasks
+	 * so that they can generate the appropriate signal to kill
+	 * themselves.  If we didn't send a Linux interrupt, we would
+	 * interrupt the task into the hypervisor but it wouldn't get a
+	 * signal to advise it of the loss of the cpu.
+	 */
+	task_isolation_remote_cpumask_interrupt(&mask,
+						"remote cache/TLB flush");
 }
 
 /*
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v16 12/13] arm, tile: turn off timer tick for oneshot_stopped state
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
                   ` (10 preceding siblings ...)
  2017-11-03 17:04 ` [PATCH v16 11/13] arch/tile: " Chris Metcalf
@ 2017-11-03 17:04 ` Chris Metcalf
  2017-11-03 17:18   ` Mark Rutland
  2017-11-03 17:04 ` [PATCH v16 13/13] task_isolation self test Chris Metcalf
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 29+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:04 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	Mark Rutland, Marc Zyngier, linux-arm-kernel, linux-kernel
  Cc: Chris Metcalf

When the schedule tick is disabled in tick_nohz_stop_sched_tick(),
we call hrtimer_cancel(), which eventually calls down into
__remove_hrtimer() and thus into hrtimer_force_reprogram().
That function's call to tick_program_event() detects that
we are trying to set the expiration to KTIME_MAX and calls
clockevents_switch_state() to set the state to ONESHOT_STOPPED,
and returns.  See commit 8fff52fd5093 ("clockevents: Introduce
CLOCK_EVT_STATE_ONESHOT_STOPPED state") for more background.

However, by default the internal __clockevents_switch_state() code
doesn't have a "set_state_oneshot_stopped" function pointer for
the arm_arch_timer or tile clock_event_device structures, so that
code returns -ENOSYS, and we end up not setting the state, and more
importantly, we don't actually turn off the hardware timer.
As a result, the timer tick we were waiting for before is still
queued, and fires shortly afterwards, only to discover there was
nothing for it to do, at which point it quiesces.

The fix is to provide that function pointer field, and like the
other function pointers, have it just turn off the timer interrupt.
Any call to set a new timer interval will properly re-enable it.

This fix avoids a small performance hiccup for regular applications,
but for TASK_ISOLATION code, it fixes a potentially serious
kernel timer interruption to the time-sensitive application.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
 arch/tile/kernel/time.c              | 1 +
 drivers/clocksource/arm_arch_timer.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index f74f10d827fa..afca6fe496c8 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -163,6 +163,7 @@ static DEFINE_PER_CPU(struct clock_event_device, tile_timer) = {
 	.set_next_event = tile_timer_set_next_event,
 	.set_state_shutdown = tile_timer_shutdown,
 	.set_state_oneshot = tile_timer_shutdown,
+	.set_state_oneshot_stopped = tile_timer_shutdown,
 	.tick_resume = tile_timer_shutdown,
 };
 
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index fd4b7f684bd0..61ea7f907c56 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -722,6 +722,8 @@ static void __arch_timer_setup(unsigned type,
 		}
 	}
 
+	clk->set_state_oneshot_stopped = clk->set_state_shutdown;
+
 	clk->set_state_shutdown(clk);
 
 	clockevents_config_and_register(clk, arch_timer_rate, 0xf, 0x7fffffff);
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v16 13/13] task_isolation self test
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
                   ` (11 preceding siblings ...)
  2017-11-03 17:04 ` [PATCH v16 12/13] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
@ 2017-11-03 17:04 ` Chris Metcalf
  2017-11-06 15:38 ` [PATCH v16 00/13] support "task_isolation" mode Christopher Lameter
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:04 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Shuah Khan,
	linux-kselftest, linux-kernel
  Cc: Chris Metcalf

This code tests various aspects of task_isolation.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/task_isolation/Makefile    |   6 +
 tools/testing/selftests/task_isolation/config      |   1 +
 tools/testing/selftests/task_isolation/isolation.c | 643 +++++++++++++++++++++
 4 files changed, 651 insertions(+)
 create mode 100644 tools/testing/selftests/task_isolation/Makefile
 create mode 100644 tools/testing/selftests/task_isolation/config
 create mode 100644 tools/testing/selftests/task_isolation/isolation.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index ff805643b5f7..ab781b99d3c9 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -30,6 +30,7 @@ TARGETS += splice
 TARGETS += static_keys
 TARGETS += sync
 TARGETS += sysctl
+TARGETS += task_isolation
 ifneq (1, $(quicktest))
 TARGETS += timers
 endif
diff --git a/tools/testing/selftests/task_isolation/Makefile b/tools/testing/selftests/task_isolation/Makefile
new file mode 100644
index 000000000000..74d060b493f9
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/Makefile
@@ -0,0 +1,6 @@
+CFLAGS += -O2 -g -W -Wall
+LDFLAGS += -pthread
+
+TEST_GEN_PROGS := isolation
+
+include ../lib.mk
diff --git a/tools/testing/selftests/task_isolation/config b/tools/testing/selftests/task_isolation/config
new file mode 100644
index 000000000000..34edfbca0423
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/config
@@ -0,0 +1 @@
+CONFIG_TASK_ISOLATION=y
diff --git a/tools/testing/selftests/task_isolation/isolation.c b/tools/testing/selftests/task_isolation/isolation.c
new file mode 100644
index 000000000000..9c0b49619b40
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/isolation.c
@@ -0,0 +1,643 @@
+/*
+ * This test program tests the features of task isolation.
+ *
+ * - Makes sure enabling task isolation fails if you are unaffinitized
+ *   or on a non-task-isolation cpu.
+ *
+ * - Validates that various synchronous exceptions are fatal in isolation
+ *   mode:
+ *
+ *   * Page fault
+ *   * System call
+ *   * TLB invalidation from another thread [1]
+ *   * Unaligned access [2]
+ *
+ * - Tests that taking a user-defined signal for the above faults works.
+ *
+ * - Tests that you can prctl(PR_TASK_ISOLATION, 0) to turn isolation off.
+ *
+ * - Tests that receiving a signal turns isolation off.
+ *
+ * - Tests that having another process schedule into the core where the
+ *   isolation process is running correctly kills the isolation process.
+ *
+ * [1] TLB invalidations do not cause IPIs on some platforms, e.g. arm64
+ * [2] Unaligned access only causes exceptions on some platforms, e.g. tile
+ *
+ *
+ * You must be running under a kernel configured with TASK_ISOLATION.
+ *
+ * You must have booted with e.g. "nohz_full=1-15 isolcpus=1-15" to
+ * enable some task-isolation cores.  If you get interrupt reports, you
+ * can also add the boot argument "task_isolation_debug" to learn more.
+ * If you get jitter but no reports, define DEBUG_TASK_ISOLATION to add
+ * isolation checks in every user_exit() call.
+ *
+ * NOTE: you must disable the code in tick_nohz_stop_sched_tick()
+ * that limits the tick delta to the maximum scheduler deferment
+ * by making it conditional not just on "!ts->inidle" but also
+ * on !current->task_isolation_flags.  This is around line 756
+ * in kernel/time/tick-sched.c (as of kernel 4.14).
+ *
+ *
+ * To compile the test program, run "make".
+ *
+ * Run the program as "./isolation" and if you want to run the
+ * jitter-detection loop for longer than 10 giga-cycles, specify the
+ * number of giga-cycles to run it for as a command-line argument.
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdbool.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <assert.h>
+#include <string.h>
+#include <errno.h>
+#include <sched.h>
+#include <pthread.h>
+#include <sys/wait.h>
+#include <sys/mman.h>
+#include <sys/time.h>
+#include <sys/prctl.h>
+#include <sys/types.h>
+#include "../kselftest.h"
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
+#define WRITE_ONCE(x, val) (*(volatile typeof(x) *)&(x) = (val))
+
+#ifndef PR_TASK_ISOLATION   /* Not in system headers yet? */
+# define PR_TASK_ISOLATION		48
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+#endif
+
+/* The cpu we are using for isolation tests. */
+static int task_isolation_cpu;
+
+/* Overall status, maintained as tests run. */
+static int exit_status = KSFT_PASS;
+
+/* Data shared between parent and children. */
+static struct {
+	/* Set to true when the parent's isolation prctl is successful. */
+	bool parent_isolated;
+} *shared;
+
+/* Set affinity to a single cpu or die if trying to do so fails. */
+static void set_my_cpu(int cpu)
+{
+	cpu_set_t set;
+	int rc;
+
+	CPU_ZERO(&set);
+	CPU_SET(cpu, &set);
+	rc = sched_setaffinity(0, sizeof(cpu_set_t), &set);
+	assert(rc == 0);
+}
+
+#define timeout_init(tv) gettimeofday((tv), NULL)
+
+static int timeout(struct timeval *start, double secs)
+{
+	struct timeval tv;
+	double time;
+
+	gettimeofday(&tv, NULL);
+	time = (tv.tv_sec - start->tv_sec) +
+		(tv.tv_usec - start->tv_usec) / 1000000.0;
+	return time >= secs;
+}
+
+static inline int prctl_isolation(int flags)
+{
+	return prctl(PR_TASK_ISOLATION, flags, 0, 0, 0);
+}
+
+static void set_task_isolation(int flags)
+{
+	int rc;
+	struct timeval start;
+
+	/* Wait for up to a second for the kernel to isolate this core. */
+	timeout_init(&start);
+	do {
+		rc = prctl_isolation(flags);
+		if (rc == 0 || errno != EAGAIN)
+			break;
+	} while (!timeout(&start, 1));
+	if (rc != 0) {
+		prctl_isolation(0);
+		printf("couldn't enable isolation (%d): FAIL\n", errno);
+		ksft_exit_fail();
+	}
+}
+
+/*
+ * Run a child process in task isolation mode and report its status.
+ * The child does mlockall() and moves itself to the task isolation cpu.
+ * It then runs SETUP_FUNC (if specified), calls prctl(PR_TASK_ISOLATION)
+ * with FLAGS (if non-zero), and then invokes TEST_FUNC and exits
+ * with its status.
+ */
+static int run_test(void (*setup_func)(), int (*test_func)(), int flags)
+{
+	int pid, rc, status;
+
+	fflush(stdout);
+	pid = fork();
+	assert(pid >= 0);
+	if (pid != 0) {
+		/* In parent; wait for child and return its status. */
+		waitpid(pid, &status, 0);
+		return status;
+	}
+
+	/* In child. */
+	rc = mlockall(MCL_CURRENT);
+	assert(rc == 0);
+	set_my_cpu(task_isolation_cpu);
+	if (setup_func)
+		setup_func();
+	if (flags)
+		set_task_isolation(flags);
+	rc = test_func();
+	exit(rc);
+}
+
+/* Run a test and make sure it exits with success. */
+static void test_ok(const char *testname, void (*setup_func)(),
+		    int (*test_func)())
+{
+	int status;
+
+	status = run_test(setup_func, test_func, PR_TASK_ISOLATION_ENABLE);
+	if (status == KSFT_PASS) {
+		printf("%s: OK\n", testname);
+	} else {
+		printf("%s: FAIL (%#x)\n", testname, status);
+		exit_status = KSFT_FAIL;
+	}
+}
+
+/*
+ * Run a test and ensure it is killed with SIGKILL by default,
+ * for whatever misdemeanor is committed in TEST_FUNC.
+ * Also test it with SIGUSR1 as well to make sure that works.
+ */
+static void test_killed(const char *testname, void (*setup_func)(),
+			int (*test_func)())
+{
+	int status;
+
+	status = run_test(setup_func, test_func, PR_TASK_ISOLATION_ENABLE);
+	if (WIFSIGNALED(status) && WTERMSIG(status) == SIGKILL) {
+		printf("%s: OK\n", testname);
+	} else {
+		printf("%s: FAIL (%#x)\n", testname, status);
+		exit_status = KSFT_FAIL;
+	}
+
+	status = run_test(setup_func, test_func,
+			  PR_TASK_ISOLATION_ENABLE |
+			  PR_TASK_ISOLATION_SET_SIG(SIGUSR1));
+	if (WIFSIGNALED(status) && WTERMSIG(status) == SIGUSR1) {
+		printf("%s (SIGUSR1): OK\n", testname);
+	} else {
+		printf("%s (SIGUSR1): FAIL (%#x)\n", testname, status);
+		exit_status = KSFT_FAIL;
+	}
+}
+
+/* Mapping address passed from setup function to test function. */
+static char *fault_file_mapping;
+
+/* mmap() a file in so we can test touching an unmapped page. */
+static void setup_fault(void)
+{
+	char fault_file[] = "/tmp/isolation_XXXXXX";
+	int fd, rc;
+
+	fd = mkstemp(fault_file);
+	assert(fd >= 0);
+	rc = ftruncate(fd, getpagesize());
+	assert(rc == 0);
+	fault_file_mapping = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
+				  MAP_SHARED, fd, 0);
+	assert(fault_file_mapping != MAP_FAILED);
+	close(fd);
+	unlink(fault_file);
+}
+
+/* Now touch the unmapped page (and be killed). */
+static int do_fault(void)
+{
+	*fault_file_mapping = 1;
+	return KSFT_FAIL;
+}
+
+/* Make a syscall (and be killed). */
+static int do_syscall(void)
+{
+	static const char *msg = "goodbye, world\n";
+	int rc;
+
+	rc = write(STDOUT_FILENO, msg, strlen(msg));
+	assert(rc == (int)strlen(msg));
+	return KSFT_FAIL;
+}
+
+/* Turn isolation back off and don't be killed. */
+static int do_syscall_off(void)
+{
+	const char *msg = "==> hello, world\n";
+	int rc;
+
+	prctl_isolation(0);
+	rc = write(STDOUT_FILENO, msg, strlen(msg));
+	assert(rc == (int)strlen(msg));
+	return KSFT_PASS;
+}
+
+/* Test that isolation is off in signal handlers. */
+static void segv_handler(int sig)
+{
+	printf("Received signal %d successfully\n", sig);
+	exit(0);
+}
+
+static void setup_segv(void)
+{
+	signal(SIGSEGV, segv_handler);
+}
+
+static int do_segv(void)
+{
+	*(volatile int *)0 = 0;
+	return KSFT_FAIL;   /* should not get here */
+}
+
+/* ARM64 uses tlbi instructions so doesn't need to interrupt the remote core. */
+#ifndef __aarch64__
+#define TEST_MUNMAP
+
+/*
+ * Fork a thread that will munmap() after a short while.
+ * It will deliver a TLB flush to the task isolation core.
+ */
+
+static void *start_munmap(void *p)
+{
+	usleep(500000);   /* 0.5s */
+	munmap(p, getpagesize());
+	return 0;
+}
+
+static void setup_munmap(void)
+{
+	pthread_t thr;
+	void *p;
+	int rc;
+
+	/* First, go back to cpu 0 and allocate some memory. */
+	set_my_cpu(0);
+	p = mmap(0, getpagesize(), PROT_READ|PROT_WRITE,
+		 MAP_ANONYMOUS|MAP_POPULATE|MAP_PRIVATE, 0, 0);
+	assert(p != MAP_FAILED);
+
+	/*
+	 * Now fire up a thread that will wait half a second on cpu 0
+	 * and then munmap the mapping.
+	 */
+	rc = pthread_create(&thr, NULL, start_munmap, p);
+	assert(rc == 0);
+
+	/* Back to the task-isolation cpu. */
+	set_my_cpu(task_isolation_cpu);
+}
+
+/* Global variable to avoid the compiler outsmarting us. */
+int munmap_spin;
+
+static int do_munmap(void)
+{
+	while (munmap_spin < 1000000000)
+		WRITE_ONCE(munmap_spin, munmap_spin + 1);
+	return KSFT_FAIL;
+}
+#endif
+
+#ifdef __tilegx__
+#define TEST_UNALIGNED
+
+/*
+ * Make an unaligned access (and be killed).
+ * Only for tilegx, since other platforms don't do in-kernel fixups.
+ */
+static int do_unaligned(void)
+{
+	static int buf[2];
+	int *addr = (int *)((char *)buf + 1);
+
+	READ_ONCE(*addr);
+
+	asm("nop");
+	return KSFT_FAIL;
+}
+#endif
+
+/*
+ * Test to make sure that if a process is scheduled to run on the same core
+ * as a task isolation process, the task isolation process will be signalled.
+ */
+
+static void setup_schedule(void)
+{
+	struct timeval start;
+	int rc, child_pid;
+
+	/*
+	 * First, go back to cpu 0 to ensure that the child we create here
+	 * doesn't race with task isolation by the parent in do_schedule().
+	 */
+	set_my_cpu(0);
+
+	/* Fork and fault in all memory in both. */
+	child_pid = fork();
+	assert(child_pid >= 0);
+	rc = mlockall(MCL_CURRENT);
+	assert(rc == 0);
+	if (child_pid != 0) {
+		/* Send parent back to the task isolation cpu. */
+		set_my_cpu(task_isolation_cpu);
+		return;
+	}
+
+	/*
+	 * In child.  Wait until parent notifies us that it has completed
+	 * its prctl, then reschedule onto its cpu.
+	 */
+	timeout_init(&start);
+	while (!shared->parent_isolated) {
+		if (timeout(&start, 1)) {
+			printf("child: no parent post-prctl\n");
+			exit(1);
+		}
+	}
+	set_my_cpu(task_isolation_cpu);
+
+	/*
+	 * We are now running on the task isolation cpu, which should have
+	 * killed the parent by forcing the OS to run the scheduler.  No
+	 * need to run any code beyond just invoking exit().
+	 */
+	exit(0);
+}
+
+static int do_schedule(void)
+{
+	struct timeval start;
+
+	/* Notify the child to switch to our cpu. */
+	shared->parent_isolated = true;
+
+	/*
+	 * Wait for child to come disturb us.  Note that we require
+	 * gettimeofday() to run via vDSO for this waiting idiom to work;
+	 * if it were a syscall we would get signalled just for calling it.
+	 */
+	timeout_init(&start);
+	while (!timeout(&start, 5))
+		;
+	printf("parent: no interrupt from scheduler\n");
+	exit(1);
+}
+
+#ifdef __tile__
+#include <arch/spr_def.h>
+#endif
+
+static inline u_int64_t get_cycle_count(void)
+{
+#ifdef __x86_64__
+	unsigned int lower, upper;
+
+	asm volatile("rdtsc" : "=a"(lower), "=d"(upper));
+	return lower | ((unsigned long)upper << 32);
+#elif defined(__tile__)
+	return __insn_mfspr(SPR_CYCLE);
+#elif defined(__aarch64__)
+	unsigned long vtick;
+
+	asm volatile("mrs %0, cntvct_el0" : "=r" (vtick));
+	return vtick;
+#elif defined(__arm__)
+	u_int64_t cval;
+
+	asm volatile("mrrc p15, 1, %Q0, %R0, c14" : "=r" (cval));
+	return cval;
+#else
+#error Unsupported architecture
+#endif
+}
+
+/*
+ * Histogram of cycle counts up to HISTSIZE cycles.  Any loop that takes
+ * more than this number of cycles is considered an error, since we likely
+ * entered the kernel (rather than just, say, having to refetch caches
+ * lines or equivalent).
+ */
+#define HISTSIZE 600
+static long hist[HISTSIZE];
+
+/* Information on loss of control of the cpu (more than HISTSIZE cycles). */
+struct jitter_info {
+	unsigned long at;      /* cycle of jitter event */
+	long cycles;           /* how long we lost the cpu for */
+};
+#define MAX_EVENTS 100
+static struct jitter_info jitter[MAX_EVENTS];
+static unsigned int count;            /* index into jitter[] */
+
+static void jitter_summarize(void)
+{
+	unsigned int i;
+
+	printf("INFO: loop times:\n");
+	for (i = 0; i < HISTSIZE; ++i)
+		if (hist[i])
+			printf("  %d cycles (count: %ld)\n", i, hist[i]);
+
+	if (count)
+		printf("ERROR: jitter:\n");
+	for (i = 0; i < count; ++i)
+		printf("  %ld: %ld cycles\n", jitter[i].at, jitter[i].cycles);
+	if (count == ARRAY_SIZE(jitter))
+		printf("  ... more\n");
+}
+
+/*
+ * A DEBUG_TASK_ISOLATION kernel will issue a SIGUSR right after this
+ * variable is set, during the prctl_isolation(0).  Handle that case
+ * by checking the variable and basically ignoring the signal.
+ */
+static bool jitter_test_complete;
+
+static void jitter_handler(int sig)
+{
+	if (jitter_test_complete)
+		return;
+
+	if (sig == SIGUSR1) {
+		exit_status = KSFT_FAIL;
+		printf("ERROR: Program unexpectedly entered kernel.\n");
+	}
+	jitter_summarize();
+	exit(exit_status);
+}
+
+static void test_jitter(unsigned long waitticks)
+{
+	u_int64_t start, last, elapsed;
+	int rc;
+
+	printf("testing task isolation jitter for %ld ticks\n", waitticks);
+
+	signal(SIGINT, jitter_handler);
+	signal(SIGUSR1, jitter_handler);
+	set_my_cpu(task_isolation_cpu);
+	rc = mlockall(MCL_CURRENT);
+	assert(rc == 0);
+
+	set_task_isolation(PR_TASK_ISOLATION_ENABLE |
+			   PR_TASK_ISOLATION_SET_SIG(SIGUSR1));
+
+	last = start = get_cycle_count();
+	do {
+		u_int64_t next = get_cycle_count();
+		u_int64_t delta = next - last;
+
+		elapsed = next - start;
+		if (__builtin_expect(delta > HISTSIZE, 0)) {
+			exit_status = KSFT_FAIL;
+			if (count < ARRAY_SIZE(jitter)) {
+				jitter[count].cycles = delta;
+				jitter[count].at = elapsed;
+				WRITE_ONCE(count, count + 1);
+			}
+		} else {
+			hist[delta]++;
+		}
+		last = next;
+
+	} while (elapsed < waitticks);
+
+	jitter_test_complete = true;
+	prctl_isolation(0);
+	jitter_summarize();
+}
+
+int main(int argc, char **argv)
+{
+	/* How many billion ticks to wait after running the other tests? */
+	unsigned long waitticks;
+	char buf[100];
+	char *result, *end;
+	FILE *f;
+
+	if (argc == 1)
+		waitticks = 10;
+	else if (argc == 2)
+		waitticks = strtol(argv[1], NULL, 10);
+	else {
+		printf("syntax: isolation [gigaticks]\n");
+		ksft_exit_fail();
+	}
+	waitticks *= 1000000000;
+
+	/* Get a core from the /sys nohz_full device. */
+	f = fopen("/sys/devices/system/cpu/nohz_full", "r");
+	if (f == NULL)
+		ksft_exit_skip("/sys nohz_full: SKIP (%s)\n", strerror(errno));
+	result = fgets(buf, sizeof(buf), f);
+	assert(result == buf);
+	fclose(f);
+	if (*buf == '\n')
+		ksft_exit_skip("No nohz_full cores configured.\n");
+	task_isolation_cpu = strtol(buf, &end, 10);
+	assert(end != buf);
+	assert(*end == ',' || *end == '-' || *end == '\n');
+	assert(task_isolation_cpu >= 0);
+
+	/* Make sure it matches the first core from the /sys isolated device. */
+	f = fopen("/sys/devices/system/cpu/isolated", "r");
+	if (f == NULL)
+		ksft_exit_skip("/sys isolated: SKIP (%s)\n", strerror(errno));
+	result = fgets(buf, sizeof(buf), f);
+	assert(result == buf);
+	fclose(f);
+	if (*buf == '\n')
+		ksft_exit_skip("No isolated cores configured.\n");
+	if (task_isolation_cpu != strtol(buf, &end, 10))
+		ksft_exit_skip("Isolated and nohz_full cores don't match.\n");
+	assert(end != buf);
+	assert(*end == ',' || *end == '-' || *end == '\n');
+
+	printf("/sys devices: OK (using task isolation cpu %d)\n",
+	       task_isolation_cpu);
+
+	/* Test to see if with no mask set, we fail. */
+	if (prctl_isolation(PR_TASK_ISOLATION_ENABLE) == 0 ||
+	    errno != EINVAL) {
+		printf("prctl unaffinitized: FAIL\n");
+		exit_status = KSFT_FAIL;
+	} else {
+		printf("prctl unaffinitized: OK\n");
+	}
+
+	/* Or if affinitized to the wrong cpu. */
+	set_my_cpu(0);
+	if (prctl_isolation(PR_TASK_ISOLATION_ENABLE) == 0 ||
+	    errno != EINVAL) {
+		printf("prctl on cpu 0: FAIL\n");
+		exit_status = KSFT_FAIL;
+	} else {
+		printf("prctl on cpu 0: OK\n");
+	}
+
+	/* Allocate some memory to be used for parent/child communication. */
+	shared = mmap(0, getpagesize(), PROT_READ|PROT_WRITE,
+		      MAP_ANONYMOUS|MAP_SHARED, 0, 0);
+	assert(shared != MAP_FAILED);
+
+	/* Run the positive tests. */
+#ifndef DEBUG_TASK_ISOLATION
+	test_ok("test_off", NULL, do_syscall_off);
+	test_ok("test_segv", setup_segv, do_segv);
+#endif
+
+	/* Run the negative tests. */
+	test_killed("test_fault", setup_fault, do_fault);
+	test_killed("test_syscall", NULL, do_syscall);
+#ifdef TEST_MUNMAP
+	test_killed("test_munmap", setup_munmap, do_munmap);
+#endif
+#ifdef TEST_UNALIGNED
+	test_killed("test_unaligned", NULL, do_unaligned);
+#endif
+	test_killed("test_schedule", setup_schedule, do_schedule);
+
+	/* Exit failure if any test failed. */
+	if (exit_status != KSFT_PASS) {
+		printf("Skipping jitter testing due to test failures\n");
+		return exit_status;
+	}
+
+	test_jitter(waitticks);
+
+	return exit_status;
+}
-- 
2.1.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v16 12/13] arm, tile: turn off timer tick for oneshot_stopped state
  2017-11-03 17:04 ` [PATCH v16 12/13] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
@ 2017-11-03 17:18   ` Mark Rutland
  2017-11-03 17:36     ` Chris Metcalf
  0 siblings, 1 reply; 29+ messages in thread
From: Mark Rutland @ 2017-11-03 17:18 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	Marc Zyngier, linux-arm-kernel, linux-kernel

Hi Chris,

On Fri, Nov 03, 2017 at 01:04:51PM -0400, Chris Metcalf wrote:
> diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
> index fd4b7f684bd0..61ea7f907c56 100644
> --- a/drivers/clocksource/arm_arch_timer.c
> +++ b/drivers/clocksource/arm_arch_timer.c
> @@ -722,6 +722,8 @@ static void __arch_timer_setup(unsigned type,
>  		}
>  	}
>  
> +	clk->set_state_oneshot_stopped = clk->set_state_shutdown;

AFAICT, we've set up this callback since commit:

  cf8c5009ee37d25c ("clockevents/drivers/arm_arch_timer: Implement ->set_state_oneshot_stopped()")

... so I don't beleive this is necessary, and I think this change can be
dropped.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v16 10/13] arch/arm: enable task isolation functionality
  2017-11-03 17:04 ` [PATCH v16 10/13] arch/arm: " Chris Metcalf
@ 2017-11-03 17:23   ` Russell King - ARM Linux
  2017-11-03 17:27     ` Chris Metcalf
  2017-11-06 22:53     ` Chris Metcalf
  2018-03-18 14:48   ` Yury Norov
  1 sibling, 2 replies; 29+ messages in thread
From: Russell King - ARM Linux @ 2017-11-03 17:23 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Mark Rutland,
	linux-arm-kernel, linux-kernel, Francis Giraldeau

On Fri, Nov 03, 2017 at 01:04:49PM -0400, Chris Metcalf wrote:
> From: Francis Giraldeau <francis.giraldeau@gmail.com>
> 
> This patch is a port of the task isolation functionality to the arm 32-bit
> architecture. The task isolation needs an additional thread flag that
> requires to change the entry assembly code to accept a bitfield larger than
> one byte.  The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now
> defined in the literal pool. The rest of the patch is straightforward and
> reflects what is done on other architectures.
> 
> To avoid problems with the tst instruction in the v7m build, we renumber
> TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7.

After a bit of digging (which could've been saved if our patch format
contained information about what kernel version this patch was
generated against) it turns out that this patch will not apply since
commit 73ac5d6a2b6ac ("arm/syscalls: Check address limit on user-mode
return") has been applied, which means the TIF numbers have changed
as well as the assembly code that your patch touches.

My guess is that this patch was generated from a 4.13 kernel, so
misses the 4.14-rc1 changes.  Since we're potentially about to start
the merge window for 4.15 this weekend, the timing of this doesn't
work well either.

Once 4.15-rc1 has been published, please rebase against that version
and resend.

Thanks.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v16 10/13] arch/arm: enable task isolation functionality
  2017-11-03 17:23   ` Russell King - ARM Linux
@ 2017-11-03 17:27     ` Chris Metcalf
  2017-11-06 22:53     ` Chris Metcalf
  1 sibling, 0 replies; 29+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:27 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Mark Rutland,
	linux-arm-kernel, linux-kernel, Francis Giraldeau

On 11/3/2017 1:23 PM, Russell King - ARM Linux wrote:
> On Fri, Nov 03, 2017 at 01:04:49PM -0400, Chris Metcalf wrote:
>> From: Francis Giraldeau <francis.giraldeau@gmail.com>
>>
>> This patch is a port of the task isolation functionality to the arm 32-bit
>> architecture. The task isolation needs an additional thread flag that
>> requires to change the entry assembly code to accept a bitfield larger than
>> one byte.  The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now
>> defined in the literal pool. The rest of the patch is straightforward and
>> reflects what is done on other architectures.
>>
>> To avoid problems with the tst instruction in the v7m build, we renumber
>> TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7.
> After a bit of digging (which could've been saved if our patch format
> contained information about what kernel version this patch was
> generated against) it turns out that this patch will not apply since
> commit 73ac5d6a2b6ac ("arm/syscalls: Check address limit on user-mode
> return") has been applied, which means the TIF numbers have changed
> as well as the assembly code that your patch touches.
>
> My guess is that this patch was generated from a 4.13 kernel, so
> misses the 4.14-rc1 changes.  Since we're potentially about to start
> the merge window for 4.15 this weekend, the timing of this doesn't
> work well either.

What patch failure did you see?  The patch is based against 4.14-rc4, so 
while
it's a few weeks out of date, it does include the commit you reference.

> Once 4.15-rc1 has been published, please rebase against that version
> and resend.

Sure.  I was hoping to eke out a little bit of attention from kernel 
developers
before the merge window actually opens :)

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v16 09/13] arch/arm64: enable task isolation functionality
  2017-11-03 17:04 ` [PATCH v16 09/13] arch/arm64: " Chris Metcalf
@ 2017-11-03 17:32   ` Mark Rutland
  2017-11-03 17:53     ` Chris Metcalf
  0 siblings, 1 reply; 29+ messages in thread
From: Mark Rutland @ 2017-11-03 17:32 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-arm-kernel,
	linux-kernel

Hi Chris,

On Fri, Nov 03, 2017 at 01:04:48PM -0400, Chris Metcalf wrote:
> In do_notify_resume(), call task_isolation_start() for
> TIF_TASK_ISOLATION tasks.  Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
> and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop,
> since we don't clear _TIF_TASK_ISOLATION in the loop.
> 
> We tweak syscall_trace_enter() slightly to carry the "flags"
> value from current_thread_info()->flags for each of the tests,
> rather than doing a volatile read from memory for each one.  This
> avoids a small overhead for each test, and in particular avoids
> that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.
> 
> We instrument the smp_send_reschedule() routine so that it checks for
> isolated tasks and generates a suitable warning if needed.
> 
> Finally, report on page faults in task-isolation processes in
> do_page_faults().

I don't have much context for this (I only received patches 9, 10, and
12), and this commit message doesn't help me to understand why these
changes are necessary.

Some elaboration on what the intended semantics are would be helpful.

[...]

>  #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
>  				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
> -				 _TIF_UPROBE | _TIF_FSCHECK)
> +				 _TIF_UPROBE | _TIF_FSCHECK | \
> +				 _TIF_TASK_ISOLATION)

Here we add to _TIF_WORK_MASK...

> @@ -741,6 +742,10 @@ static void do_signal(struct pt_regs *regs)
>  	restore_saved_sigmask();
>  }
>  
> +#define NOTIFY_RESUME_LOOP_FLAGS				     \
> +	(_TIF_NEED_RESCHED | _TIF_SIGPENDING |  _TIF_NOTIFY_RESUME | \
> +	 _TIF_FOREIGN_FPSTATE | _TIF_UPROBE | _TIF_FSCHECK)
> +

... and here we open-code the *old* _TIF_WORK_MASK.

Can we drop both in <asm/thread_info.h>, building one in terms of the
other:

#define _TIF_WORK_NOISOLATION_MASK					\
	(_TIF_NEED_RESCHED | _TIF_SIGPENDING |  _TIF_NOTIFY_RESUME |	\
	 _TIF_FOREIGN_FPSTATE | _TIF_UPROBE | _TIF_FSCHECK)

#define _TIF_WORK_MASK							\
	(_TIF_WORK_NOISOLATION_MASK | _TIF_TASK_ISOLATION)

... that avoids duplication, ensuring the two are kept in sync, and
makes it a little easier to understand.

>  asmlinkage void do_notify_resume(struct pt_regs *regs,
>  				 unsigned int thread_flags)
>  {
> @@ -777,5 +782,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
>  
>  		local_irq_disable();
>  		thread_flags = READ_ONCE(current_thread_info()->flags);
> -	} while (thread_flags & _TIF_WORK_MASK);
> +	} while (thread_flags & NOTIFY_RESUME_LOOP_FLAGS);
> +
> +	if (thread_flags & _TIF_TASK_ISOLATION)
> +		task_isolation_start();

[...]

> @@ -818,6 +819,7 @@ void arch_send_call_function_single_ipi(int cpu)
>  #ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
>  void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
>  {
> +	task_isolation_remote_cpumask(mask, "wakeup IPI");

What exactly does this do? Is it some kind of a tracepoint?

As above, I only received patches 9, 10, and 12, so I don't have much
context to go on.

> @@ -879,6 +881,9 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
>  		__inc_irq_stat(cpu, ipi_irqs[ipinr]);
>  	}
>  
> +	task_isolation_interrupt("IPI type %d (%s)", ipinr,
> +				 ipinr < NR_IPI ? ipi_types[ipinr] : "unknown");
> +
>  	switch (ipinr) {
>  	case IPI_RESCHEDULE:
>  		scheduler_ipi();
> @@ -941,12 +946,14 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
>  
>  void smp_send_reschedule(int cpu)
>  {
> +	task_isolation_remote(cpu, "reschedule IPI");
>  	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
>  }
>  
>  #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
>  void tick_broadcast(const struct cpumask *mask)
>  {
> +	task_isolation_remote_cpumask(mask, "timer IPI");
>  	smp_cross_call(mask, IPI_TIMER);
>  }
>  #endif

Similarly, I don't know what these are doing.

[...]

> @@ -495,6 +496,10 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
>  	 */
>  	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
>  			      VM_FAULT_BADACCESS)))) {
> +		/* No signal was generated, but notify task-isolation tasks. */
> +		if (user_mode(regs))
> +			task_isolation_interrupt("page fault at %#lx", addr);

What exactly does the task receive here? Are these strings ABI?

Do we need to do this for *every* exception?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v16 12/13] arm, tile: turn off timer tick for oneshot_stopped state
  2017-11-03 17:18   ` Mark Rutland
@ 2017-11-03 17:36     ` Chris Metcalf
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:36 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	Marc Zyngier, linux-arm-kernel, linux-kernel

On 11/3/2017 1:18 PM, Mark Rutland wrote:
> Hi Chris,
>
> On Fri, Nov 03, 2017 at 01:04:51PM -0400, Chris Metcalf wrote:
>> diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
>> index fd4b7f684bd0..61ea7f907c56 100644
>> --- a/drivers/clocksource/arm_arch_timer.c
>> +++ b/drivers/clocksource/arm_arch_timer.c
>> @@ -722,6 +722,8 @@ static void __arch_timer_setup(unsigned type,
>>   		}
>>   	}
>>   
>> +	clk->set_state_oneshot_stopped = clk->set_state_shutdown;
> AFAICT, we've set up this callback since commit:
>
>    cf8c5009ee37d25c ("clockevents/drivers/arm_arch_timer: Implement ->set_state_oneshot_stopped()")
>
> ... so I don't beleive this is necessary, and I think this change can be
> dropped.

Thanks, I will drop it.  I missed the semantic merge conflict there.

I extracted the arch/tile specific part of the change and just pushed it 
through
the tile tree.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v16 09/13] arch/arm64: enable task isolation functionality
  2017-11-03 17:32   ` Mark Rutland
@ 2017-11-03 17:53     ` Chris Metcalf
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:53 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, linux-arm-kernel,
	linux-kernel

On 11/3/2017 1:32 PM, Mark Rutland wrote:
> Hi Chris,
>
> On Fri, Nov 03, 2017 at 01:04:48PM -0400, Chris Metcalf wrote:
>> In do_notify_resume(), call task_isolation_start() for
>> TIF_TASK_ISOLATION tasks.  Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
>> and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop,
>> since we don't clear _TIF_TASK_ISOLATION in the loop.
>>
>> We tweak syscall_trace_enter() slightly to carry the "flags"
>> value from current_thread_info()->flags for each of the tests,
>> rather than doing a volatile read from memory for each one.  This
>> avoids a small overhead for each test, and in particular avoids
>> that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.
>>
>> We instrument the smp_send_reschedule() routine so that it checks for
>> isolated tasks and generates a suitable warning if needed.
>>
>> Finally, report on page faults in task-isolation processes in
>> do_page_faults().
> I don't have much context for this (I only received patches 9, 10, and
> 12), and this commit message doesn't help me to understand why these
> changes are necessary.

Sorry, I missed having you on the cover letter.  I'll fix that for the 
next spin.
The cover letter (and rest of the series) is here:

https://lkml.org/lkml/2017/11/3/589

The core piece of the patch is here:

https://lkml.org/lkml/2017/11/3/598

> Here we add to _TIF_WORK_MASK...
> [...]
> ... and here we open-code the *old* _TIF_WORK_MASK.
>
> Can we drop both in <asm/thread_info.h>, building one in terms of the
> other:
>
> #define _TIF_WORK_NOISOLATION_MASK					\
> 	(_TIF_NEED_RESCHED | _TIF_SIGPENDING |  _TIF_NOTIFY_RESUME |	\
> 	 _TIF_FOREIGN_FPSTATE | _TIF_UPROBE | _TIF_FSCHECK)
>
> #define _TIF_WORK_MASK							\
> 	(_TIF_WORK_NOISOLATION_MASK | _TIF_TASK_ISOLATION)
>
> ... that avoids duplication, ensuring the two are kept in sync, and
> makes it a little easier to understand.

We certainly could do that.  I based my approach on the x86 model,
which defines _TIF_ALLWORK_MASK in thread_info.h, and then a local
EXIT_TO_USERMODE_WORK_FLAGS above exit_to_usermode_loop().

If you'd prefer to avoid the duplication, perhaps names more like this?

_TIF_WORK_LOOP_MASK (without TIF_TASK_ISOLATION)
_TIF_WORK_MASK as _TIF_WORK_LOOP_MASK | _TIF_TASK_ISOLATION

That keeps the names reflective of the function (entry only vs loop).

>> @@ -818,6 +819,7 @@ void arch_send_call_function_single_ipi(int cpu)
>>   #ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
>>   void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
>>   {
>> +	task_isolation_remote_cpumask(mask, "wakeup IPI");
> What exactly does this do? Is it some kind of a tracepoint?

It is intended to generate a diagnostic for a remote task that is
trying to run isolated from the kernel (NOHZ_FULL on steroids, more
or less), if the kernel is about to interrupt it.

Similarly, the task_isolation_interrupt() hooks are diagnostics for
the current task.  The intent is that by hooking a little deeper in
the call path, you get actionable diagnostics for processes that are
about to be signalled because they have lost task isolation for some
reason.

>> @@ -495,6 +496,10 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
>>   	 */
>>   	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
>>   			      VM_FAULT_BADACCESS)))) {
>> +		/* No signal was generated, but notify task-isolation tasks. */
>> +		if (user_mode(regs))
>> +			task_isolation_interrupt("page fault at %#lx", addr);
> What exactly does the task receive here? Are these strings ABI?
>
> Do we need to do this for *every* exception?

The strings are diagnostic messages; the process itself just gets
a SIGKILL (or user-defined signal if requested).  To provide better
diagnosis we emit a log message that can be examined to see
what exactly caused the signal to be generated.

Thanks!

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v16 00/13] support "task_isolation" mode
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
                   ` (12 preceding siblings ...)
  2017-11-03 17:04 ` [PATCH v16 13/13] task_isolation self test Chris Metcalf
@ 2017-11-06 15:38 ` Christopher Lameter
       [not found]   ` <bd5526d4-7b6e-3b3d-79d0-5d6567c0bdf8@mellanox.com>
  2018-03-07 10:07 ` Yury Norov
  2018-07-12 12:29 ` Yury Norov
  15 siblings, 1 reply; 29+ messages in thread
From: Christopher Lameter @ 2017-11-06 15:38 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, Francis Giraldeau, linux-mm,
	linux-doc, linux-api, linux-kernel

On Fri, 3 Nov 2017, Chris Metcalf wrote:

> However, it doesn't seem possible to do the synchronous cancellation of
> the vmstat deferred work with irqs disabled, though if there's a way,
> it would be a little cleaner to do that; Christoph?  We can certainly
> update the statistics with interrupts disabled via
> refresh_cpu_vm_stats(false), but that's not sufficient.  For now, I

Callig quiet_vmstat() gives you the folding of the statistics without
the numa page allocator page cleanup. Leaving some pages in the queues
should not be that much of an issue.

> What about that d*mn 1 Hz clock?
>
> It's still there, so this code still requires some further work before
> it can actually get a process into long-term task isolation (without
> the obvious one-line kernel hack).  Frederic suggested a while ago
> forcing updates on cpustats was required as the last gating factor; do
> we think that is still true?  Christoph was working on this at one
> point - any progress from your point of view?

Well if you still have the 1 HZ clock then you can simply defer the numa
remote page cleanup of the page allocator to that the time you execute
that tick.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v16 10/13] arch/arm: enable task isolation functionality
  2017-11-03 17:23   ` Russell King - ARM Linux
  2017-11-03 17:27     ` Chris Metcalf
@ 2017-11-06 22:53     ` Chris Metcalf
  1 sibling, 0 replies; 29+ messages in thread
From: Chris Metcalf @ 2017-11-06 22:53 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Mark Rutland,
	linux-arm-kernel, linux-kernel, Francis Giraldeau

On 11/3/2017 1:23 PM, Russell King - ARM Linux wrote:
> Since we're potentially about to start
> the merge window for 4.15 this weekend, the timing of this doesn't
> work well either.

With the start of the merge window now delayed for a week, I'm sure
everyone can distract themselves and help make the last week of -rc8
pass more quickly by digging into this patch series!  :-)

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v16 00/13] support "task_isolation" mode
       [not found]   ` <bd5526d4-7b6e-3b3d-79d0-5d6567c0bdf8@mellanox.com>
@ 2017-11-07 17:10     ` Christopher Lameter
  2017-11-07 17:27       ` Chris Metcalf
  0 siblings, 1 reply; 29+ messages in thread
From: Christopher Lameter @ 2017-11-07 17:10 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, Francis Giraldeau, linux-mm,
	linux-doc, linux-api, linux-kernel

On Mon, 6 Nov 2017, Chris Metcalf wrote:

> On 11/6/2017 10:38 AM, Christopher Lameter wrote:
> > > What about that d*mn 1 Hz clock?
> > >
> > > It's still there, so this code still requires some further work before
> > > it can actually get a process into long-term task isolation (without
> > > the obvious one-line kernel hack).  Frederic suggested a while ago
> > > forcing updates on cpustats was required as the last gating factor; do
> > > we think that is still true?  Christoph was working on this at one
> > > point - any progress from your point of view?
> > Well if you still have the 1 HZ clock then you can simply defer the numa
> > remote page cleanup of the page allocator to that the time you execute
> > that tick.
>
> We have to get rid of the 1 Hz tick, so we don't want to tie anything
> else to it...

Yes we want to get rid of the 1 HZ tick but the work on that could also
include dealing with the remove page cleanup issue that we have deferred.

Presumably we have another context there were we may be able to call into
the cleanup code with interrupts enabled.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v16 00/13] support "task_isolation" mode
  2017-11-07 17:10     ` Christopher Lameter
@ 2017-11-07 17:27       ` Chris Metcalf
  2017-11-07 17:54         ` Christopher Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Metcalf @ 2017-11-07 17:27 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, Francis Giraldeau, linux-mm,
	linux-doc, linux-api, linux-kernel

On 11/7/2017 12:10 PM, Christopher Lameter wrote:
> On Mon, 6 Nov 2017, Chris Metcalf wrote:
>
>> On 11/6/2017 10:38 AM, Christopher Lameter wrote:
>>>> What about that d*mn 1 Hz clock?
>>>>
>>>> It's still there, so this code still requires some further work before
>>>> it can actually get a process into long-term task isolation (without
>>>> the obvious one-line kernel hack).  Frederic suggested a while ago
>>>> forcing updates on cpustats was required as the last gating factor; do
>>>> we think that is still true?  Christoph was working on this at one
>>>> point - any progress from your point of view?
>>> Well if you still have the 1 HZ clock then you can simply defer the numa
>>> remote page cleanup of the page allocator to that the time you execute
>>> that tick.
>> We have to get rid of the 1 Hz tick, so we don't want to tie anything
>> else to it...
> Yes we want to get rid of the 1 HZ tick but the work on that could also
> include dealing with the remove page cleanup issue that we have deferred.
>
> Presumably we have another context there were we may be able to call into
> the cleanup code with interrupts enabled.

Right now for task isolation we run with interrupts enabled during the
initial sys_prctl() call, and call quiet_vmstat_sync() there, which 
currently
calls refresh_cpu_vm_stats(false).  In fact we could certainly pass "true"
there instead (and probably should) since we can handle dealing with
the pagesets at this time.  As we return to userspace we will test that
nothing surprising happened with vmstat; if so we jam an EAGAIN into
the syscall result value, but if not, we will be in userspace and won't need
to touch the vmstat counters until we next go back into the kernel.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v16 00/13] support "task_isolation" mode
  2017-11-07 17:27       ` Chris Metcalf
@ 2017-11-07 17:54         ` Christopher Lameter
  0 siblings, 0 replies; 29+ messages in thread
From: Christopher Lameter @ 2017-11-07 17:54 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, Francis Giraldeau, linux-mm,
	linux-doc, linux-api, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 984 bytes --]

On Tue, 7 Nov 2017, Chris Metcalf wrote:

> > Presumably we have another context there were we may be able to call into
> > the cleanup code with interrupts enabled.
>
> Right now for task isolation we run with interrupts enabled during the
> initial sys_prctl() call, and call quiet_vmstat_sync() there, which currently
> calls refresh_cpu_vm_stats(false).  In fact we could certainly pass "true"
> there instead (and probably should) since we can handle dealing with
> the pagesets at this time.  As we return to userspace we will test that
> nothing surprising happened with vmstat; if so we jam an EAGAIN into
> the syscall result value, but if not, we will be in userspace and won't need
> to touch the vmstat counters until we next go back into the kernel.

If you do it too early and there is another page allocator action then it
may potentially repopulate the caches. Since this is only draining the
caches for remote node allocation it may be rare and not that important.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v16 00/13] support "task_isolation" mode
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
                   ` (13 preceding siblings ...)
  2017-11-06 15:38 ` [PATCH v16 00/13] support "task_isolation" mode Christopher Lameter
@ 2018-03-07 10:07 ` Yury Norov
  2018-07-12 12:29 ` Yury Norov
  15 siblings, 0 replies; 29+ messages in thread
From: Yury Norov @ 2018-03-07 10:07 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	Francis Giraldeau, linux-mm, linux-doc, linux-api, linux-kernel,
	Prasun Kapoor, Narayana Prasad Athreya, Alex Belits,
	Chandrakala Chavva

Hi Chris,

(CC Cavium people)

Thanks for your series.

On Fri, Nov 03, 2017 at 01:04:39PM -0400, Chris Metcalf wrote:
> Here, finally, is a new spin of the task isolation work (v16), with
> changes based on the issues that were raised at last year's Linux
> Plumbers Conference and in the email discussion that followed.
> 
> This version of the patch series cleans up a number of areas that were
> a little dodgy in the previous patch series.
> 
> - It no longer loops in the final code that prepares to return to
>   userspace; instead, it sets things up in the prctl() and then
>   validates when preparing to return to userspace, adjusting the
>   syscall return value to -EAGAIN at that point if something doesn't
>   line up quite correctly.
> 
> - We no longer support the NOSIG mode that let you freely call into
>   the kernel multiple times while in task isolation.  This was always
>   a little odd, since you really should be in sufficient control of
>   task isolation code that you can explicitly stop isolation with a
>   "prctl(PR_TASK_ISOLATION, 0)" before using the kernel for anything
>   else.  It also made the semantics of migrating the task confusing.
>   More importantly, removing that support means that the only path
>   that sets up task isolation is the return from prctl(), which allows
>   us to make the simplification above.
> 
> - We no longer try to signal the task isolation process from a remote
>   core when we detect that we are about to violate its isolation.
>   Instead, we just print a message there (and optionally dump stack),
>   and rely on the eventual interrupt on the core itself to trigger the
>   signal.  We are always in a safe context to generate a signal when
>   we enter the kernel, unlike when we are deep in a call stack sending
>   an SMP IPI or whatever.
> 
> - We notice the case of an unstable scheduler clock and return
>   EINVAL rather than spinning forever with EAGAIN (suggestion from
>   Francis Giraldeau).
> 
> - The prctl() call requires zeros for arg3/4/5 (suggestion from
>   Eugene Syromiatnikov).
> 
> The kernel internal isolation API is also now cleaner, and I have
> included kerneldoc APIs for all the interfaces so that it should be
> easier to port it to additional architectures; in fact looking at
> include/linux/isolation.h is a good place to start understanding the
> overall patch set.
> 
> I removed Catalin's Reviewed-by for arm64, and Christoph's Tested-by
> for x86, since this version is sufficiently different to merit
> re-review and re-testing.
> 
> Note that this is not rebased on top of Frederic's recent housekeeping
> patch series, although it is largely orthogonal right now.  After
> Frederic's patch series lands, task isolation is enabled with
> "isolcpus=nohz,domain,CPUS".  We could add some shorthand for that
> ("isolcpus=full,CPUS"?) or just use it as-is.
> 
> The previous (v15) patch series is here:
> 
> https://lkml.kernel.org/r/1471382376-5443-1-git-send-email-cmetcalf@mellanox.com
> 
> This patch series is available at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane
> 
> Some folks raised some good points at the LPC discussion and then in
> email discussions that followed.  Rather than trying to respond to
> everyone in a flurry of emails, I'll answer some questions here:
> 
> 
> Why not just instrument user_exit() to raise the isolation-lost signal?
> 
> Andy pointed me in this direction.  The advantage is that you catch
> *everything*, by definition.  There is a hook that can do this in the
> current patch set, but you have to #define DEBUG_TASK_ISOLATION
> manually to take advantage of it, because as written it has two issues:
> 
> 1. You can't actually exit the kernel with prctl(PR_TASK_ISOLATION,0)
>    because the user_exit hook kills you first.
> 2. You lose the ability to get much better diagnostics by waiting
>    until you are further into kernel entry and know what you're doing.
> 
> You could work around #2 in several ways, but #1 is harder.  I looked
> at x86 for a while, and although you could imagine this, you really
> want to generate a lost-isolation signal on any syscall that isn't
> that exact prctl(), and it's awkward to try to do all of that checking
> before user_exit().  Since in any case we do want to have the more
> specific probes at the various kernel entry points where we generate
> the diagnostics, I felt like it wasn't the right approach to enable
> as a compilation-time default.  I'm open to discussion on this though!
> 
> 
> Can't we do all the exit-to-userspace work with irqs disabled?
> 
> In fact, it turns out that you can do lru_add_drain() with irqs
> disabled, so that's what we're doing in the patch series now.
> 
> However, it doesn't seem possible to do the synchronous cancellation of
> the vmstat deferred work with irqs disabled, though if there's a way,
> it would be a little cleaner to do that; Christoph?  We can certainly
> update the statistics with interrupts disabled via
> refresh_cpu_vm_stats(false), but that's not sufficient.  For now, I
> just issue the cancellation during sys_prctl() call, and then if it
> isn't synchronized by the time we are exiting to userspace, we just
> jam in an EAGAIN and let userspace retry.  In practice, this doesn't
> seem to ever happen.
> 
> 
> What about using a per-cpu flag to stop doing new deferred work?
> 
> Andy also suggested we could structure the code to have the prctl()
> set a per-cpu flag to stop adding new future work (e.g. vmstat per-cpu
> data, or lru page cache).  Then, we could flush those structures right
> from the sys_prctl() call, and when we were returning to user space,
> we'd be confident that there wasn't going to be any new work added.
> 
> With the current set of things that we are disabling for task
> isolation, though, it didn't seem necessary.  Quiescing the vmstat
> shepherd seems like it is generally pretty safe since we will likely
> be able to sync up the per-cpu cache and kill the deferred work with
> high probability, with no expectation that additional work will show
> up.  And since we can flush the LRU page cache with interrupts
> disabled, that turns out not to be an issue either.
> 
> I could imagine that if we have to deal with some new kind of deferred
> work, we might find the per-cpu flag becomes a good solution, but for
> now we don't have a good use case for that approach.
> 
> 
> How about stopping the dyn tick?
> 
> Right now we try to stop it on return to userspace, but if we can't,
> we just return EAGAIN to userspace.  In practice, what I see is that
> usually the tick stops immediately, but occasionally it doesn't; in
> this case I've always seen that nr_running is >1, presumably with some
> temporary kernel worker threads, and the user code just needs to call
> prctl() until those threads are done.  We could structure things with
> a completion that we wait for, which is set by the timer code when it
> finally does stop the tick, but this may be overkill, particularly
> since we'll only be running this prctl() loop from userspace on cores
> where we have no other useful work that we're trying to run anyway.
> 
> 
> What about TLB flushing?
> 
> We talked about this at Plumbers and some of the email discussion also
> was about TLB flushing.  I haven't tried to add it to this patch set,
> because I really want to avoid scope creep; in any case, I think I
> managed to convince Andy that he was going to work on it himself. :)
> Paul McKenney already contributed some framework for such a patch, in
> commit b8c17e6664c4 ("rcu: Maintain special bits at bottom of
> ->dynticks counter").
> 
> What about that d*mn 1 Hz clock?
> 
> It's still there, so this code still requires some further work before
> it can actually get a process into long-term task isolation (without
> the obvious one-line kernel hack).  Frederic suggested a while ago
> forcing updates on cpustats was required as the last gating factor; do
> we think that is still true?  Christoph was working on this at one
> point - any progress from your point of view?

I tested your series on ThunderX 2 machine. When I run 10 giga-ticks test,
it always passed. If I run for more, the test exits like this:

# time ./isolation 1000
/sys devices: OK (using task isolation cpu 100)
prctl unaffinitized: OK
prctl on cpu 0: OK
==> hello, world
test_off: OK
Received signal 11 successfully
test_segv: OK
test_fault: OK
test_fault (SIGUSR1): OK
test_syscall: OK
test_syscall (SIGUSR1): OK
test_schedule: OK
test_schedule (SIGUSR1): OK
testing task isolation jitter for 1000000000000 ticks
ERROR: Program unexpectedly entered kernel.
INFO: loop times:
  1 cycles (count: 128606844716)
  2 cycles (count: 31558352292)
  3 cycles (count: 4)
  29 cycles (count: 437)
  30 cycles (count: 1874)
  31 cycles (count: 221)
  57 cycles (count: 4)
  58 cycles (count: 6)
  59 cycles (count: 1)

real  15m58.643s
user  15m58.626s
sys   0m0.012s

I pass task_isolation_debug to boot parameters:
# cat /proc/cmdline 
BOOT_IMAGE=/boot/Image-isol nohz_full=100-110 isolcpus=100-110 task_isolation_debug root=UUID=75b9dd5b-58d8-4a50-8868-004cb7c1f25f ro text

But dmesg looks empty...

I investigate it, but at now I have no ideas what happens. Have you seen
that before?

Anyway, we are going to include your test in our scenario, with some
modifications. I've added --dry-run option to make it easier to
demonstrate the effect of isolation on jitter. If you like it, feel
free to use this change.

Tested-by: Yury Norov <ynorov@caviumnetworks.com>

>From 5c5823c1fc8441ab1486287679de77b8cea5154d Mon Sep 17 00:00:00 2001
From: Yury Norov <ynorov@caviumnetworks.com>
Date: Wed, 7 Mar 2018 02:41:08 +0300
Subject: [PATCH] isolation test: --dry-run mode

Add dry-run mode for better understanding the effect of isolation.
Also, make sanity checks on waitticks.

Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
---
 tools/testing/selftests/task_isolation/isolation.c | 47 +++++++++++++++++-----
 1 file changed, 36 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/task_isolation/isolation.c b/tools/testing/selftests/task_isolation/isolation.c
index 9c0b49619b40..e90a6c85fe2a 100644
--- a/tools/testing/selftests/task_isolation/isolation.c
+++ b/tools/testing/selftests/task_isolation/isolation.c
@@ -53,6 +53,7 @@
 #include <unistd.h>
 #include <stdlib.h>
 #include <fcntl.h>
+#include <limits.h>
 #include <assert.h>
 #include <string.h>
 #include <errno.h>
@@ -500,7 +501,7 @@ static void jitter_handler(int sig)
 	exit(exit_status);
 }
 
-static void test_jitter(unsigned long waitticks)
+static void test_jitter(unsigned long waitticks, int flags)
 {
 	u_int64_t start, last, elapsed;
 	int rc;
@@ -513,7 +514,8 @@ static void test_jitter(unsigned long waitticks)
 	rc = mlockall(MCL_CURRENT);
 	assert(rc == 0);
 
-	set_task_isolation(PR_TASK_ISOLATION_ENABLE |
+	if (flags & PR_TASK_ISOLATION_ENABLE)
+		set_task_isolation(PR_TASK_ISOLATION_ENABLE |
 			   PR_TASK_ISOLATION_SET_SIG(SIGUSR1));
 
 	last = start = get_cycle_count();
@@ -537,26 +539,49 @@ static void test_jitter(unsigned long waitticks)
 	} while (elapsed < waitticks);
 
 	jitter_test_complete = true;
-	prctl_isolation(0);
+
+	if (flags & PR_TASK_ISOLATION_ENABLE)
+		prctl_isolation(0);
+
 	jitter_summarize();
 }
 
 int main(int argc, char **argv)
 {
 	/* How many billion ticks to wait after running the other tests? */
-	unsigned long waitticks;
+	long waitticks = 10;
+	int flags = PR_TASK_ISOLATION_ENABLE;
 	char buf[100];
 	char *result, *end;
 	FILE *f;
 
-	if (argc == 1)
-		waitticks = 10;
-	else if (argc == 2)
-		waitticks = strtol(argv[1], NULL, 10);
-	else {
-		printf("syntax: isolation [gigaticks]\n");
+	errno = 0;
+
+	switch (argc) {
+	case 1:
+		break;
+	case 2:
+		if (strcmp(argv[1], "--dry-run") == 0)
+			flags = 0;
+		else
+			waitticks = strtol(argv[1], NULL, 10);
+		break;
+	case 3:
+		if (strcmp(argv[1], "--dry-run") == 0)
+			flags = 0;
+
+		waitticks = strtol(argv[2], NULL, 10);
+		break;
+	default:
+		printf("syntax: isolation [--dry-run] [gigaticks]\n");
 		ksft_exit_fail();
 	}
+
+	if (errno || waitticks <= 0 || waitticks > (LONG_MAX / 1000000000)) {
+		printf("Gigaticks: bad number.\n");
+		ksft_exit_fail();
+	}
+
 	waitticks *= 1000000000;
 
 	/* Get a core from the /sys nohz_full device. */
@@ -637,7 +662,7 @@ int main(int argc, char **argv)
 		return exit_status;
 	}
 
-	test_jitter(waitticks);
+	test_jitter(waitticks, flags);
 
 	return exit_status;
 }
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v16 06/13] task_isolation: userspace hard isolation from kernel
  2017-11-03 17:04 ` [PATCH v16 06/13] task_isolation: userspace hard isolation from kernel Chris Metcalf
@ 2018-03-18 14:22   ` Yury Norov
  0 siblings, 0 replies; 29+ messages in thread
From: Yury Norov @ 2018-03-18 14:22 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	Jonathan Corbet, linux-mm, linux-doc, linux-api, linux-kernel

Hi Chris,

On Fri, Nov 03, 2017 at 01:04:45PM -0400, Chris Metcalf wrote:
> The existing nohz_full mode is designed as a "soft" isolation mode
> that makes tradeoffs to minimize userspace interruptions while
> still attempting to avoid overheads in the kernel entry/exit path,
> to provide 100% kernel semantics, etc.
> 
> However, some applications require a "hard" commitment from the
> kernel to avoid interruptions, in particular userspace device driver
> style applications, such as high-speed networking code.
> 
> This change introduces a framework to allow applications
> to elect to have the "hard" semantics as needed, specifying
> prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
> 
> The kernel must be built with the new TASK_ISOLATION Kconfig flag
> to enable this mode, and the kernel booted with an appropriate
> "nohz_full=CPULIST isolcpus=CPULIST" boot argument to enable
> nohz_full and isolcpus.  The "task_isolation" state is then indicated
> by setting a new task struct field, task_isolation_flag, to the
> value passed by prctl(), and also setting a TIF_TASK_ISOLATION
> bit in the thread_info flags.  When the kernel is returning to
> userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
> it calls the new task_isolation_start() routine to arrange for
> the task to avoid being interrupted in the future.
> 
> With interrupts disabled, task_isolation_start() ensures that kernel
> subsystems that might cause a future interrupt are quiesced.  If it
> doesn't succeed, it adjusts the syscall return value to indicate that
> fact, and userspace can retry as desired.  In addition to stopping
> the scheduler tick, the code takes any actions that might avoid
> a future interrupt to the core, such as a worker thread being
> scheduled that could be quiesced now (e.g. the vmstat worker)
> or a future IPI to the core to clean up some state that could be
> cleaned up now (e.g. the mm lru per-cpu cache).
> 
> Once the task has returned to userspace after issuing the prctl(),
> if it enters the kernel again via system call, page fault, or any
> other exception or irq, the kernel will kill it with SIGKILL.
> In addition to sending a signal, the code supports a kernel
> command-line "task_isolation_debug" flag which causes a stack
> backtrace to be generated whenever a task loses isolation.
> 
> To allow the state to be entered and exited, the syscall checking
> test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
> clear the bit again later, and ignores exit/exit_group to allow
> exiting the task without a pointless signal being delivered.
> 
> The prctl() API allows for specifying a signal number to use instead
> of the default SIGKILL, to allow for catching the notification
> signal; for example, in a production environment, it might be
> helpful to log information to the application logging mechanism
> before exiting.  Or, the signal handler might choose to reset the
> program counter back to the code segment intended to be run isolated
> via prctl() to continue execution.
> 
> In a number of cases we can tell on a remote cpu that we are
> going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
> In that case we generate the diagnostic (and optional stack dump)
> on the remote core to be able to deliver better diagnostics.
> If the interrupt is not something caught by Linux (e.g. a
> hypervisor interrupt) we can also request a reschedule IPI to
> be sent to the remote core so it can be sure to generate a
> signal to notify the process.
> 
> Separate patches that follow provide these changes for x86, tile,
> arm, and arm64.
> 
> Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |   6 +
>  include/linux/isolation.h                       | 175 +++++++++++
>  include/linux/sched.h                           |   4 +
>  include/uapi/linux/prctl.h                      |   6 +
>  init/Kconfig                                    |  28 ++
>  kernel/Makefile                                 |   1 +
>  kernel/context_tracking.c                       |   2 +
>  kernel/isolation.c                              | 402 ++++++++++++++++++++++++
>  kernel/signal.c                                 |   2 +
>  kernel/sys.c                                    |   6 +
>  10 files changed, 631 insertions(+)
>  create mode 100644 include/linux/isolation.h
>  create mode 100644 kernel/isolation.c
 
[...]

> + * This routine is called from syscall entry, prevents most syscalls
> + * from executing, and if needed raises a signal to notify the process.
> + *
> + * Note that we have to stop isolation before we even print a message
> + * here, since otherwise we might end up reporting an interrupt due to
> + * kicking the printk handling code, rather than reporting the true
> + * cause of interrupt here.
> + */
> +int task_isolation_syscall(int syscall)
> +{

All callers of this function call it like this:
        if (work & _TIF_TASK_ISOLATION) {
                if (task_isolation_syscall(regs->syscallno) ==
                                -1)
                        return -1;
        }

Would it make sense to move check of _TIF_TASK_ISOLATION flag inside
the function?

> +	struct task_struct *task = current;
> +
> +	if (is_acceptable_syscall(syscall)) {
> +		stop_isolation(task);
> +		return 0;
> +	}
> +
> +	send_isolation_signal(task);
> +
> +	pr_warn("%s/%d (cpu %d): task_isolation lost due to syscall %d\n",
> +		task->comm, task->pid, smp_processor_id(), syscall);
> +	debug_dump_stack();
> +
> +	syscall_set_return_value(task, current_pt_regs(), -ERESTARTNOINTR, -1);
> +	return -1;
> +}

Yury

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v16 10/13] arch/arm: enable task isolation functionality
  2017-11-03 17:04 ` [PATCH v16 10/13] arch/arm: " Chris Metcalf
  2017-11-03 17:23   ` Russell King - ARM Linux
@ 2018-03-18 14:48   ` Yury Norov
  1 sibling, 0 replies; 29+ messages in thread
From: Yury Norov @ 2018-03-18 14:48 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Mark Rutland,
	Russell King, linux-arm-kernel, linux-kernel, Francis Giraldeau

Hi Francis, Chris,

On Fri, Nov 03, 2017 at 01:04:49PM -0400, Chris Metcalf wrote:
> From: Francis Giraldeau <francis.giraldeau@gmail.com>
> 
> This patch is a port of the task isolation functionality to the arm 32-bit
> architecture. The task isolation needs an additional thread flag that
> requires to change the entry assembly code to accept a bitfield larger than
> one byte.  The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now
> defined in the literal pool. The rest of the patch is straightforward and
> reflects what is done on other architectures.
> 
> To avoid problems with the tst instruction in the v7m build, we renumber
> TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7.
> 
> Signed-off-by: Francis Giraldeau <francis.giraldeau@gmail.com>
> Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> [with modifications]

[...]

> ---
> diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
> index 58e3771e4c5b..0cfcba5a93df 100644
> --- a/arch/arm/kernel/ptrace.c
> +++ b/arch/arm/kernel/ptrace.c
> @@ -27,6 +27,7 @@
>  #include <linux/audit.h>
>  #include <linux/tracehook.h>
>  #include <linux/unistd.h>
> +#include <linux/isolation.h>
>  
>  #include <asm/pgtable.h>
>  #include <asm/traps.h>
> @@ -936,6 +937,15 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs, int scno)
>  	if (test_thread_flag(TIF_SYSCALL_TRACE))
>  		tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
>  
> +	/*
> +	 * In task isolation mode, we may prevent the syscall from
> +	 * running, and if so we also deliver a signal to the process.
> +	 */
> +	if (test_thread_flag(TIF_TASK_ISOLATION)) {
> +		if (task_isolation_syscall(scno) == -1)
> +			return -1;
> +	}

I think it would make sense to load thread flags to local variable
because later in the code test_thread_flag() is called again to check
TIF_SYSCALL_TRACEPOINT flag, and we can avoid it, like this:

unsigned long work = READ_ONCE(current_thread_info()->flags);

Also, all other architectures cache thread flags to local
variable before use; so doing this would make sense for the sake
of unification.

Yury

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v16 00/13] support "task_isolation" mode
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
                   ` (14 preceding siblings ...)
  2018-03-07 10:07 ` Yury Norov
@ 2018-07-12 12:29 ` Yury Norov
  15 siblings, 0 replies; 29+ messages in thread
From: Yury Norov @ 2018-07-12 12:29 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	Francis Giraldeau, Sunil Goutham, linux-mm, linux-doc, linux-api,
	linux-kernel

On Fri, Nov 03, 2017 at 01:04:39PM -0400, Chris Metcalf wrote:
> Here, finally, is a new spin of the task isolation work (v16), with
> changes based on the issues that were raised at last year's Linux
> Plumbers Conference and in the email discussion that followed.

Hi Chris,

There's another possible way to break task isolation, by net subsystem.
See patch below.

Yury


From 8025e9330bf06ce146d4ba96833aad6eafe24759 Mon Sep 17 00:00:00 2001
From: Yury Norov <ynorov@caviumnetworks.com>
Date: Sun, 8 Jul 2018 00:40:46 +0300
Subject: [PATCH] net: don't let user assign task isolation CPUs for RPS

Receive Packet Steering (RPS) subsystem distributes network traffic
handling to CPUs defined by user in
/sys/class/net/<dev>/queues/rx-<n>/rps_cpus.

If rps_cpus intersects with task_isolation_map, RPS may break task
isolation by assigning RPS work to CPU that runs isolated task.

In this patch user-provided rps_cpus map filtered to avoid that.

Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
---
 include/linux/isolation.h |  2 ++
 net/core/net-sysfs.c      | 13 +++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index f467545ad37d..b7f0a9085b13 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -14,6 +14,8 @@ struct task_struct;
 
 #ifdef CONFIG_TASK_ISOLATION
 
+extern cpumask_var_t task_isolation_map;
+
 /**
  * task_isolation_request() - prctl hook to request task isolation
  * @flags:	Flags from <linux/prctl.h> PR_TASK_ISOLATION_xxx.
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 927a6dcbad96..18e576893984 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -11,6 +11,7 @@
 
 #include <linux/capability.h>
 #include <linux/kernel.h>
+#include <linux/isolation.h>
 #include <linux/netdevice.h>
 #include <net/switchdev.h>
 #include <linux/if_arp.h>
@@ -727,6 +728,18 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 		return err;
 	}
 
+#ifdef CONFIG_TASK_ISOLATION
+	if (cpumask_intersects(mask, task_isolation_map)) {
+		char tmp[256];
+
+		pr_warn("RPS is not allowed on CPUs allocated for isolated tasks\n");
+
+		cpumask_andnot(mask, mask, task_isolation_map);
+		cpumap_print_to_pagebuf(1, tmp, mask);
+		pr_warn("RPS CPUs list is reduced to: %s\n", tmp);
+	}
+#endif
+
 	map = kzalloc(max_t(unsigned int,
 			    RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
 		      GFP_KERNEL);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2018-07-12 12:29 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
2017-11-03 17:04 ` [PATCH v16 01/13] vmstat: add quiet_vmstat_sync function Chris Metcalf
2017-11-03 17:04 ` [PATCH v16 02/13] vmstat: add vmstat_idle function Chris Metcalf
2017-11-03 17:04 ` [PATCH v16 03/13] Revert "sched/core: Drop the unused try_get_task_struct() helper function" Chris Metcalf
2017-11-03 17:04 ` [PATCH v16 04/13] Add try_get_task_struct_on_cpu() to scheduler for task isolation Chris Metcalf
2017-11-03 17:04 ` [PATCH v16 05/13] Add try_stop_full_tick() API for NO_HZ_FULL Chris Metcalf
2017-11-03 17:04 ` [PATCH v16 06/13] task_isolation: userspace hard isolation from kernel Chris Metcalf
2018-03-18 14:22   ` Yury Norov
2017-11-03 17:04 ` [PATCH v16 07/13] Add task isolation hooks to arch-independent code Chris Metcalf
2017-11-03 17:04 ` [PATCH v16 08/13] arch/x86: enable task isolation functionality Chris Metcalf
2017-11-03 17:04 ` [PATCH v16 09/13] arch/arm64: " Chris Metcalf
2017-11-03 17:32   ` Mark Rutland
2017-11-03 17:53     ` Chris Metcalf
2017-11-03 17:04 ` [PATCH v16 10/13] arch/arm: " Chris Metcalf
2017-11-03 17:23   ` Russell King - ARM Linux
2017-11-03 17:27     ` Chris Metcalf
2017-11-06 22:53     ` Chris Metcalf
2018-03-18 14:48   ` Yury Norov
2017-11-03 17:04 ` [PATCH v16 11/13] arch/tile: " Chris Metcalf
2017-11-03 17:04 ` [PATCH v16 12/13] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
2017-11-03 17:18   ` Mark Rutland
2017-11-03 17:36     ` Chris Metcalf
2017-11-03 17:04 ` [PATCH v16 13/13] task_isolation self test Chris Metcalf
2017-11-06 15:38 ` [PATCH v16 00/13] support "task_isolation" mode Christopher Lameter
     [not found]   ` <bd5526d4-7b6e-3b3d-79d0-5d6567c0bdf8@mellanox.com>
2017-11-07 17:10     ` Christopher Lameter
2017-11-07 17:27       ` Chris Metcalf
2017-11-07 17:54         ` Christopher Lameter
2018-03-07 10:07 ` Yury Norov
2018-07-12 12:29 ` Yury Norov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).