[PATCH v9 00/13] support "task_isolation" mode for nohz_full

* [PATCH v9 00/13] support "task_isolation" mode for nohz_full
@ 2016-01-04 19:34 ` Chris Metcalf
  0 siblings, 0 replies; 92+ messages in thread
From: Chris Metcalf @ 2016-01-04 19:34 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Daniel Lezcano, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

It has been a couple of months since the v8 version of this patch,
since various other priorities came up at work.  Since it's been
a while I will try to summarize where I think we got to on the 
various issues that were raised with v8.

1. Andy Lutomirski raised the issue of whether it really made sense to
   only attempt to set up the conditions for task isolation, ask the kernel
   nicely for it, and then wait until it happened.  He wondered if a
   SCHED_ISOLATED class might be a helpful abstraction.  Steven Rostedt
   also suggested having an interface that would force everything else
   off a core to enable SCHED_ISOLATED to succeed.  Frederick added 
   some concerns about enforcing the test that the process was in a
   good state to enter task isolation.

   I tried to address the different design philosphies for what I called
   the original "polite" mode and the reviewers' suggestions for an
   "aggressive" mode in this email:

   https://lkml.org/lkml/2015/10/26/625

   As I said there, on balance I think the "polite" option is still
   better.  Obviously folks are welcome to disagree and I'm happy to
   continue that conversation (or perhaps I convinced everyone).

2. Andy didn't like the idea of having a "STRICT" mode which
   delivered a signal to a process for violating the contract that it
   will promise to stay out of the kernel.  Gilad Ben Yossef argued that
   it made sense to have a way for the kernel to enforce the requested
   correctness guarantee of never being interrupted.  Andy pointed out
   that we should then really deliver such a signal when the kernel
   delivers an asynchronous interrupt to the core as well.  In particular
   this is a concern for the application-error case of a process that
   calls unmap() on one core while a thread on another core is running
   STRICT, and thus gets an unexpected TLB flush.

   This patch series addresses that concern by including support for
   IRQs, IPIs, and similar asynchronous interrupts to also send the
   STRICT signal to the process.  We don't try to send the signal if
   we are in an NMI, and instead just force a console backtrace like
   you would get in task_isolation_debug mode.

3. Frederick nack'ed my patch for a boot flag to disable the 1Hz
   periodic scheduler tick.

   I'm still hoping he's open to changing his mind about that, but in
   this patch series I have removed that boot flag.

Various other changes have been introduced since v8:

https://lkml.kernel.org/r/1445373372-6567-1-git-send-email-cmetcalf@ezchip.com

- Rebased to Linux 4.4-rc5.

- Since nohz_full and isolnodes have been separated back out again in
  4.4, I introduced a new task_isolation=MASK boot argument that sets
  both of them.  The task isolation support now requires that this
  boot flag have been used; it intentionally doesn't work if you've
  just enabled nohz_full and isolcpus separately.  I could be
  convinced that doing it the other way around makes sense, though.

- I folded the two STRICT mode patches together since there didn't
  seem to be much value in having the second patch that just enabled
  having a settable signal.  I also refactored the various routines
  that report on interrupts/exceptions/etc to make it easier to hook
  in from the case where we are interrupted asynchronously.

- For the debug support, I moved most of the functionality into
  kernel/isolation.c and out of kernel/sched/core.c, leaving only a
  small hook to handle mapping a remote cpu to a task struct safely.
  In addition to implementing Andy's suggestion of signalling a task
  when it is interrupted asynchronously, I also added a ratelimit
  hook so we won't spam the console if (for example) a timer interrupt
  runs amok - particularly since when this happens without ratelimit,
  it can end up self-perpetuating the timer interrupt.

- I added a task_isolation_debug_cpumask() helper function to check
  all the cpus in a mask to see if they are being interrupted
  inappropriately.

- I made the check for irq_enter() robust to architectures that
  have already entered user mode context_tracking before calling
  irq_enter() by testing user_mode(get_irq_regs()) instead of
  context_tracking_in_user(), and split out the code to a separate
  inlined function so I could comment it better.

- For arm64, I added a task_isolation_debug_cpumask() hook for
  smp_cross_call(), which I had missed in the earlier versions.

- I generalized the fix for tile to set up a clockevents hook for
  set_state_oneshot_stopped() to also apply to the arm_arch_timer,
  which I realized was showing the same problem.  For both cases,
  this seems to be what Viresh had in mind with commit 8fff52fd509345
  ("clockevents: Introduce CLOCK_EVT_STATE_ONESHOT_STOPPED state").

- For tile, I adopted the arm model of doing user_exit() calls in the
  early assembly code (a new patch in this series).  I also added a
  missing task_isolation_debug hook for tile's IPI and remote cache
  flush code.

Chris Metcalf (12):
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: support PR_TASK_ISOLATION_STRICT mode
  task_isolation: add debug boot flag
  arch/x86: enable task isolation functionality
  arch/arm64: adopt prepare_exit_to_usermode() model from x86
  arch/arm64: enable task isolation functionality
  arch/tile: adopt prepare_exit_to_usermode() model from x86
  arch/tile: move user_exit() to early kernel entry sequence
  arch/tile: enable task isolation functionality
  arm, tile: turn off timer tick for oneshot_stopped state

Christoph Lameter (1):
  vmstat: provide a function to quiet down the diff processing

 Documentation/kernel-parameters.txt  |  16 +++
 arch/arm64/include/asm/thread_info.h |  18 ++-
 arch/arm64/kernel/entry.S            |   6 +-
 arch/arm64/kernel/ptrace.c           |  12 +-
 arch/arm64/kernel/signal.c           |  35 ++++--
 arch/arm64/kernel/smp.c              |   2 +
 arch/arm64/mm/fault.c                |   4 +
 arch/tile/include/asm/processor.h    |   2 +-
 arch/tile/include/asm/thread_info.h  |   8 +-
 arch/tile/kernel/intvec_32.S         |  51 +++-----
 arch/tile/kernel/intvec_64.S         |  54 +++------
 arch/tile/kernel/process.c           |  83 +++++++------
 arch/tile/kernel/ptrace.c            |  19 +--
 arch/tile/kernel/single_step.c       |   8 +-
 arch/tile/kernel/smp.c               |  26 ++--
 arch/tile/kernel/time.c              |   1 +
 arch/tile/kernel/traps.c             |  13 +-
 arch/tile/kernel/unaligned.c         |  16 ++-
 arch/tile/mm/fault.c                 |   6 +-
 arch/tile/mm/homecache.c             |   2 +
 arch/x86/entry/common.c              |  10 +-
 arch/x86/kernel/traps.c              |   2 +
 arch/x86/mm/fault.c                  |   2 +
 drivers/clocksource/arm_arch_timer.c |   2 +
 include/linux/isolation.h            |  80 +++++++++++++
 include/linux/sched.h                |   3 +
 include/linux/swap.h                 |   1 +
 include/linux/vmstat.h               |   4 +
 include/uapi/linux/prctl.h           |   8 ++
 init/Kconfig                         |  20 ++++
 kernel/Makefile                      |   1 +
 kernel/irq_work.c                    |   5 +-
 kernel/isolation.c                   | 225 +++++++++++++++++++++++++++++++++++
 kernel/sched/core.c                  |  18 +++
 kernel/signal.c                      |   5 +
 kernel/smp.c                         |   6 +-
 kernel/softirq.c                     |  33 +++++
 kernel/sys.c                         |   9 ++
 mm/swap.c                            |  13 +-
 mm/vmstat.c                          |  24 ++++
 40 files changed, 665 insertions(+), 188 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

-- 
2.1.2

^ permalink raw reply	[flat|nested] 92+ messages in thread