* [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-08 17:58 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. A prctl() option (PR_SET_DATAPLANE) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on dataplane cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on my arch/tile master tree for 4.2, in turn based on 4.1-rc1) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane Chris Metcalf (6): nohz_full: add support for "dataplane" mode nohz: dataplane: allow tick to be fully disabled for dataplane dataplane nohz: run softirqs synchronously on user entry nohz: support PR_DATAPLANE_QUIESCE nohz: support PR_DATAPLANE_STRICT mode nohz: add dataplane_debug boot flag Documentation/kernel-parameters.txt | 6 ++ arch/tile/mm/homecache.c | 5 +- include/linux/sched.h | 3 + include/linux/tick.h | 12 ++++ include/uapi/linux/prctl.h | 8 +++ kernel/context_tracking.c | 3 + kernel/irq_work.c | 4 +- kernel/sched/core.c | 18 ++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 15 ++++- kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 112 +++++++++++++++++++++++++++++++++++- 13 files changed, 198 insertions(+), 5 deletions(-) -- 2.1.2 ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-08 17:58 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. A prctl() option (PR_SET_DATAPLANE) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on dataplane cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on my arch/tile master tree for 4.2, in turn based on 4.1-rc1) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane Chris Metcalf (6): nohz_full: add support for "dataplane" mode nohz: dataplane: allow tick to be fully disabled for dataplane dataplane nohz: run softirqs synchronously on user entry nohz: support PR_DATAPLANE_QUIESCE nohz: support PR_DATAPLANE_STRICT mode nohz: add dataplane_debug boot flag Documentation/kernel-parameters.txt | 6 ++ arch/tile/mm/homecache.c | 5 +- include/linux/sched.h | 3 + include/linux/tick.h | 12 ++++ include/uapi/linux/prctl.h | 8 +++ kernel/context_tracking.c | 3 + kernel/irq_work.c | 4 +- kernel/sched/core.c | 18 ++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 15 ++++- kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 112 +++++++++++++++++++++++++++++++++++- 13 files changed, 198 insertions(+), 5 deletions(-) -- 2.1.2 ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 1/6] nohz_full: add support for "dataplane" mode 2015-05-08 17:58 ` Chris Metcalf @ 2015-05-08 17:58 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a stronger commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the stronger semantics as needed, specifying prctl(PR_SET_DATAPLANE, PR_DATAPLANE_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The dataplane state is indicated by setting a new task struct field, dataplane_flags, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new tick_nohz_dataplane_enter() routine to take additional actions to help the task avoid being interrupted in the future. For this first patch, the only action taken is to call lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/sched.h | 3 +++ include/linux/tick.h | 10 ++++++++++ include/uapi/linux/prctl.h | 5 +++++ kernel/context_tracking.c | 3 +++ kernel/sys.c | 8 ++++++++ kernel/time/tick-sched.c | 13 +++++++++++++ 6 files changed, 42 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 8222ae40ecb0..3680aa07c9ea 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1732,6 +1732,9 @@ struct task_struct { #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif +#ifdef CONFIG_NO_HZ_FULL + unsigned int dataplane_flags; +#endif }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/include/linux/tick.h b/include/linux/tick.h index f8492da57ad3..d191cda9b71a 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -10,6 +10,7 @@ #include <linux/context_tracking_state.h> #include <linux/cpumask.h> #include <linux/sched.h> +#include <linux/prctl.h> #ifdef CONFIG_GENERIC_CLOCKEVENTS extern void __init tick_init(void); @@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu) return cpumask_test_cpu(cpu, tick_nohz_full_mask); } +static inline bool tick_nohz_is_dataplane(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->dataplane_flags & PR_DATAPLANE_ENABLE); +} + extern void __tick_nohz_full_check(void); extern void tick_nohz_full_kick(void); extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); +extern void tick_nohz_dataplane_enter(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { } static inline void tick_nohz_full_kick(void) { } static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } +static inline bool tick_nohz_is_dataplane(void) { return false; } +static inline void tick_nohz_dataplane_enter(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..1aa8fa8a8b05 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query dataplane mode for NO_HZ_FULL kernels. */ +#define PR_SET_DATAPLANE 47 +#define PR_GET_DATAPLANE 48 +# define PR_DATAPLANE_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 72d59a1a6eb6..dd6bdd6197b6 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/tick.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (tick_nohz_is_dataplane()) + tick_nohz_dataplane_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/sys.c b/kernel/sys.c index a4e372b798a5..930b750aefde 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_NO_HZ_FULL + case PR_SET_DATAPLANE: + me->dataplane_flags = arg2; + break; + case PR_GET_DATAPLANE: + error = me->dataplane_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 914259128145..31c674719647 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -24,6 +24,7 @@ #include <linux/posix-timers.h> #include <linux/perf_event.h> #include <linux/context_tracking.h> +#include <linux/swap.h> #include <asm/irq_regs.h> @@ -389,6 +390,18 @@ void __init tick_nohz_init(void) pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n", cpumask_pr_args(tick_nohz_full_mask)); } + +/* + * When returning to userspace on a nohz_full core after doing + * prctl(PR_DATAPLANE_SET,1), we come here and try more aggressively + * to prevent this core from being interrupted later. + */ +void tick_nohz_dataplane_enter(void) +{ + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH 1/6] nohz_full: add support for "dataplane" mode @ 2015-05-08 17:58 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a stronger commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the stronger semantics as needed, specifying prctl(PR_SET_DATAPLANE, PR_DATAPLANE_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The dataplane state is indicated by setting a new task struct field, dataplane_flags, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new tick_nohz_dataplane_enter() routine to take additional actions to help the task avoid being interrupted in the future. For this first patch, the only action taken is to call lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/sched.h | 3 +++ include/linux/tick.h | 10 ++++++++++ include/uapi/linux/prctl.h | 5 +++++ kernel/context_tracking.c | 3 +++ kernel/sys.c | 8 ++++++++ kernel/time/tick-sched.c | 13 +++++++++++++ 6 files changed, 42 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 8222ae40ecb0..3680aa07c9ea 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1732,6 +1732,9 @@ struct task_struct { #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif +#ifdef CONFIG_NO_HZ_FULL + unsigned int dataplane_flags; +#endif }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/include/linux/tick.h b/include/linux/tick.h index f8492da57ad3..d191cda9b71a 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -10,6 +10,7 @@ #include <linux/context_tracking_state.h> #include <linux/cpumask.h> #include <linux/sched.h> +#include <linux/prctl.h> #ifdef CONFIG_GENERIC_CLOCKEVENTS extern void __init tick_init(void); @@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu) return cpumask_test_cpu(cpu, tick_nohz_full_mask); } +static inline bool tick_nohz_is_dataplane(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->dataplane_flags & PR_DATAPLANE_ENABLE); +} + extern void __tick_nohz_full_check(void); extern void tick_nohz_full_kick(void); extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); +extern void tick_nohz_dataplane_enter(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { } static inline void tick_nohz_full_kick(void) { } static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } +static inline bool tick_nohz_is_dataplane(void) { return false; } +static inline void tick_nohz_dataplane_enter(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..1aa8fa8a8b05 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query dataplane mode for NO_HZ_FULL kernels. */ +#define PR_SET_DATAPLANE 47 +#define PR_GET_DATAPLANE 48 +# define PR_DATAPLANE_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 72d59a1a6eb6..dd6bdd6197b6 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/tick.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (tick_nohz_is_dataplane()) + tick_nohz_dataplane_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/sys.c b/kernel/sys.c index a4e372b798a5..930b750aefde 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_NO_HZ_FULL + case PR_SET_DATAPLANE: + me->dataplane_flags = arg2; + break; + case PR_GET_DATAPLANE: + error = me->dataplane_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 914259128145..31c674719647 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -24,6 +24,7 @@ #include <linux/posix-timers.h> #include <linux/perf_event.h> #include <linux/context_tracking.h> +#include <linux/swap.h> #include <asm/irq_regs.h> @@ -389,6 +390,18 @@ void __init tick_nohz_init(void) pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n", cpumask_pr_args(tick_nohz_full_mask)); } + +/* + * When returning to userspace on a nohz_full core after doing + * prctl(PR_DATAPLANE_SET,1), we come here and try more aggressively + * to prevent this core from being interrupted later. + */ +void tick_nohz_dataplane_enter(void) +{ + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH 2/6] nohz: dataplane: allow tick to be fully disabled for dataplane 2015-05-08 17:58 ` Chris Metcalf (?) (?) @ 2015-05-08 17:58 ` Chris Metcalf 2015-05-12 9:26 ` Peter Zijlstra -1 siblings, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-kernel Cc: Chris Metcalf While the current fallback to 1-second tick is still helpful for maintaining completely correct kernel semantics, processes using prctl(PR_SET_DATAPLANE) semantics place a higher priority on running completely tickless, so don't bound the time_delta for such processes. This was previously discussed in https://lkml.org/lkml/2014/10/31/364 and Thomas Gleixner observed that vruntime, load balancing data, load accounting, and other things might be impacted. Frederic Weisbecker similarly observed that allowing the tick to be indefinitely deferred just meant that no one would ever fix the underlying bugs. However it's at least true that the mode proposed in this patch can only be enabled on an isolcpus core, which may limit how important it is to maintain scheduler data correctly, for example. It's also worth observing that the tile architecture has been using similar code for its Zero-Overhead Linux for many years (starting in 2005) and customers are very enthusiastic about the resulting bare-metal performance on cores that are available to run full Linux semantics on demand (crash, logging, shutdown, etc). So this semantics is very useful if we can convince ourselves that doing this is safe. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- kernel/time/tick-sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 31c674719647..25fdd6bdd1eb 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -644,7 +644,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, } #ifdef CONFIG_NO_HZ_FULL - if (!ts->inidle) { + if (!ts->inidle && !tick_nohz_is_dataplane()) { time_delta = min(time_delta, scheduler_tick_max_deferment()); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH 2/6] nohz: dataplane: allow tick to be fully disabled for dataplane 2015-05-08 17:58 ` [PATCH 2/6] nohz: dataplane: allow tick to be fully disabled for dataplane Chris Metcalf @ 2015-05-12 9:26 ` Peter Zijlstra 2015-05-12 13:12 ` Paul E. McKenney 0 siblings, 1 reply; 340+ messages in thread From: Peter Zijlstra @ 2015-05-12 9:26 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-kernel On Fri, May 08, 2015 at 01:58:43PM -0400, Chris Metcalf wrote: > While the current fallback to 1-second tick is still helpful for > maintaining completely correct kernel semantics, processes using > prctl(PR_SET_DATAPLANE) semantics place a higher priority on running > completely tickless, so don't bound the time_delta for such processes. > > This was previously discussed in > > https://lkml.org/lkml/2014/10/31/364 > > and Thomas Gleixner observed that vruntime, load balancing data, > load accounting, and other things might be impacted. Frederic > Weisbecker similarly observed that allowing the tick to be indefinitely > deferred just meant that no one would ever fix the underlying bugs. > However it's at least true that the mode proposed in this patch can > only be enabled on an isolcpus core, which may limit how important > it is to maintain scheduler data correctly, for example. So how is making this available going to help people fix the actual problem? There is nothing fundamentally impossible about fixing this proper, its just a lot of hard work. NAK on this, do it right. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 2/6] nohz: dataplane: allow tick to be fully disabled for dataplane 2015-05-12 9:26 ` Peter Zijlstra @ 2015-05-12 13:12 ` Paul E. McKenney 2015-05-14 20:55 ` Chris Metcalf 0 siblings, 1 reply; 340+ messages in thread From: Paul E. McKenney @ 2015-05-12 13:12 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-kernel On Tue, May 12, 2015 at 11:26:07AM +0200, Peter Zijlstra wrote: > On Fri, May 08, 2015 at 01:58:43PM -0400, Chris Metcalf wrote: > > While the current fallback to 1-second tick is still helpful for > > maintaining completely correct kernel semantics, processes using > > prctl(PR_SET_DATAPLANE) semantics place a higher priority on running > > completely tickless, so don't bound the time_delta for such processes. > > > > This was previously discussed in > > > > https://lkml.org/lkml/2014/10/31/364 > > > > and Thomas Gleixner observed that vruntime, load balancing data, > > load accounting, and other things might be impacted. Frederic > > Weisbecker similarly observed that allowing the tick to be indefinitely > > deferred just meant that no one would ever fix the underlying bugs. > > However it's at least true that the mode proposed in this patch can > > only be enabled on an isolcpus core, which may limit how important > > it is to maintain scheduler data correctly, for example. > > So how is making this available going to help people fix the actual > problem? It will at least provide an environment where adding more of this problem might get punished. This would be an improvement over what we have today, namely that the 1HZ fallback timer silently forgives adding more problems of this sort. Thanx, Paul > There is nothing fundamentally impossible about fixing this proper, its > just a lot of hard work. > > NAK on this, do it right. > ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 2/6] nohz: dataplane: allow tick to be fully disabled for dataplane 2015-05-12 13:12 ` Paul E. McKenney @ 2015-05-14 20:55 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-14 20:55 UTC (permalink / raw) To: paulmck, Peter Zijlstra Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Christoph Lameter, linux-kernel On 05/12/2015 09:12 AM, Paul E. McKenney wrote: > On Tue, May 12, 2015 at 11:26:07AM +0200, Peter Zijlstra wrote: >> On Fri, May 08, 2015 at 01:58:43PM -0400, Chris Metcalf wrote: >>> While the current fallback to 1-second tick is still helpful for >>> maintaining completely correct kernel semantics, processes using >>> prctl(PR_SET_DATAPLANE) semantics place a higher priority on running >>> completely tickless, so don't bound the time_delta for such processes. >>> >>> This was previously discussed in >>> >>> https://lkml.org/lkml/2014/10/31/364 >>> >>> and Thomas Gleixner observed that vruntime, load balancing data, >>> load accounting, and other things might be impacted. Frederic >>> Weisbecker similarly observed that allowing the tick to be indefinitely >>> deferred just meant that no one would ever fix the underlying bugs. >>> However it's at least true that the mode proposed in this patch can >>> only be enabled on an isolcpus core, which may limit how important >>> it is to maintain scheduler data correctly, for example. >> So how is making this available going to help people fix the actual >> problem? > It will at least provide an environment where adding more of this > problem might get punished. This would be an improvement over what > we have today, namely that the 1HZ fallback timer silently forgives > adding more problems of this sort. So I guess the obvious question to ask is whether there is a mode that can be dynamically enabled (/proc/sys/kernel/nohz_experimental or whatever) where we allow turning off this tick - perhaps to make it more likely tick-dependent code isn't added to the kernel as Paul suggests, or perhaps to enable applications that want to avoid the tick conservativeness and are willing to do sufficient QA that they are comfortable exploring possible issues with the 1Hz tick being disabled? Paul, PeterZ, any thoughts on something along these lines? Or another suggestion? -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry 2015-05-08 17:58 ` Chris Metcalf ` (2 preceding siblings ...) (?) @ 2015-05-08 17:58 ` Chris Metcalf 2015-05-09 7:04 ` Mike Galbraith -1 siblings, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-kernel Cc: Chris Metcalf For tasks which have elected dataplane functionality, we run any pending softirqs for the core before returning to userspace, rather than ever scheduling ksoftirqd to run. The problem we fix is that by allowing another task to run on the core, we guarantee more interrupts in the future to the dataplane task, which is exactly what dataplane mode is required to prevent. This may be an alternate approach to what Mike Galbraith recently proposed in e.g.: https://lkml.org/lkml/2015/3/13/11 Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- kernel/softirq.c | 14 +++++++++++++- kernel/time/tick-sched.c | 26 +++++++++++++++++++++++++- 2 files changed, 38 insertions(+), 2 deletions(-) diff --git a/kernel/softirq.c b/kernel/softirq.c index 479e4436f787..bc9406337f82 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -291,6 +291,15 @@ restart: --max_restart) goto restart; + /* + * For dataplane tasks, waking ksoftirqd because the + * softirqs are slow is a bad idea; we would rather + * synchronously finish whatever is interrupting us, + * and then be able to cleanly enter dataplane mode. + */ + if (tick_nohz_is_dataplane()) + goto restart; + wakeup_softirqd(); } @@ -410,8 +419,11 @@ inline void raise_softirq_irqoff(unsigned int nr) * * Otherwise we wake up ksoftirqd to make sure we * schedule the softirq soon. + * + * For dataplane tasks, we will handle the softirq + * synchronously on return to userspace. */ - if (!in_interrupt()) + if (!in_interrupt() && !tick_nohz_is_dataplane()) wakeup_softirqd(); } diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 25fdd6bdd1eb..fd0e6e5c931c 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -398,8 +398,26 @@ void __init tick_nohz_init(void) */ void tick_nohz_dataplane_enter(void) { + /* + * Check for softirqs as close as possible to our return to + * userspace, and run any that are waiting. We need to ensure + * that we can safely avoid running softirqd, which will cause + * interrupts for nohz_full tasks. Note that interrupts may + * be enabled internally by do_softirq(). + */ + do_softirq(); + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ lru_add_drain(); + + /* + * Disable interrupts again since other code running in this + * function may have enabled them, and the caller expects + * interrupts to be disabled on return. Enabling them during + * this call is safe since the caller is not assuming any + * state that might have been altered by an interrupt. + */ + local_irq_disable(); } #endif @@ -771,7 +789,13 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts) if (need_resched()) return false; - if (unlikely(local_softirq_pending() && cpu_online(cpu))) { + /* + * If we are running dataplane for this process, don't worry + * about pending softirqs; we will force them to run + * synchronously before returning to userspace. + */ + if (unlikely(local_softirq_pending() && cpu_online(cpu) && + !tick_nohz_is_dataplane())) { static int ratelimit; if (ratelimit < 10 && -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry 2015-05-08 17:58 ` [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry Chris Metcalf @ 2015-05-09 7:04 ` Mike Galbraith 2015-05-11 20:13 ` Chris Metcalf 0 siblings, 1 reply; 340+ messages in thread From: Mike Galbraith @ 2015-05-09 7:04 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-kernel On Fri, 2015-05-08 at 13:58 -0400, Chris Metcalf wrote: > For tasks which have elected dataplane functionality, we run > any pending softirqs for the core before returning to userspace, > rather than ever scheduling ksoftirqd to run. The problem we > fix is that by allowing another task to run on the core, we > guarantee more interrupts in the future to the dataplane task, > which is exactly what dataplane mode is required to prevent. If ksoftirqd were rt class, softirqs would be gone when the soloist gets the CPU back and heads to userspace. Being a soloist, it has no use for a priority, so why can't it just let ksoftirqd run if it raises the occasional softirq? Meeting a contended lock while processing it will wreck the soloist regardless of who does that processing. -Mike ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry 2015-05-09 7:04 ` Mike Galbraith @ 2015-05-11 20:13 ` Chris Metcalf 2015-05-12 2:21 ` Mike Galbraith ` (2 more replies) 0 siblings, 3 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-11 20:13 UTC (permalink / raw) To: Mike Galbraith Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, linux-kernel On 05/09/2015 03:04 AM, Mike Galbraith wrote: > On Fri, 2015-05-08 at 13:58 -0400, Chris Metcalf wrote: >> For tasks which have elected dataplane functionality, we run >> any pending softirqs for the core before returning to userspace, >> rather than ever scheduling ksoftirqd to run. The problem we >> fix is that by allowing another task to run on the core, we >> guarantee more interrupts in the future to the dataplane task, >> which is exactly what dataplane mode is required to prevent. > If ksoftirqd were rt class I realize I actually don't know if this is true or not. Is ksoftirqd rt class? If not, it does seem pretty plausible that it should be... > softirqs would be gone when the soloist gets > the CPU back and heads to userspace. Being a soloist, it has no use for > a priority, so why can't it just let ksoftirqd run if it raises the > occasional softirq? Meeting a contended lock while processing it will > wreck the soloist regardless of who does that processing. The thing you want to avoid is having two processes both runnable at once, since then the "quiesce" mode can't make forward progress and basically spins in cpu_idle() until ksoftirqd can come in. Alas, my recollection of the precise failure mode is somewhat dimmed; my commit notes from a year ago (for a variant of the patch I'm upstreaming now): - Trying to return to userspace with pending softirqs is not currently allowed. Prior to this patch, when this happened we would just wait in cpu_idle. Instead, what we now do is directly run any pending softirqs, then go back and retry the path where we return to userspace. - Raising softirqs (in this case for hrtimer support) could cause the ksoftirqd daemon to be woken on a core. This is bad because on a dataplane core, a QUIESCE process will then block until the ksoftirqd runs, and the system sometimes seems to flag that soft irqs are available but not schedule the timer to arrange for a context switch to ksoftirqd. To handle this, we avoid bailing out in __do_softirq() when we've been working for a while, if we're on a dataplane core, and just keep working until done. Similarly, on a dataplane core running a userspace task, we don't wake ksoftirqd when we are raising a softirq, even if we're not in an interrupt context where it will run promptly, since a non-interrupt context will also run promptly. I'm happy to drop this patch entirely from the series for now, and if ksoftirqd shows up as a problem going forward, we can address it as necessary at that time. What do you think? -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry 2015-05-11 20:13 ` Chris Metcalf @ 2015-05-12 2:21 ` Mike Galbraith 2015-05-12 9:28 ` Peter Zijlstra 2015-05-12 9:32 ` Peter Zijlstra 2 siblings, 0 replies; 340+ messages in thread From: Mike Galbraith @ 2015-05-12 2:21 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, linux-kernel On Mon, 2015-05-11 at 16:13 -0400, Chris Metcalf wrote: > On 05/09/2015 03:04 AM, Mike Galbraith wrote: > > On Fri, 2015-05-08 at 13:58 -0400, Chris Metcalf wrote: > >> For tasks which have elected dataplane functionality, we run > >> any pending softirqs for the core before returning to userspace, > >> rather than ever scheduling ksoftirqd to run. The problem we > >> fix is that by allowing another task to run on the core, we > >> guarantee more interrupts in the future to the dataplane task, > >> which is exactly what dataplane mode is required to prevent. > > If ksoftirqd were rt class > > I realize I actually don't know if this is true or not. Is > ksoftirqd rt class? If not, it does seem pretty plausible that > it should be... It is in an rt kernel, not in a stock kernel, it's malleable in both ;-) > > softirqs would be gone when the soloist gets > > the CPU back and heads to userspace. Being a soloist, it has no use for > > a priority, so why can't it just let ksoftirqd run if it raises the > > occasional softirq? Meeting a contended lock while processing it will > > wreck the soloist regardless of who does that processing. > > The thing you want to avoid is having two processes both > runnable at once, since then the "quiesce" mode can't make > forward progress and basically spins in cpu_idle() until ksoftirqd > can come in. The only way ksoftirqd can appear is the soloist woke it. If alleged soloist is raising enough softirqs to matter, it ain't really an ultra sensitive solo artist, it's part of a noise inducing (locks) chorus. > Alas, my recollection of the precise failure mode > is somewhat dimmed; my commit notes from a year ago (for > a variant of the patch I'm upstreaming now): > > - Trying to return to userspace with pending softirqs is not > currently allowed. Prior to this patch, when this happened > we would just wait in cpu_idle. Instead, what we now do is > directly run any pending softirqs, then go back and retry the > path where we return to userspace. > > - Raising softirqs (in this case for hrtimer support) could > cause the ksoftirqd daemon to be woken on a core. This is > bad because on a dataplane core, a QUIESCE process will > then block until the ksoftirqd runs, and the system sometimes > seems to flag that soft irqs are available but not schedule > the timer to arrange for a context switch to ksoftirqd. > To handle this, we avoid bailing out in __do_softirq() when > we've been working for a while, if we're on a dataplane core, > and just keep working until done. Similarly, on a dataplane > core running a userspace task, we don't wake ksoftirqd when > we are raising a softirq, even if we're not in an interrupt > context where it will run promptly, since a non-interrupt > context will also run promptly. Thomas has nuked the hrtimer softirq. > I'm happy to drop this patch entirely from the series for now, and > if ksoftirqd shows up as a problem going forward, we can address it > as necessary at that time. What do you think? Inlining softirqs may save a context switch, but adds cycles that we may consume at higher frequency than the thing we're avoiding. -Mike ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry 2015-05-11 20:13 ` Chris Metcalf 2015-05-12 2:21 ` Mike Galbraith @ 2015-05-12 9:28 ` Peter Zijlstra 2015-05-12 9:32 ` Peter Zijlstra 2 siblings, 0 replies; 340+ messages in thread From: Peter Zijlstra @ 2015-05-12 9:28 UTC (permalink / raw) To: Chris Metcalf Cc: Mike Galbraith, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, linux-kernel On Mon, May 11, 2015 at 04:13:16PM -0400, Chris Metcalf wrote: > - Raising softirqs (in this case for hrtimer support) could Note that Thomas recently killed all the softirq wreckage in hrtimers. So that specific case is dealt with. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry 2015-05-11 20:13 ` Chris Metcalf 2015-05-12 2:21 ` Mike Galbraith 2015-05-12 9:28 ` Peter Zijlstra @ 2015-05-12 9:32 ` Peter Zijlstra 2015-05-12 13:08 ` Paul E. McKenney 2 siblings, 1 reply; 340+ messages in thread From: Peter Zijlstra @ 2015-05-12 9:32 UTC (permalink / raw) To: Chris Metcalf Cc: Mike Galbraith, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, linux-kernel On Mon, May 11, 2015 at 04:13:16PM -0400, Chris Metcalf wrote: > The thing you want to avoid is having two processes both > runnable at once Right, because as soon as nr_running > 1 we kill the entire nohz_full thing. RT or not for ksoftirqd doesn't matter. Then again, like interrupts, you basically want to avoid softirqs in this mode. So I think the right solution is to figure out why the softirqs get raised and cure that. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry 2015-05-12 9:32 ` Peter Zijlstra @ 2015-05-12 13:08 ` Paul E. McKenney 0 siblings, 0 replies; 340+ messages in thread From: Paul E. McKenney @ 2015-05-12 13:08 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Mike Galbraith, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Christoph Lameter, linux-kernel On Tue, May 12, 2015 at 11:32:02AM +0200, Peter Zijlstra wrote: > On Mon, May 11, 2015 at 04:13:16PM -0400, Chris Metcalf wrote: > > The thing you want to avoid is having two processes both > > runnable at once > > Right, because as soon as nr_running > 1 we kill the entire nohz_full > thing. RT or not for ksoftirqd doesn't matter. > > Then again, like interrupts, you basically want to avoid softirqs in > this mode. > > So I think the right solution is to figure out why the softirqs get > raised and cure that. Makes sense, but it also makes sense to have something that detects when that cure fails and clean up. And, in a test/debug environment, also issuing some sort of diagnostic in that case. Thanx, Paul ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE 2015-05-08 17:58 ` Chris Metcalf @ 2015-05-08 17:58 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the kernel to quiesce any pending timer interrupts prior to returning to userspace. When running with this mode set, sys calls (and page faults, etc.) can be inordinately slow. However, user applications that want to guarantee that no unexpected interrupts will occur (even if they call into the kernel) can set this flag to guarantee that semantics. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 1 + kernel/time/tick-sched.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 55 insertions(+) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 1aa8fa8a8b05..8b735651304a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_DATAPLANE 47 #define PR_GET_DATAPLANE 48 # define PR_DATAPLANE_ENABLE (1 << 0) +# define PR_DATAPLANE_QUIESCE (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index fd0e6e5c931c..69d908c6cef8 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -392,6 +392,53 @@ void __init tick_nohz_init(void) } /* + * We normally return immediately to userspace. + * + * The PR_DATAPLANE_QUIESCE flag causes us to wait until no more + * interrupts are pending. Otherwise we nap with interrupts enabled + * and wait for the next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two processes on the same core and both + * specify PR_DATAPLANE_QUIESCE, neither will ever leave the kernel, + * and one will have to be killed manually. Otherwise in situations + * where another process is in the runqueue on this cpu, this task + * will just wait for that other task to go idle before returning to + * user space. + */ +static void dataplane_quiesce(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: dataplane task blocked for %ld jiffies\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start)); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + + /* Idle with interrupts enabled and wait for the tick. */ + set_current_state(TASK_INTERRUPTIBLE); + arch_cpu_idle(); + set_current_state(TASK_RUNNING); + } + if (warned) { + pr_warn("%s/%d: cpu %d: dataplane task unblocked after %ld jiffies\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start)); + dump_stack(); + } +} + +/* * When returning to userspace on a nohz_full core after doing * prctl(PR_DATAPLANE_SET,1), we come here and try more aggressively * to prevent this core from being interrupted later. @@ -411,6 +458,13 @@ void tick_nohz_dataplane_enter(void) lru_add_drain(); /* + * Quiesce any timer ticks if requested. On return from this + * function, no timer ticks are pending. + */ + if ((current->dataplane_flags & PR_DATAPLANE_QUIESCE) != 0) + dataplane_quiesce(); + + /* * Disable interrupts again since other code running in this * function may have enabled them, and the caller expects * interrupts to be disabled on return. Enabling them during -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE @ 2015-05-08 17:58 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the kernel to quiesce any pending timer interrupts prior to returning to userspace. When running with this mode set, sys calls (and page faults, etc.) can be inordinately slow. However, user applications that want to guarantee that no unexpected interrupts will occur (even if they call into the kernel) can set this flag to guarantee that semantics. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 1 + kernel/time/tick-sched.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 55 insertions(+) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 1aa8fa8a8b05..8b735651304a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_DATAPLANE 47 #define PR_GET_DATAPLANE 48 # define PR_DATAPLANE_ENABLE (1 << 0) +# define PR_DATAPLANE_QUIESCE (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index fd0e6e5c931c..69d908c6cef8 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -392,6 +392,53 @@ void __init tick_nohz_init(void) } /* + * We normally return immediately to userspace. + * + * The PR_DATAPLANE_QUIESCE flag causes us to wait until no more + * interrupts are pending. Otherwise we nap with interrupts enabled + * and wait for the next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two processes on the same core and both + * specify PR_DATAPLANE_QUIESCE, neither will ever leave the kernel, + * and one will have to be killed manually. Otherwise in situations + * where another process is in the runqueue on this cpu, this task + * will just wait for that other task to go idle before returning to + * user space. + */ +static void dataplane_quiesce(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: dataplane task blocked for %ld jiffies\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start)); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + + /* Idle with interrupts enabled and wait for the tick. */ + set_current_state(TASK_INTERRUPTIBLE); + arch_cpu_idle(); + set_current_state(TASK_RUNNING); + } + if (warned) { + pr_warn("%s/%d: cpu %d: dataplane task unblocked after %ld jiffies\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start)); + dump_stack(); + } +} + +/* * When returning to userspace on a nohz_full core after doing * prctl(PR_DATAPLANE_SET,1), we come here and try more aggressively * to prevent this core from being interrupted later. @@ -411,6 +458,13 @@ void tick_nohz_dataplane_enter(void) lru_add_drain(); /* + * Quiesce any timer ticks if requested. On return from this + * function, no timer ticks are pending. + */ + if ((current->dataplane_flags & PR_DATAPLANE_QUIESCE) != 0) + dataplane_quiesce(); + + /* * Disable interrupts again since other code running in this * function may have enabled them, and the caller expects * interrupts to be disabled on return. Enabling them during -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE @ 2015-05-12 9:33 ` Peter Zijlstra 0 siblings, 0 replies; 340+ messages in thread From: Peter Zijlstra @ 2015-05-12 9:33 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote: > This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the > kernel to quiesce any pending timer interrupts prior to returning > to userspace. When running with this mode set, sys calls (and page > faults, etc.) can be inordinately slow. However, user applications > that want to guarantee that no unexpected interrupts will occur > (even if they call into the kernel) can set this flag to guarantee > that semantics. Currently people hot-unplug and hot-plug the CPU to do this. Obviously that's a wee bit horrible :-) Not sure if a prctl like this is any better though. This is a CPU properly not a process one. ISTR people talking about 'quiesce' sysfs file, along side the hotplug stuff, I can't quite remember. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE @ 2015-05-12 9:33 ` Peter Zijlstra 0 siblings, 0 replies; 340+ messages in thread From: Peter Zijlstra @ 2015-05-12 9:33 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote: > This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the > kernel to quiesce any pending timer interrupts prior to returning > to userspace. When running with this mode set, sys calls (and page > faults, etc.) can be inordinately slow. However, user applications > that want to guarantee that no unexpected interrupts will occur > (even if they call into the kernel) can set this flag to guarantee > that semantics. Currently people hot-unplug and hot-plug the CPU to do this. Obviously that's a wee bit horrible :-) Not sure if a prctl like this is any better though. This is a CPU properly not a process one. ISTR people talking about 'quiesce' sysfs file, along side the hotplug stuff, I can't quite remember. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE @ 2015-05-12 9:50 ` Ingo Molnar 0 siblings, 0 replies; 340+ messages in thread From: Ingo Molnar @ 2015-05-12 9:50 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel * Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote: > > This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the > > kernel to quiesce any pending timer interrupts prior to returning > > to userspace. When running with this mode set, sys calls (and page > > faults, etc.) can be inordinately slow. However, user applications > > that want to guarantee that no unexpected interrupts will occur > > (even if they call into the kernel) can set this flag to guarantee > > that semantics. > > Currently people hot-unplug and hot-plug the CPU to do this. > Obviously that's a wee bit horrible :-) > > Not sure if a prctl like this is any better though. This is a CPU > properly not a process one. So if then a prctl() (or other system call) could be a shortcut to: - move the task to an isolated CPU - make sure there _is_ such an isolated domain available I.e. have some programmatic, kernel provided way for an application to be sure it's running in the right environment. Relying on random administration flags here and there won't cut it. Thanks, Ingo ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE @ 2015-05-12 9:50 ` Ingo Molnar 0 siblings, 0 replies; 340+ messages in thread From: Ingo Molnar @ 2015-05-12 9:50 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA * Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote: > > This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the > > kernel to quiesce any pending timer interrupts prior to returning > > to userspace. When running with this mode set, sys calls (and page > > faults, etc.) can be inordinately slow. However, user applications > > that want to guarantee that no unexpected interrupts will occur > > (even if they call into the kernel) can set this flag to guarantee > > that semantics. > > Currently people hot-unplug and hot-plug the CPU to do this. > Obviously that's a wee bit horrible :-) > > Not sure if a prctl like this is any better though. This is a CPU > properly not a process one. So if then a prctl() (or other system call) could be a shortcut to: - move the task to an isolated CPU - make sure there _is_ such an isolated domain available I.e. have some programmatic, kernel provided way for an application to be sure it's running in the right environment. Relying on random administration flags here and there won't cut it. Thanks, Ingo ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE @ 2015-05-12 10:38 ` Peter Zijlstra 0 siblings, 0 replies; 340+ messages in thread From: Peter Zijlstra @ 2015-05-12 10:38 UTC (permalink / raw) To: Ingo Molnar Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Tue, May 12, 2015 at 11:50:30AM +0200, Ingo Molnar wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote: > > > This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the > > > kernel to quiesce any pending timer interrupts prior to returning > > > to userspace. When running with this mode set, sys calls (and page > > > faults, etc.) can be inordinately slow. However, user applications > > > that want to guarantee that no unexpected interrupts will occur > > > (even if they call into the kernel) can set this flag to guarantee > > > that semantics. > > > > Currently people hot-unplug and hot-plug the CPU to do this. > > Obviously that's a wee bit horrible :-) > > > > Not sure if a prctl like this is any better though. This is a CPU > > properly not a process one. > > So if then a prctl() (or other system call) could be a shortcut to: > > - move the task to an isolated CPU > - make sure there _is_ such an isolated domain available > > I.e. have some programmatic, kernel provided way for an application to > be sure it's running in the right environment. Relying on random > administration flags here and there won't cut it. No, we already have sched_setaffinity() and we should not duplicate its ability to move tasks about. What this is about is 'clearing' CPU state, its nothing to do with tasks. Ideally we'd never have to clear the state because it should be impossible to get into this predicament in the first place. The typical example here is a periodic timer that found its way onto the cpu and stays there. We're actually working on allowing such self arming timers to migrate, so once we have that sorted this could be fixed proper I think. Not sure if there's more pollution that people worry about. The hotplug hack worked because unplug force migrates the timers away. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE @ 2015-05-12 10:38 ` Peter Zijlstra 0 siblings, 0 replies; 340+ messages in thread From: Peter Zijlstra @ 2015-05-12 10:38 UTC (permalink / raw) To: Ingo Molnar Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, May 12, 2015 at 11:50:30AM +0200, Ingo Molnar wrote: > > * Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > > > On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote: > > > This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the > > > kernel to quiesce any pending timer interrupts prior to returning > > > to userspace. When running with this mode set, sys calls (and page > > > faults, etc.) can be inordinately slow. However, user applications > > > that want to guarantee that no unexpected interrupts will occur > > > (even if they call into the kernel) can set this flag to guarantee > > > that semantics. > > > > Currently people hot-unplug and hot-plug the CPU to do this. > > Obviously that's a wee bit horrible :-) > > > > Not sure if a prctl like this is any better though. This is a CPU > > properly not a process one. > > So if then a prctl() (or other system call) could be a shortcut to: > > - move the task to an isolated CPU > - make sure there _is_ such an isolated domain available > > I.e. have some programmatic, kernel provided way for an application to > be sure it's running in the right environment. Relying on random > administration flags here and there won't cut it. No, we already have sched_setaffinity() and we should not duplicate its ability to move tasks about. What this is about is 'clearing' CPU state, its nothing to do with tasks. Ideally we'd never have to clear the state because it should be impossible to get into this predicament in the first place. The typical example here is a periodic timer that found its way onto the cpu and stays there. We're actually working on allowing such self arming timers to migrate, so once we have that sorted this could be fixed proper I think. Not sure if there's more pollution that people worry about. The hotplug hack worked because unplug force migrates the timers away. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE 2015-05-12 10:38 ` Peter Zijlstra (?) @ 2015-05-12 12:52 ` Ingo Molnar 2015-05-13 4:35 ` Andy Lutomirski -1 siblings, 1 reply; 340+ messages in thread From: Ingo Molnar @ 2015-05-12 12:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel * Peter Zijlstra <peterz@infradead.org> wrote: > > So if then a prctl() (or other system call) could be a shortcut > > to: > > > > - move the task to an isolated CPU > > - make sure there _is_ such an isolated domain available > > > > I.e. have some programmatic, kernel provided way for an > > application to be sure it's running in the right environment. > > Relying on random administration flags here and there won't cut > > it. > > No, we already have sched_setaffinity() and we should not duplicate > its ability to move tasks about. But sched_setaffinity() does not guarantee isolation - it's just a syscall to move a task to a set of CPUs, which might be isolated or not. What I suggested is that it might make sense to offer a system call, for example a sched_setparam() variant, that makes such guarantees. Say if user-space does: ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params); ... then we would get the task moved to an isolated domain and get a 0 return code if the kernel is able to do all that and if the current uid/namespace/etc. has the required permissions and such. ( BIND_ISOLATED will not replace the current p->policy value, so it's still possible to use the regular policies as well on top of this. ) I.e. make it programatic instead of relying on a fragile, kernel version dependent combination of sysctl, sysfs, kernel config and boot parameter details to get us this result. I.e. provide a central hub to offer this feature in a more structured, easier to use fashion. We might still require the admin (or distro) to separately set up the domain of isolated CPUs, and it would still be possible to simply 'move' tasks there using existing syscalls - but I say that it's not a bad idea at all to offer a single central syscall interface for apps to request such treatment. > What this is about is 'clearing' CPU state, its nothing to do with > tasks. > > Ideally we'd never have to clear the state because it should be > impossible to get into this predicament in the first place. That I absolutely agree about, that bit is nonsense. We might offer debugging facilities to debug such bugs, but we won't work or hack it around. Thanks, Ingo ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE 2015-05-12 12:52 ` Ingo Molnar @ 2015-05-13 4:35 ` Andy Lutomirski 2015-05-13 17:51 ` Paul E. McKenney 0 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-05-13 4:35 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, Linux API, linux-kernel On Tue, May 12, 2015 at 5:52 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > >> > So if then a prctl() (or other system call) could be a shortcut >> > to: >> > >> > - move the task to an isolated CPU >> > - make sure there _is_ such an isolated domain available >> > >> > I.e. have some programmatic, kernel provided way for an >> > application to be sure it's running in the right environment. >> > Relying on random administration flags here and there won't cut >> > it. >> >> No, we already have sched_setaffinity() and we should not duplicate >> its ability to move tasks about. > > But sched_setaffinity() does not guarantee isolation - it's just a > syscall to move a task to a set of CPUs, which might be isolated or > not. > > What I suggested is that it might make sense to offer a system call, > for example a sched_setparam() variant, that makes such guarantees. > > Say if user-space does: > > ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params); > > ... then we would get the task moved to an isolated domain and get a 0 > return code if the kernel is able to do all that and if the current > uid/namespace/etc. has the required permissions and such. > > ( BIND_ISOLATED will not replace the current p->policy value, so it's > still possible to use the regular policies as well on top of this. ) I think we shouldn't have magic selection of an isolated domain. Anyone using this has already configured some isolated CPUs and probably wants to choose the CPU and, especially, NUMA node themselves. Also, maybe it should be a special type of realtime class/priority -- doing this should require RT permission IMO. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE 2015-05-13 4:35 ` Andy Lutomirski @ 2015-05-13 17:51 ` Paul E. McKenney 2015-05-14 20:55 ` Chris Metcalf 0 siblings, 1 reply; 340+ messages in thread From: Paul E. McKenney @ 2015-05-13 17:51 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Peter Zijlstra, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Christoph Lameter, Srivatsa S. Bhat, linux-doc, Linux API, linux-kernel On Tue, May 12, 2015 at 09:35:25PM -0700, Andy Lutomirski wrote: > On Tue, May 12, 2015 at 5:52 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > >> > So if then a prctl() (or other system call) could be a shortcut > >> > to: > >> > > >> > - move the task to an isolated CPU > >> > - make sure there _is_ such an isolated domain available > >> > > >> > I.e. have some programmatic, kernel provided way for an > >> > application to be sure it's running in the right environment. > >> > Relying on random administration flags here and there won't cut > >> > it. > >> > >> No, we already have sched_setaffinity() and we should not duplicate > >> its ability to move tasks about. > > > > But sched_setaffinity() does not guarantee isolation - it's just a > > syscall to move a task to a set of CPUs, which might be isolated or > > not. > > > > What I suggested is that it might make sense to offer a system call, > > for example a sched_setparam() variant, that makes such guarantees. > > > > Say if user-space does: > > > > ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params); > > > > ... then we would get the task moved to an isolated domain and get a 0 > > return code if the kernel is able to do all that and if the current > > uid/namespace/etc. has the required permissions and such. > > > > ( BIND_ISOLATED will not replace the current p->policy value, so it's > > still possible to use the regular policies as well on top of this. ) > > I think we shouldn't have magic selection of an isolated domain. > Anyone using this has already configured some isolated CPUs and > probably wants to choose the CPU and, especially, NUMA node > themselves. Also, maybe it should be a special type of realtime > class/priority -- doing this should require RT permission IMO. I have no real argument against special permissions, but this feature is totally orthogonal to realtime classes/priorities. It is perfectly legitimate for a given CPU's single runnable task to be SCHED_OTHER, for example. Thanx, Paul ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE @ 2015-05-14 20:55 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-14 20:55 UTC (permalink / raw) To: paulmck, Andy Lutomirski Cc: Ingo Molnar, Peter Zijlstra, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Christoph Lameter, linux-doc, Linux API, linux-kernel On 05/12/2015 08:52 AM, Ingo Molnar wrote: > What I suggested is that it might make sense to offer a system call, > for example a sched_setparam() variant, that makes such guarantees. > > Say if user-space does: > > ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params); > > ... then we would get the task moved to an isolated domain and get a 0 > return code if the kernel is able to do all that and if the current > uid/namespace/etc. has the required permissions and such. Unfortunately I don't know nearly as much about the scheduler and scheduler policies as I might, since I mostly focused on make the scheduler stay out of the way. :-) This does seem like another way to set a policy bit on a process. I assume you could only validly issue this call on a nohz_full core, and that you're not assuming it migrates the cpu to such a core? You suggested that BIND_ISOLATED would not replace the usual scheduler policies, but perhaps SCHED_ISOLATED as a full replacement would make sense - it would make it an error to have any other schedulable task on that core. I guess that brings it around to whether the "cpu_isolated" task just loses when another task is scheduled on the core with it (the current approach I'm proposing) or if it ends up truly owning the core and other processes can be denied the right to run there: which in that case clearly does get us into the area of requiring privileges to set up, as Andy pointed out later. This would leave the notion of "strict" as proposed elsewhere as a separate thing, but presumably it could still be a prctl() as originally proposed. I admit I don't know enough to say whether this sounds like a better approach than just using a prctl() to set the cpu_isolated state. My instinct is that it's cleanest to avoid requiring permissions to do this, and to simply enable the quiescing semantics the process requested when it happens to be alone on a core. If so, it's somewhat orthogonal to the actual scheduler policy in force, so best not to conflate it with the notion of scheduler code at all via sched_setscheduler()? -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE @ 2015-05-14 20:55 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-14 20:55 UTC (permalink / raw) To: paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Andy Lutomirski Cc: Ingo Molnar, Peter Zijlstra, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Christoph Lameter, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 05/12/2015 08:52 AM, Ingo Molnar wrote: > What I suggested is that it might make sense to offer a system call, > for example a sched_setparam() variant, that makes such guarantees. > > Say if user-space does: > > ret = sched_setscheduler(0, BIND_ISOLATED, &isolation_params); > > ... then we would get the task moved to an isolated domain and get a 0 > return code if the kernel is able to do all that and if the current > uid/namespace/etc. has the required permissions and such. Unfortunately I don't know nearly as much about the scheduler and scheduler policies as I might, since I mostly focused on make the scheduler stay out of the way. :-) This does seem like another way to set a policy bit on a process. I assume you could only validly issue this call on a nohz_full core, and that you're not assuming it migrates the cpu to such a core? You suggested that BIND_ISOLATED would not replace the usual scheduler policies, but perhaps SCHED_ISOLATED as a full replacement would make sense - it would make it an error to have any other schedulable task on that core. I guess that brings it around to whether the "cpu_isolated" task just loses when another task is scheduled on the core with it (the current approach I'm proposing) or if it ends up truly owning the core and other processes can be denied the right to run there: which in that case clearly does get us into the area of requiring privileges to set up, as Andy pointed out later. This would leave the notion of "strict" as proposed elsewhere as a separate thing, but presumably it could still be a prctl() as originally proposed. I admit I don't know enough to say whether this sounds like a better approach than just using a prctl() to set the cpu_isolated state. My instinct is that it's cleanest to avoid requiring permissions to do this, and to simply enable the quiescing semantics the process requested when it happens to be alone on a core. If so, it's somewhat orthogonal to the actual scheduler policy in force, so best not to conflate it with the notion of scheduler code at all via sched_setscheduler()? -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE 2015-05-12 9:33 ` Peter Zijlstra @ 2015-05-14 20:54 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-14 20:54 UTC (permalink / raw) To: Peter Zijlstra Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel On 05/12/2015 05:33 AM, Peter Zijlstra wrote: > On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote: >> This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the >> kernel to quiesce any pending timer interrupts prior to returning >> to userspace. When running with this mode set, sys calls (and page >> faults, etc.) can be inordinately slow. However, user applications >> that want to guarantee that no unexpected interrupts will occur >> (even if they call into the kernel) can set this flag to guarantee >> that semantics. > Currently people hot-unplug and hot-plug the CPU to do this. Obviously > that's a wee bit horrible :-) > > Not sure if a prctl like this is any better though. This is a CPU > properly not a process one. The CPU property aspects, I think, should be largely handled by fixing kernel bugs that let work end up running on nohz_full cores without having been explicitly requested to run there. As you said in a follow-up email: On 05/12/2015 06:38 AM, Peter Zijlstra wrote: > Ideally we'd never have to clear the state because it should be > impossible to get into this predicament in the first place. What my prctl() proposal does is quiesce things that end up happening specifically because the user process called on purpose into the kernel. For example, perhaps RCU was invoked in the kernel, and the core has to wait a timer tick to quiesce RCU. Whatever causes it, the intent is that you're not allowed back into userspace until everything has settled down from your call into the kernel; the presumption is that it's all due to the kernel entry that was just made, and not from other stray work. In that sense, it's very appropriate for it to be a process property. > ISTR people talking about 'quiesce' sysfs file, along side the hotplug > stuff, I can't quite remember. It seems somewhat similar (adding Viresh to the cc's) but does seem like it might have been more intended to address the CPU properties rather than process properties: https://lkml.org/lkml/2014/4/4/99 One thing the original Tilera dataplane code did was to require setting dataplane flags to succeed only on dataplane cores, and only when the task had been affinitized to that single core. This did not protect the task from later being re-affinitized in a way that broke those assumptions, but I suppose you could also imagine make sched_setaffinity() fail for such a process. Somewhat unrelated, but it occurred to me in the context of this reply, so what do you think? I can certainly add this to the patch series if it seems like it makes setting the prctl() flags more conservative. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE @ 2015-05-14 20:54 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-14 20:54 UTC (permalink / raw) To: Peter Zijlstra Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel On 05/12/2015 05:33 AM, Peter Zijlstra wrote: > On Fri, May 08, 2015 at 01:58:45PM -0400, Chris Metcalf wrote: >> This prctl() flag for PR_SET_DATAPLANE sets a mode that requires the >> kernel to quiesce any pending timer interrupts prior to returning >> to userspace. When running with this mode set, sys calls (and page >> faults, etc.) can be inordinately slow. However, user applications >> that want to guarantee that no unexpected interrupts will occur >> (even if they call into the kernel) can set this flag to guarantee >> that semantics. > Currently people hot-unplug and hot-plug the CPU to do this. Obviously > that's a wee bit horrible :-) > > Not sure if a prctl like this is any better though. This is a CPU > properly not a process one. The CPU property aspects, I think, should be largely handled by fixing kernel bugs that let work end up running on nohz_full cores without having been explicitly requested to run there. As you said in a follow-up email: On 05/12/2015 06:38 AM, Peter Zijlstra wrote: > Ideally we'd never have to clear the state because it should be > impossible to get into this predicament in the first place. What my prctl() proposal does is quiesce things that end up happening specifically because the user process called on purpose into the kernel. For example, perhaps RCU was invoked in the kernel, and the core has to wait a timer tick to quiesce RCU. Whatever causes it, the intent is that you're not allowed back into userspace until everything has settled down from your call into the kernel; the presumption is that it's all due to the kernel entry that was just made, and not from other stray work. In that sense, it's very appropriate for it to be a process property. > ISTR people talking about 'quiesce' sysfs file, along side the hotplug > stuff, I can't quite remember. It seems somewhat similar (adding Viresh to the cc's) but does seem like it might have been more intended to address the CPU properties rather than process properties: https://lkml.org/lkml/2014/4/4/99 One thing the original Tilera dataplane code did was to require setting dataplane flags to succeed only on dataplane cores, and only when the task had been affinitized to that single core. This did not protect the task from later being re-affinitized in a way that broke those assumptions, but I suppose you could also imagine make sched_setaffinity() fail for such a process. Somewhat unrelated, but it occurred to me in the context of this reply, so what do you think? I can certainly add this to the patch series if it seems like it makes setting the prctl() flags more conservative. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode 2015-05-08 17:58 ` Chris Metcalf @ 2015-05-08 17:58 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With QUIESCE mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we add an internal bit to current->dataplane_flags that is set when prctl() sets the flags. That way, when we are exiting the kernel after calling prctl() to forbid future kernel exits, we don't get immediately killed. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/sys.c | 2 +- kernel/time/tick-sched.c | 17 +++++++++++++++++ 3 files changed, 20 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 8b735651304a..9cf79aa1e73f 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_DATAPLANE 48 # define PR_DATAPLANE_ENABLE (1 << 0) # define PR_DATAPLANE_QUIESCE (1 << 1) +# define PR_DATAPLANE_STRICT (1 << 2) +# define PR_DATAPLANE_PRCTL (1U << 31) /* kernel internal */ #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/sys.c b/kernel/sys.c index 930b750aefde..8102433c9edd 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2245,7 +2245,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, break; #ifdef CONFIG_NO_HZ_FULL case PR_SET_DATAPLANE: - me->dataplane_flags = arg2; + me->dataplane_flags = arg2 | PR_DATAPLANE_PRCTL; break; case PR_GET_DATAPLANE: error = me->dataplane_flags; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 69d908c6cef8..22ed0decb363 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -436,6 +436,20 @@ static void dataplane_quiesce(void) (jiffies - start)); dump_stack(); } + + /* + * Kill the process if it violates STRICT mode. Note that this + * code also results in killing the task if a kernel bug causes an + * irq to be delivered to this core. + */ + if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL)) + == PR_DATAPLANE_STRICT) { + pr_warn("Dataplane STRICT mode violated; process killed.\n"); + dump_stack(); + task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE; + local_irq_enable(); + do_group_exit(SIGKILL); + } } /* @@ -464,6 +478,9 @@ void tick_nohz_dataplane_enter(void) if ((current->dataplane_flags & PR_DATAPLANE_QUIESCE) != 0) dataplane_quiesce(); + /* Clear the bit set by prctl() when it updates the flags. */ + current->dataplane_flags &= ~PR_DATAPLANE_PRCTL; + /* * Disable interrupts again since other code running in this * function may have enabled them, and the caller expects -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode @ 2015-05-08 17:58 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With QUIESCE mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we add an internal bit to current->dataplane_flags that is set when prctl() sets the flags. That way, when we are exiting the kernel after calling prctl() to forbid future kernel exits, we don't get immediately killed. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/sys.c | 2 +- kernel/time/tick-sched.c | 17 +++++++++++++++++ 3 files changed, 20 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 8b735651304a..9cf79aa1e73f 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_DATAPLANE 48 # define PR_DATAPLANE_ENABLE (1 << 0) # define PR_DATAPLANE_QUIESCE (1 << 1) +# define PR_DATAPLANE_STRICT (1 << 2) +# define PR_DATAPLANE_PRCTL (1U << 31) /* kernel internal */ #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/sys.c b/kernel/sys.c index 930b750aefde..8102433c9edd 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2245,7 +2245,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, break; #ifdef CONFIG_NO_HZ_FULL case PR_SET_DATAPLANE: - me->dataplane_flags = arg2; + me->dataplane_flags = arg2 | PR_DATAPLANE_PRCTL; break; case PR_GET_DATAPLANE: error = me->dataplane_flags; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 69d908c6cef8..22ed0decb363 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -436,6 +436,20 @@ static void dataplane_quiesce(void) (jiffies - start)); dump_stack(); } + + /* + * Kill the process if it violates STRICT mode. Note that this + * code also results in killing the task if a kernel bug causes an + * irq to be delivered to this core. + */ + if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL)) + == PR_DATAPLANE_STRICT) { + pr_warn("Dataplane STRICT mode violated; process killed.\n"); + dump_stack(); + task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE; + local_irq_enable(); + do_group_exit(SIGKILL); + } } /* @@ -464,6 +478,9 @@ void tick_nohz_dataplane_enter(void) if ((current->dataplane_flags & PR_DATAPLANE_QUIESCE) != 0) dataplane_quiesce(); + /* Clear the bit set by prctl() when it updates the flags. */ + current->dataplane_flags &= ~PR_DATAPLANE_PRCTL; + /* * Disable interrupts again since other code running in this * function may have enabled them, and the caller expects -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode 2015-05-08 17:58 ` Chris Metcalf (?) @ 2015-05-09 7:28 ` Andy Lutomirski 2015-05-09 10:37 ` Gilad Ben Yossef 2015-05-11 19:13 ` Chris Metcalf -1 siblings, 2 replies; 340+ messages in thread From: Andy Lutomirski @ 2015-05-09 7:28 UTC (permalink / raw) To: Chris Metcalf Cc: Srivatsa S. Bhat, Paul E. McKenney, Frederic Weisbecker, Ingo Molnar, Rik van Riel, linux-doc, Andrew Morton, linux-kernel, Thomas Gleixner, Tejun Heo, Peter Zijlstra, Steven Rostedt, Christoph Lameter, Gilad Ben Yossef, Linux API On May 8, 2015 11:44 PM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote: > > With QUIESCE mode, the task is in principle guaranteed not to be > interrupted by the kernel, but only if it behaves. In particular, > if it enters the kernel via system call, page fault, or any of > a number of other synchronous traps, it may be unexpectedly > exposed to long latencies. Add a simple flag that puts the process > into a state where any such kernel entry is fatal. > > To allow the state to be entered and exited, we add an internal > bit to current->dataplane_flags that is set when prctl() sets the > flags. That way, when we are exiting the kernel after calling > prctl() to forbid future kernel exits, we don't get immediately > killed. Is there any reason this can't already be addressed in userspace using /proc/interrupts or perf_events? ISTM the real goal here is to detect when we screw up and fail to avoid an interrupt, and killing the task seems like overkill to me. Also, can we please stop further torturing the exit paths? We have a disaster of assembly code that calls into syscall_trace_leave and do_notify_resume. Those functions, in turn, *both* call user_enter (WTF?), and on very brief inspection user_enter makes it into the nohz code through multiple levels of indirection, which, with these patches, has yet another conditionally enabled helper, which does this new stuff. It's getting to be impossible to tell what happens when we exit to user space any more. Also, I think your code is buggy. There's no particular guarantee that user_enter is only called once between sys_prctl and the final exit to user mode (see the above WTF), so you might spuriously kill the process. Also, I think that most users will be quite surprised if "strict dataplane" code causes any machine check on the system to kill your dataplane task. Similarly, a user accidentally running perf record -a probably should have some reasonable semantics. /proc/interrupts gets that right as is. Sure, MCEs will hurt your RT performance, but Intel screwed up the way that MCEs work, so we should make do. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* RE: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode 2015-05-09 7:28 ` Andy Lutomirski @ 2015-05-09 10:37 ` Gilad Ben Yossef 2015-05-11 19:13 ` Chris Metcalf 1 sibling, 0 replies; 340+ messages in thread From: Gilad Ben Yossef @ 2015-05-09 10:37 UTC (permalink / raw) To: Andy Lutomirski, Chris Metcalf Cc: Srivatsa S. Bhat, Paul E. McKenney, Frederic Weisbecker, Ingo Molnar, Rik van Riel, linux-doc, Andrew Morton, linux-kernel, Thomas Gleixner, Tejun Heo, Peter Zijlstra, Steven Rostedt, Christoph Lameter, Linux API [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2088 bytes --] > From: Andy Lutomirski [mailto:luto@amacapital.net] > Sent: Saturday, May 09, 2015 10:29 AM > To: Chris Metcalf > Cc: Srivatsa S. Bhat; Paul E. McKenney; Frederic Weisbecker; Ingo Molnar; > Rik van Riel; linux-doc@vger.kernel.org; Andrew Morton; linux- > kernel@vger.kernel.org; Thomas Gleixner; Tejun Heo; Peter Zijlstra; Steven > Rostedt; Christoph Lameter; Gilad Ben Yossef; Linux API > Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode > > On May 8, 2015 11:44 PM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote: > > > > With QUIESCE mode, the task is in principle guaranteed not to be > > interrupted by the kernel, but only if it behaves. In particular, > > if it enters the kernel via system call, page fault, or any of > > a number of other synchronous traps, it may be unexpectedly > > exposed to long latencies. Add a simple flag that puts the process > > into a state where any such kernel entry is fatal. > > > > To allow the state to be entered and exited, we add an internal > > bit to current->dataplane_flags that is set when prctl() sets the > > flags. That way, when we are exiting the kernel after calling > > prctl() to forbid future kernel exits, we don't get immediately > > killed. > > Is there any reason this can't already be addressed in userspace using > /proc/interrupts or perf_events? ISTM the real goal here is to detect > when we screw up and fail to avoid an interrupt, and killing the task > seems like overkill to me. > > Also, can we please stop further torturing the exit paths? So, I don't know if it is a practical suggestion or not, but would it better/easier to mark a pending signal on kernel entry for this case? The upsides I see is that the user gets her notification (killing the task or just logging the event in a signal handler) and hopefully since return to userspace with a pending signal is already handled we don't need new code in the exit path? Gilad ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 340+ messages in thread
* RE: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode @ 2015-05-09 10:37 ` Gilad Ben Yossef 0 siblings, 0 replies; 340+ messages in thread From: Gilad Ben Yossef @ 2015-05-09 10:37 UTC (permalink / raw) To: Andy Lutomirski, Chris Metcalf Cc: Srivatsa S. Bhat, Paul E. McKenney, Frederic Weisbecker, Ingo Molnar, Rik van Riel, linux-doc, Andrew Morton, linux-kernel, Thomas Gleixner, Tejun Heo, Peter Zijlstra, Steven Rostedt, Christoph Lameter, Linux API > From: Andy Lutomirski [mailto:luto@amacapital.net] > Sent: Saturday, May 09, 2015 10:29 AM > To: Chris Metcalf > Cc: Srivatsa S. Bhat; Paul E. McKenney; Frederic Weisbecker; Ingo Molnar; > Rik van Riel; linux-doc@vger.kernel.org; Andrew Morton; linux- > kernel@vger.kernel.org; Thomas Gleixner; Tejun Heo; Peter Zijlstra; Steven > Rostedt; Christoph Lameter; Gilad Ben Yossef; Linux API > Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode > > On May 8, 2015 11:44 PM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote: > > > > With QUIESCE mode, the task is in principle guaranteed not to be > > interrupted by the kernel, but only if it behaves. In particular, > > if it enters the kernel via system call, page fault, or any of > > a number of other synchronous traps, it may be unexpectedly > > exposed to long latencies. Add a simple flag that puts the process > > into a state where any such kernel entry is fatal. > > > > To allow the state to be entered and exited, we add an internal > > bit to current->dataplane_flags that is set when prctl() sets the > > flags. That way, when we are exiting the kernel after calling > > prctl() to forbid future kernel exits, we don't get immediately > > killed. > > Is there any reason this can't already be addressed in userspace using > /proc/interrupts or perf_events? ISTM the real goal here is to detect > when we screw up and fail to avoid an interrupt, and killing the task > seems like overkill to me. > > Also, can we please stop further torturing the exit paths? So, I don't know if it is a practical suggestion or not, but would it better/easier to mark a pending signal on kernel entry for this case? The upsides I see is that the user gets her notification (killing the task or just logging the event in a signal handler) and hopefully since return to userspace with a pending signal is already handled we don't need new code in the exit path? Gilad ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode @ 2015-05-11 19:13 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-11 19:13 UTC (permalink / raw) To: Andy Lutomirski Cc: Paul E. McKenney, Frederic Weisbecker, Ingo Molnar, Rik van Riel, linux-doc, Andrew Morton, linux-kernel, Thomas Gleixner, Tejun Heo, Peter Zijlstra, Steven Rostedt, Christoph Lameter, Gilad Ben Yossef, Linux API On 05/09/2015 03:28 AM, Andy Lutomirski wrote: > On May 8, 2015 11:44 PM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote: >> With QUIESCE mode, the task is in principle guaranteed not to be >> interrupted by the kernel, but only if it behaves. In particular, >> if it enters the kernel via system call, page fault, or any of >> a number of other synchronous traps, it may be unexpectedly >> exposed to long latencies. Add a simple flag that puts the process >> into a state where any such kernel entry is fatal. >> >> To allow the state to be entered and exited, we add an internal >> bit to current->dataplane_flags that is set when prctl() sets the >> flags. That way, when we are exiting the kernel after calling >> prctl() to forbid future kernel exits, we don't get immediately >> killed. > Is there any reason this can't already be addressed in userspace using > /proc/interrupts or perf_events? ISTM the real goal here is to detect > when we screw up and fail to avoid an interrupt, and killing the task > seems like overkill to me. Patch 6/6 proposes a mechanism to track down times when the kernel screws up and delivers an IRQ to a userspace-only task. Here, we're just trying to identify the times when an application screws itself up out of cluelessness, and provide a mechanism that allows the developer to easily figure out why and fix it. In particular, /proc/interrupts won't show syscalls or page faults, which are two easy ways applications can screw themselves when they think they're in userspace-only mode. Also, they don't provide sufficient precision to make it clear what part of the application caused the undesired kernel entry. In this case, killing the task is appropriate, since that's exactly the semantics that have been asked for - it's like on architectures that don't natively support unaligned accesses, but fake it relatively slowly in the kernel, and in development you just say "give me a SIGBUS when that happens" and in production you might say "fix it up and let's try to keep going". You can argue that this is something that can be done by ftrace, but certainly you'd want to have a way to programmatically turn on ftrace at the moment when you're entering userspace-only mode, so we'd want some API around that anyway. And honestly, it's so easy to test a task state bit in a couple of places and generate the failurel on the spot, vs. the relative complexity of setting up and understanding ftrace, that I think it merits inclusion on that basis alone. > Also, can we please stop further torturing the exit paths? We have a > disaster of assembly code that calls into syscall_trace_leave and > do_notify_resume. Those functions, in turn, *both* call user_enter > (WTF?), and on very brief inspection user_enter makes it into the nohz > code through multiple levels of indirection, which, with these > patches, has yet another conditionally enabled helper, which does this > new stuff. It's getting to be impossible to tell what happens when we > exit to user space any more. > > Also, I think your code is buggy. There's no particular guarantee > that user_enter is only called once between sys_prctl and the final > exit to user mode (see the above WTF), so you might spuriously kill > the process. This is a good point; I also find the x86 kernel entry and exit paths confusing, although I've reviewed them a bunch of times. The tile architecture paths are a little easier to understand. That said, I think the answer here is avoid non-idempotent actions in the dataplane code, such as clearing a syscall bit. A better implementation, I think, is to put the tests for "you screwed up and synchronously entered the kernel" in the syscall_trace_enter() code, which TIF_NOHZ already gets us into; there, we can test if the dataplane "strict" bit is set and the syscall is not prctl(), then we generate the error. (We'd exclude exit and exit_group here too, since we don't need to shoot down a task that's just trying to kill itself.) This needs a bit of platform-specific code for each platform, but that doesn't seem like too big a problem. Likewise we can test in exception_enter() since that's only called for all the synchronous user entries like page faults. > Also, I think that most users will be quite surprised if "strict > dataplane" code causes any machine check on the system to kill your > dataplane task. Fair point, and avoided by testing as described above instead. (Though presumably in development it's not such a big deal, and as I said you'd likely turn it off in production.) > Similarly, a user accidentally running perf record -a > probably should have some reasonable semantics. Yes, also avoided by doing this as above, though I'd argue we could also just say that running perf disables this mode. But it's not as clean as the above suggestion. On 05/09/2015 06:37 AM, Gilad Ben Yossef wrote: > So, I don't know if it is a practical suggestion or not, but would it better/easier to mark a pending signal on kernel entry for this case? > The upsides I see is that the user gets her notification (killing the task or just logging the event in a signal handler) and hopefully since return to userspace with a pending signal is already handled we don't need new code in the exit path? We could certainly do this now that I'm planning to do the test at kernel entry rather than super-late in kernel exit. Rather than just do_group_exit(SIGKILL), we should raise a proper SIGKILL signal via send_sig(SIGKILL, current, 1), and then we could catch it in the debugger; the pc should help identify if it was a syscall, page fault, or other trap. I'm not sure there's an argument to be made for the user process being able to catch the signal itself; presumably in production you don't turn this mode on anyway, and in development, assuming a debugger is probably fine. But if you want to argue for another signal (SIGILL?) please do; I'm curious to hear if you think it would make more sense. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode @ 2015-05-11 19:13 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-11 19:13 UTC (permalink / raw) To: Andy Lutomirski Cc: Paul E. McKenney, Frederic Weisbecker, Ingo Molnar, Rik van Riel, linux-doc-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Tejun Heo, Peter Zijlstra, Steven Rostedt, Christoph Lameter, Gilad Ben Yossef, Linux API On 05/09/2015 03:28 AM, Andy Lutomirski wrote: > On May 8, 2015 11:44 PM, "Chris Metcalf" <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> With QUIESCE mode, the task is in principle guaranteed not to be >> interrupted by the kernel, but only if it behaves. In particular, >> if it enters the kernel via system call, page fault, or any of >> a number of other synchronous traps, it may be unexpectedly >> exposed to long latencies. Add a simple flag that puts the process >> into a state where any such kernel entry is fatal. >> >> To allow the state to be entered and exited, we add an internal >> bit to current->dataplane_flags that is set when prctl() sets the >> flags. That way, when we are exiting the kernel after calling >> prctl() to forbid future kernel exits, we don't get immediately >> killed. > Is there any reason this can't already be addressed in userspace using > /proc/interrupts or perf_events? ISTM the real goal here is to detect > when we screw up and fail to avoid an interrupt, and killing the task > seems like overkill to me. Patch 6/6 proposes a mechanism to track down times when the kernel screws up and delivers an IRQ to a userspace-only task. Here, we're just trying to identify the times when an application screws itself up out of cluelessness, and provide a mechanism that allows the developer to easily figure out why and fix it. In particular, /proc/interrupts won't show syscalls or page faults, which are two easy ways applications can screw themselves when they think they're in userspace-only mode. Also, they don't provide sufficient precision to make it clear what part of the application caused the undesired kernel entry. In this case, killing the task is appropriate, since that's exactly the semantics that have been asked for - it's like on architectures that don't natively support unaligned accesses, but fake it relatively slowly in the kernel, and in development you just say "give me a SIGBUS when that happens" and in production you might say "fix it up and let's try to keep going". You can argue that this is something that can be done by ftrace, but certainly you'd want to have a way to programmatically turn on ftrace at the moment when you're entering userspace-only mode, so we'd want some API around that anyway. And honestly, it's so easy to test a task state bit in a couple of places and generate the failurel on the spot, vs. the relative complexity of setting up and understanding ftrace, that I think it merits inclusion on that basis alone. > Also, can we please stop further torturing the exit paths? We have a > disaster of assembly code that calls into syscall_trace_leave and > do_notify_resume. Those functions, in turn, *both* call user_enter > (WTF?), and on very brief inspection user_enter makes it into the nohz > code through multiple levels of indirection, which, with these > patches, has yet another conditionally enabled helper, which does this > new stuff. It's getting to be impossible to tell what happens when we > exit to user space any more. > > Also, I think your code is buggy. There's no particular guarantee > that user_enter is only called once between sys_prctl and the final > exit to user mode (see the above WTF), so you might spuriously kill > the process. This is a good point; I also find the x86 kernel entry and exit paths confusing, although I've reviewed them a bunch of times. The tile architecture paths are a little easier to understand. That said, I think the answer here is avoid non-idempotent actions in the dataplane code, such as clearing a syscall bit. A better implementation, I think, is to put the tests for "you screwed up and synchronously entered the kernel" in the syscall_trace_enter() code, which TIF_NOHZ already gets us into; there, we can test if the dataplane "strict" bit is set and the syscall is not prctl(), then we generate the error. (We'd exclude exit and exit_group here too, since we don't need to shoot down a task that's just trying to kill itself.) This needs a bit of platform-specific code for each platform, but that doesn't seem like too big a problem. Likewise we can test in exception_enter() since that's only called for all the synchronous user entries like page faults. > Also, I think that most users will be quite surprised if "strict > dataplane" code causes any machine check on the system to kill your > dataplane task. Fair point, and avoided by testing as described above instead. (Though presumably in development it's not such a big deal, and as I said you'd likely turn it off in production.) > Similarly, a user accidentally running perf record -a > probably should have some reasonable semantics. Yes, also avoided by doing this as above, though I'd argue we could also just say that running perf disables this mode. But it's not as clean as the above suggestion. On 05/09/2015 06:37 AM, Gilad Ben Yossef wrote: > So, I don't know if it is a practical suggestion or not, but would it better/easier to mark a pending signal on kernel entry for this case? > The upsides I see is that the user gets her notification (killing the task or just logging the event in a signal handler) and hopefully since return to userspace with a pending signal is already handled we don't need new code in the exit path? We could certainly do this now that I'm planning to do the test at kernel entry rather than super-late in kernel exit. Rather than just do_group_exit(SIGKILL), we should raise a proper SIGKILL signal via send_sig(SIGKILL, current, 1), and then we could catch it in the debugger; the pc should help identify if it was a syscall, page fault, or other trap. I'm not sure there's an argument to be made for the user process being able to catch the signal itself; presumably in production you don't turn this mode on anyway, and in development, assuming a debugger is probably fine. But if you want to argue for another signal (SIGILL?) please do; I'm curious to hear if you think it would make more sense. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode 2015-05-11 19:13 ` Chris Metcalf (?) @ 2015-05-11 22:28 ` Andy Lutomirski 2015-05-12 21:06 ` Chris Metcalf -1 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-05-11 22:28 UTC (permalink / raw) To: Chris Metcalf, Peter Zijlstra Cc: Paul E. McKenney, Frederic Weisbecker, Ingo Molnar, Rik van Riel, linux-doc, Andrew Morton, linux-kernel, Thomas Gleixner, Tejun Heo, Steven Rostedt, Christoph Lameter, Gilad Ben Yossef, Linux API [add peterz due to perf stuff] On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 05/09/2015 03:28 AM, Andy Lutomirski wrote: >> >> On May 8, 2015 11:44 PM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote: >>> >>> With QUIESCE mode, the task is in principle guaranteed not to be >>> interrupted by the kernel, but only if it behaves. In particular, >>> if it enters the kernel via system call, page fault, or any of >>> a number of other synchronous traps, it may be unexpectedly >>> exposed to long latencies. Add a simple flag that puts the process >>> into a state where any such kernel entry is fatal. >>> >>> To allow the state to be entered and exited, we add an internal >>> bit to current->dataplane_flags that is set when prctl() sets the >>> flags. That way, when we are exiting the kernel after calling >>> prctl() to forbid future kernel exits, we don't get immediately >>> killed. >> >> Is there any reason this can't already be addressed in userspace using >> /proc/interrupts or perf_events? ISTM the real goal here is to detect >> when we screw up and fail to avoid an interrupt, and killing the task >> seems like overkill to me. > > > Patch 6/6 proposes a mechanism to track down times when the > kernel screws up and delivers an IRQ to a userspace-only task. > Here, we're just trying to identify the times when an application > screws itself up out of cluelessness, and provide a mechanism > that allows the developer to easily figure out why and fix it. > > In particular, /proc/interrupts won't show syscalls or page faults, > which are two easy ways applications can screw themselves > when they think they're in userspace-only mode. Also, they don't > provide sufficient precision to make it clear what part of the > application caused the undesired kernel entry. Perf does, though, complete with context. > > In this case, killing the task is appropriate, since that's exactly > the semantics that have been asked for - it's like on architectures > that don't natively support unaligned accesses, but fake it relatively > slowly in the kernel, and in development you just say "give me a > SIGBUS when that happens" and in production you might say > "fix it up and let's try to keep going". I think more control is needed. I also think that, if we go this route, we should distinguish syscalls, synchronous non-syscall entries, and asynchronous non-syscall entries. They're quite different. > > You can argue that this is something that can be done by ftrace, > but certainly you'd want to have a way to programmatically > turn on ftrace at the moment when you're entering userspace-only > mode, so we'd want some API around that anyway. And honestly, > it's so easy to test a task state bit in a couple of places and > generate the failurel on the spot, vs. the relative complexity > of setting up and understanding ftrace, that I think it merits > inclusion on that basis alone. perf_event, not ftrace. > >> Also, can we please stop further torturing the exit paths? We have a >> disaster of assembly code that calls into syscall_trace_leave and >> do_notify_resume. Those functions, in turn, *both* call user_enter >> (WTF?), and on very brief inspection user_enter makes it into the nohz >> code through multiple levels of indirection, which, with these >> patches, has yet another conditionally enabled helper, which does this >> new stuff. It's getting to be impossible to tell what happens when we >> exit to user space any more. >> >> Also, I think your code is buggy. There's no particular guarantee >> that user_enter is only called once between sys_prctl and the final >> exit to user mode (see the above WTF), so you might spuriously kill >> the process. > > > This is a good point; I also find the x86 kernel entry and exit > paths confusing, although I've reviewed them a bunch of times. > The tile architecture paths are a little easier to understand. > > That said, I think the answer here is avoid non-idempotent > actions in the dataplane code, such as clearing a syscall bit. > > A better implementation, I think, is to put the tests for "you > screwed up and synchronously entered the kernel" in > the syscall_trace_enter() code, which TIF_NOHZ already > gets us into; No, not unless you're planning on using that to distinguish syscalls from other stuff *and* people think that's justified. It's far to easy to just make a tiny change to the entry code. Add a tiny trivial change here, a few lines of asm (that's you, audit!) there, some weird written-in-asm scheduling code over here, and you end up with the truly awful mess that we currently have. If it really makes sense for this stuff to go with context tracking, then fine, but we should *fix* the context tracking first rather than kludging around it. I already have a prototype patch for the relevant part of that. > there, we can test if the dataplane "strict" bit is > set and the syscall is not prctl(), then we generate the error. > (We'd exclude exit and exit_group here too, since we don't > need to shoot down a task that's just trying to kill itself.) > This needs a bit of platform-specific code for each platform, > but that doesn't seem like too big a problem. I'd rather avoid that, too. This feature isn't really arch-specific, so let's avoid the arch stuff if at all possible. > > Likewise we can test in exception_enter() since that's only > called for all the synchronous user entries like page faults. Let's try to generalize a bit. There's also irq_entry and ist_enter, and some of the exception_enter cases are for synchronous entries while (IIRC -- could be wrong) others aren't always like that. > >> Also, I think that most users will be quite surprised if "strict >> dataplane" code causes any machine check on the system to kill your >> dataplane task. > > > Fair point, and avoided by testing as described above instead. > (Though presumably in development it's not such a big deal, > and as I said you'd likely turn it off in production.) Until you forget to turn it off in production because it worked so nicely in development. What if we added a mode to perf where delivery of a sample synchronously (or semi-synchronously by catching it on the next exit to userspace) freezes the delivering task? It would be like debugger support via perf. peterz, do you think this would be a sensible thing to add to perf? It would only make sense for some types of events (tracepoints and hw_breakpoints mostly, I think). >> So, I don't know if it is a practical suggestion or not, but would it >> better/easier to mark a pending signal on kernel entry for this case? >> The upsides I see is that the user gets her notification (killing the task >> or just logging the event in a signal handler) and hopefully since return to >> userspace with a pending signal is already handled we don't need new code in >> the exit path? > > > We could certainly do this now that I'm planning to do the > test at kernel entry rather than super-late in kernel exit. > Rather than just do_group_exit(SIGKILL), we should raise > a proper SIGKILL signal via send_sig(SIGKILL, current, 1), > and then we could catch it in the debugger; the pc should > help identify if it was a syscall, page fault, or other trap. > > I'm not sure there's an argument to be made for the user > process being able to catch the signal itself; presumably in > production you don't turn this mode on anyway, and in > development, assuming a debugger is probably fine. > > But if you want to argue for another signal (SIGILL?) please > do; I'm curious to hear if you think it would make more sense. Make it configurable as part of the prctl. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode 2015-05-11 22:28 ` Andy Lutomirski @ 2015-05-12 21:06 ` Chris Metcalf 2015-05-12 22:23 ` Andy Lutomirski 0 siblings, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-05-12 21:06 UTC (permalink / raw) To: Andy Lutomirski, Peter Zijlstra Cc: Paul E. McKenney, Frederic Weisbecker, Ingo Molnar, Rik van Riel, linux-doc, Andrew Morton, linux-kernel, Thomas Gleixner, Tejun Heo, Steven Rostedt, Christoph Lameter, Gilad Ben Yossef, Linux API On 05/11/2015 06:28 PM, Andy Lutomirski wrote: > [add peterz due to perf stuff] > > On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> Patch 6/6 proposes a mechanism to track down times when the >> kernel screws up and delivers an IRQ to a userspace-only task. >> Here, we're just trying to identify the times when an application >> screws itself up out of cluelessness, and provide a mechanism >> that allows the developer to easily figure out why and fix it. >> >> In particular, /proc/interrupts won't show syscalls or page faults, >> which are two easy ways applications can screw themselves >> when they think they're in userspace-only mode. Also, they don't >> provide sufficient precision to make it clear what part of the >> application caused the undesired kernel entry. > Perf does, though, complete with context. The perf_event suggestions are interesting, but I think it's plausible for this to be an alternate way to debug the issues that STRICT addresses. >> In this case, killing the task is appropriate, since that's exactly >> the semantics that have been asked for - it's like on architectures >> that don't natively support unaligned accesses, but fake it relatively >> slowly in the kernel, and in development you just say "give me a >> SIGBUS when that happens" and in production you might say >> "fix it up and let's try to keep going". > I think more control is needed. I also think that, if we go this > route, we should distinguish syscalls, synchronous non-syscall > entries, and asynchronous non-syscall entries. They're quite > different. I don't think it's necessary to distinguish the types. As long as we have a PC pointing to the instruction that triggered the problem, we can see if it's a system call instruction, a memory write that caused a page fault, a trap instruction, etc. We certainly could add infrastructure to capture syscall numbers, fault/signal numbers, etc etc, but I think it's overkill if it adds kernel overhead on entry/exit. >> A better implementation, I think, is to put the tests for "you >> screwed up and synchronously entered the kernel" in >> the syscall_trace_enter() code, which TIF_NOHZ already >> gets us into; > No, not unless you're planning on using that to distinguish syscalls > from other stuff *and* people think that's justified. So, the question is how we separate synchronous entries from IRQs? At a high level, IRQs are kernel bugs (for cpu-isolated tasks), and synchronous entries are application bugs. We'd like to deliver a signal for the latter, and do some kind of kernel diagnostics for the former. So we can't just add the test in the context tracking code, which doesn't actually know why we're entering or exiting. That's why I was thinking that the syscall_trace_entry and exception_enter paths were the best choices. I'm fairly sure that exception_enter is only done for synchronous traps, page faults, etc. Certainly on the tile architecture we include the trap number in the pt_regs, so it's possible to just examine the pt_regs and know why you entered or are exiting the kernel, but I don't think we can rely on that for all architectures. > It's far to easy to just make a tiny change to the entry code. Add a > tiny trivial change here, a few lines of asm (that's you, audit!) > there, some weird written-in-asm scheduling code over here, and you > end up with the truly awful mess that we currently have. > > If it really makes sense for this stuff to go with context tracking, > then fine, but we should *fix* the context tracking first rather than > kludging around it. I already have a prototype patch for the relevant > part of that. > >> there, we can test if the dataplane "strict" bit is >> set and the syscall is not prctl(), then we generate the error. >> (We'd exclude exit and exit_group here too, since we don't >> need to shoot down a task that's just trying to kill itself.) >> This needs a bit of platform-specific code for each platform, >> but that doesn't seem like too big a problem. > I'd rather avoid that, too. This feature isn't really arch-specific, > so let's avoid the arch stuff if at all possible. I'll put out a v2 of my patch that does both the things you advise against :-) just so we can have a strawman to think about how to do it better - unless you have a suggestion offhand as to how we can better differentiate sync and async entries into the kernel in a platform-independent way. I could imagine modifying user_exit() and exception_enter() to pass an identifier into the context system saying why they were changing contexts, so we could have syscalls, trap numbers, fault numbers, etc., and some way to query as to whether they were synchronous or asynchronous, and build this scheme on top of that, but I'm not sure the extra infrastructure is worthwhile. >> Likewise we can test in exception_enter() since that's only >> called for all the synchronous user entries like page faults. > Let's try to generalize a bit. There's also irq_entry and ist_enter, > and some of the exception_enter cases are for synchronous entries > while (IIRC -- could be wrong) others aren't always like that. I don't think we need to generalize this piece. irq_entry() shouldn't be reported by the STRICT mechanism but by kernel bug reporting. For ist_enter(), it looks like if you're coming from userspace it's just handled with exception_enter(). I'm more familiar with the tile architecture mechanisms than with x86, though, to be honest. >>> Also, I think that most users will be quite surprised if "strict >>> dataplane" code causes any machine check on the system to kill your >>> dataplane task. >> >> Fair point, and avoided by testing as described above instead. >> (Though presumably in development it's not such a big deal, >> and as I said you'd likely turn it off in production.) > Until you forget to turn it off in production because it worked so > nicely in development. I guess that's an argument for using a non-fatal signal with a handler from the get-go, since then even in production you'll just end up with a slightly heavier-weight kernel overhead (whatever stupid thing your application did, plus the time spent in the signal handler), but then after that you can get back to processing packets or whatever the app is doing. You had mentioned some alternatives to a catchable signal (a signal to some other process, or queuing to an fd); I think it still seems reasonable to just deliver a signal to the process, configurably by the prctl, and not do anything more complex. Does this seem reasonable to you at this point? > What if we added a mode to perf where delivery of a sample > synchronously (or semi-synchronously by catching it on the next exit > to userspace) freezes the delivering task? It would be like debugger > support via perf. > > peterz, do you think this would be a sensible thing to add to perf? > It would only make sense for some types of events (tracepoints and > hw_breakpoints mostly, I think). I suspect it's reasonable to consider this orthogonal, particularly if there is some skid between the actual violation by the application, and the freeze happening. You pushed back somewhat on prctl() in favor of a quiesce() syscall in your email, but it seemed like at the end of your email you were adopting the prctl() perspective. Is that true? I admit the prctl() still seems cleaner from my perspective. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode 2015-05-12 21:06 ` Chris Metcalf @ 2015-05-12 22:23 ` Andy Lutomirski 2015-05-15 21:25 ` Chris Metcalf 0 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-05-12 22:23 UTC (permalink / raw) To: Chris Metcalf Cc: Paul E. McKenney, Frederic Weisbecker, linux-kernel, Rik van Riel, Andrew Morton, Linux API, Thomas Gleixner, Tejun Heo, Peter Zijlstra, Steven Rostedt, linux-doc, Christoph Lameter, Gilad Ben Yossef, Ingo Molnar On May 13, 2015 6:06 AM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote: > > On 05/11/2015 06:28 PM, Andy Lutomirski wrote: >> >> [add peterz due to perf stuff] >> >> On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >>> >>> Patch 6/6 proposes a mechanism to track down times when the >>> kernel screws up and delivers an IRQ to a userspace-only task. >>> Here, we're just trying to identify the times when an application >>> screws itself up out of cluelessness, and provide a mechanism >>> that allows the developer to easily figure out why and fix it. >>> >>> In particular, /proc/interrupts won't show syscalls or page faults, >>> which are two easy ways applications can screw themselves >>> when they think they're in userspace-only mode. Also, they don't >>> provide sufficient precision to make it clear what part of the >>> application caused the undesired kernel entry. >> >> Perf does, though, complete with context. > > > The perf_event suggestions are interesting, but I think it's plausible > for this to be an alternate way to debug the issues that STRICT > addresses. > > >>> In this case, killing the task is appropriate, since that's exactly >>> the semantics that have been asked for - it's like on architectures >>> that don't natively support unaligned accesses, but fake it relatively >>> slowly in the kernel, and in development you just say "give me a >>> SIGBUS when that happens" and in production you might say >>> "fix it up and let's try to keep going". >> >> I think more control is needed. I also think that, if we go this >> route, we should distinguish syscalls, synchronous non-syscall >> entries, and asynchronous non-syscall entries. They're quite >> different. > > > I don't think it's necessary to distinguish the types. As long as we > have a PC pointing to the instruction that triggered the problem, > we can see if it's a system call instruction, a memory write that > caused a page fault, a trap instruction, etc. Not true. PC right after a syscall insn could be any type of kernel entry, and you can't even reliably tell whether the syscall insn was executed or, on x86, whether it was a syscall at all. (x86 insns can't be reliably decided backwards.) PC pointing at a load could be a page fault or an IPI. > We certainly could > add infrastructure to capture syscall numbers, fault/signal numbers, > etc etc, but I think it's overkill if it adds kernel overhead on > entry/exit. > None of these should add overhead. > >>> A better implementation, I think, is to put the tests for "you >>> screwed up and synchronously entered the kernel" in >>> the syscall_trace_enter() code, which TIF_NOHZ already >>> gets us into; >> >> No, not unless you're planning on using that to distinguish syscalls >> from other stuff *and* people think that's justified. > > > So, the question is how we separate synchronous entries > from IRQs? At a high level, IRQs are kernel bugs (for cpu-isolated > tasks), and synchronous entries are application bugs. We'd > like to deliver a signal for the latter, and do some kind of > kernel diagnostics for the former. So we can't just add the > test in the context tracking code, which doesn't actually know > why we're entering or exiting. Synchronous entries could be VM bugs, too. > > That's why I was thinking that the syscall_trace_entry and > exception_enter paths were the best choices. I'm fairly sure > that exception_enter is only done for synchronous traps, > page faults, etc. Maybe. Doing it through the actual entry/exit slow paths would be overhead-free, although I'm not sure that IRQs have real slow paths for entry. > > Certainly on the tile architecture we include the trap number > in the pt_regs, so it's possible to just examine the pt_regs and > know why you entered or are exiting the kernel, but I don't > think we can rely on that for all architectures. x86 can't do this. > I'll put out a v2 of my patch that does both the things you > advise against :-) just so we can have a strawman to think > about how to do it better - unless you have a suggestion > offhand as to how we can better differentiate sync and async > entries into the kernel in a platform-independent way. > > I could imagine modifying user_exit() and exception_enter() > to pass an identifier into the context system saying why they > were changing contexts, so we could have syscalls, trap > numbers, fault numbers, etc., and some way to query as > to whether they were synchronous or asynchronous, and > build this scheme on top of that, but I'm not sure the extra > infrastructure is worthwhile. > I'll take a look. Again, though, I think we really do need to distinguish at least MCE and NMI (on x86) from the others. > >> What if we added a mode to perf where delivery of a sample >> synchronously (or semi-synchronously by catching it on the next exit >> to userspace) freezes the delivering task? It would be like debugger >> support via perf. >> >> peterz, do you think this would be a sensible thing to add to perf? >> It would only make sense for some types of events (tracepoints and >> hw_breakpoints mostly, I think). > > > I suspect it's reasonable to consider this orthogonal, particularly > if there is some skid between the actual violation by the > application, and the freeze happening. > I think it could be done without skid, except for async entries, but for asynx entries we don't care about exact user state anyway. > You pushed back somewhat on prctl() in favor of a quiesce() > syscall in your email, but it seemed like at the end of your > email you were adopting the prctl() perspective. Is that true? > I admit the prctl() still seems cleaner from my perspective. > Prctl for the strict thing seems much more reasonable to me than prctl for quiescing. Also, the scheduler people seem to thing that quiescing should be automatic. Anyway, I'll happily look at code and maybe even write more coherent emails when I'm back in town in a week. Since you're thinking that async entries should give kernel diagnostics instead of signals, maybe the right thing to do is to separate them out completely and try to address the individual entry types separately and as needed. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode 2015-05-12 22:23 ` Andy Lutomirski @ 2015-05-15 21:25 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-15 21:25 UTC (permalink / raw) To: Andy Lutomirski Cc: Paul E. McKenney, Frederic Weisbecker, linux-kernel, Rik van Riel, Andrew Morton, Linux API, Thomas Gleixner, Tejun Heo, Peter Zijlstra, Steven Rostedt, linux-doc, Christoph Lameter, Gilad Ben Yossef, Ingo Molnar On 05/12/2015 06:23 PM, Andy Lutomirski wrote: > On May 13, 2015 6:06 AM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote: >> On 05/11/2015 06:28 PM, Andy Lutomirski wrote: >>> On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >>>> In this case, killing the task is appropriate, since that's exactly >>>> the semantics that have been asked for - it's like on architectures >>>> that don't natively support unaligned accesses, but fake it relatively >>>> slowly in the kernel, and in development you just say "give me a >>>> SIGBUS when that happens" and in production you might say >>>> "fix it up and let's try to keep going". >>> I think more control is needed. I also think that, if we go this >>> route, we should distinguish syscalls, synchronous non-syscall >>> entries, and asynchronous non-syscall entries. They're quite >>> different. >> >> I don't think it's necessary to distinguish the types. As long as we >> have a PC pointing to the instruction that triggered the problem, >> we can see if it's a system call instruction, a memory write that >> caused a page fault, a trap instruction, etc. > Not true. PC right after a syscall insn could be any type of kernel > entry, and you can't even reliably tell whether the syscall insn was > executed or, on x86, whether it was a syscall at all. (x86 insns > can't be reliably decided backwards.) > > PC pointing at a load could be a page fault or an IPI. All that we are trying to do with this API, though, is distinguish synchronous faults. So IPIs, etc., should not be happening (they would be bugs), and hopefully we are mostly just distinguishing different types of synchronous program entries. That said, I did a si_info flag to differentiate syscalls from other synchronous entries, and I'm open to looking at more such if it seems useful. > Again, though, I think we really do need to distinguish at least MCE > and NMI (on x86) from the others. Yes, those are both interesting cases, and I'm not entirely sure what the right way to handle them is - for example, likely disable STRICT if you are running with perf enabled. I look forward to hearing more when you're back next week! -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode 2015-05-08 17:58 ` Chris Metcalf (?) (?) @ 2015-05-12 9:38 ` Peter Zijlstra 2015-05-12 13:20 ` Paul E. McKenney -1 siblings, 1 reply; 340+ messages in thread From: Peter Zijlstra @ 2015-05-12 9:38 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote: > +++ b/kernel/time/tick-sched.c > @@ -436,6 +436,20 @@ static void dataplane_quiesce(void) > (jiffies - start)); > dump_stack(); > } > + > + /* > + * Kill the process if it violates STRICT mode. Note that this > + * code also results in killing the task if a kernel bug causes an > + * irq to be delivered to this core. > + */ > + if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL)) > + == PR_DATAPLANE_STRICT) { > + pr_warn("Dataplane STRICT mode violated; process killed.\n"); > + dump_stack(); > + task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE; > + local_irq_enable(); > + do_group_exit(SIGKILL); > + } > } So while I'm all for hard fails like this, can we not provide a wee bit more information in the siginfo ? And maybe use a slightly less fatal signal, such that userspace can actually catch it and dump state in debug modes? ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode @ 2015-05-12 13:20 ` Paul E. McKenney 0 siblings, 0 replies; 340+ messages in thread From: Paul E. McKenney @ 2015-05-12 13:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Tue, May 12, 2015 at 11:38:58AM +0200, Peter Zijlstra wrote: > On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote: > > +++ b/kernel/time/tick-sched.c > > @@ -436,6 +436,20 @@ static void dataplane_quiesce(void) > > (jiffies - start)); > > dump_stack(); > > } > > + > > + /* > > + * Kill the process if it violates STRICT mode. Note that this > > + * code also results in killing the task if a kernel bug causes an > > + * irq to be delivered to this core. > > + */ > > + if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL)) > > + == PR_DATAPLANE_STRICT) { > > + pr_warn("Dataplane STRICT mode violated; process killed.\n"); > > + dump_stack(); > > + task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE; > > + local_irq_enable(); > > + do_group_exit(SIGKILL); > > + } > > } > > So while I'm all for hard fails like this, can we not provide a wee bit > more information in the siginfo ? And maybe use a slightly less fatal > signal, such that userspace can actually catch it and dump state in > debug modes? Agreed, a bit more debug state would be helpful. Thanx, Paul ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode @ 2015-05-12 13:20 ` Paul E. McKenney 0 siblings, 0 replies; 340+ messages in thread From: Paul E. McKenney @ 2015-05-12 13:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, May 12, 2015 at 11:38:58AM +0200, Peter Zijlstra wrote: > On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote: > > +++ b/kernel/time/tick-sched.c > > @@ -436,6 +436,20 @@ static void dataplane_quiesce(void) > > (jiffies - start)); > > dump_stack(); > > } > > + > > + /* > > + * Kill the process if it violates STRICT mode. Note that this > > + * code also results in killing the task if a kernel bug causes an > > + * irq to be delivered to this core. > > + */ > > + if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL)) > > + == PR_DATAPLANE_STRICT) { > > + pr_warn("Dataplane STRICT mode violated; process killed.\n"); > > + dump_stack(); > > + task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE; > > + local_irq_enable(); > > + do_group_exit(SIGKILL); > > + } > > } > > So while I'm all for hard fails like this, can we not provide a wee bit > more information in the siginfo ? And maybe use a slightly less fatal > signal, such that userspace can actually catch it and dump state in > debug modes? Agreed, a bit more debug state would be helpful. Thanx, Paul ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 6/6] nohz: add dataplane_debug boot flag 2015-05-08 17:58 ` Chris Metcalf ` (5 preceding siblings ...) (?) @ 2015-05-08 17:58 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-08 17:58 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-kernel Cc: Chris Metcalf This flag simplifies debugging of NO_HZ_FULL kernels when processes are running in PR_DATAPLANE_QUIESCE mode. Such processes should get no interrupts from the kernel, and if they do, when this boot flag is specified a kernel stack dump on the console is generated. It's possible to use ftrace to simply detect whether a dataplane core has unexpectedly entered the kernel. But what this boot flag does is allow the kernel to provide better diagnostics, e.g. by reporting in the IPI-generating code what remote core and context is preparing to deliver an interrupt to a dataplane core. It may be worth considering other ways to generate useful debugging output rather than console spew, but for now that is simple and direct. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- Documentation/kernel-parameters.txt | 6 ++++++ arch/tile/mm/homecache.c | 5 ++++- include/linux/tick.h | 2 ++ kernel/irq_work.c | 4 +++- kernel/sched/core.c | 18 ++++++++++++++++++ kernel/signal.c | 5 +++++ kernel/smp.c | 4 ++++ kernel/softirq.c | 1 + 8 files changed, 43 insertions(+), 2 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index f6befa9855c1..5c5af5258e17 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -794,6 +794,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. dasd= [HW,NET] See header of drivers/s390/block/dasd_devmap.c. + dataplane_debug [KNL] + In kernels built with CONFIG_NO_HZ_FULL and booted + in nohz_full= mode, this setting will generate console + backtraces when the kernel is about to interrupt a + task that has requested PR_DATAPLANE_QUIESCE. + db9.dev[2|3]= [HW,JOY] Multisystem joystick support via parallel port (one device per port) Format: <port#>,<type> diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c index 40ca30a9fee3..dd5ec7eca9a8 100644 --- a/arch/tile/mm/homecache.c +++ b/arch/tile/mm/homecache.c @@ -31,6 +31,7 @@ #include <linux/smp.h> #include <linux/module.h> #include <linux/hugetlb.h> +#include <linux/tick.h> #include <asm/page.h> #include <asm/sections.h> @@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask, * Don't bother to update atomically; losing a count * here is not that critical. */ - for_each_cpu(cpu, &mask) + for_each_cpu(cpu, &mask) { ++per_cpu(irq_stat, cpu).irq_hv_flush_count; + tick_nohz_dataplane_debug(cpu); + } } /* diff --git a/include/linux/tick.h b/include/linux/tick.h index d191cda9b71a..4610cdf0f972 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -147,6 +147,7 @@ extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); extern void tick_nohz_dataplane_enter(void); +extern void tick_nohz_dataplane_debug(int cpu); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -157,6 +158,7 @@ static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } static inline bool tick_nohz_is_dataplane(void) { return false; } static inline void tick_nohz_dataplane_enter(void) { } +static inline void tick_nohz_dataplane_debug(int cpu) { } #endif static inline bool is_housekeeping_cpu(int cpu) diff --git a/kernel/irq_work.c b/kernel/irq_work.c index cbf9fb899d92..0adc53c4e899 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -75,8 +75,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu) if (!irq_work_claim(work)) return false; - if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) + if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) { + tick_nohz_dataplane_debug(cpu); arch_send_call_function_single_ipi(cpu); + } return true; } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f9123a82cbb6..202fab0c41cb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -719,6 +719,24 @@ bool sched_can_stop_tick(void) return true; } + +/* Enable debugging of any interrupts of dataplane cores. */ +static int dataplane_debug; +static int __init dataplane_debug_func(char *str) +{ + dataplane_debug = true; + return 1; +} +__setup("dataplane_debug", dataplane_debug_func); + +void tick_nohz_dataplane_debug(int cpu) +{ + if (dataplane_debug && tick_nohz_full_cpu(cpu) && + (cpu_curr(cpu)->dataplane_flags & PR_DATAPLANE_QUIESCE)) { + pr_err("Interrupt detected for dataplane cpu %d\n", cpu); + dump_stack(); + } +} #endif /* CONFIG_NO_HZ_FULL */ void sched_avg_update(struct rq *rq) diff --git a/kernel/signal.c b/kernel/signal.c index d51c5ddd855c..ebc552cafff5 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -689,6 +689,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) */ void signal_wake_up_state(struct task_struct *t, unsigned int state) { +#ifdef CONFIG_NO_HZ_FULL + /* If the task is being killed, don't complain about dataplane. */ + if (state & TASK_WAKEKILL) + t->dataplane_flags = 0; +#endif set_tsk_thread_flag(t, TIF_SIGPENDING); /* * TASK_WAKEKILL also means wake it up in the stopped/traced/killable diff --git a/kernel/smp.c b/kernel/smp.c index 07854477c164..9518fc80321b 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -14,6 +14,7 @@ #include <linux/smp.h> #include <linux/cpu.h> #include <linux/sched.h> +#include <linux/tick.h> #include "smpboot.h" @@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, * locking and barrier primitives. Generic code isn't really * equipped to do the right thing... */ + tick_nohz_dataplane_debug(cpu); if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) arch_send_call_function_single_ipi(cpu); @@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask, } /* Send a message to all CPUs in the map */ + for_each_cpu(cpu, cfd->cpumask) + tick_nohz_dataplane_debug(cpu); arch_send_call_function_ipi_mask(cfd->cpumask); if (wait) { diff --git a/kernel/softirq.c b/kernel/softirq.c index bc9406337f82..eeacabf08ca6 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -394,6 +394,7 @@ void irq_exit(void) WARN_ON_ONCE(!irqs_disabled()); #endif + tick_nohz_dataplane_debug(smp_processor_id()); account_irq_exit_time(current); preempt_count_sub(HARDIRQ_OFFSET); if (!in_interrupt() && local_softirq_pending()) -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-08 21:18 ` Andrew Morton 0 siblings, 0 replies; 340+ messages in thread From: Andrew Morton @ 2015-05-08 21:18 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > A prctl() option (PR_SET_DATAPLANE) is added Dumb question: what does the term "dataplane" mean in this context? I can't see the relationship between those words and what this patch does. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-08 21:18 ` Andrew Morton 0 siblings, 0 replies; 340+ messages in thread From: Andrew Morton @ 2015-05-08 21:18 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > A prctl() option (PR_SET_DATAPLANE) is added Dumb question: what does the term "dataplane" mean in this context? I can't see the relationship between those words and what this patch does. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-08 21:18 ` Andrew Morton @ 2015-05-08 21:22 ` Steven Rostedt -1 siblings, 0 replies; 340+ messages in thread From: Steven Rostedt @ 2015-05-08 21:22 UTC (permalink / raw) To: Andrew Morton Cc: Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Fri, 8 May 2015 14:18:24 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > > > A prctl() option (PR_SET_DATAPLANE) is added > > Dumb question: what does the term "dataplane" mean in this context? I > can't see the relationship between those words and what this patch > does. I was thinking the same thing. I haven't gotten around to searching DATAPLANE yet. I would assume we want a name that is more meaningful for what is happening. -- Steve ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-08 21:22 ` Steven Rostedt 0 siblings, 0 replies; 340+ messages in thread From: Steven Rostedt @ 2015-05-08 21:22 UTC (permalink / raw) To: Andrew Morton Cc: Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Fri, 8 May 2015 14:18:24 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > > > A prctl() option (PR_SET_DATAPLANE) is added > > Dumb question: what does the term "dataplane" mean in this context? I > can't see the relationship between those words and what this patch > does. I was thinking the same thing. I haven't gotten around to searching DATAPLANE yet. I would assume we want a name that is more meaningful for what is happening. -- Steve ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-08 23:11 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-08 23:11 UTC (permalink / raw) To: Steven Rostedt, Andrew Morton Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On 5/8/2015 5:22 PM, Steven Rostedt wrote: > On Fri, 8 May 2015 14:18:24 -0700 > Andrew Morton <akpm@linux-foundation.org> wrote: > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: >> >>> A prctl() option (PR_SET_DATAPLANE) is added >> Dumb question: what does the term "dataplane" mean in this context? I >> can't see the relationship between those words and what this patch >> does. > I was thinking the same thing. I haven't gotten around to searching > DATAPLANE yet. > > I would assume we want a name that is more meaningful for what is > happening. The text in the commit message and the 0/6 cover letter do try to explain the concept. The terminology comes, I think, from networking line cards, where the "dataplane" is the part of the application that handles all the fast path processing of network packets, and the "control plane" is the part that handles routing updates, etc., generally slow-path stuff. I've probably just been using the terms so long they seem normal to me. That said, what would be clearer? NO_HZ_STRICT as a superset of NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all, we're talking about no interrupts of any kind, and maybe NO_HZ is too limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look to vendors who ship bare-metal runtimes and call it BARE_METAL? Borrow the Tilera marketing name and call it ZERO_OVERHEAD? Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me, of course :-) -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-08 23:11 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-08 23:11 UTC (permalink / raw) To: Steven Rostedt, Andrew Morton Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 5/8/2015 5:22 PM, Steven Rostedt wrote: > On Fri, 8 May 2015 14:18:24 -0700 > Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote: > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> >>> A prctl() option (PR_SET_DATAPLANE) is added >> Dumb question: what does the term "dataplane" mean in this context? I >> can't see the relationship between those words and what this patch >> does. > I was thinking the same thing. I haven't gotten around to searching > DATAPLANE yet. > > I would assume we want a name that is more meaningful for what is > happening. The text in the commit message and the 0/6 cover letter do try to explain the concept. The terminology comes, I think, from networking line cards, where the "dataplane" is the part of the application that handles all the fast path processing of network packets, and the "control plane" is the part that handles routing updates, etc., generally slow-path stuff. I've probably just been using the terms so long they seem normal to me. That said, what would be clearer? NO_HZ_STRICT as a superset of NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all, we're talking about no interrupts of any kind, and maybe NO_HZ is too limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look to vendors who ship bare-metal runtimes and call it BARE_METAL? Borrow the Tilera marketing name and call it ZERO_OVERHEAD? Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me, of course :-) -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-08 23:11 ` Chris Metcalf @ 2015-05-08 23:19 ` Andrew Morton -1 siblings, 0 replies; 340+ messages in thread From: Andrew Morton @ 2015-05-08 23:19 UTC (permalink / raw) To: Chris Metcalf Cc: Steven Rostedt, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 5/8/2015 5:22 PM, Steven Rostedt wrote: > > On Fri, 8 May 2015 14:18:24 -0700 > > Andrew Morton <akpm@linux-foundation.org> wrote: > > > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > >> > >>> A prctl() option (PR_SET_DATAPLANE) is added > >> Dumb question: what does the term "dataplane" mean in this context? I > >> can't see the relationship between those words and what this patch > >> does. > > I was thinking the same thing. I haven't gotten around to searching > > DATAPLANE yet. > > > > I would assume we want a name that is more meaningful for what is > > happening. > > The text in the commit message and the 0/6 cover letter do try to explain > the concept. The terminology comes, I think, from networking line cards, > where the "dataplane" is the part of the application that handles all the > fast path processing of network packets, and the "control plane" is the part > that handles routing updates, etc., generally slow-path stuff. I've probably > just been using the terms so long they seem normal to me. > > That said, what would be clearer? NO_HZ_STRICT as a superset of > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all, > we're talking about no interrupts of any kind, and maybe NO_HZ is too > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look > to vendors who ship bare-metal runtimes and call it BARE_METAL? > Borrow the Tilera marketing name and call it ZERO_OVERHEAD? > > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me, > of course :-) I like NO_INTERRUPTS. Simple, direct. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-08 23:19 ` Andrew Morton 0 siblings, 0 replies; 340+ messages in thread From: Andrew Morton @ 2015-05-08 23:19 UTC (permalink / raw) To: Chris Metcalf Cc: Steven Rostedt, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 5/8/2015 5:22 PM, Steven Rostedt wrote: > > On Fri, 8 May 2015 14:18:24 -0700 > > Andrew Morton <akpm@linux-foundation.org> wrote: > > > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > >> > >>> A prctl() option (PR_SET_DATAPLANE) is added > >> Dumb question: what does the term "dataplane" mean in this context? I > >> can't see the relationship between those words and what this patch > >> does. > > I was thinking the same thing. I haven't gotten around to searching > > DATAPLANE yet. > > > > I would assume we want a name that is more meaningful for what is > > happening. > > The text in the commit message and the 0/6 cover letter do try to explain > the concept. The terminology comes, I think, from networking line cards, > where the "dataplane" is the part of the application that handles all the > fast path processing of network packets, and the "control plane" is the part > that handles routing updates, etc., generally slow-path stuff. I've probably > just been using the terms so long they seem normal to me. > > That said, what would be clearer? NO_HZ_STRICT as a superset of > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all, > we're talking about no interrupts of any kind, and maybe NO_HZ is too > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look > to vendors who ship bare-metal runtimes and call it BARE_METAL? > Borrow the Tilera marketing name and call it ZERO_OVERHEAD? > > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me, > of course :-) I like NO_INTERRUPTS. Simple, direct. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-08 23:19 ` Andrew Morton (?) @ 2015-05-09 7:05 ` Ingo Molnar 2015-05-09 7:19 ` Andy Lutomirski ` (2 more replies) -1 siblings, 3 replies; 340+ messages in thread From: Ingo Molnar @ 2015-05-09 7:05 UTC (permalink / raw) To: Andrew Morton Cc: Chris Metcalf, Steven Rostedt, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel * Andrew Morton <akpm@linux-foundation.org> wrote: > On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > > > On 5/8/2015 5:22 PM, Steven Rostedt wrote: > > > On Fri, 8 May 2015 14:18:24 -0700 > > > Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > > >> > > >>> A prctl() option (PR_SET_DATAPLANE) is added > > >> Dumb question: what does the term "dataplane" mean in this context? I > > >> can't see the relationship between those words and what this patch > > >> does. > > > I was thinking the same thing. I haven't gotten around to searching > > > DATAPLANE yet. > > > > > > I would assume we want a name that is more meaningful for what is > > > happening. > > > > The text in the commit message and the 0/6 cover letter do try to explain > > the concept. The terminology comes, I think, from networking line cards, > > where the "dataplane" is the part of the application that handles all the > > fast path processing of network packets, and the "control plane" is the part > > that handles routing updates, etc., generally slow-path stuff. I've probably > > just been using the terms so long they seem normal to me. > > > > That said, what would be clearer? NO_HZ_STRICT as a superset of > > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all, > > we're talking about no interrupts of any kind, and maybe NO_HZ is too > > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look > > to vendors who ship bare-metal runtimes and call it BARE_METAL? > > Borrow the Tilera marketing name and call it ZERO_OVERHEAD? > > > > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me, > > of course :-) 'baremetal' has uses in virtualization speak, so I think that would be confusing. > I like NO_INTERRUPTS. Simple, direct. NO_HZ_PURE? That's what it's really about: user-space wants to run exclusively, in pure user-mode, without any interrupts. So I don't like 'NO_HZ_NO_INTERRUPTS' for a couple of reasons: - It is similar to a term we use in perf: PERF_PMU_CAP_NO_INTERRUPT. - Another reason is that 'NO_INTERRUPTS', in most existing uses in the kernel generally relates to some sort of hardware weakness, limitation, a negative property: that we try to limp along without having a hardware interrupt and have to poll. In other driver code that uses variants of NO_INTERRUPT it appears to be similar. So I think there's some confusion potential here. - Here the fact that we don't disturb user-space is an absolutely positive property, not a limitation, a kernel feature we work hard to achieve. NO_HZ_PURE would convey that while NO_HZ_NO_INTERRUPTS wouldn't. - NO_HZ_NO_INTERRUPTS has a double negation, and it's also too long, compared to NO_HZ_FULL or NO_HZ_PURE ;-) The term 'no HZ' already expresses that we don't have periodic interruptions. We just duplicate that information with NO_HZ_NO_INTERRUPTS, while NO_HZ_FULL or NO_HZ_PURE qualifies it, makes it a stronger property - which is what we want I think. So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be such a 'zero overhead' mode of operation, where if user-space runs, it won't get interrupted in any way. There's no need to add yet another Kconfig variant - lets just enhance the current stuff and maybe rename it to NO_HZ_PURE to better express its intent. Thanks, Ingo ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-09 7:19 ` Andy Lutomirski 0 siblings, 0 replies; 340+ messages in thread From: Andy Lutomirski @ 2015-05-09 7:19 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Chris Metcalf, Steven Rostedt, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, Linux API, linux-kernel On Sat, May 9, 2015 at 12:05 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Andrew Morton <akpm@linux-foundation.org> wrote: > >> On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: >> >> > On 5/8/2015 5:22 PM, Steven Rostedt wrote: >> > > On Fri, 8 May 2015 14:18:24 -0700 >> > > Andrew Morton <akpm@linux-foundation.org> wrote: >> > > >> > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: >> > >> >> > >>> A prctl() option (PR_SET_DATAPLANE) is added >> > >> Dumb question: what does the term "dataplane" mean in this context? I >> > >> can't see the relationship between those words and what this patch >> > >> does. >> > > I was thinking the same thing. I haven't gotten around to searching >> > > DATAPLANE yet. >> > > >> > > I would assume we want a name that is more meaningful for what is >> > > happening. >> > >> > The text in the commit message and the 0/6 cover letter do try to explain >> > the concept. The terminology comes, I think, from networking line cards, >> > where the "dataplane" is the part of the application that handles all the >> > fast path processing of network packets, and the "control plane" is the part >> > that handles routing updates, etc., generally slow-path stuff. I've probably >> > just been using the terms so long they seem normal to me. >> > >> > That said, what would be clearer? NO_HZ_STRICT as a superset of >> > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all, >> > we're talking about no interrupts of any kind, and maybe NO_HZ is too >> > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look >> > to vendors who ship bare-metal runtimes and call it BARE_METAL? >> > Borrow the Tilera marketing name and call it ZERO_OVERHEAD? >> > >> > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me, >> > of course :-) > > 'baremetal' has uses in virtualization speak, so I think that would be > confusing. > >> I like NO_INTERRUPTS. Simple, direct. > > NO_HZ_PURE? > Naming aside, I don't think this should be a per-task flag at all. We already have way too much overhead per syscall in nohz mode, and it would be nice to get the per-syscall overhead as low as possible. We should strive, for all tasks, to keep syscall overhead down *and* avoid as many interrupts as possible. That being said, I do see a legitimate use for a way to tell the kernel "I'm going to run in userspace for a long time; stay away". But shouldn't that be a single operation, not an ongoing flag? IOW, I think that we should have a new syscall quiesce() or something rather than a prctl. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-09 7:19 ` Andy Lutomirski 0 siblings, 0 replies; 340+ messages in thread From: Andy Lutomirski @ 2015-05-09 7:19 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Chris Metcalf, Steven Rostedt, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Sat, May 9, 2015 at 12:05 AM, Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > > * Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote: > >> On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> >> > On 5/8/2015 5:22 PM, Steven Rostedt wrote: >> > > On Fri, 8 May 2015 14:18:24 -0700 >> > > Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote: >> > > >> > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> > >> >> > >>> A prctl() option (PR_SET_DATAPLANE) is added >> > >> Dumb question: what does the term "dataplane" mean in this context? I >> > >> can't see the relationship between those words and what this patch >> > >> does. >> > > I was thinking the same thing. I haven't gotten around to searching >> > > DATAPLANE yet. >> > > >> > > I would assume we want a name that is more meaningful for what is >> > > happening. >> > >> > The text in the commit message and the 0/6 cover letter do try to explain >> > the concept. The terminology comes, I think, from networking line cards, >> > where the "dataplane" is the part of the application that handles all the >> > fast path processing of network packets, and the "control plane" is the part >> > that handles routing updates, etc., generally slow-path stuff. I've probably >> > just been using the terms so long they seem normal to me. >> > >> > That said, what would be clearer? NO_HZ_STRICT as a superset of >> > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all, >> > we're talking about no interrupts of any kind, and maybe NO_HZ is too >> > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look >> > to vendors who ship bare-metal runtimes and call it BARE_METAL? >> > Borrow the Tilera marketing name and call it ZERO_OVERHEAD? >> > >> > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me, >> > of course :-) > > 'baremetal' has uses in virtualization speak, so I think that would be > confusing. > >> I like NO_INTERRUPTS. Simple, direct. > > NO_HZ_PURE? > Naming aside, I don't think this should be a per-task flag at all. We already have way too much overhead per syscall in nohz mode, and it would be nice to get the per-syscall overhead as low as possible. We should strive, for all tasks, to keep syscall overhead down *and* avoid as many interrupts as possible. That being said, I do see a legitimate use for a way to tell the kernel "I'm going to run in userspace for a long time; stay away". But shouldn't that be a single operation, not an ongoing flag? IOW, I think that we should have a new syscall quiesce() or something rather than a prctl. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-11 19:54 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-11 19:54 UTC (permalink / raw) To: Andy Lutomirski, Ingo Molnar Cc: Andrew Morton, Steven Rostedt, Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, linux-doc, Linux API, linux-kernel (Oops, resending and forcing html off.) On 05/09/2015 03:19 AM, Andy Lutomirski wrote: > Naming aside, I don't think this should be a per-task flag at all. We > already have way too much overhead per syscall in nohz mode, and it > would be nice to get the per-syscall overhead as low as possible. We > should strive, for all tasks, to keep syscall overhead down*and* > avoid as many interrupts as possible. > > That being said, I do see a legitimate use for a way to tell the > kernel "I'm going to run in userspace for a long time; stay away". > But shouldn't that be a single operation, not an ongoing flag? IOW, I > think that we should have a new syscall quiesce() or something rather > than a prctl. Yes, if all you are concerned about is quiescing the tick, we could probably do it as a new syscall. I do note that you'd want to try to actually do the quiesce as late as possible - in particular, if you just did it in the usual syscall, you might miss out on a timer that is set by softirq, or even something that happened when you called schedule() on the syscall exit path. Doing it as late as we are doing helps to ensure that that doesn't happen. We could still arrange for this semantics by having a new quiesce() syscall set a temporary task bit that was cleared on return to userspace, but as you pointed out in a different email, that gets tricky if you end up doing multiple user_exit() calls on your way back to userspace. More to the point, I think it's actually important to know when an application believes it's in userspace-only mode as an actual state bit, rather than just during its transitional moment. If an application calls the kernel at an unexpected time (third-party code is the usual culprit for our customers, whether it's syscalls, page faults, or other things) we would prefer to have the "quiesce" semantics stay in force and cause the third-party code to be visibly very slow, rather than cause a totally unexpected and hard-to-diagnose interrupt show up later as we are still going around the loop that we thought was safely userspace-only. And, for debugging the kernel, it's crazy helpful to have that state bit in place: see patch 6/6 in the series for how we can diagnose things like "a different core just queued an IPI that will hit a dataplane core unexpectedly". Having that state bit makes this sort of thing a trivial check in the kernel and relatively easy to debug. Finally, I proposed a "strict" mode in patch 5/6 where we kill the process if it voluntarily enters the kernel by mistake after saying it wasn't going to any more. To do this requires a state bit, so carrying another state bit for "quiesce on user entry" seems pretty reasonable. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-11 19:54 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-11 19:54 UTC (permalink / raw) To: Andy Lutomirski, Ingo Molnar Cc: Andrew Morton, Steven Rostedt, Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA (Oops, resending and forcing html off.) On 05/09/2015 03:19 AM, Andy Lutomirski wrote: > Naming aside, I don't think this should be a per-task flag at all. We > already have way too much overhead per syscall in nohz mode, and it > would be nice to get the per-syscall overhead as low as possible. We > should strive, for all tasks, to keep syscall overhead down*and* > avoid as many interrupts as possible. > > That being said, I do see a legitimate use for a way to tell the > kernel "I'm going to run in userspace for a long time; stay away". > But shouldn't that be a single operation, not an ongoing flag? IOW, I > think that we should have a new syscall quiesce() or something rather > than a prctl. Yes, if all you are concerned about is quiescing the tick, we could probably do it as a new syscall. I do note that you'd want to try to actually do the quiesce as late as possible - in particular, if you just did it in the usual syscall, you might miss out on a timer that is set by softirq, or even something that happened when you called schedule() on the syscall exit path. Doing it as late as we are doing helps to ensure that that doesn't happen. We could still arrange for this semantics by having a new quiesce() syscall set a temporary task bit that was cleared on return to userspace, but as you pointed out in a different email, that gets tricky if you end up doing multiple user_exit() calls on your way back to userspace. More to the point, I think it's actually important to know when an application believes it's in userspace-only mode as an actual state bit, rather than just during its transitional moment. If an application calls the kernel at an unexpected time (third-party code is the usual culprit for our customers, whether it's syscalls, page faults, or other things) we would prefer to have the "quiesce" semantics stay in force and cause the third-party code to be visibly very slow, rather than cause a totally unexpected and hard-to-diagnose interrupt show up later as we are still going around the loop that we thought was safely userspace-only. And, for debugging the kernel, it's crazy helpful to have that state bit in place: see patch 6/6 in the series for how we can diagnose things like "a different core just queued an IPI that will hit a dataplane core unexpectedly". Having that state bit makes this sort of thing a trivial check in the kernel and relatively easy to debug. Finally, I proposed a "strict" mode in patch 5/6 where we kill the process if it voluntarily enters the kernel by mistake after saying it wasn't going to any more. To do this requires a state bit, so carrying another state bit for "quiesce on user entry" seems pretty reasonable. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-11 22:15 ` Andy Lutomirski 0 siblings, 0 replies; 340+ messages in thread From: Andy Lutomirski @ 2015-05-11 22:15 UTC (permalink / raw) To: Chris Metcalf Cc: Paul E. McKenney, Frederic Weisbecker, linux-kernel, Rik van Riel, Andrew Morton, Linux API, Thomas Gleixner, Tejun Heo, Peter Zijlstra, Steven Rostedt, linux-doc, Christoph Lameter, Gilad Ben Yossef, Ingo Molnar On May 12, 2015 4:54 AM, "Chris Metcalf" <cmetcalf@ezchip.com> wrote: > > (Oops, resending and forcing html off.) > > > On 05/09/2015 03:19 AM, Andy Lutomirski wrote: >> >> Naming aside, I don't think this should be a per-task flag at all. We >> already have way too much overhead per syscall in nohz mode, and it >> would be nice to get the per-syscall overhead as low as possible. We >> should strive, for all tasks, to keep syscall overhead down*and* >> avoid as many interrupts as possible. >> >> That being said, I do see a legitimate use for a way to tell the >> kernel "I'm going to run in userspace for a long time; stay away". >> But shouldn't that be a single operation, not an ongoing flag? IOW, I >> think that we should have a new syscall quiesce() or something rather >> than a prctl. > > > Yes, if all you are concerned about is quiescing the tick, we could > probably do it as a new syscall. > > I do note that you'd want to try to actually do the quiesce as late as > possible - in particular, if you just did it in the usual syscall, you > might miss out on a timer that is set by softirq, or even something > that happened when you called schedule() on the syscall exit path. > Doing it as late as we are doing helps to ensure that that doesn't > happen. We could still arrange for this semantics by having a new > quiesce() syscall set a temporary task bit that was cleared on > return to userspace, but as you pointed out in a different email, > that gets tricky if you end up doing multiple user_exit() calls on > your way back to userspace. We should fix that, then. A quiesce() syscall can certainly arrange to clean up on final exit. > > More to the point, I think it's actually important to know when an > application believes it's in userspace-only mode as an actual state > bit, rather than just during its transitional moment. We can do that, too, with a new flag that's cleared on the next entry. > If an > application calls the kernel at an unexpected time (third-party code > is the usual culprit for our customers, whether it's syscalls, page > faults, or other things) we would prefer to have the "quiesce" > semantics stay in force and cause the third-party code to be > visibly very slow, rather than cause a totally unexpected and > hard-to-diagnose interrupt show up later as we are still going > around the loop that we thought was safely userspace-only. I'm not really convinced that we should design this feature around ease of debugging userspace screwups. There are already plenty of ways to do that part. Userspace getting an interrupt because userspace accidentally did a syscall is very different from userspace getting interrupted due to an IPI. > > And, for debugging the kernel, it's crazy helpful to have that state > bit in place: see patch 6/6 in the series for how we can diagnose > things like "a different core just queued an IPI that will hit a > dataplane core unexpectedly". Having that state bit makes this sort > of thing a trivial check in the kernel and relatively easy to debug. As above, this can be done with a one-time operation, too. > > Finally, I proposed a "strict" mode in patch 5/6 where we kill the > process if it voluntarily enters the kernel by mistake after saying it > wasn't going to any more. To do this requires a state bit, so > carrying another state bit for "quiesce on user entry" seems pretty > reasonable. I still dislike that in the form you chose. It's too deadly to be useful for anyone but the hardest RT users. I think I'd be okay with variants, though: let a suitably privileged process ask for a signal on inadvertent kernel entry or rig up an fd to be notified when one of these bad entries happens. Queueing something to a pollable fd would work, too. See that thread for more comments. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-11 22:15 ` Andy Lutomirski 0 siblings, 0 replies; 340+ messages in thread From: Andy Lutomirski @ 2015-05-11 22:15 UTC (permalink / raw) To: Chris Metcalf Cc: Paul E. McKenney, Frederic Weisbecker, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Rik van Riel, Andrew Morton, Linux API, Thomas Gleixner, Tejun Heo, Peter Zijlstra, Steven Rostedt, linux-doc-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter, Gilad Ben Yossef, Ingo Molnar On May 12, 2015 4:54 AM, "Chris Metcalf" <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > > (Oops, resending and forcing html off.) > > > On 05/09/2015 03:19 AM, Andy Lutomirski wrote: >> >> Naming aside, I don't think this should be a per-task flag at all. We >> already have way too much overhead per syscall in nohz mode, and it >> would be nice to get the per-syscall overhead as low as possible. We >> should strive, for all tasks, to keep syscall overhead down*and* >> avoid as many interrupts as possible. >> >> That being said, I do see a legitimate use for a way to tell the >> kernel "I'm going to run in userspace for a long time; stay away". >> But shouldn't that be a single operation, not an ongoing flag? IOW, I >> think that we should have a new syscall quiesce() or something rather >> than a prctl. > > > Yes, if all you are concerned about is quiescing the tick, we could > probably do it as a new syscall. > > I do note that you'd want to try to actually do the quiesce as late as > possible - in particular, if you just did it in the usual syscall, you > might miss out on a timer that is set by softirq, or even something > that happened when you called schedule() on the syscall exit path. > Doing it as late as we are doing helps to ensure that that doesn't > happen. We could still arrange for this semantics by having a new > quiesce() syscall set a temporary task bit that was cleared on > return to userspace, but as you pointed out in a different email, > that gets tricky if you end up doing multiple user_exit() calls on > your way back to userspace. We should fix that, then. A quiesce() syscall can certainly arrange to clean up on final exit. > > More to the point, I think it's actually important to know when an > application believes it's in userspace-only mode as an actual state > bit, rather than just during its transitional moment. We can do that, too, with a new flag that's cleared on the next entry. > If an > application calls the kernel at an unexpected time (third-party code > is the usual culprit for our customers, whether it's syscalls, page > faults, or other things) we would prefer to have the "quiesce" > semantics stay in force and cause the third-party code to be > visibly very slow, rather than cause a totally unexpected and > hard-to-diagnose interrupt show up later as we are still going > around the loop that we thought was safely userspace-only. I'm not really convinced that we should design this feature around ease of debugging userspace screwups. There are already plenty of ways to do that part. Userspace getting an interrupt because userspace accidentally did a syscall is very different from userspace getting interrupted due to an IPI. > > And, for debugging the kernel, it's crazy helpful to have that state > bit in place: see patch 6/6 in the series for how we can diagnose > things like "a different core just queued an IPI that will hit a > dataplane core unexpectedly". Having that state bit makes this sort > of thing a trivial check in the kernel and relatively easy to debug. As above, this can be done with a one-time operation, too. > > Finally, I proposed a "strict" mode in patch 5/6 where we kill the > process if it voluntarily enters the kernel by mistake after saying it > wasn't going to any more. To do this requires a state bit, so > carrying another state bit for "quiesce on user entry" seems pretty > reasonable. I still dislike that in the form you chose. It's too deadly to be useful for anyone but the hardest RT users. I think I'd be okay with variants, though: let a suitably privileged process ask for a signal on inadvertent kernel entry or rig up an fd to be notified when one of these bad entries happens. Queueing something to a pollable fd would work, too. See that thread for more comments. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
[parent not found: <55510885.9070101@ezchip.com>]
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full [not found] ` <55510885.9070101@ezchip.com> @ 2015-05-12 13:18 ` Paul E. McKenney 0 siblings, 0 replies; 340+ messages in thread From: Paul E. McKenney @ 2015-05-12 13:18 UTC (permalink / raw) To: Chris Metcalf Cc: Andy Lutomirski, Ingo Molnar, Andrew Morton, Steven Rostedt, Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, Linux API, linux-kernel On Mon, May 11, 2015 at 03:52:37PM -0400, Chris Metcalf wrote: > On 05/09/2015 03:19 AM, Andy Lutomirski wrote: > >Naming aside, I don't think this should be a per-task flag at all. We > >already have way too much overhead per syscall in nohz mode, and it > >would be nice to get the per-syscall overhead as low as possible. We > >should strive, for all tasks, to keep syscall overhead down*and* > >avoid as many interrupts as possible. > > > >That being said, I do see a legitimate use for a way to tell the > >kernel "I'm going to run in userspace for a long time; stay away". > >But shouldn't that be a single operation, not an ongoing flag? IOW, I > >think that we should have a new syscall quiesce() or something rather > >than a prctl. > > Yes, if all you are concerned about is quiescing the tick, we could > probably do it as a new syscall. > > I do note that you'd want to try to actually do the quiesce as late as > possible - in particular, if you just did it in the usual syscall, you > might miss out on a timer that is set by softirq, or even something > that happened when you called schedule() on the syscall exit path. > Doing it as late as we are doing helps to ensure that that doesn't > happen. We could still arrange for this semantics by having a new > quiesce() syscall set a temporary task bit that was cleared on > return to userspace, but as you pointed out in a different email, > that gets tricky if you end up doing multiple user_exit() calls on > your way back to userspace. > > More to the point, I think it's actually important to know when an > application believes it's in userspace-only mode as an actual state > bit, rather than just during its transitional moment. If an > application calls the kernel at an unexpected time (third-party code > is the usual culprit for our customers, whether it's syscalls, page > faults, or other things) we would prefer to have the "quiesce" > semantics stay in force and cause the third-party code to be > visibly very slow, rather than cause a totally unexpected and > hard-to-diagnose interrupt show up later as we are still going > around the loop that we thought was safely userspace-only. > > And, for debugging the kernel, it's crazy helpful to have that state > bit in place: see patch 6/6 in the series for how we can diagnose > things like "a different core just queued an IPI that will hit a > dataplane core unexpectedly". Having that state bit makes this sort > of thing a trivial check in the kernel and relatively easy to debug. I agree with this! It is currently a bit painful to debug problems that might result in multiple tasks runnable on a given CPU. If you suspect a problem, you enable tracing and re-run. Not paricularly friendly for chasing down intermittent problems, so some sort of improvement would be a very good thing. Thanx, Paul > Finally, I proposed a "strict" mode in patch 5/6 where we kill the > process if it voluntarily enters the kernel by mistake after saying it > wasn't going to any more. To do this requires a state bit, so > carrying another state bit for "quiesce on user entry" seems pretty > reasonable. > > -- > Chris Metcalf, EZChip Semiconductor > http://www.ezchip.com > ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-09 7:19 ` Mike Galbraith 0 siblings, 0 replies; 340+ messages in thread From: Mike Galbraith @ 2015-05-09 7:19 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Chris Metcalf, Steven Rostedt, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Sat, 2015-05-09 at 09:05 +0200, Ingo Molnar wrote: > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > > > > > On 5/8/2015 5:22 PM, Steven Rostedt wrote: > > > > On Fri, 8 May 2015 14:18:24 -0700 > > > > Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > > > >> > > > >>> A prctl() option (PR_SET_DATAPLANE) is added > > > >> Dumb question: what does the term "dataplane" mean in this context? I > > > >> can't see the relationship between those words and what this patch > > > >> does. > > > > I was thinking the same thing. I haven't gotten around to searching > > > > DATAPLANE yet. > > > > > > > > I would assume we want a name that is more meaningful for what is > > > > happening. > > > > > > The text in the commit message and the 0/6 cover letter do try to explain > > > the concept. The terminology comes, I think, from networking line cards, > > > where the "dataplane" is the part of the application that handles all the > > > fast path processing of network packets, and the "control plane" is the part > > > that handles routing updates, etc., generally slow-path stuff. I've probably > > > just been using the terms so long they seem normal to me. > > > > > > That said, what would be clearer? NO_HZ_STRICT as a superset of > > > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all, > > > we're talking about no interrupts of any kind, and maybe NO_HZ is too > > > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look > > > to vendors who ship bare-metal runtimes and call it BARE_METAL? > > > Borrow the Tilera marketing name and call it ZERO_OVERHEAD? > > > > > > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me, > > > of course :-) > > 'baremetal' has uses in virtualization speak, so I think that would be > confusing. > > > I like NO_INTERRUPTS. Simple, direct. > > NO_HZ_PURE? Hm, coke light, coke zero... OS_LIGHT and OS_ZERO? -Mike ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-09 7:19 ` Mike Galbraith 0 siblings, 0 replies; 340+ messages in thread From: Mike Galbraith @ 2015-05-09 7:19 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Chris Metcalf, Steven Rostedt, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Sat, 2015-05-09 at 09:05 +0200, Ingo Molnar wrote: > * Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote: > > > On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > > > > > On 5/8/2015 5:22 PM, Steven Rostedt wrote: > > > > On Fri, 8 May 2015 14:18:24 -0700 > > > > Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote: > > > > > > > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > > > >> > > > >>> A prctl() option (PR_SET_DATAPLANE) is added > > > >> Dumb question: what does the term "dataplane" mean in this context? I > > > >> can't see the relationship between those words and what this patch > > > >> does. > > > > I was thinking the same thing. I haven't gotten around to searching > > > > DATAPLANE yet. > > > > > > > > I would assume we want a name that is more meaningful for what is > > > > happening. > > > > > > The text in the commit message and the 0/6 cover letter do try to explain > > > the concept. The terminology comes, I think, from networking line cards, > > > where the "dataplane" is the part of the application that handles all the > > > fast path processing of network packets, and the "control plane" is the part > > > that handles routing updates, etc., generally slow-path stuff. I've probably > > > just been using the terms so long they seem normal to me. > > > > > > That said, what would be clearer? NO_HZ_STRICT as a superset of > > > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after all, > > > we're talking about no interrupts of any kind, and maybe NO_HZ is too > > > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look > > > to vendors who ship bare-metal runtimes and call it BARE_METAL? > > > Borrow the Tilera marketing name and call it ZERO_OVERHEAD? > > > > > > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me, > > > of course :-) > > 'baremetal' has uses in virtualization speak, so I think that would be > confusing. > > > I like NO_INTERRUPTS. Simple, direct. > > NO_HZ_PURE? Hm, coke light, coke zero... OS_LIGHT and OS_ZERO? -Mike ^ permalink raw reply [flat|nested] 340+ messages in thread
* RE: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-09 10:18 ` Gilad Ben Yossef 0 siblings, 0 replies; 340+ messages in thread From: Gilad Ben Yossef @ 2015-05-09 10:18 UTC (permalink / raw) To: Mike Galbraith, Ingo Molnar Cc: Andrew Morton, Chris Metcalf, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3728 bytes --] > From: Mike Galbraith [mailto:umgwanakikbuti@gmail.com] > Sent: Saturday, May 09, 2015 10:20 AM > To: Ingo Molnar > Cc: Andrew Morton; Chris Metcalf; Steven Rostedt; Gilad Ben Yossef; Ingo > Molnar; Peter Zijlstra; Rik van Riel; Tejun Heo; Frederic Weisbecker; > Thomas Gleixner; Paul E. McKenney; Christoph Lameter; Srivatsa S. Bhat; > linux-doc@vger.kernel.org; linux-api@vger.kernel.org; linux- > kernel@vger.kernel.org > Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full > > On Sat, 2015-05-09 at 09:05 +0200, Ingo Molnar wrote: > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf@ezchip.com> > wrote: > > > > > > > On 5/8/2015 5:22 PM, Steven Rostedt wrote: > > > > > On Fri, 8 May 2015 14:18:24 -0700 > > > > > Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > > > > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf > <cmetcalf@ezchip.com> wrote: > > > > >> > > > > >>> A prctl() option (PR_SET_DATAPLANE) is added > > > > >> Dumb question: what does the term "dataplane" mean in this > context? I > > > > >> can't see the relationship between those words and what this > patch > > > > >> does. > > > > > I was thinking the same thing. I haven't gotten around to > searching > > > > > DATAPLANE yet. > > > > > > > > > > I would assume we want a name that is more meaningful for what is > > > > > happening. > > > > > > > > The text in the commit message and the 0/6 cover letter do try to > explain > > > > the concept. The terminology comes, I think, from networking line > cards, > > > > where the "dataplane" is the part of the application that handles > all the > > > > fast path processing of network packets, and the "control plane" is > the part > > > > that handles routing updates, etc., generally slow-path stuff. I've > probably > > > > just been using the terms so long they seem normal to me. > > > > > > > > That said, what would be clearer? NO_HZ_STRICT as a superset of > > > > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after > all, > > > > we're talking about no interrupts of any kind, and maybe NO_HZ is > too > > > > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look > > > > to vendors who ship bare-metal runtimes and call it BARE_METAL? > > > > Borrow the Tilera marketing name and call it ZERO_OVERHEAD? > > > > > > > > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me, > > > > of course :-) > > > > 'baremetal' has uses in virtualization speak, so I think that would be > > confusing. > > > > > I like NO_INTERRUPTS. Simple, direct. > > > > NO_HZ_PURE? > > Hm, coke light, coke zero... OS_LIGHT and OS_ZERO? LOL... you forgot OS_CLASSIC for backwards compatibility :-) How about TASK_SOLO? Yes, you are trying to achieve the least amount of interference but the bigger context is about monopolizing a single CPU for yourself. Anyway it is worth pointing out that while NO_HZ_FULL is very useful in conjunction with this turning the tick off is useful also if you have multiple tasks runnable (e.g. if you know you only need to context switch in 100 ms, why keep a periodic interrupt running?) even though we don't support it *right now*. It might be a good idea not to entangle these concepts too much. Gilad Gilad Ben-Yossef Chief Software Architect EZchip Technologies Ltd. 37 Israel Pollak Ave, Kiryat Gat 82025 ,Israel Tel: +972-4-959-6666 ext. 576, Fax: +972-8-681-1483 Mobile: +972-52-826-0388, US Mobile: +1-973-826-0388 Email: giladb@ezchip.com, Web: http://www.ezchip.com ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 340+ messages in thread
* RE: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-09 10:18 ` Gilad Ben Yossef 0 siblings, 0 replies; 340+ messages in thread From: Gilad Ben Yossef @ 2015-05-09 10:18 UTC (permalink / raw) To: Mike Galbraith, Ingo Molnar Cc: Andrew Morton, Chris Metcalf, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA > From: Mike Galbraith [mailto:umgwanakikbuti@gmail.com] > Sent: Saturday, May 09, 2015 10:20 AM > To: Ingo Molnar > Cc: Andrew Morton; Chris Metcalf; Steven Rostedt; Gilad Ben Yossef; Ingo > Molnar; Peter Zijlstra; Rik van Riel; Tejun Heo; Frederic Weisbecker; > Thomas Gleixner; Paul E. McKenney; Christoph Lameter; Srivatsa S. Bhat; > linux-doc@vger.kernel.org; linux-api@vger.kernel.org; linux- > kernel@vger.kernel.org > Subject: Re: [PATCH 0/6] support "dataplane" mode for nohz_full > > On Sat, 2015-05-09 at 09:05 +0200, Ingo Molnar wrote: > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > On Fri, 8 May 2015 19:11:10 -0400 Chris Metcalf <cmetcalf@ezchip.com> > wrote: > > > > > > > On 5/8/2015 5:22 PM, Steven Rostedt wrote: > > > > > On Fri, 8 May 2015 14:18:24 -0700 > > > > > Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > > > > >> On Fri, 8 May 2015 13:58:41 -0400 Chris Metcalf > <cmetcalf@ezchip.com> wrote: > > > > >> > > > > >>> A prctl() option (PR_SET_DATAPLANE) is added > > > > >> Dumb question: what does the term "dataplane" mean in this > context? I > > > > >> can't see the relationship between those words and what this > patch > > > > >> does. > > > > > I was thinking the same thing. I haven't gotten around to > searching > > > > > DATAPLANE yet. > > > > > > > > > > I would assume we want a name that is more meaningful for what is > > > > > happening. > > > > > > > > The text in the commit message and the 0/6 cover letter do try to > explain > > > > the concept. The terminology comes, I think, from networking line > cards, > > > > where the "dataplane" is the part of the application that handles > all the > > > > fast path processing of network packets, and the "control plane" is > the part > > > > that handles routing updates, etc., generally slow-path stuff. I've > probably > > > > just been using the terms so long they seem normal to me. > > > > > > > > That said, what would be clearer? NO_HZ_STRICT as a superset of > > > > NO_HZ_FULL? Or move away from the NO_HZ terminology a bit; after > all, > > > > we're talking about no interrupts of any kind, and maybe NO_HZ is > too > > > > limited in scope? So, NO_INTERRUPTS? USERSPACE_ONLY? Or look > > > > to vendors who ship bare-metal runtimes and call it BARE_METAL? > > > > Borrow the Tilera marketing name and call it ZERO_OVERHEAD? > > > > > > > > Maybe BARE_METAL seems most plausible -- after DATAPLANE, to me, > > > > of course :-) > > > > 'baremetal' has uses in virtualization speak, so I think that would be > > confusing. > > > > > I like NO_INTERRUPTS. Simple, direct. > > > > NO_HZ_PURE? > > Hm, coke light, coke zero... OS_LIGHT and OS_ZERO? LOL... you forgot OS_CLASSIC for backwards compatibility :-) How about TASK_SOLO? Yes, you are trying to achieve the least amount of interference but the bigger context is about monopolizing a single CPU for yourself. Anyway it is worth pointing out that while NO_HZ_FULL is very useful in conjunction with this turning the tick off is useful also if you have multiple tasks runnable (e.g. if you know you only need to context switch in 100 ms, why keep a periodic interrupt running?) even though we don't support it *right now*. It might be a good idea not to entangle these concepts too much. Gilad Gilad Ben-Yossef Chief Software Architect EZchip Technologies Ltd. 37 Israel Pollak Ave, Kiryat Gat 82025 ,Israel Tel: +972-4-959-6666 ext. 576, Fax: +972-8-681-1483 Mobile: +972-52-826-0388, US Mobile: +1-973-826-0388 Email: giladb@ezchip.com, Web: http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-11 12:57 ` Steven Rostedt 0 siblings, 0 replies; 340+ messages in thread From: Steven Rostedt @ 2015-05-11 12:57 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel NO_HZ_LEAVE_ME_THE_FSCK_ALONE! On Sat, 9 May 2015 09:05:38 +0200 Ingo Molnar <mingo@kernel.org> wrote: > So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep > it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be > such a 'zero overhead' mode of operation, where if user-space runs, it > won't get interrupted in any way. All kidding aside, I think this is the real answer. We don't need a new NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly what it was created to do. That should be fixed. Please lets get NO_HZ_FULL up to par. That should be the main focus. -- Steve ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-11 12:57 ` Steven Rostedt 0 siblings, 0 replies; 340+ messages in thread From: Steven Rostedt @ 2015-05-11 12:57 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA NO_HZ_LEAVE_ME_THE_FSCK_ALONE! On Sat, 9 May 2015 09:05:38 +0200 Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep > it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be > such a 'zero overhead' mode of operation, where if user-space runs, it > won't get interrupted in any way. All kidding aside, I think this is the real answer. We don't need a new NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly what it was created to do. That should be fixed. Please lets get NO_HZ_FULL up to par. That should be the main focus. -- Steve ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 12:57 ` Steven Rostedt (?) @ 2015-05-11 15:36 ` Frederic Weisbecker 2015-05-11 19:19 ` Mike Galbraith -1 siblings, 1 reply; 340+ messages in thread From: Frederic Weisbecker @ 2015-05-11 15:36 UTC (permalink / raw) To: Steven Rostedt Cc: Ingo Molnar, Andrew Morton, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote: > > NO_HZ_LEAVE_ME_THE_FSCK_ALONE! > > > On Sat, 9 May 2015 09:05:38 +0200 > Ingo Molnar <mingo@kernel.org> wrote: > > > So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep > > it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be > > such a 'zero overhead' mode of operation, where if user-space runs, it > > won't get interrupted in any way. > > > All kidding aside, I think this is the real answer. We don't need a new > NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly > what it was created to do. That should be fixed. > > Please lets get NO_HZ_FULL up to par. That should be the main focus. Now if we can achieve to make NO_HZ_FULL behave in a specific way that fits everyone's usecase, I'll be happy. But some people may expect hard isolation requirement (Real Time, deterministic latency) and others softer isolation (HPC, only interested in performance, can live with one rare random tick, so no need to loop before returning to userspace until we have the no-noise guarantee). I expect some Real Time users may want this kind of dataplane mode where a syscall or whatever sleeps until the system is ready to provide the guarantee that no disturbance is going to happen for a given time. I'm not sure HPC users are interested in that. In fact it goes along the fact that NO_HZ_FULL was really only supposed to be about the tick and now people are introducing more and more kernel default presetting that assume NO_HZ_FULL implies ISOLATION which is about all kind of noise (tick, tasks, irqs, ...). Which is true but what kind of ISOLATION? Probably NO_HZ_FULL should really only be about stopping the tick then some sort of CONFIG_ISOLATION would drive the kind of isolation we are interested in and hereby the behaviour of NO_HZ_FULL, workqueues, timers, tasks affinity, irqs affinity, dataplane mode, ... ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 15:36 ` Frederic Weisbecker @ 2015-05-11 19:19 ` Mike Galbraith 2015-05-11 19:25 ` Chris Metcalf 0 siblings, 1 reply; 340+ messages in thread From: Mike Galbraith @ 2015-05-11 19:19 UTC (permalink / raw) To: Frederic Weisbecker Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Mon, 2015-05-11 at 17:36 +0200, Frederic Weisbecker wrote: > I expect some Real Time users may want this kind of dataplane mode where a syscall > or whatever sleeps until the system is ready to provide the guarantee that no > disturbance is going to happen for a given time. I'm not sure HPC users are interested > in that. I bet they are. RT is just a different way to spell HPC, and reverse. > In fact it goes along the fact that NO_HZ_FULL was really only supposed to be about > the tick and now people are introducing more and more kernel default presetting that > assume NO_HZ_FULL implies ISOLATION which is about all kind of noise (tick, tasks, irqs, > ...). Which is true but what kind of ISOLATION? True, nohz mode and various isolation measures are distinct properties. NO_HZ_FULL is kinda pointless without isolation measures to go with it, but you're right. I really shouldn't have acked nohz_full -> isolcpus. Beside the fact that old static isolcpus was _supposed_ to crawl off and die, I know beyond doubt that having isolated a cpu as well as you can definitely does NOT imply that said cpu should become tickless. I routinely run a load model that wants all the isolation it can get. It's not single task compute though, rt executive coordinating rt workers, and of course wants every cycle it can get, so nohz_full is less than helpful. -Mike ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 19:19 ` Mike Galbraith @ 2015-05-11 19:25 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-11 19:25 UTC (permalink / raw) To: Mike Galbraith, Frederic Weisbecker Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On 05/11/2015 03:19 PM, Mike Galbraith wrote: > I really shouldn't have acked nohz_full -> isolcpus. Beside the fact > that old static isolcpus was_supposed_ to crawl off and die, I know > beyond doubt that having isolated a cpu as well as you can definitely > does NOT imply that said cpu should become tickless. True, at a high level, I agree that it would be better to have a top-level concept like Frederic's proposed ISOLATION that includes isolcpus and nohz_cpu (and other stuff as needed). That said, what you wrote above is wrong; even with the patch you acked, setting isolcpus does not automatically turn on nohz_full for a given cpu. The patch made it true the other way around: when you say nohz_full, you automatically get isolcpus on that cpu too. That does, at least, make sense for the semantics of nohz_full. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-11 19:25 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-11 19:25 UTC (permalink / raw) To: Mike Galbraith, Frederic Weisbecker Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On 05/11/2015 03:19 PM, Mike Galbraith wrote: > I really shouldn't have acked nohz_full -> isolcpus. Beside the fact > that old static isolcpus was_supposed_ to crawl off and die, I know > beyond doubt that having isolated a cpu as well as you can definitely > does NOT imply that said cpu should become tickless. True, at a high level, I agree that it would be better to have a top-level concept like Frederic's proposed ISOLATION that includes isolcpus and nohz_cpu (and other stuff as needed). That said, what you wrote above is wrong; even with the patch you acked, setting isolcpus does not automatically turn on nohz_full for a given cpu. The patch made it true the other way around: when you say nohz_full, you automatically get isolcpus on that cpu too. That does, at least, make sense for the semantics of nohz_full. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 19:25 ` Chris Metcalf (?) @ 2015-05-12 1:47 ` Mike Galbraith 2015-05-12 4:35 ` Mike Galbraith 2015-05-15 15:05 ` Chris Metcalf -1 siblings, 2 replies; 340+ messages in thread From: Mike Galbraith @ 2015-05-12 1:47 UTC (permalink / raw) To: Chris Metcalf Cc: Frederic Weisbecker, Steven Rostedt, Ingo Molnar, Andrew Morton, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Mon, 2015-05-11 at 15:25 -0400, Chris Metcalf wrote: > On 05/11/2015 03:19 PM, Mike Galbraith wrote: > > I really shouldn't have acked nohz_full -> isolcpus. Beside the fact > > that old static isolcpus was_supposed_ to crawl off and die, I know > > beyond doubt that having isolated a cpu as well as you can definitely > > does NOT imply that said cpu should become tickless. > > True, at a high level, I agree that it would be better to have a > top-level concept like Frederic's proposed ISOLATION that includes > isolcpus and nohz_cpu (and other stuff as needed). > > That said, what you wrote above is wrong; even with the patch you > acked, setting isolcpus does not automatically turn on nohz_full for > a given cpu. The patch made it true the other way around: when > you say nohz_full, you automatically get isolcpus on that cpu too. > That does, at least, make sense for the semantics of nohz_full. I didn't write that, I wrote nohz_full implies (spelled '->') isolcpus. Yes, with nohz_full currently being static, the old allegedly dying but also static isolcpus scheduler off switch is a convenient thing to wire the nohz_full CPU SET (<- hint;) property to. -Mike ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-12 1:47 ` Mike Galbraith @ 2015-05-12 4:35 ` Mike Galbraith 2015-05-15 15:05 ` Chris Metcalf 1 sibling, 0 replies; 340+ messages in thread From: Mike Galbraith @ 2015-05-12 4:35 UTC (permalink / raw) To: Chris Metcalf Cc: Frederic Weisbecker, Steven Rostedt, Ingo Molnar, Andrew Morton, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Tue, 2015-05-12 at 03:47 +0200, Mike Galbraith wrote: > On Mon, 2015-05-11 at 15:25 -0400, Chris Metcalf wrote: > > On 05/11/2015 03:19 PM, Mike Galbraith wrote: > > > I really shouldn't have acked nohz_full -> isolcpus. Beside the fact > > > that old static isolcpus was_supposed_ to crawl off and die, I know > > > beyond doubt that having isolated a cpu as well as you can definitely > > > does NOT imply that said cpu should become tickless. > > > > True, at a high level, I agree that it would be better to have a > > top-level concept like Frederic's proposed ISOLATION that includes > > isolcpus and nohz_cpu (and other stuff as needed). > > > > That said, what you wrote above is wrong; even with the patch you > > acked, setting isolcpus does not automatically turn on nohz_full for > > a given cpu. The patch made it true the other way around: when > > you say nohz_full, you automatically get isolcpus on that cpu too. > > That does, at least, make sense for the semantics of nohz_full. > > I didn't write that, I wrote nohz_full implies (spelled '->') isolcpus. > Yes, with nohz_full currently being static, the old allegedly dying but > also static isolcpus scheduler off switch is a convenient thing to wire > the nohz_full CPU SET (<- hint;) property to. BTW, another facet of this: Rik wants to make isolcpus immune to cpusets, which makes some sense, user did say isolcpus=, but that also makes isolcpus truly static. If the user now says nohz_full=, they lose the ability to deactivate CPU isolation, making the set fairly useless for anything other than HPC. Currently, the user can flip the isolation switch as he sees fit. He takes a size extra large performance hit for having said nohz_full=, but he doesn't lose generic utility. -Mike ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-12 1:47 ` Mike Galbraith 2015-05-12 4:35 ` Mike Galbraith @ 2015-05-15 15:05 ` Chris Metcalf 2015-05-15 18:44 ` Mike Galbraith 1 sibling, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-05-15 15:05 UTC (permalink / raw) To: Mike Galbraith Cc: Frederic Weisbecker, Steven Rostedt, Ingo Molnar, Andrew Morton, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, linux-kernel On 05/11/2015 09:47 PM, Mike Galbraith wrote: > On Mon, 2015-05-11 at 15:25 -0400, Chris Metcalf wrote: >> On 05/11/2015 03:19 PM, Mike Galbraith wrote: >>> I really shouldn't have acked nohz_full -> isolcpus. Beside the fact >>> that old static isolcpus was_supposed_ to crawl off and die, I know >>> beyond doubt that having isolated a cpu as well as you can definitely >>> does NOT imply that said cpu should become tickless. >> True, at a high level, I agree that it would be better to have a >> top-level concept like Frederic's proposed ISOLATION that includes >> isolcpus and nohz_cpu (and other stuff as needed). >> >> That said, what you wrote above is wrong; even with the patch you >> acked, setting isolcpus does not automatically turn on nohz_full for >> a given cpu. The patch made it true the other way around: when >> you say nohz_full, you automatically get isolcpus on that cpu too. >> That does, at least, make sense for the semantics of nohz_full. > I didn't write that, I wrote nohz_full implies (spelled '->') isolcpus. > Yes, with nohz_full currently being static, the old allegedly dying but > also static isolcpus scheduler off switch is a convenient thing to wire > the nohz_full CPU SET (<- hint;) property to. Yes, I was responding to the bit where you said "having isolated a cpu as well as you can does NOT imply it should become tickless", but indeed, the "nohz_full -> isolcpus" patch didn't make that true. In any case sounds like we were just talking past each other. > BTW, another facet of this: Rik wants to make isolcpus immune to > cpusets, which makes some sense, user did say isolcpus=, but that also > makes isolcpus truly static. If the user now says nohz_full=, they lose > the ability to deactivate CPU isolation, making the set fairly useless > for anything other than HPC. Currently, the user can flip the isolation > switch as he sees fit. He takes a size extra large performance hit for > having said nohz_full=, but he doesn't lose generic utility. I don't I follow this completely. If the user says nohz_full=, he probably doesn't care about deactivating isolcpus later, since that defeats the entire purpose of the nohz_full= in the first place, as far as I can tell. And when you say "anything other than HPC", I'm not sure what you mean; as far as I know high-performance computing only cares because it wants that extra 0.5% of the cpu or whatever interrupts eat up, but just as a nice-to-have. The real use case is high-performance userspace drivers where the nohz_full cores are responding to real-time things like packet arrivals with almost no latency to spare. What is the generic utility you're envisioning for nohz_full cores that have turned off scheduler isolation? I assume it's some workload where you'd prefer not to have too many interrupts but still are running multiple tasks, but in that case does it really make much difference in practice? > Thomas has nuked the hrtimer softirq. Yes, this I didn't know. So I will drop my "no ksoftirqd" patch and we will see if ksoftirqs emerge as an issue for my "cpu isolation" stuff in the future; it may be that that was the only issue. > Inlining softirqs may save a context switch, but adds cycles that we may > consume at higher frequency than the thing we're avoiding. Yes but consuming cycles is not nearly as much of a concern as avoiding interrupts or scheduling, certainly for the case of userspace drivers that I described above. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-15 15:05 ` Chris Metcalf @ 2015-05-15 18:44 ` Mike Galbraith 2015-05-26 19:51 ` Chris Metcalf 0 siblings, 1 reply; 340+ messages in thread From: Mike Galbraith @ 2015-05-15 18:44 UTC (permalink / raw) To: Chris Metcalf Cc: Frederic Weisbecker, Steven Rostedt, Ingo Molnar, Andrew Morton, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, linux-kernel On Fri, 2015-05-15 at 11:05 -0400, Chris Metcalf wrote: > On 05/11/2015 09:47 PM, Mike Galbraith wrote: > > On Mon, 2015-05-11 at 15:25 -0400, Chris Metcalf wrote: > >> On 05/11/2015 03:19 PM, Mike Galbraith wrote: > >>> I really shouldn't have acked nohz_full -> isolcpus. Beside the fact > >>> that old static isolcpus was_supposed_ to crawl off and die, I know > >>> beyond doubt that having isolated a cpu as well as you can definitely > >>> does NOT imply that said cpu should become tickless. > >> True, at a high level, I agree that it would be better to have a > >> top-level concept like Frederic's proposed ISOLATION that includes > >> isolcpus and nohz_cpu (and other stuff as needed). > >> > >> That said, what you wrote above is wrong; even with the patch you > >> acked, setting isolcpus does not automatically turn on nohz_full for > >> a given cpu. The patch made it true the other way around: when > >> you say nohz_full, you automatically get isolcpus on that cpu too. > >> That does, at least, make sense for the semantics of nohz_full. > > I didn't write that, I wrote nohz_full implies (spelled '->') isolcpus. > > Yes, with nohz_full currently being static, the old allegedly dying but > > also static isolcpus scheduler off switch is a convenient thing to wire > > the nohz_full CPU SET (<- hint;) property to. > > Yes, I was responding to the bit where you said "having isolated a > cpu as well as you can does NOT imply it should become tickless", > but indeed, the "nohz_full -> isolcpus" patch didn't make that true. > In any case sounds like we were just talking past each other. Yup. > > BTW, another facet of this: Rik wants to make isolcpus immune to > > cpusets, which makes some sense, user did say isolcpus=, but that also > > makes isolcpus truly static. If the user now says nohz_full=, they lose > > the ability to deactivate CPU isolation, making the set fairly useless > > for anything other than HPC. Currently, the user can flip the isolation > > switch as he sees fit. He takes a size extra large performance hit for > > having said nohz_full=, but he doesn't lose generic utility. > > I don't I follow this completely. If the user says nohz_full=, he > probably doesn't care about deactivating isolcpus later, since that > defeats the entire purpose of the nohz_full= in the first place, > as far as I can tell. And when you say "anything other than HPC", > I'm not sure what you mean; as far as I know high-performance > computing only cares because it wants that extra 0.5% of the > cpu or whatever interrupts eat up, but just as a nice-to-have. > The real use case is high-performance userspace drivers where > the nohz_full cores are responding to real-time things like packet > arrivals with almost no latency to spare. Ok, verbosity on. Currently, nohz_full is static, meaning in a dynamic environment, where the user may not have a constant need for it, if you make it imply isolcpus, then make isolcpus immutable, you have just needlessly taken an option from the user. Those CPUS are no longer part of his generic resource pool, and he has nothing to say about it. > What is the generic utility you're envisioning for nohz_full cores > that have turned off scheduler isolation? I assume it's some > workload where you'd prefer not to have too many interrupts > but still are running multiple tasks, but in that case does it really > make much difference in practice? Again, I think we're talking past one another. I'm saying there is no need to mandate, nothing more. For your needs, my needs whatever, that immutable may sound good, but in fact, it removes flexibility, and for no good reason. This shows immediately in simple testing. Do I need nohz_full? Hell no, only for testing. If I want to test, I obviously need it for a while, and yes, I can reboot... but what's the difference between me the silly tester who needs it only to see if it works at all, and how well, and some guy who does something critical once in a while, or a company with a pool of big boxen that they reconfigure on the fly to meet whatever dynamic needs? Just because the nohz_full feature itself is currently static is no reason to put users thereof in a straight jacket by mandating that any set they define irrevocably disappears from the generic resource pool . Those CPUS are useful until the moment someone cripples them, which making nohz_full imply isolcpus does if isolcpus then also becomes immutable, which Rik's patch does. Making nohz_full imply isolcpus sounds perfectly fine until someone comes along and makes isolcpus immutable (Rik's patch), at which point the user loses a choice due to two people making it imply things that _alone_ sound perfectly fine. See what I'm saying now? > > Thomas has nuked the hrtimer softirq. > > Yes, this I didn't know. So I will drop my "no ksoftirqd" patch and > we will see if ksoftirqs emerge as an issue for my "cpu isolation" > stuff in the future; it may be that that was the only issue. > > > Inlining softirqs may save a context switch, but adds cycles that we may > > consume at higher frequency than the thing we're avoiding. > > Yes but consuming cycles is not nearly as much of a concern > as avoiding interrupts or scheduling, certainly for the case of > userspace drivers that I described above. If you're raising softirqs in an SMP kernel, you're also doing something that puts you at very serious risk of meeting the jitter monster, locks, and worse, sleeping locks, no? -Mike ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-15 18:44 ` Mike Galbraith @ 2015-05-26 19:51 ` Chris Metcalf 2015-05-27 3:28 ` Mike Galbraith 0 siblings, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-05-26 19:51 UTC (permalink / raw) To: Mike Galbraith Cc: Frederic Weisbecker, Steven Rostedt, Ingo Molnar, Andrew Morton, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, linux-kernel Thanks for the clarification, and sorry for the slow reply; I had a busy week of meetings last week, and then the long weekend in the U.S. On 05/15/2015 02:44 PM, Mike Galbraith wrote: > Just because the nohz_full feature itself is currently static is no > reason to put users thereof in a straight jacket by mandating that any > set they define irrevocably disappears from the generic resource pool . > Those CPUS are useful until the moment someone cripples them, which > making nohz_full imply isolcpus does if isolcpus then also becomes > immutable, which Rik's patch does. Making nohz_full imply isolcpus > sounds perfectly fine until someone comes along and makes isolcpus > immutable (Rik's patch), at which point the user loses a choice due to > two people making it imply things that _alone_ sound perfectly fine. > > See what I'm saying now? That does make sense; my argument was that 99% of the time when someone specifies nohz_full they also need isolcpus. You're right that someone playing with nohz_full would be unpleasantly surprised. And of course having more flexibility always feels like a plus. On balance I suspect it's still better to make command line arguments handle the common cases most succinctly. Hopefully we'll get a to a point where all of this is dynamic and how we play with the boot arguments no longer matters. If not, perhaps we revisit this and make a cpu_isolation=1-15 type command line argument that enables isolcpus and nohz_full both. >>> Thomas has nuked the hrtimer softirq. >> Yes, this I didn't know. So I will drop my "no ksoftirqd" patch and >> we will see if ksoftirqs emerge as an issue for my "cpu isolation" >> stuff in the future; it may be that that was the only issue. >> >>> Inlining softirqs may save a context switch, but adds cycles that we may >>> consume at higher frequency than the thing we're avoiding. >> Yes but consuming cycles is not nearly as much of a concern >> as avoiding interrupts or scheduling, certainly for the case of >> userspace drivers that I described above. > If you're raising softirqs in an SMP kernel, you're also doing something > that puts you at very serious risk of meeting the jitter monster, locks, > and worse, sleeping locks, no? The softirqs were being raised by third parties for hrtimer, not by the application code itself, if I remember correctly. In any case this appears not to be an issue for nohz_full any more now. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-26 19:51 ` Chris Metcalf @ 2015-05-27 3:28 ` Mike Galbraith 0 siblings, 0 replies; 340+ messages in thread From: Mike Galbraith @ 2015-05-27 3:28 UTC (permalink / raw) To: Chris Metcalf Cc: Frederic Weisbecker, Steven Rostedt, Ingo Molnar, Andrew Morton, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, linux-kernel On Tue, 2015-05-26 at 15:51 -0400, Chris Metcalf wrote: > On balance I suspect it's still better to make command line arguments > handle the common cases most succinctly. I prefer user specifies precisely, but yeah, that entails more typing. Idle curiosity: can SGI monster from hell boot a NO_HZ_FULL_ALL kernel, w/wo it implying isolcpus? Readers having same and a reactor to power it in their basement, please test. -Mike ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 12:57 ` Steven Rostedt (?) (?) @ 2015-05-11 17:19 ` Paul E. McKenney 2015-05-11 17:27 ` Andrew Morton -1 siblings, 1 reply; 340+ messages in thread From: Paul E. McKenney @ 2015-05-11 17:19 UTC (permalink / raw) To: Steven Rostedt Cc: Ingo Molnar, Andrew Morton, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote: > > NO_HZ_LEAVE_ME_THE_FSCK_ALONE! NO_HZ_OVERFLOWING? Kconfig naming controversy aside, I believe this patchset is addressing a real need. Might need additional adjustment, but something useful. Thanx, Paul > On Sat, 9 May 2015 09:05:38 +0200 > Ingo Molnar <mingo@kernel.org> wrote: > > > So I think we should either rename NO_HZ_FULL to NO_HZ_PURE, or keep > > it at NO_HZ_FULL: because the intention of NO_HZ_FULL was always to be > > such a 'zero overhead' mode of operation, where if user-space runs, it > > won't get interrupted in any way. > > > All kidding aside, I think this is the real answer. We don't need a new > NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly > what it was created to do. That should be fixed. > > Please lets get NO_HZ_FULL up to par. That should be the main focus. > > -- Steve > ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 17:19 ` Paul E. McKenney @ 2015-05-11 17:27 ` Andrew Morton 2015-05-11 17:33 ` Frederic Weisbecker 0 siblings, 1 reply; 340+ messages in thread From: Andrew Morton @ 2015-05-11 17:27 UTC (permalink / raw) To: paulmck Cc: Steven Rostedt, Ingo Molnar, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Mon, 11 May 2015 10:19:16 -0700 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote: > > > > NO_HZ_LEAVE_ME_THE_FSCK_ALONE! > > NO_HZ_OVERFLOWING? Actually, "NO_HZ" shouldn't appear in the name at all. The objective is to permit userspace to execute without interruption. NO_HZ is a part of that, as is NO_INTERRUPTS. The "NO_HZ" thing is a historical artifact from an early partial implementation. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 17:27 ` Andrew Morton @ 2015-05-11 17:33 ` Frederic Weisbecker 2015-05-11 18:00 ` Steven Rostedt 0 siblings, 1 reply; 340+ messages in thread From: Frederic Weisbecker @ 2015-05-11 17:33 UTC (permalink / raw) To: Andrew Morton Cc: paulmck, Steven Rostedt, Ingo Molnar, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Mon, May 11, 2015 at 10:27:44AM -0700, Andrew Morton wrote: > On Mon, 11 May 2015 10:19:16 -0700 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote: > > > > > > NO_HZ_LEAVE_ME_THE_FSCK_ALONE! > > > > NO_HZ_OVERFLOWING? > > Actually, "NO_HZ" shouldn't appear in the name at all. The objective > is to permit userspace to execute without interruption. NO_HZ is a > part of that, as is NO_INTERRUPTS. The "NO_HZ" thing is a historical > artifact from an early partial implementation. Agreed! Which is why I'd rather advocate in favour of CONFIG_ISOLATION. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 17:33 ` Frederic Weisbecker @ 2015-05-11 18:00 ` Steven Rostedt 2015-05-11 18:09 ` Chris Metcalf 0 siblings, 1 reply; 340+ messages in thread From: Steven Rostedt @ 2015-05-11 18:00 UTC (permalink / raw) To: Frederic Weisbecker Cc: Andrew Morton, paulmck, Ingo Molnar, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Mon, 11 May 2015 19:33:06 +0200 Frederic Weisbecker <fweisbec@gmail.com> wrote: > On Mon, May 11, 2015 at 10:27:44AM -0700, Andrew Morton wrote: > > On Mon, 11 May 2015 10:19:16 -0700 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote: > > > > > > > > NO_HZ_LEAVE_ME_THE_FSCK_ALONE! > > > > > > NO_HZ_OVERFLOWING? > > > > Actually, "NO_HZ" shouldn't appear in the name at all. The objective > > is to permit userspace to execute without interruption. NO_HZ is a > > part of that, as is NO_INTERRUPTS. The "NO_HZ" thing is a historical > > artifact from an early partial implementation. > > Agreed! Which is why I'd rather advocate in favour of CONFIG_ISOLATION. Then we should have CONFIG_LEAVE_ME_THE_FSCK_ALONE. Hmm, I guess that's just an synonym for CONFIG_ISOLATION. -- Steve ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-11 18:00 ` Steven Rostedt @ 2015-05-11 18:09 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-11 18:09 UTC (permalink / raw) To: Steven Rostedt, Frederic Weisbecker Cc: Andrew Morton, paulmck, Ingo Molnar, Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel A bunch of issues have been raised by various folks (thanks!) and I'll try to break them down and respond to them in a few different emails. This email is just about the issue of naming and whether the proposed patch series should even have its own "name" or just be part of NO_HZ_FULL. First, Ingo and Steven both suggested that this new "dataplane" mode (or whatever we want to call it; see below) should just be rolled into the existing NO_HZ_FULL and that we should focus on making that work better. Steven writes: > All kidding aside, I think this is the real answer. We don't need a new > NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly > what it was created to do. That should be fixed. The claim I'm making is that it's worthwhile to differentiate the two semantics. Plain NO_HZ_FULL just says "kernel makes a best effort to avoid periodic interrupts without incurring any serious overhead". My patch series allows an app to request "kernel makes an absolute commitment to avoid all interrupts regardless of cost when leaving kernel space". These are different enough ideas, and serve different enough application needs, that I think they should be kept distinct. Frederic actually summed this up very nicely in his recent email when he wrote "some people may expect hard isolation requirement (Real Time, deterministic latency) and others softer isolation (HPC, only interested in performance, can live with one rare random tick, so no need to loop before returning to userspace until we have the no-noise guarantee)." So we need a way for apps to ask for the "harder" mode and let the softer mode be the default. What about naming? We may or may not want to have a Kconfig flag for this, and we may or may not have a separate mode for it, but we still will need some kind of name to talk about it with. (In particular there's the prctl name, if we take that approach, and potential boot command-line flags to consider naming for.) I'll quickly cover the suggestions that have been raised: - DATAPLANE. My suggestion, seemingly broadly disliked by folks who felt it wasn't apparent what it meant. Probably a fair point. - NO_INTERRUPTS (Andrew). Captures some of the sense, but was criticized pretty fairly by Ingo as being too negative, confusing with perf nomenclature, and too long :-) - PURE (Ingo). Proposed as an alternative to NO_HZ_FULL, but we could use it as a name for this new mode. However, I think it's not clear enough how FULL and PURE can/should relate to each other from the names alone. - BARE_METAL (me). Ingo observes it's confusing with respect to virtualization. - TASK_SOLO (Gilad). Not sure this conveys enough of the semantics. - OS_LIGHT/OS_ZERO and NO_HZ_LEAVE_ME_THE_FSCK_ALONE. Excellent ideas :-) - ISOLATION (Frederic). I like this but it conflicts with other uses of "isolation" in the kernel: cgroup isolation, lru page isolation, iommu isolation, scheduler isolation (at least it's a superset of that one), etc. Also, we're not exactly isolating a task - often a "dataplane" app consists of a bunch of interacting threads in userspace, so not exactly isolated. So perhaps it's too confusing. - OVERFLOWING (Steven) - not sure I understood this one, honestly. I suggested earlier a few other candidates that I don't love, but no one commented on: NO_HZ_STRICT, USERSPACE_ONLY, and ZERO_OVERHEAD. One thing I'm leaning towards is to remove the intermediate state of DATAPLANE_ENABLE and say that there is really only one primary state, DATAPLANE_QUIESCE (or whatever we call it). The "dataplane but no quiesce" state probably isn't that useful, since it doesn't offer the hard guarantee that is the entire point of this patch series. So that opens the idea of using the name NO_HZ_QUIESCE or just QUIESCE as the word that describes the mode; of course this sort of conflicts with RCU quiesce (though it is a superset of that so maybe that's OK). One new idea I had is to use NO_HZ_HARD to reflect what Frederic was suggesting about "soft" and "hard" requirements for NO_HZ. So enabling NO_HZ_HARD would enable my suggested QUIESCE mode. One way to focus this discussion is on the user API naming. I had prctl(PR_SET_DATAPLANE), which was attractive in being a "positive" noun. A lot of the other suggestions fail this test in various way. Reasonable candidates seem to be: PR_SET_OS_ZERO PR_SET_TASK_SOLO PR_SET_ISOLATION Another possibility: PR_SET_NONSTOP Or take Andrew's NO_INTERRUPTS and have: PR_SET_UNINTERRUPTED I slightly favor ISOLATION at this point despite the overlap with other kernel concepts. Let the bike-shedding continue! :-) -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-11 18:09 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-11 18:09 UTC (permalink / raw) To: Steven Rostedt, Frederic Weisbecker Cc: Andrew Morton, paulmck, Ingo Molnar, Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel A bunch of issues have been raised by various folks (thanks!) and I'll try to break them down and respond to them in a few different emails. This email is just about the issue of naming and whether the proposed patch series should even have its own "name" or just be part of NO_HZ_FULL. First, Ingo and Steven both suggested that this new "dataplane" mode (or whatever we want to call it; see below) should just be rolled into the existing NO_HZ_FULL and that we should focus on making that work better. Steven writes: > All kidding aside, I think this is the real answer. We don't need a new > NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly > what it was created to do. That should be fixed. The claim I'm making is that it's worthwhile to differentiate the two semantics. Plain NO_HZ_FULL just says "kernel makes a best effort to avoid periodic interrupts without incurring any serious overhead". My patch series allows an app to request "kernel makes an absolute commitment to avoid all interrupts regardless of cost when leaving kernel space". These are different enough ideas, and serve different enough application needs, that I think they should be kept distinct. Frederic actually summed this up very nicely in his recent email when he wrote "some people may expect hard isolation requirement (Real Time, deterministic latency) and others softer isolation (HPC, only interested in performance, can live with one rare random tick, so no need to loop before returning to userspace until we have the no-noise guarantee)." So we need a way for apps to ask for the "harder" mode and let the softer mode be the default. What about naming? We may or may not want to have a Kconfig flag for this, and we may or may not have a separate mode for it, but we still will need some kind of name to talk about it with. (In particular there's the prctl name, if we take that approach, and potential boot command-line flags to consider naming for.) I'll quickly cover the suggestions that have been raised: - DATAPLANE. My suggestion, seemingly broadly disliked by folks who felt it wasn't apparent what it meant. Probably a fair point. - NO_INTERRUPTS (Andrew). Captures some of the sense, but was criticized pretty fairly by Ingo as being too negative, confusing with perf nomenclature, and too long :-) - PURE (Ingo). Proposed as an alternative to NO_HZ_FULL, but we could use it as a name for this new mode. However, I think it's not clear enough how FULL and PURE can/should relate to each other from the names alone. - BARE_METAL (me). Ingo observes it's confusing with respect to virtualization. - TASK_SOLO (Gilad). Not sure this conveys enough of the semantics. - OS_LIGHT/OS_ZERO and NO_HZ_LEAVE_ME_THE_FSCK_ALONE. Excellent ideas :-) - ISOLATION (Frederic). I like this but it conflicts with other uses of "isolation" in the kernel: cgroup isolation, lru page isolation, iommu isolation, scheduler isolation (at least it's a superset of that one), etc. Also, we're not exactly isolating a task - often a "dataplane" app consists of a bunch of interacting threads in userspace, so not exactly isolated. So perhaps it's too confusing. - OVERFLOWING (Steven) - not sure I understood this one, honestly. I suggested earlier a few other candidates that I don't love, but no one commented on: NO_HZ_STRICT, USERSPACE_ONLY, and ZERO_OVERHEAD. One thing I'm leaning towards is to remove the intermediate state of DATAPLANE_ENABLE and say that there is really only one primary state, DATAPLANE_QUIESCE (or whatever we call it). The "dataplane but no quiesce" state probably isn't that useful, since it doesn't offer the hard guarantee that is the entire point of this patch series. So that opens the idea of using the name NO_HZ_QUIESCE or just QUIESCE as the word that describes the mode; of course this sort of conflicts with RCU quiesce (though it is a superset of that so maybe that's OK). One new idea I had is to use NO_HZ_HARD to reflect what Frederic was suggesting about "soft" and "hard" requirements for NO_HZ. So enabling NO_HZ_HARD would enable my suggested QUIESCE mode. One way to focus this discussion is on the user API naming. I had prctl(PR_SET_DATAPLANE), which was attractive in being a "positive" noun. A lot of the other suggestions fail this test in various way. Reasonable candidates seem to be: PR_SET_OS_ZERO PR_SET_TASK_SOLO PR_SET_ISOLATION Another possibility: PR_SET_NONSTOP Or take Andrew's NO_INTERRUPTS and have: PR_SET_UNINTERRUPTED I slightly favor ISOLATION at this point despite the overlap with other kernel concepts. Let the bike-shedding continue! :-) -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-11 18:36 ` Steven Rostedt 0 siblings, 0 replies; 340+ messages in thread From: Steven Rostedt @ 2015-05-11 18:36 UTC (permalink / raw) To: Chris Metcalf Cc: Frederic Weisbecker, Andrew Morton, paulmck, Ingo Molnar, Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Mon, 11 May 2015 14:09:59 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > Steven writes: > > All kidding aside, I think this is the real answer. We don't need a new > > NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly > > what it was created to do. That should be fixed. > > The claim I'm making is that it's worthwhile to differentiate the two > semantics. Plain NO_HZ_FULL just says "kernel makes a best effort to > avoid periodic interrupts without incurring any serious overhead". My > patch series allows an app to request "kernel makes an absolute > commitment to avoid all interrupts regardless of cost when leaving > kernel space". These are different enough ideas, and serve different > enough application needs, that I think they should be kept distinct. > > Frederic actually summed this up very nicely in his recent email when > he wrote "some people may expect hard isolation requirement (Real > Time, deterministic latency) and others softer isolation (HPC, only > interested in performance, can live with one rare random tick, so no > need to loop before returning to userspace until we have the no-noise > guarantee)." > > So we need a way for apps to ask for the "harder" mode and let > the softer mode be the default. Fair enough. But I would hope that this would improve on NO_HZ_FULL as well. > > What about naming? We may or may not want to have a Kconfig flag > for this, and we may or may not have a separate mode for it, but > we still will need some kind of name to talk about it with. (In > particular there's the prctl name, if we take that approach, and > potential boot command-line flags to consider naming for.) > > I'll quickly cover the suggestions that have been raised: > > - DATAPLANE. My suggestion, seemingly broadly disliked by folks > who felt it wasn't apparent what it meant. Probably a fair point. > > - NO_INTERRUPTS (Andrew). Captures some of the sense, but was > criticized pretty fairly by Ingo as being too negative, confusing > with perf nomenclature, and too long :-) What about NO_INTERRUPTIONS > > - PURE (Ingo). Proposed as an alternative to NO_HZ_FULL, but we could > use it as a name for this new mode. However, I think it's not clear > enough how FULL and PURE can/should relate to each other from the > names alone. I would find the two confusing as well. > > - BARE_METAL (me). Ingo observes it's confusing with respect to > virtualization. This is also confusing. > > - TASK_SOLO (Gilad). Not sure this conveys enough of the semantics. Agreed. > > - OS_LIGHT/OS_ZERO and NO_HZ_LEAVE_ME_THE_FSCK_ALONE. Excellent > ideas :-) At least the LEAVE_ME_ALONE conveys the semantics ;-) > > - ISOLATION (Frederic). I like this but it conflicts with other uses > of "isolation" in the kernel: cgroup isolation, lru page isolation, > iommu isolation, scheduler isolation (at least it's a superset of > that one), etc. Also, we're not exactly isolating a task - often > a "dataplane" app consists of a bunch of interacting threads in > userspace, so not exactly isolated. So perhaps it's too confusing. > > - OVERFLOWING (Steven) - not sure I understood this one, honestly. Actually, that was suggested by Paul McKenney. > > I suggested earlier a few other candidates that I don't love, but no > one commented on: NO_HZ_STRICT, USERSPACE_ONLY, and ZERO_OVERHEAD. > > One thing I'm leaning towards is to remove the intermediate state of > DATAPLANE_ENABLE and say that there is really only one primary state, > DATAPLANE_QUIESCE (or whatever we call it). The "dataplane but no > quiesce" state probably isn't that useful, since it doesn't offer the > hard guarantee that is the entire point of this patch series. So that > opens the idea of using the name NO_HZ_QUIESCE or just QUIESCE as the > word that describes the mode; of course this sort of conflicts with > RCU quiesce (though it is a superset of that so maybe that's OK). > > One new idea I had is to use NO_HZ_HARD to reflect what Frederic was > suggesting about "soft" and "hard" requirements for NO_HZ. So > enabling NO_HZ_HARD would enable my suggested QUIESCE mode. > > One way to focus this discussion is on the user API naming. I had > prctl(PR_SET_DATAPLANE), which was attractive in being a "positive" > noun. A lot of the other suggestions fail this test in various way. > Reasonable candidates seem to be: > > PR_SET_OS_ZERO > PR_SET_TASK_SOLO > PR_SET_ISOLATION > > Another possibility: > > PR_SET_NONSTOP > > Or take Andrew's NO_INTERRUPTS and have: > > PR_SET_UNINTERRUPTED For another possible answer, what about SET_TRANQUILITY A state with no disturbances. -- Steve > > I slightly favor ISOLATION at this point despite the overlap with > other kernel concepts. > > Let the bike-shedding continue! :-) > ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-11 18:36 ` Steven Rostedt 0 siblings, 0 replies; 340+ messages in thread From: Steven Rostedt @ 2015-05-11 18:36 UTC (permalink / raw) To: Chris Metcalf Cc: Frederic Weisbecker, Andrew Morton, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Ingo Molnar, Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon, 11 May 2015 14:09:59 -0400 Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > Steven writes: > > All kidding aside, I think this is the real answer. We don't need a new > > NO_HZ, we need to make NO_HZ_FULL work. Right now it doesn't do exactly > > what it was created to do. That should be fixed. > > The claim I'm making is that it's worthwhile to differentiate the two > semantics. Plain NO_HZ_FULL just says "kernel makes a best effort to > avoid periodic interrupts without incurring any serious overhead". My > patch series allows an app to request "kernel makes an absolute > commitment to avoid all interrupts regardless of cost when leaving > kernel space". These are different enough ideas, and serve different > enough application needs, that I think they should be kept distinct. > > Frederic actually summed this up very nicely in his recent email when > he wrote "some people may expect hard isolation requirement (Real > Time, deterministic latency) and others softer isolation (HPC, only > interested in performance, can live with one rare random tick, so no > need to loop before returning to userspace until we have the no-noise > guarantee)." > > So we need a way for apps to ask for the "harder" mode and let > the softer mode be the default. Fair enough. But I would hope that this would improve on NO_HZ_FULL as well. > > What about naming? We may or may not want to have a Kconfig flag > for this, and we may or may not have a separate mode for it, but > we still will need some kind of name to talk about it with. (In > particular there's the prctl name, if we take that approach, and > potential boot command-line flags to consider naming for.) > > I'll quickly cover the suggestions that have been raised: > > - DATAPLANE. My suggestion, seemingly broadly disliked by folks > who felt it wasn't apparent what it meant. Probably a fair point. > > - NO_INTERRUPTS (Andrew). Captures some of the sense, but was > criticized pretty fairly by Ingo as being too negative, confusing > with perf nomenclature, and too long :-) What about NO_INTERRUPTIONS > > - PURE (Ingo). Proposed as an alternative to NO_HZ_FULL, but we could > use it as a name for this new mode. However, I think it's not clear > enough how FULL and PURE can/should relate to each other from the > names alone. I would find the two confusing as well. > > - BARE_METAL (me). Ingo observes it's confusing with respect to > virtualization. This is also confusing. > > - TASK_SOLO (Gilad). Not sure this conveys enough of the semantics. Agreed. > > - OS_LIGHT/OS_ZERO and NO_HZ_LEAVE_ME_THE_FSCK_ALONE. Excellent > ideas :-) At least the LEAVE_ME_ALONE conveys the semantics ;-) > > - ISOLATION (Frederic). I like this but it conflicts with other uses > of "isolation" in the kernel: cgroup isolation, lru page isolation, > iommu isolation, scheduler isolation (at least it's a superset of > that one), etc. Also, we're not exactly isolating a task - often > a "dataplane" app consists of a bunch of interacting threads in > userspace, so not exactly isolated. So perhaps it's too confusing. > > - OVERFLOWING (Steven) - not sure I understood this one, honestly. Actually, that was suggested by Paul McKenney. > > I suggested earlier a few other candidates that I don't love, but no > one commented on: NO_HZ_STRICT, USERSPACE_ONLY, and ZERO_OVERHEAD. > > One thing I'm leaning towards is to remove the intermediate state of > DATAPLANE_ENABLE and say that there is really only one primary state, > DATAPLANE_QUIESCE (or whatever we call it). The "dataplane but no > quiesce" state probably isn't that useful, since it doesn't offer the > hard guarantee that is the entire point of this patch series. So that > opens the idea of using the name NO_HZ_QUIESCE or just QUIESCE as the > word that describes the mode; of course this sort of conflicts with > RCU quiesce (though it is a superset of that so maybe that's OK). > > One new idea I had is to use NO_HZ_HARD to reflect what Frederic was > suggesting about "soft" and "hard" requirements for NO_HZ. So > enabling NO_HZ_HARD would enable my suggested QUIESCE mode. > > One way to focus this discussion is on the user API naming. I had > prctl(PR_SET_DATAPLANE), which was attractive in being a "positive" > noun. A lot of the other suggestions fail this test in various way. > Reasonable candidates seem to be: > > PR_SET_OS_ZERO > PR_SET_TASK_SOLO > PR_SET_ISOLATION > > Another possibility: > > PR_SET_NONSTOP > > Or take Andrew's NO_INTERRUPTS and have: > > PR_SET_UNINTERRUPTED For another possible answer, what about SET_TRANQUILITY A state with no disturbances. -- Steve > > I slightly favor ISOLATION at this point despite the overlap with > other kernel concepts. > > Let the bike-shedding continue! :-) > ^ permalink raw reply [flat|nested] 340+ messages in thread
* CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) 2015-05-11 18:09 ` Chris Metcalf (?) (?) @ 2015-05-12 9:10 ` Ingo Molnar 2015-05-12 11:48 ` Peter Zijlstra 2015-05-12 21:05 ` CONFIG_ISOLATION=y Chris Metcalf -1 siblings, 2 replies; 340+ messages in thread From: Ingo Molnar @ 2015-05-12 9:10 UTC (permalink / raw) To: Chris Metcalf Cc: Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck, Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel * Chris Metcalf <cmetcalf@ezchip.com> wrote: > - ISOLATION (Frederic). I like this but it conflicts with other uses > of "isolation" in the kernel: cgroup isolation, lru page isolation, > iommu isolation, scheduler isolation (at least it's a superset of > that one), etc. Also, we're not exactly isolating a task - often > a "dataplane" app consists of a bunch of interacting threads in > userspace, so not exactly isolated. So perhaps it's too confusing. So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this is a high level kernel feature, so it won't conflict with isolation concepts in lower level subsystems such as IOMMU isolation - and other higher level features like scheduler isolation are basically another partial implementation we want to merge with all this... nohz, RCU tricks, watchdog defaults, isolcpus and various other measures to keep these CPUs and workloads as isolated as possible are (or should become) components of this high level concept. Ideally CONFIG_ISOLATION=y would be a kernel feature that has almost zero overhead on normal workloads and on non-isolated CPUs, so that Linux distributions can enable it. Enabling CONFIG_ISOLATION=y should be the only 'kernel config' step needed: just like cpusets, the configuration of isolated CPUs should be a completely boot option free excercise that can be dynamically done and undone by the administrator via an intuitive interface. Thanks, Ingo ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) @ 2015-05-12 11:48 ` Peter Zijlstra 0 siblings, 0 replies; 340+ messages in thread From: Peter Zijlstra @ 2015-05-12 11:48 UTC (permalink / raw) To: Ingo Molnar Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck, Gilad Ben Yossef, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Tue, May 12, 2015 at 11:10:32AM +0200, Ingo Molnar wrote: > > So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this is > a high level kernel feature, so it won't conflict with isolation > concepts in lower level subsystems such as IOMMU isolation - and other > higher level features like scheduler isolation are basically another > partial implementation we want to merge with all this... > But why do we need a CONFIG flag for something that has no content? That is, I do not see anything much; except the 'I want to stay in userspace and kill me otherwise' flag, and I'm not sure that warrants a CONFIG flag like this. Other than that, its all a combination of NOHZ_FULL and cpusets/isolcpus and whatnot. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) @ 2015-05-12 11:48 ` Peter Zijlstra 0 siblings, 0 replies; 340+ messages in thread From: Peter Zijlstra @ 2015-05-12 11:48 UTC (permalink / raw) To: Ingo Molnar Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Gilad Ben Yossef, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, May 12, 2015 at 11:10:32AM +0200, Ingo Molnar wrote: > > So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this is > a high level kernel feature, so it won't conflict with isolation > concepts in lower level subsystems such as IOMMU isolation - and other > higher level features like scheduler isolation are basically another > partial implementation we want to merge with all this... > But why do we need a CONFIG flag for something that has no content? That is, I do not see anything much; except the 'I want to stay in userspace and kill me otherwise' flag, and I'm not sure that warrants a CONFIG flag like this. Other than that, its all a combination of NOHZ_FULL and cpusets/isolcpus and whatnot. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) 2015-05-12 11:48 ` Peter Zijlstra (?) @ 2015-05-12 12:34 ` Ingo Molnar 2015-05-12 12:39 ` Peter Zijlstra 2015-05-12 15:36 ` Frederic Weisbecker -1 siblings, 2 replies; 340+ messages in thread From: Ingo Molnar @ 2015-05-12 12:34 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck, Gilad Ben Yossef, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel * Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, May 12, 2015 at 11:10:32AM +0200, Ingo Molnar wrote: > > > > So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this > > is a high level kernel feature, so it won't conflict with > > isolation concepts in lower level subsystems such as IOMMU > > isolation - and other higher level features like scheduler > > isolation are basically another partial implementation we want to > > merge with all this... > > But why do we need a CONFIG flag for something that has no content? > > That is, I do not see anything much; except the 'I want to stay in > userspace and kill me otherwise' flag, and I'm not sure that > warrants a CONFIG flag like this. > > Other than that, its all a combination of NOHZ_FULL and > cpusets/isolcpus and whatnot. Yes, that's what I meant: CONFIG_ISOLATION would trigger what is NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL as an individual Kconfig option? CONFIG_ISOLATION=y would express the guarantee from the kernel that it's possible for user-space to configure itself to run undisturbed - instead of the current inconsistent set of options and facilities. A bit like CONFIG_PREEMPT_RT is more than just preemptable spinlocks, it also tries to offer various facilities and tune the defaults to turn the kernel hard-rt. Does that make sense to you? Thanks, Ingo ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) @ 2015-05-12 12:39 ` Peter Zijlstra 0 siblings, 0 replies; 340+ messages in thread From: Peter Zijlstra @ 2015-05-12 12:39 UTC (permalink / raw) To: Ingo Molnar Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck, Gilad Ben Yossef, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Tue, May 12, 2015 at 02:34:40PM +0200, Ingo Molnar wrote: > Yes, that's what I meant: CONFIG_ISOLATION would trigger what is > NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL as > an individual Kconfig option? Ah, as a rename of nohz_full, sure that might work. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) @ 2015-05-12 12:39 ` Peter Zijlstra 0 siblings, 0 replies; 340+ messages in thread From: Peter Zijlstra @ 2015-05-12 12:39 UTC (permalink / raw) To: Ingo Molnar Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Gilad Ben Yossef, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, May 12, 2015 at 02:34:40PM +0200, Ingo Molnar wrote: > Yes, that's what I meant: CONFIG_ISOLATION would trigger what is > NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL as > an individual Kconfig option? Ah, as a rename of nohz_full, sure that might work. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) @ 2015-05-12 12:43 ` Ingo Molnar 0 siblings, 0 replies; 340+ messages in thread From: Ingo Molnar @ 2015-05-12 12:43 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck, Gilad Ben Yossef, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel * Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, May 12, 2015 at 02:34:40PM +0200, Ingo Molnar wrote: > > > Yes, that's what I meant: CONFIG_ISOLATION would trigger what is > > NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL > > as an individual Kconfig option? > > Ah, as a rename of nohz_full, sure that might work. It could also be named CONFIG_CPU_ISOLATION=y, to make it more explicit what it's about. Thanks, Ingo ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) @ 2015-05-12 12:43 ` Ingo Molnar 0 siblings, 0 replies; 340+ messages in thread From: Ingo Molnar @ 2015-05-12 12:43 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Gilad Ben Yossef, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA * Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > On Tue, May 12, 2015 at 02:34:40PM +0200, Ingo Molnar wrote: > > > Yes, that's what I meant: CONFIG_ISOLATION would trigger what is > > NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL > > as an individual Kconfig option? > > Ah, as a rename of nohz_full, sure that might work. It could also be named CONFIG_CPU_ISOLATION=y, to make it more explicit what it's about. Thanks, Ingo ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) 2015-05-12 12:34 ` Ingo Molnar 2015-05-12 12:39 ` Peter Zijlstra @ 2015-05-12 15:36 ` Frederic Weisbecker 1 sibling, 0 replies; 340+ messages in thread From: Frederic Weisbecker @ 2015-05-12 15:36 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Chris Metcalf, Steven Rostedt, Andrew Morton, paulmck, Gilad Ben Yossef, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Tue, May 12, 2015 at 02:34:40PM +0200, Ingo Molnar wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > On Tue, May 12, 2015 at 11:10:32AM +0200, Ingo Molnar wrote: > > > > > > So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this > > > is a high level kernel feature, so it won't conflict with > > > isolation concepts in lower level subsystems such as IOMMU > > > isolation - and other higher level features like scheduler > > > isolation are basically another partial implementation we want to > > > merge with all this... > > > > But why do we need a CONFIG flag for something that has no content? > > > > That is, I do not see anything much; except the 'I want to stay in > > userspace and kill me otherwise' flag, and I'm not sure that > > warrants a CONFIG flag like this. > > > > Other than that, its all a combination of NOHZ_FULL and > > cpusets/isolcpus and whatnot. > > Yes, that's what I meant: CONFIG_ISOLATION would trigger what is > NO_HZ_FULL today - we could possibly even remove CONFIG_NO_HZ_FULL as > an individual Kconfig option? Right, we could return to what we had previously: CONFIG_NO_HZ. A config that enables dynticks-idle by default and allows full dynticks if nohz_full= boot option is passed (or something driven by higher level isolation interface). Because eventually, distros enable NO_HZ_FULL so that their 0.0001% users can use it. Well at least Red Hat does. > > CONFIG_ISOLATION=y would express the guarantee from the kernel that > it's possible for user-space to configure itself to run undisturbed - > instead of the current inconsistent set of options and facilities. > > A bit like CONFIG_PREEMPT_RT is more than just preemptable spinlocks, > it also tries to offer various facilities and tune the defaults to > turn the kernel hard-rt. > > Does that make sense to you? Right although distros tend to want features to be enabled dynamically so that they have a single kernel to maintain. Things like PREEMPT_RT really need to be a different kernel because fundamental primitives like spinlocks must be implemented statically. But isolation can be a boot-enabled, or even runtime-enabled, as it's only about timer,irq,task affinity. Full Nohz is more complicated but it can be runtime toggled in the future. So we can bring CONFIG_CPU_ISOLATION, at least for distros that are really not interested in that so they can disable it. CONFIG_CPU_ISOLATION=y would bring an ability which is default-disabled and driven dynamically through whatever interface. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: CONFIG_ISOLATION=y 2015-05-12 9:10 ` CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) Ingo Molnar @ 2015-05-12 21:05 ` Chris Metcalf 2015-05-12 21:05 ` CONFIG_ISOLATION=y Chris Metcalf 1 sibling, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-12 21:05 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck, Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On 05/12/2015 05:10 AM, Ingo Molnar wrote: > * Chris Metcalf <cmetcalf@ezchip.com> wrote: > >> - ISOLATION (Frederic). I like this but it conflicts with other uses >> of "isolation" in the kernel: cgroup isolation, lru page isolation, >> iommu isolation, scheduler isolation (at least it's a superset of >> that one), etc. Also, we're not exactly isolating a task - often >> a "dataplane" app consists of a bunch of interacting threads in >> userspace, so not exactly isolated. So perhaps it's too confusing. > So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this is > a high level kernel feature, so it won't conflict with isolation > concepts in lower level subsystems such as IOMMU isolation - and other > higher level features like scheduler isolation are basically another > partial implementation we want to merge with all this... > > nohz, RCU tricks, watchdog defaults, isolcpus and various other > measures to keep these CPUs and workloads as isolated as possible > are (or should become) components of this high level concept. > > Ideally CONFIG_ISOLATION=y would be a kernel feature that has almost > zero overhead on normal workloads and on non-isolated CPUs, so that > Linux distributions can enable it. Using CONFIG_CPU_ISOLATION to capture all this stuff instead of making CONFIG_NO_HZ_FULL do it seems plausible for naming. However, this feels like just bombing the current naming to this new name, right? I'd like to argue that this is orthogonal to adding new isolation functionality into no_hz_full, as my patch series has been doing. Perhaps we can defer this to a follow-up patch series? I'm happy to do the work but I'm not sure we want to bundle all that churn into the current patch series under consideration. I can use cpu_isolation_xxx for naming in the current patch series so we don't have to come back and bomb that later. > Enabling CONFIG_ISOLATION=y should be the only 'kernel config' step > needed: just like cpusets, the configuration of isolated CPUs should > be a completely boot option free excercise that can be dynamically > done and undone by the administrator via an intuitive interface. Eventually isolation can be runtime-enabled, but for now I think it makes sense to be boot-enabled. As Frederic suggested, we can arrange full nohz to be runtime toggled in the future. I agree that it should be reasonable to compile it in by default. On 05/12/2015 07:48 AM, Peter Zijlstra wrote: > But why do we need a CONFIG flag for something that has no content? > > That is, I do not see anything much; except the 'I want to stay in > userspace and kill me otherwise' flag, and I'm not sure that warrants a > CONFIG flag like this. > > Other than that, its all a combination of NOHZ_FULL and cpusets/isolcpus > and whatnot. There are three major pieces here - one is the STRICT piece that you allude to, but there is also the piece where we quiesce tasks in the kernel until no timer interrupts are pending, and the piece that allows easy debugging of stray IRQs etc to isolated cpus. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: CONFIG_ISOLATION=y @ 2015-05-12 21:05 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-12 21:05 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, Frederic Weisbecker, Andrew Morton, paulmck, Gilad Ben Yossef, Peter Zijlstra, Rik van Riel, Tejun Heo, Thomas Gleixner, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On 05/12/2015 05:10 AM, Ingo Molnar wrote: > * Chris Metcalf <cmetcalf@ezchip.com> wrote: > >> - ISOLATION (Frederic). I like this but it conflicts with other uses >> of "isolation" in the kernel: cgroup isolation, lru page isolation, >> iommu isolation, scheduler isolation (at least it's a superset of >> that one), etc. Also, we're not exactly isolating a task - often >> a "dataplane" app consists of a bunch of interacting threads in >> userspace, so not exactly isolated. So perhaps it's too confusing. > So I'd vote for Frederic's CONFIG_ISOLATION=y, mostly because this is > a high level kernel feature, so it won't conflict with isolation > concepts in lower level subsystems such as IOMMU isolation - and other > higher level features like scheduler isolation are basically another > partial implementation we want to merge with all this... > > nohz, RCU tricks, watchdog defaults, isolcpus and various other > measures to keep these CPUs and workloads as isolated as possible > are (or should become) components of this high level concept. > > Ideally CONFIG_ISOLATION=y would be a kernel feature that has almost > zero overhead on normal workloads and on non-isolated CPUs, so that > Linux distributions can enable it. Using CONFIG_CPU_ISOLATION to capture all this stuff instead of making CONFIG_NO_HZ_FULL do it seems plausible for naming. However, this feels like just bombing the current naming to this new name, right? I'd like to argue that this is orthogonal to adding new isolation functionality into no_hz_full, as my patch series has been doing. Perhaps we can defer this to a follow-up patch series? I'm happy to do the work but I'm not sure we want to bundle all that churn into the current patch series under consideration. I can use cpu_isolation_xxx for naming in the current patch series so we don't have to come back and bomb that later. > Enabling CONFIG_ISOLATION=y should be the only 'kernel config' step > needed: just like cpusets, the configuration of isolated CPUs should > be a completely boot option free excercise that can be dynamically > done and undone by the administrator via an intuitive interface. Eventually isolation can be runtime-enabled, but for now I think it makes sense to be boot-enabled. As Frederic suggested, we can arrange full nohz to be runtime toggled in the future. I agree that it should be reasonable to compile it in by default. On 05/12/2015 07:48 AM, Peter Zijlstra wrote: > But why do we need a CONFIG flag for something that has no content? > > That is, I do not see anything much; except the 'I want to stay in > userspace and kill me otherwise' flag, and I'm not sure that warrants a > CONFIG flag like this. > > Other than that, its all a combination of NOHZ_FULL and cpusets/isolcpus > and whatnot. There are three major pieces here - one is the STRICT piece that you allude to, but there is also the piece where we quiesce tasks in the kernel until no timer interrupts are pending, and the piece that allows easy debugging of stray IRQs etc to isolated cpus. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-12 10:46 ` Peter Zijlstra 0 siblings, 0 replies; 340+ messages in thread From: Peter Zijlstra @ 2015-05-12 10:46 UTC (permalink / raw) To: Steven Rostedt Cc: Ingo Molnar, Andrew Morton, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote: > > Please lets get NO_HZ_FULL up to par. That should be the main focus. > ACK, much of this dataplane stuff is (useful) hacks working around the fact that nohz_full just isn't complete. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-12 10:46 ` Peter Zijlstra 0 siblings, 0 replies; 340+ messages in thread From: Peter Zijlstra @ 2015-05-12 10:46 UTC (permalink / raw) To: Steven Rostedt Cc: Ingo Molnar, Andrew Morton, Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote: > > Please lets get NO_HZ_FULL up to par. That should be the main focus. > ACK, much of this dataplane stuff is (useful) hacks working around the fact that nohz_full just isn't complete. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full 2015-05-12 10:46 ` Peter Zijlstra @ 2015-05-15 15:10 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-15 15:10 UTC (permalink / raw) To: Peter Zijlstra, Steven Rostedt Cc: Ingo Molnar, Andrew Morton, Gilad Ben Yossef, Ingo Molnar, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On 05/12/2015 06:46 AM, Peter Zijlstra wrote: > On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote: >> Please lets get NO_HZ_FULL up to par. That should be the main focus. >> > ACK, much of this dataplane stuff is (useful) hacks working around the > fact that nohz_full just isn't complete. There are enough disjoint threads on this topic that I want to just touch base here and see if you have been convinced on other threads that there is stuff beyond the hacks here: in particular 1. The basic "dataplane" mode to arrange to do extra work on return to kernel space that normally isn't warranted, to avoid future IPIs, and additionally to wait in the kernel until any timer interrupts required by the kernel invocation itself are done; and 2. The "strict" mode to allow a task to tell the kernel it isn't planning on making any more such calls, and have the kernel help diagnose any resulting application bugs. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 0/6] support "dataplane" mode for nohz_full @ 2015-05-15 15:10 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-15 15:10 UTC (permalink / raw) To: Peter Zijlstra, Steven Rostedt Cc: Ingo Molnar, Andrew Morton, Gilad Ben Yossef, Ingo Molnar, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Srivatsa S. Bhat, linux-doc, linux-api, linux-kernel On 05/12/2015 06:46 AM, Peter Zijlstra wrote: > On Mon, May 11, 2015 at 08:57:59AM -0400, Steven Rostedt wrote: >> Please lets get NO_HZ_FULL up to par. That should be the main focus. >> > ACK, much of this dataplane stuff is (useful) hacks working around the > fact that nohz_full just isn't complete. There are enough disjoint threads on this topic that I want to just touch base here and see if you have been convinced on other threads that there is stuff beyond the hacks here: in particular 1. The basic "dataplane" mode to arrange to do extra work on return to kernel space that normally isn't warranted, to avoid future IPIs, and additionally to wait in the kernel until any timer interrupts required by the kernel invocation itself are done; and 2. The "strict" mode to allow a task to tell the kernel it isn't planning on making any more such calls, and have the kernel help diagnose any resulting application bugs. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v2 0/5] support "cpu_isolated" mode for nohz_full @ 2015-05-15 21:26 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-15 21:26 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. A prctl() option (PR_SET_CPU_ISOLATED) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on cpu_isolated cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on my arch/tile master tree for 4.2, in turn based on 4.1-rc1) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() Note: I have not yet removed the hack to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555). Chris Metcalf (5): nohz_full: add support for "cpu_isolated" mode nohz: support PR_CPU_ISOLATED_STRICT mode nohz: cpu_isolated strict mode configurable signal nohz: add cpu_isolated_debug boot flag nohz: cpu_isolated: allow tick to be fully disabled Documentation/kernel-parameters.txt | 6 +++ arch/tile/kernel/ptrace.c | 6 ++- arch/tile/mm/homecache.c | 5 +- arch/x86/kernel/ptrace.c | 2 + include/linux/context_tracking.h | 11 +++-- include/linux/sched.h | 3 ++ include/linux/tick.h | 28 +++++++++++ include/uapi/linux/prctl.h | 8 +++ kernel/context_tracking.c | 12 +++-- kernel/irq_work.c | 4 +- kernel/sched/core.c | 18 +++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 6 +++ kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 98 ++++++++++++++++++++++++++++++++++++- 16 files changed, 214 insertions(+), 10 deletions(-) -- 2.1.2 ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v2 0/5] support "cpu_isolated" mode for nohz_full @ 2015-05-15 21:26 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-15 21:26 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Chris Metcalf The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. A prctl() option (PR_SET_CPU_ISOLATED) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on cpu_isolated cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on my arch/tile master tree for 4.2, in turn based on 4.1-rc1) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() Note: I have not yet removed the hack to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555). Chris Metcalf (5): nohz_full: add support for "cpu_isolated" mode nohz: support PR_CPU_ISOLATED_STRICT mode nohz: cpu_isolated strict mode configurable signal nohz: add cpu_isolated_debug boot flag nohz: cpu_isolated: allow tick to be fully disabled Documentation/kernel-parameters.txt | 6 +++ arch/tile/kernel/ptrace.c | 6 ++- arch/tile/mm/homecache.c | 5 +- arch/x86/kernel/ptrace.c | 2 + include/linux/context_tracking.h | 11 +++-- include/linux/sched.h | 3 ++ include/linux/tick.h | 28 +++++++++++ include/uapi/linux/prctl.h | 8 +++ kernel/context_tracking.c | 12 +++-- kernel/irq_work.c | 4 +- kernel/sched/core.c | 18 +++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 6 +++ kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 98 ++++++++++++++++++++++++++++++++++++- 16 files changed, 214 insertions(+), 10 deletions(-) -- 2.1.2 ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode 2015-05-15 21:26 ` Chris Metcalf @ 2015-05-15 21:27 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a stronger commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the stronger semantics as needed, specifying prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The "cpu_isolated" state is indicated by setting a new task struct field, cpu_isolated_flags, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only two actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/sched.h | 3 +++ include/linux/tick.h | 10 +++++++++ include/uapi/linux/prctl.h | 5 +++++ kernel/context_tracking.c | 3 +++ kernel/sys.c | 8 ++++++++ kernel/time/tick-sched.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 80 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 8222ae40ecb0..fb4ba400d7e1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1732,6 +1732,9 @@ struct task_struct { #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif +#ifdef CONFIG_NO_HZ_FULL + unsigned int cpu_isolated_flags; +#endif }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/include/linux/tick.h b/include/linux/tick.h index f8492da57ad3..ec1953474a65 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -10,6 +10,7 @@ #include <linux/context_tracking_state.h> #include <linux/cpumask.h> #include <linux/sched.h> +#include <linux/prctl.h> #ifdef CONFIG_GENERIC_CLOCKEVENTS extern void __init tick_init(void); @@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu) return cpumask_test_cpu(cpu, tick_nohz_full_mask); } +static inline bool tick_nohz_is_cpu_isolated(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); +} + extern void __tick_nohz_full_check(void); extern void tick_nohz_full_kick(void); extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); +extern void tick_nohz_cpu_isolated_enter(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { } static inline void tick_nohz_full_kick(void) { } static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } +static inline bool tick_nohz_is_cpu_isolated(void) { return false; } +static inline void tick_nohz_cpu_isolated_enter(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..edb40b6b84db 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */ +#define PR_SET_CPU_ISOLATED 47 +#define PR_GET_CPU_ISOLATED 48 +# define PR_CPU_ISOLATED_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 72d59a1a6eb6..66739d7c1350 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/tick.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (tick_nohz_is_cpu_isolated()) + tick_nohz_cpu_isolated_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/sys.c b/kernel/sys.c index a4e372b798a5..3fd9e47f8fc8 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_NO_HZ_FULL + case PR_SET_CPU_ISOLATED: + me->cpu_isolated_flags = arg2; + break; + case PR_GET_CPU_ISOLATED: + error = me->cpu_isolated_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 914259128145..f1551c946c45 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -24,6 +24,7 @@ #include <linux/posix-timers.h> #include <linux/perf_event.h> #include <linux/context_tracking.h> +#include <linux/swap.h> #include <asm/irq_regs.h> @@ -389,6 +390,56 @@ void __init tick_nohz_init(void) pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n", cpumask_pr_args(tick_nohz_full_mask)); } + +/* + * We normally return immediately to userspace. + * + * In "cpu_isolated" mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two "cpu_isolated" processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void tick_nohz_cpu_isolated_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start)); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + + /* Idle with interrupts enabled and wait for the tick. */ + set_current_state(TASK_INTERRUPTIBLE); + arch_cpu_idle(); + set_current_state(TASK_RUNNING); + } + if (warned) { + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start)); + dump_stack(); + } +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode @ 2015-05-15 21:27 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a stronger commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the stronger semantics as needed, specifying prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The "cpu_isolated" state is indicated by setting a new task struct field, cpu_isolated_flags, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only two actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/sched.h | 3 +++ include/linux/tick.h | 10 +++++++++ include/uapi/linux/prctl.h | 5 +++++ kernel/context_tracking.c | 3 +++ kernel/sys.c | 8 ++++++++ kernel/time/tick-sched.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 80 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 8222ae40ecb0..fb4ba400d7e1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1732,6 +1732,9 @@ struct task_struct { #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif +#ifdef CONFIG_NO_HZ_FULL + unsigned int cpu_isolated_flags; +#endif }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/include/linux/tick.h b/include/linux/tick.h index f8492da57ad3..ec1953474a65 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -10,6 +10,7 @@ #include <linux/context_tracking_state.h> #include <linux/cpumask.h> #include <linux/sched.h> +#include <linux/prctl.h> #ifdef CONFIG_GENERIC_CLOCKEVENTS extern void __init tick_init(void); @@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu) return cpumask_test_cpu(cpu, tick_nohz_full_mask); } +static inline bool tick_nohz_is_cpu_isolated(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); +} + extern void __tick_nohz_full_check(void); extern void tick_nohz_full_kick(void); extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); +extern void tick_nohz_cpu_isolated_enter(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { } static inline void tick_nohz_full_kick(void) { } static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } +static inline bool tick_nohz_is_cpu_isolated(void) { return false; } +static inline void tick_nohz_cpu_isolated_enter(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..edb40b6b84db 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */ +#define PR_SET_CPU_ISOLATED 47 +#define PR_GET_CPU_ISOLATED 48 +# define PR_CPU_ISOLATED_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 72d59a1a6eb6..66739d7c1350 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/tick.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (tick_nohz_is_cpu_isolated()) + tick_nohz_cpu_isolated_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/sys.c b/kernel/sys.c index a4e372b798a5..3fd9e47f8fc8 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_NO_HZ_FULL + case PR_SET_CPU_ISOLATED: + me->cpu_isolated_flags = arg2; + break; + case PR_GET_CPU_ISOLATED: + error = me->cpu_isolated_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 914259128145..f1551c946c45 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -24,6 +24,7 @@ #include <linux/posix-timers.h> #include <linux/perf_event.h> #include <linux/context_tracking.h> +#include <linux/swap.h> #include <asm/irq_regs.h> @@ -389,6 +390,56 @@ void __init tick_nohz_init(void) pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n", cpumask_pr_args(tick_nohz_full_mask)); } + +/* + * We normally return immediately to userspace. + * + * In "cpu_isolated" mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two "cpu_isolated" processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void tick_nohz_cpu_isolated_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start)); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + + /* Idle with interrupts enabled and wait for the tick. */ + set_current_state(TASK_INTERRUPTIBLE); + arch_cpu_idle(); + set_current_state(TASK_RUNNING); + } + if (warned) { + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start)); + dump_stack(); + } +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v2 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode 2015-05-15 21:27 ` Chris Metcalf @ 2015-05-15 21:27 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With cpu_isolated mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we add an internal bit to current->cpu_isolated_flags that is set when prctl() sets the flags. We check the bit on syscall entry as well as on any exception_enter(). The prctl() syscall is ignored to allow clearing the bit again later, and exit/exit_group are ignored to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86 and tile; I am happy to try to add more for additional platforms in the final version. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/ptrace.c | 6 +++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/tick.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/time/tick-sched.c | 38 ++++++++++++++++++++++++++++++++++++++ 7 files changed, 76 insertions(+), 7 deletions(-) diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..d4e43a13bab1 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall( + regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index a7bc79480719..7f784054ddea 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index 2821838256b4..d042f4cda39d 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/tick.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); extern void __context_tracking_task_switch(struct task_struct *prev, @@ -37,8 +38,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_exception(); + } + } return prev_ctx; } diff --git a/include/linux/tick.h b/include/linux/tick.h index ec1953474a65..b7ffb10337ba 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -147,6 +147,8 @@ extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); extern void tick_nohz_cpu_isolated_enter(void); +extern void tick_nohz_cpu_isolated_syscall(int nr); +extern void tick_nohz_cpu_isolated_exception(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -157,6 +159,8 @@ static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } static inline bool tick_nohz_is_cpu_isolated(void) { return false; } static inline void tick_nohz_cpu_isolated_enter(void) { } +static inline void tick_nohz_cpu_isolated_syscall(int nr) { } +static inline void tick_nohz_cpu_isolated_exception(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) @@ -189,4 +193,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk) __tick_nohz_task_switch(tsk); } +static inline bool tick_nohz_cpu_isolated_strict(void) +{ +#ifdef CONFIG_NO_HZ_FULL + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) == + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index edb40b6b84db..0c11238a84fb 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_CPU_ISOLATED 47 #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) +# define PR_CPU_ISOLATED_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 66739d7c1350..c82509caa42e 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -131,15 +131,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (__this_cpu_read(context_tracking.state) == state) { @@ -150,6 +151,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -157,6 +159,7 @@ void context_tracking_exit(enum ctx_state state) __this_cpu_write(context_tracking.state, CONTEXT_KERNEL); } local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index f1551c946c45..273820cd484a 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -27,6 +27,7 @@ #include <linux/swap.h> #include <asm/irq_regs.h> +#include <asm/unistd.h> #include "tick-internal.h" @@ -440,6 +441,43 @@ void tick_nohz_cpu_isolated_enter(void) } } +static void kill_cpu_isolated_strict_task(void) +{ + dump_stack(); + current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void tick_nohz_cpu_isolated_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_cpu_isolated_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void tick_nohz_cpu_isolated_exception(void) +{ + pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", + current->comm, current->pid); + kill_cpu_isolated_strict_task(); +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v2 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode @ 2015-05-15 21:27 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With cpu_isolated mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we add an internal bit to current->cpu_isolated_flags that is set when prctl() sets the flags. We check the bit on syscall entry as well as on any exception_enter(). The prctl() syscall is ignored to allow clearing the bit again later, and exit/exit_group are ignored to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86 and tile; I am happy to try to add more for additional platforms in the final version. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/ptrace.c | 6 +++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/tick.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/time/tick-sched.c | 38 ++++++++++++++++++++++++++++++++++++++ 7 files changed, 76 insertions(+), 7 deletions(-) diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..d4e43a13bab1 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall( + regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index a7bc79480719..7f784054ddea 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index 2821838256b4..d042f4cda39d 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/tick.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); extern void __context_tracking_task_switch(struct task_struct *prev, @@ -37,8 +38,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_exception(); + } + } return prev_ctx; } diff --git a/include/linux/tick.h b/include/linux/tick.h index ec1953474a65..b7ffb10337ba 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -147,6 +147,8 @@ extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); extern void tick_nohz_cpu_isolated_enter(void); +extern void tick_nohz_cpu_isolated_syscall(int nr); +extern void tick_nohz_cpu_isolated_exception(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -157,6 +159,8 @@ static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } static inline bool tick_nohz_is_cpu_isolated(void) { return false; } static inline void tick_nohz_cpu_isolated_enter(void) { } +static inline void tick_nohz_cpu_isolated_syscall(int nr) { } +static inline void tick_nohz_cpu_isolated_exception(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) @@ -189,4 +193,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk) __tick_nohz_task_switch(tsk); } +static inline bool tick_nohz_cpu_isolated_strict(void) +{ +#ifdef CONFIG_NO_HZ_FULL + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) == + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index edb40b6b84db..0c11238a84fb 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_CPU_ISOLATED 47 #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) +# define PR_CPU_ISOLATED_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 66739d7c1350..c82509caa42e 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -131,15 +131,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (__this_cpu_read(context_tracking.state) == state) { @@ -150,6 +151,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -157,6 +159,7 @@ void context_tracking_exit(enum ctx_state state) __this_cpu_write(context_tracking.state, CONTEXT_KERNEL); } local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index f1551c946c45..273820cd484a 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -27,6 +27,7 @@ #include <linux/swap.h> #include <asm/irq_regs.h> +#include <asm/unistd.h> #include "tick-internal.h" @@ -440,6 +441,43 @@ void tick_nohz_cpu_isolated_enter(void) } } +static void kill_cpu_isolated_strict_task(void) +{ + dump_stack(); + current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void tick_nohz_cpu_isolated_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_cpu_isolated_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void tick_nohz_cpu_isolated_exception(void) +{ + pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", + current->comm, current->pid); + kill_cpu_isolated_strict_task(); +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v2 3/5] nohz: cpu_isolated strict mode configurable signal @ 2015-05-15 21:27 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a cpu_isolated process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/time/tick-sched.c | 15 +++++++++++---- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 0c11238a84fb..ab45bd3d5799 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) # define PR_CPU_ISOLATED_STRICT (1 << 1) +# define PR_CPU_ISOLATED_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 273820cd484a..772be78f926c 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -441,11 +441,18 @@ void tick_nohz_cpu_isolated_enter(void) } } -static void kill_cpu_isolated_strict_task(void) +static void kill_cpu_isolated_strict_task(int is_syscall) { + siginfo_t info = {}; + int sig; + dump_stack(); current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -464,7 +471,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall) pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(1); } /* @@ -475,7 +482,7 @@ void tick_nohz_cpu_isolated_exception(void) { pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", current->comm, current->pid); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(0); } #endif -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v2 3/5] nohz: cpu_isolated strict mode configurable signal @ 2015-05-15 21:27 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a cpu_isolated process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> --- include/uapi/linux/prctl.h | 2 ++ kernel/time/tick-sched.c | 15 +++++++++++---- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 0c11238a84fb..ab45bd3d5799 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) # define PR_CPU_ISOLATED_STRICT (1 << 1) +# define PR_CPU_ISOLATED_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 273820cd484a..772be78f926c 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -441,11 +441,18 @@ void tick_nohz_cpu_isolated_enter(void) } } -static void kill_cpu_isolated_strict_task(void) +static void kill_cpu_isolated_strict_task(int is_syscall) { + siginfo_t info = {}; + int sig; + dump_stack(); current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -464,7 +471,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall) pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(1); } /* @@ -475,7 +482,7 @@ void tick_nohz_cpu_isolated_exception(void) { pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", current->comm, current->pid); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(0); } #endif -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v2 4/5] nohz: add cpu_isolated_debug boot flag 2015-05-15 21:27 ` Chris Metcalf ` (2 preceding siblings ...) (?) @ 2015-05-15 21:27 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-kernel Cc: Chris Metcalf This flag simplifies debugging of NO_HZ_FULL kernels when processes are running in PR_CPU_ISOLATED_ENABLE mode. Such processes should get no interrupts from the kernel, and if they do, when this boot flag is specified a kernel stack dump on the console is generated. It's possible to use ftrace to simply detect whether a cpu_isolated core has unexpectedly entered the kernel. But what this boot flag does is allow the kernel to provide better diagnostics, e.g. by reporting in the IPI-generating code what remote core and context is preparing to deliver an interrupt to a cpu_isolated core. It may be worth considering other ways to generate useful debugging output rather than console spew, but for now that is simple and direct. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- Documentation/kernel-parameters.txt | 6 ++++++ arch/tile/mm/homecache.c | 5 ++++- include/linux/tick.h | 2 ++ kernel/irq_work.c | 4 +++- kernel/sched/core.c | 18 ++++++++++++++++++ kernel/signal.c | 5 +++++ kernel/smp.c | 4 ++++ kernel/softirq.c | 6 ++++++ 8 files changed, 48 insertions(+), 2 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index f6befa9855c1..2b4c89225d25 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -743,6 +743,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. /proc/<pid>/coredump_filter. See also Documentation/filesystems/proc.txt. + cpu_isolated_debug [KNL] + In kernels built with CONFIG_NO_HZ_FULL and booted + in nohz_full= mode, this setting will generate console + backtraces when the kernel is about to interrupt a + task that has requested PR_CPU_ISOLATED_ENABLE. + cpuidle.off=1 [CPU_IDLE] disable the cpuidle sub-system diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c index 40ca30a9fee3..f336880e1b01 100644 --- a/arch/tile/mm/homecache.c +++ b/arch/tile/mm/homecache.c @@ -31,6 +31,7 @@ #include <linux/smp.h> #include <linux/module.h> #include <linux/hugetlb.h> +#include <linux/tick.h> #include <asm/page.h> #include <asm/sections.h> @@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask, * Don't bother to update atomically; losing a count * here is not that critical. */ - for_each_cpu(cpu, &mask) + for_each_cpu(cpu, &mask) { ++per_cpu(irq_stat, cpu).irq_hv_flush_count; + tick_nohz_cpu_isolated_debug(cpu); + } } /* diff --git a/include/linux/tick.h b/include/linux/tick.h index b7ffb10337ba..0b0d76106b8c 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -149,6 +149,7 @@ extern void __tick_nohz_task_switch(struct task_struct *tsk); extern void tick_nohz_cpu_isolated_enter(void); extern void tick_nohz_cpu_isolated_syscall(int nr); extern void tick_nohz_cpu_isolated_exception(void); +extern void tick_nohz_cpu_isolated_debug(int cpu); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -161,6 +162,7 @@ static inline bool tick_nohz_is_cpu_isolated(void) { return false; } static inline void tick_nohz_cpu_isolated_enter(void) { } static inline void tick_nohz_cpu_isolated_syscall(int nr) { } static inline void tick_nohz_cpu_isolated_exception(void) { } +static inline void tick_nohz_cpu_isolated_debug(int cpu) { } #endif static inline bool is_housekeeping_cpu(int cpu) diff --git a/kernel/irq_work.c b/kernel/irq_work.c index cbf9fb899d92..7f35c90346de 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -75,8 +75,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu) if (!irq_work_claim(work)) return false; - if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) + if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) { + tick_nohz_cpu_isolated_debug(cpu); arch_send_call_function_single_ipi(cpu); + } return true; } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f9123a82cbb6..7315e7272e94 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -719,6 +719,24 @@ bool sched_can_stop_tick(void) return true; } + +/* Enable debugging of any interrupts of cpu_isolated cores. */ +static int cpu_isolated_debug; +static int __init cpu_isolated_debug_func(char *str) +{ + cpu_isolated_debug = true; + return 1; +} +__setup("cpu_isolated_debug", cpu_isolated_debug_func); + +void tick_nohz_cpu_isolated_debug(int cpu) +{ + if (cpu_isolated_debug && tick_nohz_full_cpu(cpu) && + (cpu_curr(cpu)->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE)) { + pr_err("Interrupt detected for cpu_isolated cpu %d\n", cpu); + dump_stack(); + } +} #endif /* CONFIG_NO_HZ_FULL */ void sched_avg_update(struct rq *rq) diff --git a/kernel/signal.c b/kernel/signal.c index d51c5ddd855c..1a810ac2656e 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -689,6 +689,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) */ void signal_wake_up_state(struct task_struct *t, unsigned int state) { +#ifdef CONFIG_NO_HZ_FULL + /* If the task is being killed, don't complain about cpu_isolated. */ + if (state & TASK_WAKEKILL) + t->cpu_isolated_flags = 0; +#endif set_tsk_thread_flag(t, TIF_SIGPENDING); /* * TASK_WAKEKILL also means wake it up in the stopped/traced/killable diff --git a/kernel/smp.c b/kernel/smp.c index 07854477c164..6b7d8e2c8af4 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -14,6 +14,7 @@ #include <linux/smp.h> #include <linux/cpu.h> #include <linux/sched.h> +#include <linux/tick.h> #include "smpboot.h" @@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, * locking and barrier primitives. Generic code isn't really * equipped to do the right thing... */ + tick_nohz_cpu_isolated_debug(cpu); if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) arch_send_call_function_single_ipi(cpu); @@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask, } /* Send a message to all CPUs in the map */ + for_each_cpu(cpu, cfd->cpumask) + tick_nohz_cpu_isolated_debug(cpu); arch_send_call_function_ipi_mask(cfd->cpumask); if (wait) { diff --git a/kernel/softirq.c b/kernel/softirq.c index 479e4436f787..333872925ff6 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -24,6 +24,7 @@ #include <linux/ftrace.h> #include <linux/smp.h> #include <linux/smpboot.h> +#include <linux/context_tracking.h> #include <linux/tick.h> #include <linux/irq.h> @@ -335,6 +336,11 @@ void irq_enter(void) _local_bh_enable(); } + if (context_tracking_cpu_is_enabled() && + context_tracking_in_user() && + !in_interrupt()) + tick_nohz_cpu_isolated_debug(smp_processor_id()); + __irq_enter(); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v2 5/5] nohz: cpu_isolated: allow tick to be fully disabled 2015-05-15 21:27 ` Chris Metcalf ` (3 preceding siblings ...) (?) @ 2015-05-15 21:27 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-15 21:27 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-kernel Cc: Chris Metcalf While the current fallback to 1-second tick is still helpful for maintaining completely correct kernel semantics, processes using prctl(PR_SET_CPU_ISOLATED) semantics place a higher priority on running completely tickless, so don't bound the time_delta for such processes. This was previously discussed in https://lkml.org/lkml/2014/10/31/364 and Thomas Gleixner observed that vruntime, load balancing data, load accounting, and other things might be impacted. Frederic Weisbecker similarly observed that allowing the tick to be indefinitely deferred just meant that no one would ever fix the underlying bugs. However it's at least true that the mode proposed in this patch can only be enabled on an isolcpus core, which may limit how important it is to maintain scheduler data correctly, for example. It's also worth observing that the tile architecture has been using similar code for its Zero-Overhead Linux for many years (starting in 2005) and customers are very enthusiastic about the resulting bare-metal performance on cores that are available to run full Linux semantics on demand (crash, logging, shutdown, etc). So this semantics is very useful if we can convince ourselves that doing this is safe. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- Note: I have kept this in the series despite PeterZ's nack, since it didn't seem resolved in the original thread from v1 of the patch (https://lkml.org/lkml/2015/5/8/555). kernel/time/tick-sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 772be78f926c..be4db5d81ada 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -727,7 +727,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, } #ifdef CONFIG_NO_HZ_FULL - if (!ts->inidle) { + if (!ts->inidle && !tick_nohz_is_cpu_isolated()) { time_delta = min(time_delta, scheduler_tick_max_deferment()); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode @ 2015-05-15 22:17 ` Thomas Gleixner 0 siblings, 0 replies; 340+ messages in thread From: Thomas Gleixner @ 2015-05-15 22:17 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel On Fri, 15 May 2015, Chris Metcalf wrote: > +/* > + * We normally return immediately to userspace. > + * > + * In "cpu_isolated" mode we wait until no more interrupts are > + * pending. Otherwise we nap with interrupts enabled and wait for the > + * next interrupt to fire, then loop back and retry. > + * > + * Note that if you schedule two "cpu_isolated" processes on the same > + * core, neither will ever leave the kernel, and one will have to be > + * killed manually. And why are we not preventing that situation in the first place? The scheduler should be able to figure that out easily.. > + Otherwise in situations where another process is > + * in the runqueue on this cpu, this task will just wait for that > + * other task to go idle before returning to user space. > + */ > +void tick_nohz_cpu_isolated_enter(void) > +{ > + struct clock_event_device *dev = > + __this_cpu_read(tick_cpu_device.evtdev); > + struct task_struct *task = current; > + unsigned long start = jiffies; > + bool warned = false; > + > + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ > + lru_add_drain(); > + > + while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) { What's the ACCESS_ONCE for? > + if (!warned && (jiffies - start) >= (5 * HZ)) { > + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n", > + task->comm, task->pid, smp_processor_id(), > + (jiffies - start)); What additional value has the jiffies delta over a plain human readable '5sec' ? > + warned = true; > + } > + if (should_resched()) > + schedule(); > + if (test_thread_flag(TIF_SIGPENDING)) > + break; > + > + /* Idle with interrupts enabled and wait for the tick. */ > + set_current_state(TASK_INTERRUPTIBLE); > + arch_cpu_idle(); Oh NO! Not another variant of fake idle task. The idle implementations can call into code which rightfully expects that the CPU is actually IDLE. I wasted enough time already debugging the resulting wreckage. Feel free to use it for experimental purposes, but this is not going anywhere near to a mainline kernel. I completely understand WHY you want to do that, but we need proper mechanisms for that and not some duct tape engineering band aids which will create hard to debug side effects. Hint: It's a scheduler job to make sure that the machine has quiesced _BEFORE_ letting the magic task off to user land. > + set_current_state(TASK_RUNNING); > + } > + if (warned) { > + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n", > + task->comm, task->pid, smp_processor_id(), > + (jiffies - start)); > + dump_stack(); And that dump_stack() tells us which important information? tick_nohz_cpu_isolated_enter context_tracking_enter context_tracking_user_enter arch_return_to_user_code Thanks, tglx ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode @ 2015-05-15 22:17 ` Thomas Gleixner 0 siblings, 0 replies; 340+ messages in thread From: Thomas Gleixner @ 2015-05-15 22:17 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Fri, 15 May 2015, Chris Metcalf wrote: > +/* > + * We normally return immediately to userspace. > + * > + * In "cpu_isolated" mode we wait until no more interrupts are > + * pending. Otherwise we nap with interrupts enabled and wait for the > + * next interrupt to fire, then loop back and retry. > + * > + * Note that if you schedule two "cpu_isolated" processes on the same > + * core, neither will ever leave the kernel, and one will have to be > + * killed manually. And why are we not preventing that situation in the first place? The scheduler should be able to figure that out easily.. > + Otherwise in situations where another process is > + * in the runqueue on this cpu, this task will just wait for that > + * other task to go idle before returning to user space. > + */ > +void tick_nohz_cpu_isolated_enter(void) > +{ > + struct clock_event_device *dev = > + __this_cpu_read(tick_cpu_device.evtdev); > + struct task_struct *task = current; > + unsigned long start = jiffies; > + bool warned = false; > + > + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ > + lru_add_drain(); > + > + while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) { What's the ACCESS_ONCE for? > + if (!warned && (jiffies - start) >= (5 * HZ)) { > + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n", > + task->comm, task->pid, smp_processor_id(), > + (jiffies - start)); What additional value has the jiffies delta over a plain human readable '5sec' ? > + warned = true; > + } > + if (should_resched()) > + schedule(); > + if (test_thread_flag(TIF_SIGPENDING)) > + break; > + > + /* Idle with interrupts enabled and wait for the tick. */ > + set_current_state(TASK_INTERRUPTIBLE); > + arch_cpu_idle(); Oh NO! Not another variant of fake idle task. The idle implementations can call into code which rightfully expects that the CPU is actually IDLE. I wasted enough time already debugging the resulting wreckage. Feel free to use it for experimental purposes, but this is not going anywhere near to a mainline kernel. I completely understand WHY you want to do that, but we need proper mechanisms for that and not some duct tape engineering band aids which will create hard to debug side effects. Hint: It's a scheduler job to make sure that the machine has quiesced _BEFORE_ letting the magic task off to user land. > + set_current_state(TASK_RUNNING); > + } > + if (warned) { > + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n", > + task->comm, task->pid, smp_processor_id(), > + (jiffies - start)); > + dump_stack(); And that dump_stack() tells us which important information? tick_nohz_cpu_isolated_enter context_tracking_enter context_tracking_user_enter arch_return_to_user_code Thanks, tglx ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode 2015-05-15 22:17 ` Thomas Gleixner @ 2015-05-28 20:38 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-28 20:38 UTC (permalink / raw) To: Thomas Gleixner Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Thomas, thanks for the feedback. My reply was delayed by being in meetings all last week and then catching up this week - sorry about that. On 05/15/2015 06:17 PM, Thomas Gleixner wrote: > On Fri, 15 May 2015, Chris Metcalf wrote: >> +/* >> + * We normally return immediately to userspace. >> + * >> + * In "cpu_isolated" mode we wait until no more interrupts are >> + * pending. Otherwise we nap with interrupts enabled and wait for the >> + * next interrupt to fire, then loop back and retry. >> + * >> + * Note that if you schedule two "cpu_isolated" processes on the same >> + * core, neither will ever leave the kernel, and one will have to be >> + * killed manually. > And why are we not preventing that situation in the first place? The > scheduler should be able to figure that out easily.. This is an interesting observation. My instinct is that adding tests in the scheduler costs time on a hot path for all processes, and I'm trying to avoid adding cost where we don't need it. It's pretty much a straight-up application bug if two threads or processes explicitly request the cpu_isolated semantics, and then explicitly schedule themselves onto the same core, so my preference was to let the application writer identify and fix the problem if it comes up. However, I'm certainly open to thinking about checking for this failure mode in the scheduler, though I don't know enough about the scheduler to immediately identify where such a change might go. Would it be appropriate to think about this as a follow-on patch, if it's determined that the cost of testing for this condition is worth it? >> + Otherwise in situations where another process is >> + * in the runqueue on this cpu, this task will just wait for that >> + * other task to go idle before returning to user space. >> + */ >> +void tick_nohz_cpu_isolated_enter(void) >> +{ >> + struct clock_event_device *dev = >> + __this_cpu_read(tick_cpu_device.evtdev); >> + struct task_struct *task = current; >> + unsigned long start = jiffies; >> + bool warned = false; >> + >> + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ >> + lru_add_drain(); >> + >> + while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) { > What's the ACCESS_ONCE for? We are technically in a loop here where we are waiting for an interrupt handler to update dev->next_event.tv64, so I felt it was appropriate to flag it as such. If we didn't have function calls inside the loop, the compiler would eliminate the loop. But it's just a style thing, and we can certainly drop it if it seems confusing. In any case I've changed it to READ_ONCE() since that's preferred now anyway; this code was originally written a while ago. >> + if (!warned && (jiffies - start) >= (5 * HZ)) { >> + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n", >> + task->comm, task->pid, smp_processor_id(), >> + (jiffies - start)); > What additional value has the jiffies delta over a plain human > readable '5sec' ? Good point. I've changed it to emit a value in seconds. >> + warned = true; >> + } >> + if (should_resched()) >> + schedule(); >> + if (test_thread_flag(TIF_SIGPENDING)) >> + break; >> + >> + /* Idle with interrupts enabled and wait for the tick. */ >> + set_current_state(TASK_INTERRUPTIBLE); >> + arch_cpu_idle(); > Oh NO! Not another variant of fake idle task. The idle implementations > can call into code which rightfully expects that the CPU is actually > IDLE. > > I wasted enough time already debugging the resulting wreckage. Feel > free to use it for experimental purposes, but this is not going > anywhere near to a mainline kernel. > > I completely understand WHY you want to do that, but we need proper > mechanisms for that and not some duct tape engineering band aids which > will create hard to debug side effects. Yes, I worried about that a little when I put it in. In particular it's certainly true that arch_cpu_idle() isn't necessarily designed to behave properly in this context, even if it may do the right thing somewhat by accident. In fact, we don't need the cpu-idling semantics in this loop; the loop can spin quite happily waiting for next_event in the tick_cpu_device to stop being defined (or a signal or scheduling request to occur). I've changed the code to make it opt-in, so that a weak no-op function that just calls cpu_relax() can be replaced by an architecture-defined function that safely waits until an interrupt is delivered, reducing the number of times we spin around in the outer loop. > Hint: It's a scheduler job to make sure that the machine has quiesced > _BEFORE_ letting the magic task off to user land. This is not so clear to me. There may, for example, be RCU events that occur after the scheduler is done with its part, that still require another timer tick on the cpu to finish quiescing RCU. I think we need to check for the timer-quiesced state as late as possible to handle things like this. Arguably the scheduler could also try to do the right thing with a cpu_isolated task, but again, this feels like time spent in the scheduler hot path that affects the non-cpu_isolated tasks. For cpu_isolated tasks they should be the only thing that's runnable on the core 99.999% of the time, or you've done something quite wrong anyway. >> + set_current_state(TASK_RUNNING); >> + } >> + if (warned) { >> + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n", >> + task->comm, task->pid, smp_processor_id(), >> + (jiffies - start)); >> + dump_stack(); > And that dump_stack() tells us which important information? > > tick_nohz_cpu_isolated_enter > context_tracking_enter > context_tracking_user_enter > arch_return_to_user_code For tile, the dump_stack() includes the register state, which includes the interrupt type that took us into the kernel, which might be helpful. That said, I'm certainly willing to remove it, or make it call a weak no-op function where architectures can add more info if they have it. Thanks again! I'll put out v3 of the patch series shortly, with changes from your comments incorporated. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode @ 2015-05-28 20:38 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-05-28 20:38 UTC (permalink / raw) To: Thomas Gleixner Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Thomas, thanks for the feedback. My reply was delayed by being in meetings all last week and then catching up this week - sorry about that. On 05/15/2015 06:17 PM, Thomas Gleixner wrote: > On Fri, 15 May 2015, Chris Metcalf wrote: >> +/* >> + * We normally return immediately to userspace. >> + * >> + * In "cpu_isolated" mode we wait until no more interrupts are >> + * pending. Otherwise we nap with interrupts enabled and wait for the >> + * next interrupt to fire, then loop back and retry. >> + * >> + * Note that if you schedule two "cpu_isolated" processes on the same >> + * core, neither will ever leave the kernel, and one will have to be >> + * killed manually. > And why are we not preventing that situation in the first place? The > scheduler should be able to figure that out easily.. This is an interesting observation. My instinct is that adding tests in the scheduler costs time on a hot path for all processes, and I'm trying to avoid adding cost where we don't need it. It's pretty much a straight-up application bug if two threads or processes explicitly request the cpu_isolated semantics, and then explicitly schedule themselves onto the same core, so my preference was to let the application writer identify and fix the problem if it comes up. However, I'm certainly open to thinking about checking for this failure mode in the scheduler, though I don't know enough about the scheduler to immediately identify where such a change might go. Would it be appropriate to think about this as a follow-on patch, if it's determined that the cost of testing for this condition is worth it? >> + Otherwise in situations where another process is >> + * in the runqueue on this cpu, this task will just wait for that >> + * other task to go idle before returning to user space. >> + */ >> +void tick_nohz_cpu_isolated_enter(void) >> +{ >> + struct clock_event_device *dev = >> + __this_cpu_read(tick_cpu_device.evtdev); >> + struct task_struct *task = current; >> + unsigned long start = jiffies; >> + bool warned = false; >> + >> + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ >> + lru_add_drain(); >> + >> + while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) { > What's the ACCESS_ONCE for? We are technically in a loop here where we are waiting for an interrupt handler to update dev->next_event.tv64, so I felt it was appropriate to flag it as such. If we didn't have function calls inside the loop, the compiler would eliminate the loop. But it's just a style thing, and we can certainly drop it if it seems confusing. In any case I've changed it to READ_ONCE() since that's preferred now anyway; this code was originally written a while ago. >> + if (!warned && (jiffies - start) >= (5 * HZ)) { >> + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n", >> + task->comm, task->pid, smp_processor_id(), >> + (jiffies - start)); > What additional value has the jiffies delta over a plain human > readable '5sec' ? Good point. I've changed it to emit a value in seconds. >> + warned = true; >> + } >> + if (should_resched()) >> + schedule(); >> + if (test_thread_flag(TIF_SIGPENDING)) >> + break; >> + >> + /* Idle with interrupts enabled and wait for the tick. */ >> + set_current_state(TASK_INTERRUPTIBLE); >> + arch_cpu_idle(); > Oh NO! Not another variant of fake idle task. The idle implementations > can call into code which rightfully expects that the CPU is actually > IDLE. > > I wasted enough time already debugging the resulting wreckage. Feel > free to use it for experimental purposes, but this is not going > anywhere near to a mainline kernel. > > I completely understand WHY you want to do that, but we need proper > mechanisms for that and not some duct tape engineering band aids which > will create hard to debug side effects. Yes, I worried about that a little when I put it in. In particular it's certainly true that arch_cpu_idle() isn't necessarily designed to behave properly in this context, even if it may do the right thing somewhat by accident. In fact, we don't need the cpu-idling semantics in this loop; the loop can spin quite happily waiting for next_event in the tick_cpu_device to stop being defined (or a signal or scheduling request to occur). I've changed the code to make it opt-in, so that a weak no-op function that just calls cpu_relax() can be replaced by an architecture-defined function that safely waits until an interrupt is delivered, reducing the number of times we spin around in the outer loop. > Hint: It's a scheduler job to make sure that the machine has quiesced > _BEFORE_ letting the magic task off to user land. This is not so clear to me. There may, for example, be RCU events that occur after the scheduler is done with its part, that still require another timer tick on the cpu to finish quiescing RCU. I think we need to check for the timer-quiesced state as late as possible to handle things like this. Arguably the scheduler could also try to do the right thing with a cpu_isolated task, but again, this feels like time spent in the scheduler hot path that affects the non-cpu_isolated tasks. For cpu_isolated tasks they should be the only thing that's runnable on the core 99.999% of the time, or you've done something quite wrong anyway. >> + set_current_state(TASK_RUNNING); >> + } >> + if (warned) { >> + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n", >> + task->comm, task->pid, smp_processor_id(), >> + (jiffies - start)); >> + dump_stack(); > And that dump_stack() tells us which important information? > > tick_nohz_cpu_isolated_enter > context_tracking_enter > context_tracking_user_enter > arch_return_to_user_code For tile, the dump_stack() includes the register state, which includes the interrupt type that took us into the kernel, which might be helpful. That said, I'm certainly willing to remove it, or make it call a weak no-op function where architectures can add more info if they have it. Thanks again! I'll put out v3 of the patch series shortly, with changes from your comments incorporated. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v3 0/5] support "cpu_isolated" mode for nohz_full @ 2015-06-03 15:29 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. A prctl() option (PR_SET_CPU_ISOLATED) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on cpu_isolated cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on my arch/tile master tree for 4.2, in turn based on 4.1-rc1) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() Note: I have not removed the commit to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if we remove the 1Hz tick, cpu_isolated threads will never re-enter userspace since a tick will always be pending. Chris Metcalf (5): nohz_full: add support for "cpu_isolated" mode nohz: support PR_CPU_ISOLATED_STRICT mode nohz: cpu_isolated strict mode configurable signal nohz: add cpu_isolated_debug boot flag nohz: cpu_isolated: allow tick to be fully disabled Documentation/kernel-parameters.txt | 6 +++ arch/tile/kernel/process.c | 9 ++++ arch/tile/kernel/ptrace.c | 6 ++- arch/tile/mm/homecache.c | 5 +- arch/x86/kernel/ptrace.c | 2 + include/linux/context_tracking.h | 11 ++-- include/linux/sched.h | 3 ++ include/linux/tick.h | 28 ++++++++++ include/uapi/linux/prctl.h | 8 +++ kernel/context_tracking.c | 12 +++-- kernel/irq_work.c | 4 +- kernel/sched/core.c | 18 +++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 6 +++ kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 104 +++++++++++++++++++++++++++++++++++- 17 files changed, 229 insertions(+), 10 deletions(-) -- 2.1.2 ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v3 0/5] support "cpu_isolated" mode for nohz_full @ 2015-06-03 15:29 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Chris Metcalf The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. A prctl() option (PR_SET_CPU_ISOLATED) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on cpu_isolated cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on my arch/tile master tree for 4.2, in turn based on 4.1-rc1) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() Note: I have not removed the commit to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if we remove the 1Hz tick, cpu_isolated threads will never re-enter userspace since a tick will always be pending. Chris Metcalf (5): nohz_full: add support for "cpu_isolated" mode nohz: support PR_CPU_ISOLATED_STRICT mode nohz: cpu_isolated strict mode configurable signal nohz: add cpu_isolated_debug boot flag nohz: cpu_isolated: allow tick to be fully disabled Documentation/kernel-parameters.txt | 6 +++ arch/tile/kernel/process.c | 9 ++++ arch/tile/kernel/ptrace.c | 6 ++- arch/tile/mm/homecache.c | 5 +- arch/x86/kernel/ptrace.c | 2 + include/linux/context_tracking.h | 11 ++-- include/linux/sched.h | 3 ++ include/linux/tick.h | 28 ++++++++++ include/uapi/linux/prctl.h | 8 +++ kernel/context_tracking.c | 12 +++-- kernel/irq_work.c | 4 +- kernel/sched/core.c | 18 +++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 6 +++ kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 104 +++++++++++++++++++++++++++++++++++- 17 files changed, 229 insertions(+), 10 deletions(-) -- 2.1.2 ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v3 1/5] nohz_full: add support for "cpu_isolated" mode @ 2015-06-03 15:29 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a stronger commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the stronger semantics as needed, specifying prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The "cpu_isolated" state is indicated by setting a new task struct field, cpu_isolated_flags, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only two actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/process.c | 9 ++++++++ include/linux/sched.h | 3 +++ include/linux/tick.h | 10 ++++++++ include/uapi/linux/prctl.h | 5 ++++ kernel/context_tracking.c | 3 +++ kernel/sys.c | 8 +++++++ kernel/time/tick-sched.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 95 insertions(+) diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index e036c0aa9792..e20c3f4a6a82 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -70,6 +70,15 @@ void arch_cpu_idle(void) _cpu_idle(); } +#ifdef CONFIG_NO_HZ_FULL +void tick_nohz_cpu_isolated_wait() +{ + set_current_state(TASK_INTERRUPTIBLE); + _cpu_idle(); + set_current_state(TASK_RUNNING); +} +#endif + /* * Release a thread_info structure */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 8222ae40ecb0..fb4ba400d7e1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1732,6 +1732,9 @@ struct task_struct { #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif +#ifdef CONFIG_NO_HZ_FULL + unsigned int cpu_isolated_flags; +#endif }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/include/linux/tick.h b/include/linux/tick.h index f8492da57ad3..ec1953474a65 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -10,6 +10,7 @@ #include <linux/context_tracking_state.h> #include <linux/cpumask.h> #include <linux/sched.h> +#include <linux/prctl.h> #ifdef CONFIG_GENERIC_CLOCKEVENTS extern void __init tick_init(void); @@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu) return cpumask_test_cpu(cpu, tick_nohz_full_mask); } +static inline bool tick_nohz_is_cpu_isolated(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); +} + extern void __tick_nohz_full_check(void); extern void tick_nohz_full_kick(void); extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); +extern void tick_nohz_cpu_isolated_enter(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { } static inline void tick_nohz_full_kick(void) { } static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } +static inline bool tick_nohz_is_cpu_isolated(void) { return false; } +static inline void tick_nohz_cpu_isolated_enter(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..edb40b6b84db 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */ +#define PR_SET_CPU_ISOLATED 47 +#define PR_GET_CPU_ISOLATED 48 +# define PR_CPU_ISOLATED_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 72d59a1a6eb6..66739d7c1350 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/tick.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (tick_nohz_is_cpu_isolated()) + tick_nohz_cpu_isolated_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/sys.c b/kernel/sys.c index a4e372b798a5..3fd9e47f8fc8 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_NO_HZ_FULL + case PR_SET_CPU_ISOLATED: + me->cpu_isolated_flags = arg2; + break; + case PR_GET_CPU_ISOLATED: + error = me->cpu_isolated_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 914259128145..f6236b66788f 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -24,6 +24,7 @@ #include <linux/posix-timers.h> #include <linux/perf_event.h> #include <linux/context_tracking.h> +#include <linux/swap.h> #include <asm/irq_regs.h> @@ -389,6 +390,62 @@ void __init tick_nohz_init(void) pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n", cpumask_pr_args(tick_nohz_full_mask)); } + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + */ +void __weak tick_nohz_cpu_isolated_wait() +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In "cpu_isolated" mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two "cpu_isolated" processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void tick_nohz_cpu_isolated_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + tick_nohz_cpu_isolated_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v3 1/5] nohz_full: add support for "cpu_isolated" mode @ 2015-06-03 15:29 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Chris Metcalf The existing nohz_full mode makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a stronger commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the stronger semantics as needed, specifying prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The "cpu_isolated" state is indicated by setting a new task struct field, cpu_isolated_flags, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only two actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> --- arch/tile/kernel/process.c | 9 ++++++++ include/linux/sched.h | 3 +++ include/linux/tick.h | 10 ++++++++ include/uapi/linux/prctl.h | 5 ++++ kernel/context_tracking.c | 3 +++ kernel/sys.c | 8 +++++++ kernel/time/tick-sched.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 95 insertions(+) diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index e036c0aa9792..e20c3f4a6a82 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -70,6 +70,15 @@ void arch_cpu_idle(void) _cpu_idle(); } +#ifdef CONFIG_NO_HZ_FULL +void tick_nohz_cpu_isolated_wait() +{ + set_current_state(TASK_INTERRUPTIBLE); + _cpu_idle(); + set_current_state(TASK_RUNNING); +} +#endif + /* * Release a thread_info structure */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 8222ae40ecb0..fb4ba400d7e1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1732,6 +1732,9 @@ struct task_struct { #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif +#ifdef CONFIG_NO_HZ_FULL + unsigned int cpu_isolated_flags; +#endif }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/include/linux/tick.h b/include/linux/tick.h index f8492da57ad3..ec1953474a65 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -10,6 +10,7 @@ #include <linux/context_tracking_state.h> #include <linux/cpumask.h> #include <linux/sched.h> +#include <linux/prctl.h> #ifdef CONFIG_GENERIC_CLOCKEVENTS extern void __init tick_init(void); @@ -134,11 +135,18 @@ static inline bool tick_nohz_full_cpu(int cpu) return cpumask_test_cpu(cpu, tick_nohz_full_mask); } +static inline bool tick_nohz_is_cpu_isolated(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); +} + extern void __tick_nohz_full_check(void); extern void tick_nohz_full_kick(void); extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); +extern void tick_nohz_cpu_isolated_enter(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -147,6 +155,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { } static inline void tick_nohz_full_kick(void) { } static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } +static inline bool tick_nohz_is_cpu_isolated(void) { return false; } +static inline void tick_nohz_cpu_isolated_enter(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..edb40b6b84db 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */ +#define PR_SET_CPU_ISOLATED 47 +#define PR_GET_CPU_ISOLATED 48 +# define PR_CPU_ISOLATED_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 72d59a1a6eb6..66739d7c1350 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/tick.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -85,6 +86,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (tick_nohz_is_cpu_isolated()) + tick_nohz_cpu_isolated_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/sys.c b/kernel/sys.c index a4e372b798a5..3fd9e47f8fc8 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2243,6 +2243,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_NO_HZ_FULL + case PR_SET_CPU_ISOLATED: + me->cpu_isolated_flags = arg2; + break; + case PR_GET_CPU_ISOLATED: + error = me->cpu_isolated_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 914259128145..f6236b66788f 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -24,6 +24,7 @@ #include <linux/posix-timers.h> #include <linux/perf_event.h> #include <linux/context_tracking.h> +#include <linux/swap.h> #include <asm/irq_regs.h> @@ -389,6 +390,62 @@ void __init tick_nohz_init(void) pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n", cpumask_pr_args(tick_nohz_full_mask)); } + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + */ +void __weak tick_nohz_cpu_isolated_wait() +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In "cpu_isolated" mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two "cpu_isolated" processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void tick_nohz_cpu_isolated_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + tick_nohz_cpu_isolated_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v3 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode 2015-06-03 15:29 ` Chris Metcalf @ 2015-06-03 15:29 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With cpu_isolated mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we add an internal bit to current->cpu_isolated_flags that is set when prctl() sets the flags. We check the bit on syscall entry as well as on any exception_enter(). The prctl() syscall is ignored to allow clearing the bit again later, and exit/exit_group are ignored to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86 and tile; I am happy to try to add more for additional platforms in the final version. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/ptrace.c | 6 +++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/tick.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/time/tick-sched.c | 38 ++++++++++++++++++++++++++++++++++++++ 7 files changed, 76 insertions(+), 7 deletions(-) diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..d4e43a13bab1 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall( + regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index a7bc79480719..7f784054ddea 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index 2821838256b4..d042f4cda39d 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/tick.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); extern void __context_tracking_task_switch(struct task_struct *prev, @@ -37,8 +38,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_exception(); + } + } return prev_ctx; } diff --git a/include/linux/tick.h b/include/linux/tick.h index ec1953474a65..b7ffb10337ba 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -147,6 +147,8 @@ extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); extern void tick_nohz_cpu_isolated_enter(void); +extern void tick_nohz_cpu_isolated_syscall(int nr); +extern void tick_nohz_cpu_isolated_exception(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -157,6 +159,8 @@ static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } static inline bool tick_nohz_is_cpu_isolated(void) { return false; } static inline void tick_nohz_cpu_isolated_enter(void) { } +static inline void tick_nohz_cpu_isolated_syscall(int nr) { } +static inline void tick_nohz_cpu_isolated_exception(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) @@ -189,4 +193,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk) __tick_nohz_task_switch(tsk); } +static inline bool tick_nohz_cpu_isolated_strict(void) +{ +#ifdef CONFIG_NO_HZ_FULL + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) == + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index edb40b6b84db..0c11238a84fb 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_CPU_ISOLATED 47 #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) +# define PR_CPU_ISOLATED_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 66739d7c1350..c82509caa42e 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -131,15 +131,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (__this_cpu_read(context_tracking.state) == state) { @@ -150,6 +151,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -157,6 +159,7 @@ void context_tracking_exit(enum ctx_state state) __this_cpu_write(context_tracking.state, CONTEXT_KERNEL); } local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index f6236b66788f..ce3bcf29a0f6 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -27,6 +27,7 @@ #include <linux/swap.h> #include <asm/irq_regs.h> +#include <asm/unistd.h> #include "tick-internal.h" @@ -446,6 +447,43 @@ void tick_nohz_cpu_isolated_enter(void) } } +static void kill_cpu_isolated_strict_task(void) +{ + dump_stack(); + current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void tick_nohz_cpu_isolated_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_cpu_isolated_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void tick_nohz_cpu_isolated_exception(void) +{ + pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", + current->comm, current->pid); + kill_cpu_isolated_strict_task(); +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v3 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode @ 2015-06-03 15:29 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With cpu_isolated mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we add an internal bit to current->cpu_isolated_flags that is set when prctl() sets the flags. We check the bit on syscall entry as well as on any exception_enter(). The prctl() syscall is ignored to allow clearing the bit again later, and exit/exit_group are ignored to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86 and tile; I am happy to try to add more for additional platforms in the final version. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/ptrace.c | 6 +++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/tick.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/time/tick-sched.c | 38 ++++++++++++++++++++++++++++++++++++++ 7 files changed, 76 insertions(+), 7 deletions(-) diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..d4e43a13bab1 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall( + regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index a7bc79480719..7f784054ddea 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index 2821838256b4..d042f4cda39d 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/tick.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); extern void __context_tracking_task_switch(struct task_struct *prev, @@ -37,8 +38,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_exception(); + } + } return prev_ctx; } diff --git a/include/linux/tick.h b/include/linux/tick.h index ec1953474a65..b7ffb10337ba 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -147,6 +147,8 @@ extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); extern void tick_nohz_cpu_isolated_enter(void); +extern void tick_nohz_cpu_isolated_syscall(int nr); +extern void tick_nohz_cpu_isolated_exception(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -157,6 +159,8 @@ static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } static inline bool tick_nohz_is_cpu_isolated(void) { return false; } static inline void tick_nohz_cpu_isolated_enter(void) { } +static inline void tick_nohz_cpu_isolated_syscall(int nr) { } +static inline void tick_nohz_cpu_isolated_exception(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) @@ -189,4 +193,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk) __tick_nohz_task_switch(tsk); } +static inline bool tick_nohz_cpu_isolated_strict(void) +{ +#ifdef CONFIG_NO_HZ_FULL + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) == + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index edb40b6b84db..0c11238a84fb 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_CPU_ISOLATED 47 #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) +# define PR_CPU_ISOLATED_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 66739d7c1350..c82509caa42e 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -131,15 +131,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (__this_cpu_read(context_tracking.state) == state) { @@ -150,6 +151,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -157,6 +159,7 @@ void context_tracking_exit(enum ctx_state state) __this_cpu_write(context_tracking.state, CONTEXT_KERNEL); } local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index f6236b66788f..ce3bcf29a0f6 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -27,6 +27,7 @@ #include <linux/swap.h> #include <asm/irq_regs.h> +#include <asm/unistd.h> #include "tick-internal.h" @@ -446,6 +447,43 @@ void tick_nohz_cpu_isolated_enter(void) } } +static void kill_cpu_isolated_strict_task(void) +{ + dump_stack(); + current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void tick_nohz_cpu_isolated_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_cpu_isolated_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void tick_nohz_cpu_isolated_exception(void) +{ + pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", + current->comm, current->pid); + kill_cpu_isolated_strict_task(); +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v3 3/5] nohz: cpu_isolated strict mode configurable signal @ 2015-06-03 15:29 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a cpu_isolated process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/time/tick-sched.c | 15 +++++++++++---- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 0c11238a84fb..ab45bd3d5799 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) # define PR_CPU_ISOLATED_STRICT (1 << 1) +# define PR_CPU_ISOLATED_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index ce3bcf29a0f6..f09c003da22f 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -447,11 +447,18 @@ void tick_nohz_cpu_isolated_enter(void) } } -static void kill_cpu_isolated_strict_task(void) +static void kill_cpu_isolated_strict_task(int is_syscall) { + siginfo_t info = {}; + int sig; + dump_stack(); current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -470,7 +477,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall) pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(1); } /* @@ -481,7 +488,7 @@ void tick_nohz_cpu_isolated_exception(void) { pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", current->comm, current->pid); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(0); } #endif -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v3 3/5] nohz: cpu_isolated strict mode configurable signal @ 2015-06-03 15:29 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a cpu_isolated process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> --- include/uapi/linux/prctl.h | 2 ++ kernel/time/tick-sched.c | 15 +++++++++++---- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 0c11238a84fb..ab45bd3d5799 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) # define PR_CPU_ISOLATED_STRICT (1 << 1) +# define PR_CPU_ISOLATED_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index ce3bcf29a0f6..f09c003da22f 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -447,11 +447,18 @@ void tick_nohz_cpu_isolated_enter(void) } } -static void kill_cpu_isolated_strict_task(void) +static void kill_cpu_isolated_strict_task(int is_syscall) { + siginfo_t info = {}; + int sig; + dump_stack(); current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -470,7 +477,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall) pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(1); } /* @@ -481,7 +488,7 @@ void tick_nohz_cpu_isolated_exception(void) { pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", current->comm, current->pid); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(0); } #endif -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v3 4/5] nohz: add cpu_isolated_debug boot flag 2015-06-03 15:29 ` Chris Metcalf ` (3 preceding siblings ...) (?) @ 2015-06-03 15:29 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-kernel Cc: Chris Metcalf This flag simplifies debugging of NO_HZ_FULL kernels when processes are running in PR_CPU_ISOLATED_ENABLE mode. Such processes should get no interrupts from the kernel, and if they do, when this boot flag is specified a kernel stack dump on the console is generated. It's possible to use ftrace to simply detect whether a cpu_isolated core has unexpectedly entered the kernel. But what this boot flag does is allow the kernel to provide better diagnostics, e.g. by reporting in the IPI-generating code what remote core and context is preparing to deliver an interrupt to a cpu_isolated core. It may be worth considering other ways to generate useful debugging output rather than console spew, but for now that is simple and direct. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- Documentation/kernel-parameters.txt | 6 ++++++ arch/tile/mm/homecache.c | 5 ++++- include/linux/tick.h | 2 ++ kernel/irq_work.c | 4 +++- kernel/sched/core.c | 18 ++++++++++++++++++ kernel/signal.c | 5 +++++ kernel/smp.c | 4 ++++ kernel/softirq.c | 6 ++++++ 8 files changed, 48 insertions(+), 2 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index f6befa9855c1..2b4c89225d25 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -743,6 +743,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. /proc/<pid>/coredump_filter. See also Documentation/filesystems/proc.txt. + cpu_isolated_debug [KNL] + In kernels built with CONFIG_NO_HZ_FULL and booted + in nohz_full= mode, this setting will generate console + backtraces when the kernel is about to interrupt a + task that has requested PR_CPU_ISOLATED_ENABLE. + cpuidle.off=1 [CPU_IDLE] disable the cpuidle sub-system diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c index 40ca30a9fee3..f336880e1b01 100644 --- a/arch/tile/mm/homecache.c +++ b/arch/tile/mm/homecache.c @@ -31,6 +31,7 @@ #include <linux/smp.h> #include <linux/module.h> #include <linux/hugetlb.h> +#include <linux/tick.h> #include <asm/page.h> #include <asm/sections.h> @@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask, * Don't bother to update atomically; losing a count * here is not that critical. */ - for_each_cpu(cpu, &mask) + for_each_cpu(cpu, &mask) { ++per_cpu(irq_stat, cpu).irq_hv_flush_count; + tick_nohz_cpu_isolated_debug(cpu); + } } /* diff --git a/include/linux/tick.h b/include/linux/tick.h index b7ffb10337ba..0b0d76106b8c 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -149,6 +149,7 @@ extern void __tick_nohz_task_switch(struct task_struct *tsk); extern void tick_nohz_cpu_isolated_enter(void); extern void tick_nohz_cpu_isolated_syscall(int nr); extern void tick_nohz_cpu_isolated_exception(void); +extern void tick_nohz_cpu_isolated_debug(int cpu); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -161,6 +162,7 @@ static inline bool tick_nohz_is_cpu_isolated(void) { return false; } static inline void tick_nohz_cpu_isolated_enter(void) { } static inline void tick_nohz_cpu_isolated_syscall(int nr) { } static inline void tick_nohz_cpu_isolated_exception(void) { } +static inline void tick_nohz_cpu_isolated_debug(int cpu) { } #endif static inline bool is_housekeeping_cpu(int cpu) diff --git a/kernel/irq_work.c b/kernel/irq_work.c index cbf9fb899d92..7f35c90346de 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -75,8 +75,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu) if (!irq_work_claim(work)) return false; - if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) + if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) { + tick_nohz_cpu_isolated_debug(cpu); arch_send_call_function_single_ipi(cpu); + } return true; } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f9123a82cbb6..7315e7272e94 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -719,6 +719,24 @@ bool sched_can_stop_tick(void) return true; } + +/* Enable debugging of any interrupts of cpu_isolated cores. */ +static int cpu_isolated_debug; +static int __init cpu_isolated_debug_func(char *str) +{ + cpu_isolated_debug = true; + return 1; +} +__setup("cpu_isolated_debug", cpu_isolated_debug_func); + +void tick_nohz_cpu_isolated_debug(int cpu) +{ + if (cpu_isolated_debug && tick_nohz_full_cpu(cpu) && + (cpu_curr(cpu)->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE)) { + pr_err("Interrupt detected for cpu_isolated cpu %d\n", cpu); + dump_stack(); + } +} #endif /* CONFIG_NO_HZ_FULL */ void sched_avg_update(struct rq *rq) diff --git a/kernel/signal.c b/kernel/signal.c index d51c5ddd855c..1a810ac2656e 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -689,6 +689,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) */ void signal_wake_up_state(struct task_struct *t, unsigned int state) { +#ifdef CONFIG_NO_HZ_FULL + /* If the task is being killed, don't complain about cpu_isolated. */ + if (state & TASK_WAKEKILL) + t->cpu_isolated_flags = 0; +#endif set_tsk_thread_flag(t, TIF_SIGPENDING); /* * TASK_WAKEKILL also means wake it up in the stopped/traced/killable diff --git a/kernel/smp.c b/kernel/smp.c index 07854477c164..6b7d8e2c8af4 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -14,6 +14,7 @@ #include <linux/smp.h> #include <linux/cpu.h> #include <linux/sched.h> +#include <linux/tick.h> #include "smpboot.h" @@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, * locking and barrier primitives. Generic code isn't really * equipped to do the right thing... */ + tick_nohz_cpu_isolated_debug(cpu); if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) arch_send_call_function_single_ipi(cpu); @@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask, } /* Send a message to all CPUs in the map */ + for_each_cpu(cpu, cfd->cpumask) + tick_nohz_cpu_isolated_debug(cpu); arch_send_call_function_ipi_mask(cfd->cpumask); if (wait) { diff --git a/kernel/softirq.c b/kernel/softirq.c index 479e4436f787..333872925ff6 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -24,6 +24,7 @@ #include <linux/ftrace.h> #include <linux/smp.h> #include <linux/smpboot.h> +#include <linux/context_tracking.h> #include <linux/tick.h> #include <linux/irq.h> @@ -335,6 +336,11 @@ void irq_enter(void) _local_bh_enable(); } + if (context_tracking_cpu_is_enabled() && + context_tracking_in_user() && + !in_interrupt()) + tick_nohz_cpu_isolated_debug(smp_processor_id()); + __irq_enter(); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v3 5/5] nohz: cpu_isolated: allow tick to be fully disabled 2015-06-03 15:29 ` Chris Metcalf ` (4 preceding siblings ...) (?) @ 2015-06-03 15:29 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-06-03 15:29 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-kernel Cc: Chris Metcalf While the current fallback to 1-second tick is still helpful for maintaining completely correct kernel semantics, processes using prctl(PR_SET_CPU_ISOLATED) semantics place a higher priority on running completely tickless, so don't bound the time_delta for such processes. In addition, due to the way such processes quiesce by waiting for the timer tick to stop prior to returning to userspace, without this commit it won't be possible to use the cpu_isolated mode at all. Removing the 1-second cap was previously discussed (see link below) and Thomas Gleixner observed that vruntime, load balancing data, load accounting, and other things might be impacted. Frederic Weisbecker similarly observed that allowing the tick to be indefinitely deferred just meant that no one would ever fix the underlying bugs. However it's at least true that the mode proposed in this patch can only be enabled on an isolcpus core by a process requesting cpu_isolated mode, which may limit how important it is to maintain scheduler data correctly, for example. Paul McKenney observed that if provide a mode where the 1Hz fallback timer is removed, this will provide an environment where new code that relies on that tick will get punished, and we won't forgive such assumptions silently, so it may also be worth it from that perspective. Finally, it's worth observing that the tile architecture has been using similar code for its Zero-Overhead Linux for many years (starting in 2008) and customers are very enthusiastic about the resulting bare-metal performance on cores that are available to run full Linux semantics on demand (crash, logging, shutdown, etc). So this semantics is very useful if we can convince ourselves that doing this is safe. Link: https://lkml.kernel.org/r/alpine.DEB.2.11.1410311058500.32582@gentwo.org Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- kernel/time/tick-sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index f09c003da22f..ec36ed00af9d 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -733,7 +733,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, } #ifdef CONFIG_NO_HZ_FULL - if (!ts->inidle) { + if (!ts->inidle && !tick_nohz_is_cpu_isolated()) { time_delta = min(time_delta, scheduler_tick_max_deferment()); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v4 0/5] support "cpu_isolated" mode for nohz_full 2015-06-03 15:29 ` Chris Metcalf @ 2015-07-13 19:57 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf This posting of the series is basically a "ping" since there were no comments to the v3 version. I have rebased it to 4.2-rc1, added support for arm64 syscall tracking for "strict" mode, and retested it; are there any remaining concerns? Thomas, I haven't heard from you whether my removal of the cpu_idle calls sufficiently addresses your concerns about that aspect. Are there other concerns with this patch series at this point? Original patch series cover letter follows: The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. A prctl() option (PR_SET_CPU_ISOLATED) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on cpu_isolated cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on v4.2-rc1) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane v4: rebased on kernel v4.2-rc1 added support for detecting CPU_ISOLATED_STRICT syscalls on arm64 v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() Note: I have not removed the commit to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if we remove the 1Hz tick, cpu_isolated threads will never re-enter userspace since a tick will always be pending. Chris Metcalf (5): nohz_full: add support for "cpu_isolated" mode nohz: support PR_CPU_ISOLATED_STRICT mode nohz: cpu_isolated strict mode configurable signal nohz: add cpu_isolated_debug boot flag nohz: cpu_isolated: allow tick to be fully disabled Documentation/kernel-parameters.txt | 6 +++ arch/tile/kernel/process.c | 9 ++++ arch/tile/kernel/ptrace.c | 6 ++- arch/tile/mm/homecache.c | 5 +- arch/x86/kernel/ptrace.c | 2 + include/linux/context_tracking.h | 11 ++-- include/linux/sched.h | 3 ++ include/linux/tick.h | 28 ++++++++++ include/uapi/linux/prctl.h | 8 +++ kernel/context_tracking.c | 12 +++-- kernel/irq_work.c | 4 +- kernel/sched/core.c | 18 +++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 6 +++ kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 104 +++++++++++++++++++++++++++++++++++- 17 files changed, 229 insertions(+), 10 deletions(-) -- 2.1.2 ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v4 0/5] support "cpu_isolated" mode for nohz_full @ 2015-07-13 19:57 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf This posting of the series is basically a "ping" since there were no comments to the v3 version. I have rebased it to 4.2-rc1, added support for arm64 syscall tracking for "strict" mode, and retested it; are there any remaining concerns? Thomas, I haven't heard from you whether my removal of the cpu_idle calls sufficiently addresses your concerns about that aspect. Are there other concerns with this patch series at this point? Original patch series cover letter follows: The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. A prctl() option (PR_SET_CPU_ISOLATED) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on cpu_isolated cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on v4.2-rc1) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane v4: rebased on kernel v4.2-rc1 added support for detecting CPU_ISOLATED_STRICT syscalls on arm64 v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() Note: I have not removed the commit to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if we remove the 1Hz tick, cpu_isolated threads will never re-enter userspace since a tick will always be pending. Chris Metcalf (5): nohz_full: add support for "cpu_isolated" mode nohz: support PR_CPU_ISOLATED_STRICT mode nohz: cpu_isolated strict mode configurable signal nohz: add cpu_isolated_debug boot flag nohz: cpu_isolated: allow tick to be fully disabled Documentation/kernel-parameters.txt | 6 +++ arch/tile/kernel/process.c | 9 ++++ arch/tile/kernel/ptrace.c | 6 ++- arch/tile/mm/homecache.c | 5 +- arch/x86/kernel/ptrace.c | 2 + include/linux/context_tracking.h | 11 ++-- include/linux/sched.h | 3 ++ include/linux/tick.h | 28 ++++++++++ include/uapi/linux/prctl.h | 8 +++ kernel/context_tracking.c | 12 +++-- kernel/irq_work.c | 4 +- kernel/sched/core.c | 18 +++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 6 +++ kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 104 +++++++++++++++++++++++++++++++++++- 17 files changed, 229 insertions(+), 10 deletions(-) -- 2.1.2 ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-13 19:57 ` Chris Metcalf @ 2015-07-13 19:57 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a stronger commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the stronger semantics as needed, specifying prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The "cpu_isolated" state is indicated by setting a new task struct field, cpu_isolated_flags, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only two actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/process.c | 9 ++++++++ include/linux/sched.h | 3 +++ include/linux/tick.h | 10 ++++++++ include/uapi/linux/prctl.h | 5 ++++ kernel/context_tracking.c | 3 +++ kernel/sys.c | 8 +++++++ kernel/time/tick-sched.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 95 insertions(+) diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index e036c0aa9792..3625e839ad62 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -70,6 +70,15 @@ void arch_cpu_idle(void) _cpu_idle(); } +#ifdef CONFIG_NO_HZ_FULL +void tick_nohz_cpu_isolated_wait(void) +{ + set_current_state(TASK_INTERRUPTIBLE); + _cpu_idle(); + set_current_state(TASK_RUNNING); +} +#endif + /* * Release a thread_info structure */ diff --git a/include/linux/sched.h b/include/linux/sched.h index ae21f1591615..f350b0c20bbc 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1778,6 +1778,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_NO_HZ_FULL + unsigned int cpu_isolated_flags; +#endif }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/include/linux/tick.h b/include/linux/tick.h index 3741ba1a652c..cb5569181359 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -10,6 +10,7 @@ #include <linux/context_tracking_state.h> #include <linux/cpumask.h> #include <linux/sched.h> +#include <linux/prctl.h> #ifdef CONFIG_GENERIC_CLOCKEVENTS extern void __init tick_init(void); @@ -144,11 +145,18 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) cpumask_or(mask, mask, tick_nohz_full_mask); } +static inline bool tick_nohz_is_cpu_isolated(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); +} + extern void __tick_nohz_full_check(void); extern void tick_nohz_full_kick(void); extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); +extern void tick_nohz_cpu_isolated_enter(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -158,6 +166,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { } static inline void tick_nohz_full_kick(void) { } static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } +static inline bool tick_nohz_is_cpu_isolated(void) { return false; } +static inline void tick_nohz_cpu_isolated_enter(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..edb40b6b84db 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */ +#define PR_SET_CPU_ISOLATED 47 +#define PR_GET_CPU_ISOLATED 48 +# define PR_CPU_ISOLATED_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 0a495ab35bc7..f9de3ee12723 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/tick.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (tick_nohz_is_cpu_isolated()) + tick_nohz_cpu_isolated_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/sys.c b/kernel/sys.c index 259fda25eb6b..36eb9a839f1f 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_NO_HZ_FULL + case PR_SET_CPU_ISOLATED: + me->cpu_isolated_flags = arg2; + break; + case PR_GET_CPU_ISOLATED: + error = me->cpu_isolated_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index c792429e98c6..4cf093c012d1 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -24,6 +24,7 @@ #include <linux/posix-timers.h> #include <linux/perf_event.h> #include <linux/context_tracking.h> +#include <linux/swap.h> #include <asm/irq_regs.h> @@ -389,6 +390,62 @@ void __init tick_nohz_init(void) pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n", cpumask_pr_args(tick_nohz_full_mask)); } + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + */ +void __weak tick_nohz_cpu_isolated_wait(void) +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In "cpu_isolated" mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two "cpu_isolated" processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void tick_nohz_cpu_isolated_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + tick_nohz_cpu_isolated_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode @ 2015-07-13 19:57 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a stronger commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the stronger semantics as needed, specifying prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The "cpu_isolated" state is indicated by setting a new task struct field, cpu_isolated_flags, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only two actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/process.c | 9 ++++++++ include/linux/sched.h | 3 +++ include/linux/tick.h | 10 ++++++++ include/uapi/linux/prctl.h | 5 ++++ kernel/context_tracking.c | 3 +++ kernel/sys.c | 8 +++++++ kernel/time/tick-sched.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 95 insertions(+) diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index e036c0aa9792..3625e839ad62 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -70,6 +70,15 @@ void arch_cpu_idle(void) _cpu_idle(); } +#ifdef CONFIG_NO_HZ_FULL +void tick_nohz_cpu_isolated_wait(void) +{ + set_current_state(TASK_INTERRUPTIBLE); + _cpu_idle(); + set_current_state(TASK_RUNNING); +} +#endif + /* * Release a thread_info structure */ diff --git a/include/linux/sched.h b/include/linux/sched.h index ae21f1591615..f350b0c20bbc 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1778,6 +1778,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_NO_HZ_FULL + unsigned int cpu_isolated_flags; +#endif }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/include/linux/tick.h b/include/linux/tick.h index 3741ba1a652c..cb5569181359 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -10,6 +10,7 @@ #include <linux/context_tracking_state.h> #include <linux/cpumask.h> #include <linux/sched.h> +#include <linux/prctl.h> #ifdef CONFIG_GENERIC_CLOCKEVENTS extern void __init tick_init(void); @@ -144,11 +145,18 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) cpumask_or(mask, mask, tick_nohz_full_mask); } +static inline bool tick_nohz_is_cpu_isolated(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); +} + extern void __tick_nohz_full_check(void); extern void tick_nohz_full_kick(void); extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); +extern void tick_nohz_cpu_isolated_enter(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -158,6 +166,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { } static inline void tick_nohz_full_kick(void) { } static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } +static inline bool tick_nohz_is_cpu_isolated(void) { return false; } +static inline void tick_nohz_cpu_isolated_enter(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..edb40b6b84db 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */ +#define PR_SET_CPU_ISOLATED 47 +#define PR_GET_CPU_ISOLATED 48 +# define PR_CPU_ISOLATED_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 0a495ab35bc7..f9de3ee12723 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/tick.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (tick_nohz_is_cpu_isolated()) + tick_nohz_cpu_isolated_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/sys.c b/kernel/sys.c index 259fda25eb6b..36eb9a839f1f 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_NO_HZ_FULL + case PR_SET_CPU_ISOLATED: + me->cpu_isolated_flags = arg2; + break; + case PR_GET_CPU_ISOLATED: + error = me->cpu_isolated_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index c792429e98c6..4cf093c012d1 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -24,6 +24,7 @@ #include <linux/posix-timers.h> #include <linux/perf_event.h> #include <linux/context_tracking.h> +#include <linux/swap.h> #include <asm/irq_regs.h> @@ -389,6 +390,62 @@ void __init tick_nohz_init(void) pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n", cpumask_pr_args(tick_nohz_full_mask)); } + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + */ +void __weak tick_nohz_cpu_isolated_wait(void) +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In "cpu_isolated" mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two "cpu_isolated" processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void tick_nohz_cpu_isolated_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + tick_nohz_cpu_isolated_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-13 19:57 ` Chris Metcalf (?) @ 2015-07-13 20:40 ` Andy Lutomirski 2015-07-13 21:01 ` Chris Metcalf -1 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-07-13 20:40 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, Linux API, linux-kernel On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > The existing nohz_full mode makes tradeoffs to minimize userspace > interruptions while still attempting to avoid overheads in the > kernel entry/exit path, to provide 100% kernel semantics, etc. > > However, some applications require a stronger commitment from the > kernel to avoid interruptions, in particular userspace device > driver style applications, such as high-speed networking code. > > This change introduces a framework to allow applications to elect > to have the stronger semantics as needed, specifying > prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. > Subsequent commits will add additional flags and additional > semantics. I thought the general consensus was that this should be the default behavior and that any associated bugs should be fixed. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-13 20:40 ` Andy Lutomirski @ 2015-07-13 21:01 ` Chris Metcalf 2015-07-13 21:45 ` Andy Lutomirski 0 siblings, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-07-13 21:01 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, Linux API, linux-kernel On 07/13/2015 04:40 PM, Andy Lutomirski wrote: > On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> The existing nohz_full mode makes tradeoffs to minimize userspace >> interruptions while still attempting to avoid overheads in the >> kernel entry/exit path, to provide 100% kernel semantics, etc. >> >> However, some applications require a stronger commitment from the >> kernel to avoid interruptions, in particular userspace device >> driver style applications, such as high-speed networking code. >> >> This change introduces a framework to allow applications to elect >> to have the stronger semantics as needed, specifying >> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. >> Subsequent commits will add additional flags and additional >> semantics. > I thought the general consensus was that this should be the default > behavior and that any associated bugs should be fixed. I think it comes down to dividing the set of use cases in two: - "Regular" nohz_full, as used to improve performance and limit interruptions, possibly for power benefits, etc. But, stray interrupts are not particularly bad, and you don't want to take extreme measures to avoid them. - What I'm calling "cpu_isolated" mode where when you return to userspace, you expect that by God, the kernel doesn't interrupt you again, and if it does, it's a flat-out bug. There are a few things that cpu_isolated mode currently does to accomplish its goals that are pretty heavy-weight: Processes are held in kernel space until ticks are quiesced; this is not necessarily what every nohz_full task wants. If a task makes a kernel call, there may well be arbitrary timer fallout, and having a way to select whether or not you are willing to take a timer tick after return to userspace is pretty important. Likewise, there are things that you may want to do on return to userspace that are designed to prevent further interruptions in cpu_isolated mode, even at a possible future performance cost if and when you return to the kernel, such as flushing the per-cpu free page list so that you won't be interrupted by an IPI to flush it later. If you're arguing that the cpu_isolated semantic is really the only one that makes sense for nohz_full, my sense is that it might be surprising to many of the folks who do nohz_full work. But, I'm happy to be wrong on this point, and maybe all the nohz_full community is interested in making the same tradeoffs for nohz_full generally that I've proposed in this patch series just for cpu_isolated? -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-13 21:01 ` Chris Metcalf @ 2015-07-13 21:45 ` Andy Lutomirski 2015-07-21 19:10 ` Chris Metcalf 0 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-07-13 21:45 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, Linux API, linux-kernel On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 07/13/2015 04:40 PM, Andy Lutomirski wrote: >> >> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> >> wrote: >>> >>> The existing nohz_full mode makes tradeoffs to minimize userspace >>> interruptions while still attempting to avoid overheads in the >>> kernel entry/exit path, to provide 100% kernel semantics, etc. >>> >>> However, some applications require a stronger commitment from the >>> kernel to avoid interruptions, in particular userspace device >>> driver style applications, such as high-speed networking code. >>> >>> This change introduces a framework to allow applications to elect >>> to have the stronger semantics as needed, specifying >>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. >>> Subsequent commits will add additional flags and additional >>> semantics. >> >> I thought the general consensus was that this should be the default >> behavior and that any associated bugs should be fixed. > > > I think it comes down to dividing the set of use cases in two: > > - "Regular" nohz_full, as used to improve performance and limit > interruptions, possibly for power benefits, etc. But, stray > interrupts are not particularly bad, and you don't want to take > extreme measures to avoid them. > > - What I'm calling "cpu_isolated" mode where when you return to > userspace, you expect that by God, the kernel doesn't interrupt you > again, and if it does, it's a flat-out bug. > > There are a few things that cpu_isolated mode currently does to > accomplish its goals that are pretty heavy-weight: > > Processes are held in kernel space until ticks are quiesced; this is > not necessarily what every nohz_full task wants. If a task makes a > kernel call, there may well be arbitrary timer fallout, and having a > way to select whether or not you are willing to take a timer tick after > return to userspace is pretty important. Then shouldn't deferred work be done immediately in nohz_full mode regardless? What is this delayed work that's being done? > > Likewise, there are things that you may want to do on return to > userspace that are designed to prevent further interruptions in > cpu_isolated mode, even at a possible future performance cost if and > when you return to the kernel, such as flushing the per-cpu free page > list so that you won't be interrupted by an IPI to flush it later. > Why not just kick the per-cpu free page over to whatever cpu is monitoring your RCU state, etc? That should be very quick. > If you're arguing that the cpu_isolated semantic is really the only > one that makes sense for nohz_full, my sense is that it might be > surprising to many of the folks who do nohz_full work. But, I'm happy > to be wrong on this point, and maybe all the nohz_full community is > interested in making the same tradeoffs for nohz_full generally that > I've proposed in this patch series just for cpu_isolated? nohz_full is currently dog slow for no particularly good reasons. I suspect that the interrupts you're seeing are also there for no particularly good reasons as well. Let's fix them instead of adding new ABIs to work around them. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-13 21:45 ` Andy Lutomirski @ 2015-07-21 19:10 ` Chris Metcalf 2015-07-21 19:26 ` Andy Lutomirski 2015-07-24 14:03 ` Frederic Weisbecker 0 siblings, 2 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-21 19:10 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, Linux API, linux-kernel Sorry for the delay in responding; some other priorities came up internally. On 07/13/2015 05:45 PM, Andy Lutomirski wrote: > On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> On 07/13/2015 04:40 PM, Andy Lutomirski wrote: >>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> >>> wrote: >>>> The existing nohz_full mode makes tradeoffs to minimize userspace >>>> interruptions while still attempting to avoid overheads in the >>>> kernel entry/exit path, to provide 100% kernel semantics, etc. >>>> >>>> However, some applications require a stronger commitment from the >>>> kernel to avoid interruptions, in particular userspace device >>>> driver style applications, such as high-speed networking code. >>>> >>>> This change introduces a framework to allow applications to elect >>>> to have the stronger semantics as needed, specifying >>>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. >>>> Subsequent commits will add additional flags and additional >>>> semantics. >>> I thought the general consensus was that this should be the default >>> behavior and that any associated bugs should be fixed. >> >> I think it comes down to dividing the set of use cases in two: >> >> - "Regular" nohz_full, as used to improve performance and limit >> interruptions, possibly for power benefits, etc. But, stray >> interrupts are not particularly bad, and you don't want to take >> extreme measures to avoid them. >> >> - What I'm calling "cpu_isolated" mode where when you return to >> userspace, you expect that by God, the kernel doesn't interrupt you >> again, and if it does, it's a flat-out bug. >> >> There are a few things that cpu_isolated mode currently does to >> accomplish its goals that are pretty heavy-weight: >> >> Processes are held in kernel space until ticks are quiesced; this is >> not necessarily what every nohz_full task wants. If a task makes a >> kernel call, there may well be arbitrary timer fallout, and having a >> way to select whether or not you are willing to take a timer tick after >> return to userspace is pretty important. > Then shouldn't deferred work be done immediately in nohz_full mode > regardless? What is this delayed work that's being done? I'm thinking of things like needing to wait for an RCU quiesce period to complete. In the current version, there's also the vmstat_update() that may schedule delayed work and interrupt the core again shortly before realizing that there are no more counter updates happening, at which point it quiesces. Currently we handle this in cpu_isolated mode simply by spinning and waiting for the timer interrupts to complete. >> Likewise, there are things that you may want to do on return to >> userspace that are designed to prevent further interruptions in >> cpu_isolated mode, even at a possible future performance cost if and >> when you return to the kernel, such as flushing the per-cpu free page >> list so that you won't be interrupted by an IPI to flush it later. > Why not just kick the per-cpu free page over to whatever cpu is > monitoring your RCU state, etc? That should be very quick. So just for the sake of precision, the thing I'm talking about is the lru_add_drain() call on kernel exit. Are you proposing that we call that for every nohz_full core on kernel exit? I'm not opposed to this, but I don't know if other nohz developers feel like this is the right tradeoff. Similarly, addressing the vmstat_update() issue above, in cpu_isolated mode we might want to have a follow-on patch that forces the vmstat system into quiesced state on return to userspace. We would need to do this unconditionally on all nohz_full cores if we tried to combine the current nohz_full with my proposed cpu_isolated functionality. Again, I'm not necessarily opposed, but I suspect other nohz developers might not want this. (I didn't want to introduce such a patch as part of this series since it pulls in even more interested parties, and it gets harder and harder to get to consensus.) >> If you're arguing that the cpu_isolated semantic is really the only >> one that makes sense for nohz_full, my sense is that it might be >> surprising to many of the folks who do nohz_full work. But, I'm happy >> to be wrong on this point, and maybe all the nohz_full community is >> interested in making the same tradeoffs for nohz_full generally that >> I've proposed in this patch series just for cpu_isolated? > nohz_full is currently dog slow for no particularly good reasons. I > suspect that the interrupts you're seeing are also there for no > particularly good reasons as well. > > Let's fix them instead of adding new ABIs to work around them. Well, in principle if we accepted my proposed patch series and then over time came to decide that it was reasonable for nohz_full to have these complete cpu isolation semantics, the one proposed ABI simply becomes a no-op. So it's not as problematic an ABI as some. My issue is this: I'm totally happy with submitting a revised patch series that does all the stuff for pure nohz_full that I'm currently proposing for cpu_isolated. But, is it what the community wants? Should I propose it and see? Frederic, do you have any insight here? Thanks! -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-21 19:10 ` Chris Metcalf @ 2015-07-21 19:26 ` Andy Lutomirski 2015-07-21 20:36 ` Paul E. McKenney 2015-07-24 20:22 ` Chris Metcalf 2015-07-24 14:03 ` Frederic Weisbecker 1 sibling, 2 replies; 340+ messages in thread From: Andy Lutomirski @ 2015-07-21 19:26 UTC (permalink / raw) To: Chris Metcalf, Paul McKenney Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Christoph Lameter, Viresh Kumar, linux-doc, Linux API, linux-kernel On Tue, Jul 21, 2015 at 12:10 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > Sorry for the delay in responding; some other priorities came up internally. > > On 07/13/2015 05:45 PM, Andy Lutomirski wrote: >> >> On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <cmetcalf@ezchip.com> >> wrote: >>> >>> On 07/13/2015 04:40 PM, Andy Lutomirski wrote: >>>> >>>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> >>>> >>>> wrote: >>>>> >>>>> The existing nohz_full mode makes tradeoffs to minimize userspace >>>>> interruptions while still attempting to avoid overheads in the >>>>> kernel entry/exit path, to provide 100% kernel semantics, etc. >>>>> >>>>> However, some applications require a stronger commitment from the >>>>> kernel to avoid interruptions, in particular userspace device >>>>> driver style applications, such as high-speed networking code. >>>>> >>>>> This change introduces a framework to allow applications to elect >>>>> to have the stronger semantics as needed, specifying >>>>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. >>>>> Subsequent commits will add additional flags and additional >>>>> semantics. >>>> >>>> I thought the general consensus was that this should be the default >>>> behavior and that any associated bugs should be fixed. >>> >>> >>> I think it comes down to dividing the set of use cases in two: >>> >>> - "Regular" nohz_full, as used to improve performance and limit >>> interruptions, possibly for power benefits, etc. But, stray >>> interrupts are not particularly bad, and you don't want to take >>> extreme measures to avoid them. >>> >>> - What I'm calling "cpu_isolated" mode where when you return to >>> userspace, you expect that by God, the kernel doesn't interrupt you >>> again, and if it does, it's a flat-out bug. >>> >>> There are a few things that cpu_isolated mode currently does to >>> accomplish its goals that are pretty heavy-weight: >>> >>> Processes are held in kernel space until ticks are quiesced; this is >>> not necessarily what every nohz_full task wants. If a task makes a >>> kernel call, there may well be arbitrary timer fallout, and having a >>> way to select whether or not you are willing to take a timer tick after >>> return to userspace is pretty important. >> >> Then shouldn't deferred work be done immediately in nohz_full mode >> regardless? What is this delayed work that's being done? > > > I'm thinking of things like needing to wait for an RCU quiesce > period to complete. rcu_nocbs does this, right? > > In the current version, there's also the vmstat_update() that > may schedule delayed work and interrupt the core again > shortly before realizing that there are no more counter updates > happening, at which point it quiesces. Currently we handle > this in cpu_isolated mode simply by spinning and waiting for > the timer interrupts to complete. Perhaps we should fix that? > >>> Likewise, there are things that you may want to do on return to >>> userspace that are designed to prevent further interruptions in >>> cpu_isolated mode, even at a possible future performance cost if and >>> when you return to the kernel, such as flushing the per-cpu free page >>> list so that you won't be interrupted by an IPI to flush it later. >> >> Why not just kick the per-cpu free page over to whatever cpu is >> monitoring your RCU state, etc? That should be very quick. > > > So just for the sake of precision, the thing I'm talking about > is the lru_add_drain() call on kernel exit. Are you proposing > that we call that for every nohz_full core on kernel exit? > I'm not opposed to this, but I don't know if other nohz > developers feel like this is the right tradeoff. I'm proposing either that we do that or that we arrange for other cpus to be able to steal our LRU list while we're in RCU user/idle. >> Let's fix them instead of adding new ABIs to work around them. > > > Well, in principle if we accepted my proposed patch series > and then over time came to decide that it was reasonable > for nohz_full to have these complete cpu isolation > semantics, the one proposed ABI simply becomes a no-op. > So it's not as problematic an ABI as some. What if we made it a debugfs thing instead of a prctl? Have a mode where the system tries really hard to quiesce itself even at the cost of performance. > > My issue is this: I'm totally happy with submitting a revised > patch series that does all the stuff for pure nohz_full that > I'm currently proposing for cpu_isolated. But, is it what > the community wants? Should I propose it and see? > > Frederic, do you have any insight here? Thanks! > > -- > Chris Metcalf, EZChip Semiconductor > http://www.ezchip.com > > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode @ 2015-07-21 20:36 ` Paul E. McKenney 0 siblings, 0 replies; 340+ messages in thread From: Paul E. McKenney @ 2015-07-21 20:36 UTC (permalink / raw) To: Andy Lutomirski Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Christoph Lameter, Viresh Kumar, linux-doc, Linux API, linux-kernel On Tue, Jul 21, 2015 at 12:26:17PM -0700, Andy Lutomirski wrote: > On Tue, Jul 21, 2015 at 12:10 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > > Sorry for the delay in responding; some other priorities came up internally. > > > > On 07/13/2015 05:45 PM, Andy Lutomirski wrote: > >> > >> On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <cmetcalf@ezchip.com> > >> wrote: > >>> > >>> On 07/13/2015 04:40 PM, Andy Lutomirski wrote: > >>>> > >>>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> > >>>> > >>>> wrote: > >>>>> > >>>>> The existing nohz_full mode makes tradeoffs to minimize userspace > >>>>> interruptions while still attempting to avoid overheads in the > >>>>> kernel entry/exit path, to provide 100% kernel semantics, etc. > >>>>> > >>>>> However, some applications require a stronger commitment from the > >>>>> kernel to avoid interruptions, in particular userspace device > >>>>> driver style applications, such as high-speed networking code. > >>>>> > >>>>> This change introduces a framework to allow applications to elect > >>>>> to have the stronger semantics as needed, specifying > >>>>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. > >>>>> Subsequent commits will add additional flags and additional > >>>>> semantics. > >>>> > >>>> I thought the general consensus was that this should be the default > >>>> behavior and that any associated bugs should be fixed. > >>> > >>> > >>> I think it comes down to dividing the set of use cases in two: > >>> > >>> - "Regular" nohz_full, as used to improve performance and limit > >>> interruptions, possibly for power benefits, etc. But, stray > >>> interrupts are not particularly bad, and you don't want to take > >>> extreme measures to avoid them. > >>> > >>> - What I'm calling "cpu_isolated" mode where when you return to > >>> userspace, you expect that by God, the kernel doesn't interrupt you > >>> again, and if it does, it's a flat-out bug. > >>> > >>> There are a few things that cpu_isolated mode currently does to > >>> accomplish its goals that are pretty heavy-weight: > >>> > >>> Processes are held in kernel space until ticks are quiesced; this is > >>> not necessarily what every nohz_full task wants. If a task makes a > >>> kernel call, there may well be arbitrary timer fallout, and having a > >>> way to select whether or not you are willing to take a timer tick after > >>> return to userspace is pretty important. > >> > >> Then shouldn't deferred work be done immediately in nohz_full mode > >> regardless? What is this delayed work that's being done? > > > > I'm thinking of things like needing to wait for an RCU quiesce > > period to complete. > > rcu_nocbs does this, right? CONFIG_RCU_NOCB_CPUS offloads the RCU callbacks to a kthread, which allows the nohz CPU to turn off its scheduling-clock tick more frequently. Chris might have some other reason to wait for an RCU grace period, given that waiting for an RCU grace period would not guarantee no callbacks. Some more might have arrived in the meantime, and there can be some delay between the end of the grace period and the invocation of the callbacks. > > In the current version, there's also the vmstat_update() that > > may schedule delayed work and interrupt the core again > > shortly before realizing that there are no more counter updates > > happening, at which point it quiesces. Currently we handle > > this in cpu_isolated mode simply by spinning and waiting for > > the timer interrupts to complete. > > Perhaps we should fix that? Didn't Christoph Lameter fix this? Or is this an additional problem? Thanx, Paul > >>> Likewise, there are things that you may want to do on return to > >>> userspace that are designed to prevent further interruptions in > >>> cpu_isolated mode, even at a possible future performance cost if and > >>> when you return to the kernel, such as flushing the per-cpu free page > >>> list so that you won't be interrupted by an IPI to flush it later. > >> > >> Why not just kick the per-cpu free page over to whatever cpu is > >> monitoring your RCU state, etc? That should be very quick. > > > > > > So just for the sake of precision, the thing I'm talking about > > is the lru_add_drain() call on kernel exit. Are you proposing > > that we call that for every nohz_full core on kernel exit? > > I'm not opposed to this, but I don't know if other nohz > > developers feel like this is the right tradeoff. > > I'm proposing either that we do that or that we arrange for other cpus > to be able to steal our LRU list while we're in RCU user/idle. > > >> Let's fix them instead of adding new ABIs to work around them. > > > > > > Well, in principle if we accepted my proposed patch series > > and then over time came to decide that it was reasonable > > for nohz_full to have these complete cpu isolation > > semantics, the one proposed ABI simply becomes a no-op. > > So it's not as problematic an ABI as some. > > What if we made it a debugfs thing instead of a prctl? Have a mode > where the system tries really hard to quiesce itself even at the cost > of performance. > > > > > My issue is this: I'm totally happy with submitting a revised > > patch series that does all the stuff for pure nohz_full that > > I'm currently proposing for cpu_isolated. But, is it what > > the community wants? Should I propose it and see? > > > > Frederic, do you have any insight here? Thanks! > > > > -- > > Chris Metcalf, EZChip Semiconductor > > http://www.ezchip.com > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-api" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Andy Lutomirski > AMA Capital Management, LLC > ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode @ 2015-07-21 20:36 ` Paul E. McKenney 0 siblings, 0 replies; 340+ messages in thread From: Paul E. McKenney @ 2015-07-21 20:36 UTC (permalink / raw) To: Andy Lutomirski Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, Jul 21, 2015 at 12:26:17PM -0700, Andy Lutomirski wrote: > On Tue, Jul 21, 2015 at 12:10 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > > Sorry for the delay in responding; some other priorities came up internally. > > > > On 07/13/2015 05:45 PM, Andy Lutomirski wrote: > >> > >> On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> > >> wrote: > >>> > >>> On 07/13/2015 04:40 PM, Andy Lutomirski wrote: > >>>> > >>>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> > >>>> > >>>> wrote: > >>>>> > >>>>> The existing nohz_full mode makes tradeoffs to minimize userspace > >>>>> interruptions while still attempting to avoid overheads in the > >>>>> kernel entry/exit path, to provide 100% kernel semantics, etc. > >>>>> > >>>>> However, some applications require a stronger commitment from the > >>>>> kernel to avoid interruptions, in particular userspace device > >>>>> driver style applications, such as high-speed networking code. > >>>>> > >>>>> This change introduces a framework to allow applications to elect > >>>>> to have the stronger semantics as needed, specifying > >>>>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. > >>>>> Subsequent commits will add additional flags and additional > >>>>> semantics. > >>>> > >>>> I thought the general consensus was that this should be the default > >>>> behavior and that any associated bugs should be fixed. > >>> > >>> > >>> I think it comes down to dividing the set of use cases in two: > >>> > >>> - "Regular" nohz_full, as used to improve performance and limit > >>> interruptions, possibly for power benefits, etc. But, stray > >>> interrupts are not particularly bad, and you don't want to take > >>> extreme measures to avoid them. > >>> > >>> - What I'm calling "cpu_isolated" mode where when you return to > >>> userspace, you expect that by God, the kernel doesn't interrupt you > >>> again, and if it does, it's a flat-out bug. > >>> > >>> There are a few things that cpu_isolated mode currently does to > >>> accomplish its goals that are pretty heavy-weight: > >>> > >>> Processes are held in kernel space until ticks are quiesced; this is > >>> not necessarily what every nohz_full task wants. If a task makes a > >>> kernel call, there may well be arbitrary timer fallout, and having a > >>> way to select whether or not you are willing to take a timer tick after > >>> return to userspace is pretty important. > >> > >> Then shouldn't deferred work be done immediately in nohz_full mode > >> regardless? What is this delayed work that's being done? > > > > I'm thinking of things like needing to wait for an RCU quiesce > > period to complete. > > rcu_nocbs does this, right? CONFIG_RCU_NOCB_CPUS offloads the RCU callbacks to a kthread, which allows the nohz CPU to turn off its scheduling-clock tick more frequently. Chris might have some other reason to wait for an RCU grace period, given that waiting for an RCU grace period would not guarantee no callbacks. Some more might have arrived in the meantime, and there can be some delay between the end of the grace period and the invocation of the callbacks. > > In the current version, there's also the vmstat_update() that > > may schedule delayed work and interrupt the core again > > shortly before realizing that there are no more counter updates > > happening, at which point it quiesces. Currently we handle > > this in cpu_isolated mode simply by spinning and waiting for > > the timer interrupts to complete. > > Perhaps we should fix that? Didn't Christoph Lameter fix this? Or is this an additional problem? Thanx, Paul > >>> Likewise, there are things that you may want to do on return to > >>> userspace that are designed to prevent further interruptions in > >>> cpu_isolated mode, even at a possible future performance cost if and > >>> when you return to the kernel, such as flushing the per-cpu free page > >>> list so that you won't be interrupted by an IPI to flush it later. > >> > >> Why not just kick the per-cpu free page over to whatever cpu is > >> monitoring your RCU state, etc? That should be very quick. > > > > > > So just for the sake of precision, the thing I'm talking about > > is the lru_add_drain() call on kernel exit. Are you proposing > > that we call that for every nohz_full core on kernel exit? > > I'm not opposed to this, but I don't know if other nohz > > developers feel like this is the right tradeoff. > > I'm proposing either that we do that or that we arrange for other cpus > to be able to steal our LRU list while we're in RCU user/idle. > > >> Let's fix them instead of adding new ABIs to work around them. > > > > > > Well, in principle if we accepted my proposed patch series > > and then over time came to decide that it was reasonable > > for nohz_full to have these complete cpu isolation > > semantics, the one proposed ABI simply becomes a no-op. > > So it's not as problematic an ABI as some. > > What if we made it a debugfs thing instead of a prctl? Have a mode > where the system tries really hard to quiesce itself even at the cost > of performance. > > > > > My issue is this: I'm totally happy with submitting a revised > > patch series that does all the stuff for pure nohz_full that > > I'm currently proposing for cpu_isolated. But, is it what > > the community wants? Should I propose it and see? > > > > Frederic, do you have any insight here? Thanks! > > > > -- > > Chris Metcalf, EZChip Semiconductor > > http://www.ezchip.com > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-api" in > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Andy Lutomirski > AMA Capital Management, LLC > ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-21 20:36 ` Paul E. McKenney (?) @ 2015-07-22 13:57 ` Christoph Lameter 2015-07-22 19:28 ` Paul E. McKenney -1 siblings, 1 reply; 340+ messages in thread From: Christoph Lameter @ 2015-07-22 13:57 UTC (permalink / raw) To: Paul E. McKenney Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Viresh Kumar, linux-doc, Linux API, linux-kernel On Tue, 21 Jul 2015, Paul E. McKenney wrote: > > > In the current version, there's also the vmstat_update() that > > > may schedule delayed work and interrupt the core again > > > shortly before realizing that there are no more counter updates > > > happening, at which point it quiesces. Currently we handle > > > this in cpu_isolated mode simply by spinning and waiting for > > > the timer interrupts to complete. > > > > Perhaps we should fix that? > > Didn't Christoph Lameter fix this? Or is this an additional problem? Well the vmstat update must realize first that there are no outstanding updates before switching itself off. So typically there is one extra tick. But we could add another function that will simply fold the differential immediately and turn the kworker task in the expectation that the processor will stay quiet. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode @ 2015-07-22 19:28 ` Paul E. McKenney 0 siblings, 0 replies; 340+ messages in thread From: Paul E. McKenney @ 2015-07-22 19:28 UTC (permalink / raw) To: Christoph Lameter Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Viresh Kumar, linux-doc, Linux API, linux-kernel On Wed, Jul 22, 2015 at 08:57:45AM -0500, Christoph Lameter wrote: > On Tue, 21 Jul 2015, Paul E. McKenney wrote: > > > > > In the current version, there's also the vmstat_update() that > > > > may schedule delayed work and interrupt the core again > > > > shortly before realizing that there are no more counter updates > > > > happening, at which point it quiesces. Currently we handle > > > > this in cpu_isolated mode simply by spinning and waiting for > > > > the timer interrupts to complete. > > > > > > Perhaps we should fix that? > > > > Didn't Christoph Lameter fix this? Or is this an additional problem? > > Well the vmstat update must realize first that there are no outstanding > updates before switching itself off. So typically there is one extra tick. > But we could add another function that will simply fold the differential > immediately and turn the kworker task in the expectation that the > processor will stay quiet. Got it, thank you! Thanx, Paul ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode @ 2015-07-22 19:28 ` Paul E. McKenney 0 siblings, 0 replies; 340+ messages in thread From: Paul E. McKenney @ 2015-07-22 19:28 UTC (permalink / raw) To: Christoph Lameter Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Wed, Jul 22, 2015 at 08:57:45AM -0500, Christoph Lameter wrote: > On Tue, 21 Jul 2015, Paul E. McKenney wrote: > > > > > In the current version, there's also the vmstat_update() that > > > > may schedule delayed work and interrupt the core again > > > > shortly before realizing that there are no more counter updates > > > > happening, at which point it quiesces. Currently we handle > > > > this in cpu_isolated mode simply by spinning and waiting for > > > > the timer interrupts to complete. > > > > > > Perhaps we should fix that? > > > > Didn't Christoph Lameter fix this? Or is this an additional problem? > > Well the vmstat update must realize first that there are no outstanding > updates before switching itself off. So typically there is one extra tick. > But we could add another function that will simply fold the differential > immediately and turn the kworker task in the expectation that the > processor will stay quiet. Got it, thank you! Thanx, Paul ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-22 19:28 ` Paul E. McKenney (?) @ 2015-07-22 20:02 ` Christoph Lameter 2015-07-24 20:21 ` Chris Metcalf -1 siblings, 1 reply; 340+ messages in thread From: Christoph Lameter @ 2015-07-22 20:02 UTC (permalink / raw) To: Paul E. McKenney Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Viresh Kumar, linux-doc, Linux API, linux-kernel On Wed, 22 Jul 2015, Paul E. McKenney wrote: > > > Didn't Christoph Lameter fix this? Or is this an additional problem? > > > > Well the vmstat update must realize first that there are no outstanding > > updates before switching itself off. So typically there is one extra tick. > > But we could add another function that will simply fold the differential > > immediately and turn the kworker task in the expectation that the > > processor will stay quiet. > > Got it, thank you! > > Thanx, Paul Ok here is a function that quiets down the vmstat kworkers. Subject: vmstat: provide a function to quiet down the diff processing quiet_vmstat() can be called in anticipation of a OS "quiet" period where no tick processing should be triggered. quiet_vmstat() will fold all pending differentials into the global counters and disable the vmstat_worker processing. Note that the shepherd thread will continue scanning the differentials from another processor and will reenable the vmstat workers if it detects any changes. Signed-off-by: Christoph Lameter <cl@linux.com> Index: linux/mm/vmstat.c =================================================================== --- linux.orig/mm/vmstat.c +++ linux/mm/vmstat.c @@ -1394,6 +1394,20 @@ static void vmstat_update(struct work_st } /* + * Switch off vmstat processing and then fold all the remaining differentials + * until the diffs stay at zero. The function is used by NOHZ and can only be + * invoked when tick processing is not active. + */ +void quiet_vmstat(void) +{ + do { + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) + cancel_delayed_work(this_cpu_ptr(&vmstat_work)); + + } while (refresh_cpu_vm_stats()); +} + +/* * Check if the diffs for a certain cpu indicate that * an update is needed. */ Index: linux/include/linux/vmstat.h =================================================================== --- linux.orig/include/linux/vmstat.h +++ linux/include/linux/vmstat.h @@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); +void quiet_vmstat(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -272,6 +273,7 @@ static inline void __dec_zone_page_state static inline void refresh_cpu_vm_stats(int cpu) { } static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } +static inline void quiet_vmstat(void) { } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-22 20:02 ` Christoph Lameter @ 2015-07-24 20:21 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-24 20:21 UTC (permalink / raw) To: Christoph Lameter, Paul E. McKenney Cc: Andy Lutomirski, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Viresh Kumar, linux-doc, Linux API, linux-kernel On 07/22/2015 04:02 PM, Christoph Lameter wrote: > On Wed, 22 Jul 2015, Paul E. McKenney wrote: > >>>> Didn't Christoph Lameter fix this? Or is this an additional problem? >>> Well the vmstat update must realize first that there are no outstanding >>> updates before switching itself off. So typically there is one extra tick. >>> But we could add another function that will simply fold the differential >>> immediately and turn the kworker task in the expectation that the >>> processor will stay quiet. >> Got it, thank you! >> >> Thanx, Paul > Ok here is a function that quiets down the vmstat kworkers. That's great - I will include this patch in my series then, and call it as part of the "hard isolation" mode return to userspace. Thanks! -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-21 19:26 ` Andy Lutomirski 2015-07-21 20:36 ` Paul E. McKenney @ 2015-07-24 20:22 ` Chris Metcalf 1 sibling, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-24 20:22 UTC (permalink / raw) To: Andy Lutomirski, Paul McKenney Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Christoph Lameter, Viresh Kumar, linux-doc, Linux API, linux-kernel On 07/21/2015 03:26 PM, Andy Lutomirski wrote: > On Tue, Jul 21, 2015 at 12:10 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> So just for the sake of precision, the thing I'm talking about >> is the lru_add_drain() call on kernel exit. Are you proposing >> that we call that for every nohz_full core on kernel exit? >> I'm not opposed to this, but I don't know if other nohz >> developers feel like this is the right tradeoff. > I'm proposing either that we do that or that we arrange for other cpus > to be able to steal our LRU list while we're in RCU user/idle. That seems challenging; there is a lot that has to be done in lru_add_drain() and we may not want to do it for the "soft isolation" mode Frederic alludes to in a later email. And, we would have to add a bunch of locking to allow another process to steal the list from under us, so that's not obviously going to be a performance win in terms of the per-cpu page cache for normal operations. Perhaps there could be a lock taken that nohz_full processes have to take just to exit from userspace, and that other tasks could take to do things on behalf of the nohz_full process that it thinks it can do locklessly. It gets complicated, since you'd want to tie that to whether the nohz_full process was currently in the kernel or not, so some kind of atomic update on the context_tracking state or some such, perhaps. Still not really clear if that overhead is worth it (both from a maintenance point of view and the possible performance hit). Limiting it just to the hard isolation mode seems like a good answer since there we really know that userspace does not care about the performance implications of kernel/userspace transitions, and it doesn't cause slowdowns to anyone else. For now I will bundle it in with my respin as part of the "hard isolation" mode Frederic proposed. >> Well, in principle if we accepted my proposed patch series >> and then over time came to decide that it was reasonable >> for nohz_full to have these complete cpu isolation >> semantics, the one proposed ABI simply becomes a no-op. >> So it's not as problematic an ABI as some. > What if we made it a debugfs thing instead of a prctl? Have a mode > where the system tries really hard to quiesce itself even at the cost > of performance. No, since it's really a mode within an individual task that you'd like to switch on and off depending on what the task is trying to do - strict mode while it's running its main fast-path userspace code, but certainly not strict mode during its setup, and possibly leaving strict mode to run some kinds of slow-path, diagnostic, or error-handling code. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-21 19:10 ` Chris Metcalf 2015-07-21 19:26 ` Andy Lutomirski @ 2015-07-24 14:03 ` Frederic Weisbecker 2015-07-24 20:19 ` Chris Metcalf 1 sibling, 1 reply; 340+ messages in thread From: Frederic Weisbecker @ 2015-07-24 14:03 UTC (permalink / raw) To: Chris Metcalf Cc: Andy Lutomirski, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, Linux API, linux-kernel, Mike Galbraith On Tue, Jul 21, 2015 at 03:10:54PM -0400, Chris Metcalf wrote: > >>If you're arguing that the cpu_isolated semantic is really the only > >>one that makes sense for nohz_full, my sense is that it might be > >>surprising to many of the folks who do nohz_full work. But, I'm happy > >>to be wrong on this point, and maybe all the nohz_full community is > >>interested in making the same tradeoffs for nohz_full generally that > >>I've proposed in this patch series just for cpu_isolated? > >nohz_full is currently dog slow for no particularly good reasons. I > >suspect that the interrupts you're seeing are also there for no > >particularly good reasons as well. > > > >Let's fix them instead of adding new ABIs to work around them. > > Well, in principle if we accepted my proposed patch series > and then over time came to decide that it was reasonable > for nohz_full to have these complete cpu isolation > semantics, the one proposed ABI simply becomes a no-op. > So it's not as problematic an ABI as some. > > My issue is this: I'm totally happy with submitting a revised > patch series that does all the stuff for pure nohz_full that > I'm currently proposing for cpu_isolated. But, is it what > the community wants? Should I propose it and see? > > Frederic, do you have any insight here? Thanks! So you guys mean that if nohz_full was implemented fully like we expect it to, we shouldn't be burdened at all by noise and that whole patchset would therefore be pointless, right? And that would meet the requirements for those who want hard isolation (critical noise-free guarantee) as well as those who want soft isolation (less noise as possible for performance). Well first of all nohz is not isolation, it's a significant part of it but it's not all isolatiion. We really want to separate these things and not mess up isolation policies in the tick code. Second, yes perhaps we can eventually have both soft and hard isolation expectation eventually be implemented the same way through hard isolation. But that will only work if we don't do that polling for noise-free before resuming userspace, which might work for hard isolation that is ready to sacrifice some warm-up before a run to meet guarantees, but it won't work for soft isolation workloads. So the only solution is to offline everything we can to housekeeping CPUs. And if we still have stuff that can't be dealt with that way and which need to be taken care of with some explicit operation before resuming to userspace, then we can start to think about splitting stuff in several isolation configs. Similarly, offlining everything to housekeepers means that we sacrifice a CPU that could have been used in performance oriented workloads so that might not match soft isolation as well. But I think we'll see that all once we manage to have pure noise-free CPUs (some patches are on the way to be posted by Vatika Harlalka concerning the residual 1hz tick to kill). To summarize, lets first split nohz and isolation. Introduce CONFIG_CPU_ISOLATION and stuff all the isolation policies to kernel/cpu_isolation.c, lets try to implement hard isolation and see if that meets soft isolation workload users as well, if not we'll split that later. And we can keep the prctl to tell the user when hard isolation has been broken, through SIGKILL or whatever. I think we are doing a similar thing with SCHED_DEADLINE when the task hasn't met deadline requirement. We might want to do the same. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-24 14:03 ` Frederic Weisbecker @ 2015-07-24 20:19 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-24 20:19 UTC (permalink / raw) To: Frederic Weisbecker Cc: Andy Lutomirski, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, Linux API, linux-kernel, Mike Galbraith On 07/24/2015 10:03 AM, Frederic Weisbecker wrote: > To summarize, lets first split nohz and isolation. Introduce > CONFIG_CPU_ISOLATION and stuff all the isolation policies to > kernel/cpu_isolation.c, lets try to implement hard isolation and see if that > meets soft isolation workload users as well, if not we'll split that later. I will do that for v5. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-13 19:57 ` Chris Metcalf (?) (?) @ 2015-07-24 13:27 ` Frederic Weisbecker 2015-07-24 20:21 ` Chris Metcalf -1 siblings, 1 reply; 340+ messages in thread From: Frederic Weisbecker @ 2015-07-24 13:27 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel On Mon, Jul 13, 2015 at 03:57:57PM -0400, Chris Metcalf wrote: > The existing nohz_full mode makes tradeoffs to minimize userspace > interruptions while still attempting to avoid overheads in the > kernel entry/exit path, to provide 100% kernel semantics, etc. > > However, some applications require a stronger commitment from the > kernel to avoid interruptions, in particular userspace device > driver style applications, such as high-speed networking code. > > This change introduces a framework to allow applications to elect > to have the stronger semantics as needed, specifying > prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. > Subsequent commits will add additional flags and additional > semantics. > > The "cpu_isolated" state is indicated by setting a new task struct > field, cpu_isolated_flags, to the value passed by prctl(). When the > _ENABLE bit is set for a task, and it is returning to userspace > on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter() > routine to take additional actions to help the task avoid being > interrupted in the future. > > Initially, there are only two actions taken. First, the task > calls lru_add_drain() to prevent being interrupted by a subsequent > lru_add_drain_all() call on another core. Then, the code checks for > pending timer interrupts and quiesces until they are no longer pending. > As a result, sys calls (and page faults, etc.) can be inordinately slow. > However, this quiescing guarantees that no unexpected interrupts will > occur, even if the application intentionally calls into the kernel. > > Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> > --- > arch/tile/kernel/process.c | 9 ++++++++ > include/linux/sched.h | 3 +++ > include/linux/tick.h | 10 ++++++++ > include/uapi/linux/prctl.h | 5 ++++ > kernel/context_tracking.c | 3 +++ > kernel/sys.c | 8 +++++++ > kernel/time/tick-sched.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++ > 7 files changed, 95 insertions(+) > > diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c > index e036c0aa9792..3625e839ad62 100644 > --- a/arch/tile/kernel/process.c > +++ b/arch/tile/kernel/process.c > @@ -70,6 +70,15 @@ void arch_cpu_idle(void) > _cpu_idle(); > } > > +#ifdef CONFIG_NO_HZ_FULL I think this goes way beyond nohz itself. We don't only want the tick to shutdown, we want also the pending timers, workqueues, etc... It's time to create the CONFIG_ISOLATION_foo stuffs. > +void tick_nohz_cpu_isolated_wait(void) > +{ > + set_current_state(TASK_INTERRUPTIBLE); > + _cpu_idle(); > + set_current_state(TASK_RUNNING); > +} > +#endif > + > /* > * Release a thread_info structure > */ > diff --git a/include/linux/sched.h b/include/linux/sched.h > index ae21f1591615..f350b0c20bbc 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1778,6 +1778,9 @@ struct task_struct { > unsigned long task_state_change; > #endif > int pagefault_disabled; > +#ifdef CONFIG_NO_HZ_FULL > + unsigned int cpu_isolated_flags; > +#endif > }; > > /* Future-safe accessor for struct task_struct's cpus_allowed. */ > diff --git a/include/linux/tick.h b/include/linux/tick.h > index 3741ba1a652c..cb5569181359 100644 > --- a/include/linux/tick.h > +++ b/include/linux/tick.h > @@ -10,6 +10,7 @@ > #include <linux/context_tracking_state.h> > #include <linux/cpumask.h> > #include <linux/sched.h> > +#include <linux/prctl.h> > > #ifdef CONFIG_GENERIC_CLOCKEVENTS > extern void __init tick_init(void); > @@ -144,11 +145,18 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) > cpumask_or(mask, mask, tick_nohz_full_mask); > } > > +static inline bool tick_nohz_is_cpu_isolated(void) > +{ > + return tick_nohz_full_cpu(smp_processor_id()) && > + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); > +} > + > extern void __tick_nohz_full_check(void); > extern void tick_nohz_full_kick(void); > extern void tick_nohz_full_kick_cpu(int cpu); > extern void tick_nohz_full_kick_all(void); > extern void __tick_nohz_task_switch(struct task_struct *tsk); > +extern void tick_nohz_cpu_isolated_enter(void); > #else > static inline bool tick_nohz_full_enabled(void) { return false; } > static inline bool tick_nohz_full_cpu(int cpu) { return false; } > @@ -158,6 +166,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { } > static inline void tick_nohz_full_kick(void) { } > static inline void tick_nohz_full_kick_all(void) { } > static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } > +static inline bool tick_nohz_is_cpu_isolated(void) { return false; } > +static inline void tick_nohz_cpu_isolated_enter(void) { } > #endif > > static inline bool is_housekeeping_cpu(int cpu) > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 31891d9535e2..edb40b6b84db 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -190,4 +190,9 @@ struct prctl_mm_map { > # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ > # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ > > +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */ > +#define PR_SET_CPU_ISOLATED 47 > +#define PR_GET_CPU_ISOLATED 48 > +# define PR_CPU_ISOLATED_ENABLE (1 << 0) > + > #endif /* _LINUX_PRCTL_H */ > diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c > index 0a495ab35bc7..f9de3ee12723 100644 > --- a/kernel/context_tracking.c > +++ b/kernel/context_tracking.c > @@ -20,6 +20,7 @@ > #include <linux/hardirq.h> > #include <linux/export.h> > #include <linux/kprobes.h> > +#include <linux/tick.h> > > #define CREATE_TRACE_POINTS > #include <trace/events/context_tracking.h> > @@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state) > * on the tick. > */ > if (state == CONTEXT_USER) { > + if (tick_nohz_is_cpu_isolated()) > + tick_nohz_cpu_isolated_enter(); > trace_user_enter(0); > vtime_user_enter(current); > } > diff --git a/kernel/sys.c b/kernel/sys.c > index 259fda25eb6b..36eb9a839f1f 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > case PR_GET_FP_MODE: > error = GET_FP_MODE(me); > break; > +#ifdef CONFIG_NO_HZ_FULL > + case PR_SET_CPU_ISOLATED: > + me->cpu_isolated_flags = arg2; > + break; > + case PR_GET_CPU_ISOLATED: > + error = me->cpu_isolated_flags; > + break; > +#endif > default: > error = -EINVAL; > break; > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index c792429e98c6..4cf093c012d1 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -24,6 +24,7 @@ > #include <linux/posix-timers.h> > #include <linux/perf_event.h> > #include <linux/context_tracking.h> > +#include <linux/swap.h> > > #include <asm/irq_regs.h> > > @@ -389,6 +390,62 @@ void __init tick_nohz_init(void) > pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n", > cpumask_pr_args(tick_nohz_full_mask)); > } > + > +/* > + * Rather than continuously polling for the next_event in the > + * tick_cpu_device, architectures can provide a method to save power > + * by sleeping until an interrupt arrives. > + */ > +void __weak tick_nohz_cpu_isolated_wait(void) > +{ > + cpu_relax(); > +} > + > +/* > + * We normally return immediately to userspace. > + * > + * In "cpu_isolated" mode we wait until no more interrupts are > + * pending. Otherwise we nap with interrupts enabled and wait for the > + * next interrupt to fire, then loop back and retry. > + * > + * Note that if you schedule two "cpu_isolated" processes on the same > + * core, neither will ever leave the kernel, and one will have to be > + * killed manually. Otherwise in situations where another process is > + * in the runqueue on this cpu, this task will just wait for that > + * other task to go idle before returning to user space. > + */ > +void tick_nohz_cpu_isolated_enter(void) Similarly, I'd rather see that in kernel/cpu_isolation.c and call it cpu_isolation_enter(). > +{ > + struct clock_event_device *dev = > + __this_cpu_read(tick_cpu_device.evtdev); > + struct task_struct *task = current; > + unsigned long start = jiffies; > + bool warned = false; > + > + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ > + lru_add_drain(); > + > + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { > + if (!warned && (jiffies - start) >= (5 * HZ)) { > + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n", > + task->comm, task->pid, smp_processor_id(), > + (jiffies - start) / HZ); > + warned = true; > + } > + if (should_resched()) > + schedule(); > + if (test_thread_flag(TIF_SIGPENDING)) > + break; > + tick_nohz_cpu_isolated_wait(); If we call cpu_idle(), what is going to wake the CPU up if not further interrupt happen? We could either implement some sort of tick waiters with proper wake up once the CPU sees no tick to schedule. Arguably this is all risky because this involve a scheduler wake up and thus the risk for new noise. But it might work. Another possibility is an msleep() based wait. But that's about the same, maybe even worse due to repetitive wake ups. > + } > + if (warned) { > + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n", > + task->comm, task->pid, smp_processor_id(), > + (jiffies - start) / HZ); > + dump_stack(); > + } > +} > + > #endif > > /* > -- > 2.1.2 > ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode 2015-07-24 13:27 ` Frederic Weisbecker @ 2015-07-24 20:21 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-24 20:21 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel On 07/24/2015 09:27 AM, Frederic Weisbecker wrote: > On Mon, Jul 13, 2015 at 03:57:57PM -0400, Chris Metcalf wrote: >> +{ >> + struct clock_event_device *dev = >> + __this_cpu_read(tick_cpu_device.evtdev); >> + struct task_struct *task = current; >> + unsigned long start = jiffies; >> + bool warned = false; >> + >> + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ >> + lru_add_drain(); >> + >> + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { >> + if (!warned && (jiffies - start) >= (5 * HZ)) { >> + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n", >> + task->comm, task->pid, smp_processor_id(), >> + (jiffies - start) / HZ); >> + warned = true; >> + } >> + if (should_resched()) >> + schedule(); >> + if (test_thread_flag(TIF_SIGPENDING)) >> + break; >> + tick_nohz_cpu_isolated_wait(); > If we call cpu_idle(), what is going to wake the CPU up if no further interrupt happen? > > We could either implement some sort of tick waiters with proper wake up once the CPU sees > no tick to schedule. Arguably this is all risky because this involve a scheduler wake up > and thus the risk for new noise. But it might work. > > Another possibility is an msleep() based wait. But that's about the same, maybe even worse > due to repetitive wake ups. The presumption here is that it is not possible to have tick_cpu_device have a pending next_event without also having a timer interrupt pending to go off. That certainly seems to be true on the architectures I have looked at. Do we think that might ever not be the case? We are running here with interrupts disabled, so this core won't transition from "timer interrupt scheduled" to "no timer interrupt scheduled" before we spin or idle, and presumably no other core can reach across and turn off our timer interrupt either. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode @ 2015-07-24 20:21 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-24 20:21 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 07/24/2015 09:27 AM, Frederic Weisbecker wrote: > On Mon, Jul 13, 2015 at 03:57:57PM -0400, Chris Metcalf wrote: >> +{ >> + struct clock_event_device *dev = >> + __this_cpu_read(tick_cpu_device.evtdev); >> + struct task_struct *task = current; >> + unsigned long start = jiffies; >> + bool warned = false; >> + >> + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ >> + lru_add_drain(); >> + >> + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { >> + if (!warned && (jiffies - start) >= (5 * HZ)) { >> + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n", >> + task->comm, task->pid, smp_processor_id(), >> + (jiffies - start) / HZ); >> + warned = true; >> + } >> + if (should_resched()) >> + schedule(); >> + if (test_thread_flag(TIF_SIGPENDING)) >> + break; >> + tick_nohz_cpu_isolated_wait(); > If we call cpu_idle(), what is going to wake the CPU up if no further interrupt happen? > > We could either implement some sort of tick waiters with proper wake up once the CPU sees > no tick to schedule. Arguably this is all risky because this involve a scheduler wake up > and thus the risk for new noise. But it might work. > > Another possibility is an msleep() based wait. But that's about the same, maybe even worse > due to repetitive wake ups. The presumption here is that it is not possible to have tick_cpu_device have a pending next_event without also having a timer interrupt pending to go off. That certainly seems to be true on the architectures I have looked at. Do we think that might ever not be the case? We are running here with interrupts disabled, so this core won't transition from "timer interrupt scheduled" to "no timer interrupt scheduled" before we spin or idle, and presumably no other core can reach across and turn off our timer interrupt either. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode 2015-07-13 19:57 ` Chris Metcalf @ 2015-07-13 19:57 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With cpu_isolated mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we add an internal bit to current->cpu_isolated_flags that is set when prctl() sets the flags. We check the bit on syscall entry as well as on any exception_enter(). The prctl() syscall is ignored to allow clearing the bit again later, and exit/exit_group are ignored to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86, arm64, and tile. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/arm64/kernel/ptrace.c | 4 ++++ arch/tile/kernel/ptrace.c | 6 +++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/tick.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/time/tick-sched.c | 38 ++++++++++++++++++++++++++++++++++++++ 8 files changed, 80 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index d882b833dbdb..7315b1579cbd 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, asmlinkage int syscall_trace_enter(struct pt_regs *regs) { + /* Ensure we report cpu_isolated violations in all circumstances. */ + if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall(regs->syscallno); + /* Do the secure computing check first; failures should be fast. */ if (secure_computing() == -1) return -1; diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..d4e43a13bab1 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall( + regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 9be72bc3613f..860f346977e2 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b96bd299966f..8b994e2a0330 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/tick.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_exception(); + } + } return prev_ctx; } diff --git a/include/linux/tick.h b/include/linux/tick.h index cb5569181359..f79f6945f762 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -157,6 +157,8 @@ extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); extern void tick_nohz_cpu_isolated_enter(void); +extern void tick_nohz_cpu_isolated_syscall(int nr); +extern void tick_nohz_cpu_isolated_exception(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -168,6 +170,8 @@ static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } static inline bool tick_nohz_is_cpu_isolated(void) { return false; } static inline void tick_nohz_cpu_isolated_enter(void) { } +static inline void tick_nohz_cpu_isolated_syscall(int nr) { } +static inline void tick_nohz_cpu_isolated_exception(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) @@ -200,4 +204,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk) __tick_nohz_task_switch(tsk); } +static inline bool tick_nohz_cpu_isolated_strict(void) +{ +#ifdef CONFIG_NO_HZ_FULL + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) == + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index edb40b6b84db..0c11238a84fb 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_CPU_ISOLATED 47 #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) +# define PR_CPU_ISOLATED_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index f9de3ee12723..fd051ea290ee 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 4cf093c012d1..9f495c7c7dc2 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -27,6 +27,7 @@ #include <linux/swap.h> #include <asm/irq_regs.h> +#include <asm/unistd.h> #include "tick-internal.h" @@ -446,6 +447,43 @@ void tick_nohz_cpu_isolated_enter(void) } } +static void kill_cpu_isolated_strict_task(void) +{ + dump_stack(); + current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void tick_nohz_cpu_isolated_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_cpu_isolated_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void tick_nohz_cpu_isolated_exception(void) +{ + pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", + current->comm, current->pid); + kill_cpu_isolated_strict_task(); +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode @ 2015-07-13 19:57 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With cpu_isolated mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we add an internal bit to current->cpu_isolated_flags that is set when prctl() sets the flags. We check the bit on syscall entry as well as on any exception_enter(). The prctl() syscall is ignored to allow clearing the bit again later, and exit/exit_group are ignored to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86, arm64, and tile. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/arm64/kernel/ptrace.c | 4 ++++ arch/tile/kernel/ptrace.c | 6 +++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/tick.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/time/tick-sched.c | 38 ++++++++++++++++++++++++++++++++++++++ 8 files changed, 80 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index d882b833dbdb..7315b1579cbd 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, asmlinkage int syscall_trace_enter(struct pt_regs *regs) { + /* Ensure we report cpu_isolated violations in all circumstances. */ + if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall(regs->syscallno); + /* Do the secure computing check first; failures should be fast. */ if (secure_computing() == -1) return -1; diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..d4e43a13bab1 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,12 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall( + regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 9be72bc3613f..860f346977e2 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b96bd299966f..8b994e2a0330 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/tick.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (tick_nohz_cpu_isolated_strict()) + tick_nohz_cpu_isolated_exception(); + } + } return prev_ctx; } diff --git a/include/linux/tick.h b/include/linux/tick.h index cb5569181359..f79f6945f762 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -157,6 +157,8 @@ extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); extern void tick_nohz_cpu_isolated_enter(void); +extern void tick_nohz_cpu_isolated_syscall(int nr); +extern void tick_nohz_cpu_isolated_exception(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -168,6 +170,8 @@ static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } static inline bool tick_nohz_is_cpu_isolated(void) { return false; } static inline void tick_nohz_cpu_isolated_enter(void) { } +static inline void tick_nohz_cpu_isolated_syscall(int nr) { } +static inline void tick_nohz_cpu_isolated_exception(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) @@ -200,4 +204,16 @@ static inline void tick_nohz_task_switch(struct task_struct *tsk) __tick_nohz_task_switch(tsk); } +static inline bool tick_nohz_cpu_isolated_strict(void) +{ +#ifdef CONFIG_NO_HZ_FULL + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) == + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index edb40b6b84db..0c11238a84fb 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_CPU_ISOLATED 47 #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) +# define PR_CPU_ISOLATED_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index f9de3ee12723..fd051ea290ee 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 4cf093c012d1..9f495c7c7dc2 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -27,6 +27,7 @@ #include <linux/swap.h> #include <asm/irq_regs.h> +#include <asm/unistd.h> #include "tick-internal.h" @@ -446,6 +447,43 @@ void tick_nohz_cpu_isolated_enter(void) } } +static void kill_cpu_isolated_strict_task(void) +{ + dump_stack(); + current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void tick_nohz_cpu_isolated_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_cpu_isolated_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void tick_nohz_cpu_isolated_exception(void) +{ + pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", + current->comm, current->pid); + kill_cpu_isolated_strict_task(); +} + #endif /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode @ 2015-07-13 21:47 ` Andy Lutomirski 0 siblings, 0 replies; 340+ messages in thread From: Andy Lutomirski @ 2015-07-13 21:47 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > With cpu_isolated mode, the task is in principle guaranteed not to be > interrupted by the kernel, but only if it behaves. In particular, if it > enters the kernel via system call, page fault, or any of a number of other > synchronous traps, it may be unexpectedly exposed to long latencies. > Add a simple flag that puts the process into a state where any such > kernel entry is fatal. > To me, this seems like the wrong design. If nothing else, it seems too much like an abusable anti-debugging mechanism. I can imagine some per-task flag "I think I shouldn't be interrupted now" and a tracepoint that fires if the task is interrupted with that flag set. But the strong cpu isolation stuff requires systemwide configuration, and I think that monitoring that it works should work similarly. More comments below. > Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> > --- > arch/arm64/kernel/ptrace.c | 4 ++++ > arch/tile/kernel/ptrace.c | 6 +++++- > arch/x86/kernel/ptrace.c | 2 ++ > include/linux/context_tracking.h | 11 ++++++++--- > include/linux/tick.h | 16 ++++++++++++++++ > include/uapi/linux/prctl.h | 1 + > kernel/context_tracking.c | 9 ++++++--- > kernel/time/tick-sched.c | 38 ++++++++++++++++++++++++++++++++++++++ > 8 files changed, 80 insertions(+), 7 deletions(-) > > diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c > index d882b833dbdb..7315b1579cbd 100644 > --- a/arch/arm64/kernel/ptrace.c > +++ b/arch/arm64/kernel/ptrace.c > @@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, > > asmlinkage int syscall_trace_enter(struct pt_regs *regs) > { > + /* Ensure we report cpu_isolated violations in all circumstances. */ > + if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict()) > + tick_nohz_cpu_isolated_syscall(regs->syscallno); IMO this is pointless. If a user wants a syscall to kill them, use seccomp. The kernel isn't at fault if the user does a syscall when it didn't want to enter the kernel. > @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) > return 0; > > prev_ctx = this_cpu_read(context_tracking.state); > - if (prev_ctx != CONTEXT_KERNEL) > - context_tracking_exit(prev_ctx); > + if (prev_ctx != CONTEXT_KERNEL) { > + if (context_tracking_exit(prev_ctx)) { > + if (tick_nohz_cpu_isolated_strict()) > + tick_nohz_cpu_isolated_exception(); > + } > + } NACK. I'm cautiously optimistic that an x86 kernel 4.3 or newer will simply never call exception_enter. It certainly won't call it frequently unless something goes wrong with the patches that are already in -tip. > --- a/kernel/context_tracking.c > +++ b/kernel/context_tracking.c > @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); > * This call supports re-entrancy. This way it can be called from any exception > * handler without needing to know if we came from userspace or not. > */ > -void context_tracking_exit(enum ctx_state state) > +bool context_tracking_exit(enum ctx_state state) > { > unsigned long flags; > + bool from_user = false; > IMO the internal context tracking API (e.g. context_tracking_exit) are mostly of the form "hey context tracking: I don't really know what you're doing or what I'm doing, but let me call you and make both of us feel better." You're making it somewhat worse: now it's all of the above plus "I don't even know whether I just entered the kernel -- maybe you have a better idea". Starting with 4.3, x86 kernels will know *exactly* when they enter the kernel. All of this context tracking what-was-my-previous-state stuff will remain until someone kills it, but when it goes away we'll get a nice performance boost. So, no, let's implement this for real if we're going to implement it. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode @ 2015-07-13 21:47 ` Andy Lutomirski 0 siblings, 0 replies; 340+ messages in thread From: Andy Lutomirski @ 2015-07-13 21:47 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > With cpu_isolated mode, the task is in principle guaranteed not to be > interrupted by the kernel, but only if it behaves. In particular, if it > enters the kernel via system call, page fault, or any of a number of other > synchronous traps, it may be unexpectedly exposed to long latencies. > Add a simple flag that puts the process into a state where any such > kernel entry is fatal. > To me, this seems like the wrong design. If nothing else, it seems too much like an abusable anti-debugging mechanism. I can imagine some per-task flag "I think I shouldn't be interrupted now" and a tracepoint that fires if the task is interrupted with that flag set. But the strong cpu isolation stuff requires systemwide configuration, and I think that monitoring that it works should work similarly. More comments below. > Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> > --- > arch/arm64/kernel/ptrace.c | 4 ++++ > arch/tile/kernel/ptrace.c | 6 +++++- > arch/x86/kernel/ptrace.c | 2 ++ > include/linux/context_tracking.h | 11 ++++++++--- > include/linux/tick.h | 16 ++++++++++++++++ > include/uapi/linux/prctl.h | 1 + > kernel/context_tracking.c | 9 ++++++--- > kernel/time/tick-sched.c | 38 ++++++++++++++++++++++++++++++++++++++ > 8 files changed, 80 insertions(+), 7 deletions(-) > > diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c > index d882b833dbdb..7315b1579cbd 100644 > --- a/arch/arm64/kernel/ptrace.c > +++ b/arch/arm64/kernel/ptrace.c > @@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, > > asmlinkage int syscall_trace_enter(struct pt_regs *regs) > { > + /* Ensure we report cpu_isolated violations in all circumstances. */ > + if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict()) > + tick_nohz_cpu_isolated_syscall(regs->syscallno); IMO this is pointless. If a user wants a syscall to kill them, use seccomp. The kernel isn't at fault if the user does a syscall when it didn't want to enter the kernel. > @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) > return 0; > > prev_ctx = this_cpu_read(context_tracking.state); > - if (prev_ctx != CONTEXT_KERNEL) > - context_tracking_exit(prev_ctx); > + if (prev_ctx != CONTEXT_KERNEL) { > + if (context_tracking_exit(prev_ctx)) { > + if (tick_nohz_cpu_isolated_strict()) > + tick_nohz_cpu_isolated_exception(); > + } > + } NACK. I'm cautiously optimistic that an x86 kernel 4.3 or newer will simply never call exception_enter. It certainly won't call it frequently unless something goes wrong with the patches that are already in -tip. > --- a/kernel/context_tracking.c > +++ b/kernel/context_tracking.c > @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); > * This call supports re-entrancy. This way it can be called from any exception > * handler without needing to know if we came from userspace or not. > */ > -void context_tracking_exit(enum ctx_state state) > +bool context_tracking_exit(enum ctx_state state) > { > unsigned long flags; > + bool from_user = false; > IMO the internal context tracking API (e.g. context_tracking_exit) are mostly of the form "hey context tracking: I don't really know what you're doing or what I'm doing, but let me call you and make both of us feel better." You're making it somewhat worse: now it's all of the above plus "I don't even know whether I just entered the kernel -- maybe you have a better idea". Starting with 4.3, x86 kernels will know *exactly* when they enter the kernel. All of this context tracking what-was-my-previous-state stuff will remain until someone kills it, but when it goes away we'll get a nice performance boost. So, no, let's implement this for real if we're going to implement it. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode @ 2015-07-21 19:34 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-21 19:34 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On 07/13/2015 05:47 PM, Andy Lutomirski wrote: > On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> With cpu_isolated mode, the task is in principle guaranteed not to be >> interrupted by the kernel, but only if it behaves. In particular, if it >> enters the kernel via system call, page fault, or any of a number of other >> synchronous traps, it may be unexpectedly exposed to long latencies. >> Add a simple flag that puts the process into a state where any such >> kernel entry is fatal. >> > To me, this seems like the wrong design. If nothing else, it seems > too much like an abusable anti-debugging mechanism. I can imagine > some per-task flag "I think I shouldn't be interrupted now" and a > tracepoint that fires if the task is interrupted with that flag set. > But the strong cpu isolation stuff requires systemwide configuration, > and I think that monitoring that it works should work similarly. First, you mention a per-task flag, but not specifically whether the proposed prctl() mechanism is a reasonable way to set that flag. Just wanted to clarify that this wasn't an issue in and of itself for you. Second, you suggest a tracepoint. I'm OK with creating a tracepoint dedicated to cpu_isolated strict failures and making that the only way this mechanism works. But, earlier community feedback seemed to suggest that the signal mechanism was OK; one piece of feedback just requested being able to set which signal was delivered. Do you think the signal idea is a bad one? Are you proposing potentially having a signal and/or a tracepoint? Last, you mention systemwide configuration for monitoring. Can you expand on what you mean by that? We already support the monitoring only on the nohz_full cores, so to that extent it's already systemwide. And the per-task flag has to be set by the running process when it's ready for this state, so that can't really be systemwide configuration. I don't understand your suggestion on this point. >> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c >> index d882b833dbdb..7315b1579cbd 100644 >> --- a/arch/arm64/kernel/ptrace.c >> +++ b/arch/arm64/kernel/ptrace.c >> @@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, >> >> asmlinkage int syscall_trace_enter(struct pt_regs *regs) >> { >> + /* Ensure we report cpu_isolated violations in all circumstances. */ >> + if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict()) >> + tick_nohz_cpu_isolated_syscall(regs->syscallno); > IMO this is pointless. If a user wants a syscall to kill them, use > seccomp. The kernel isn't at fault if the user does a syscall when it > didn't want to enter the kernel. Interesting! I didn't realize how close SECCOMP_SET_MODE_STRICT was to what I wanted here. One concern is that there doesn't seem to be a way to "escape" from seccomp strict mode, i.e. you can't call seccomp() again to turn it off - which makes sense for seccomp since it's a security issue, but not so much sense with cpu_isolated. So, do you think there's a good role for the seccomp() API to play in achieving this goal? It's certainly not a question of "the kernel at fault" but rather "asking the kernel to help catch user mistakes" (typically third-party libraries in our customers' experience). You could imagine a SECCOMP_SET_MODE_ISOLATED or something. Alternatively, we could stick with the API proposed in my patch series, or something similar, and just try to piggy-back on the seccomp internals to make it happen. It would require Kconfig to ensure that SECCOMP was enabled though, which obviously isn't currently required to do cpu isolation. >> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) >> return 0; >> >> prev_ctx = this_cpu_read(context_tracking.state); >> - if (prev_ctx != CONTEXT_KERNEL) >> - context_tracking_exit(prev_ctx); >> + if (prev_ctx != CONTEXT_KERNEL) { >> + if (context_tracking_exit(prev_ctx)) { >> + if (tick_nohz_cpu_isolated_strict()) >> + tick_nohz_cpu_isolated_exception(); >> + } >> + } > NACK. I'm cautiously optimistic that an x86 kernel 4.3 or newer will > simply never call exception_enter. It certainly won't call it > frequently unless something goes wrong with the patches that are > already in -tip. This is intended to catch user exceptions like page faults, GPV or (on platforms where this would happen) unaligned data traps. The kernel still has a role to play here and cpu_isolated mode needs to let the user know they have accidentally entered the kernel in this case. >> --- a/kernel/context_tracking.c >> +++ b/kernel/context_tracking.c >> @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); >> * This call supports re-entrancy. This way it can be called from any exception >> * handler without needing to know if we came from userspace or not. >> */ >> -void context_tracking_exit(enum ctx_state state) >> +bool context_tracking_exit(enum ctx_state state) >> { >> unsigned long flags; >> + bool from_user = false; >> > IMO the internal context tracking API (e.g. context_tracking_exit) are > mostly of the form "hey context tracking: I don't really know what > you're doing or what I'm doing, but let me call you and make both of > us feel better." You're making it somewhat worse: now it's all of the > above plus "I don't even know whether I just entered the kernel -- > maybe you have a better idea". > > Starting with 4.3, x86 kernels will know *exactly* when they enter the > kernel. All of this context tracking what-was-my-previous-state stuff > will remain until someone kills it, but when it goes away we'll get a > nice performance boost. > > So, no, let's implement this for real if we're going to implement it. I'm certainly OK with rebasing on top of 4.3 after the context tracking stuff is better. That said, I think it makes sense to continue to debate the intent of the patch series even if we pull this one patch out and defer it until after 4.3, or having it end up pulled into some other repo that includes the improvements and is being pulled for 4.3. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode @ 2015-07-21 19:34 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-21 19:34 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 07/13/2015 05:47 PM, Andy Lutomirski wrote: > On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> With cpu_isolated mode, the task is in principle guaranteed not to be >> interrupted by the kernel, but only if it behaves. In particular, if it >> enters the kernel via system call, page fault, or any of a number of other >> synchronous traps, it may be unexpectedly exposed to long latencies. >> Add a simple flag that puts the process into a state where any such >> kernel entry is fatal. >> > To me, this seems like the wrong design. If nothing else, it seems > too much like an abusable anti-debugging mechanism. I can imagine > some per-task flag "I think I shouldn't be interrupted now" and a > tracepoint that fires if the task is interrupted with that flag set. > But the strong cpu isolation stuff requires systemwide configuration, > and I think that monitoring that it works should work similarly. First, you mention a per-task flag, but not specifically whether the proposed prctl() mechanism is a reasonable way to set that flag. Just wanted to clarify that this wasn't an issue in and of itself for you. Second, you suggest a tracepoint. I'm OK with creating a tracepoint dedicated to cpu_isolated strict failures and making that the only way this mechanism works. But, earlier community feedback seemed to suggest that the signal mechanism was OK; one piece of feedback just requested being able to set which signal was delivered. Do you think the signal idea is a bad one? Are you proposing potentially having a signal and/or a tracepoint? Last, you mention systemwide configuration for monitoring. Can you expand on what you mean by that? We already support the monitoring only on the nohz_full cores, so to that extent it's already systemwide. And the per-task flag has to be set by the running process when it's ready for this state, so that can't really be systemwide configuration. I don't understand your suggestion on this point. >> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c >> index d882b833dbdb..7315b1579cbd 100644 >> --- a/arch/arm64/kernel/ptrace.c >> +++ b/arch/arm64/kernel/ptrace.c >> @@ -1150,6 +1150,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, >> >> asmlinkage int syscall_trace_enter(struct pt_regs *regs) >> { >> + /* Ensure we report cpu_isolated violations in all circumstances. */ >> + if (test_thread_flag(TIF_NOHZ) && tick_nohz_cpu_isolated_strict()) >> + tick_nohz_cpu_isolated_syscall(regs->syscallno); > IMO this is pointless. If a user wants a syscall to kill them, use > seccomp. The kernel isn't at fault if the user does a syscall when it > didn't want to enter the kernel. Interesting! I didn't realize how close SECCOMP_SET_MODE_STRICT was to what I wanted here. One concern is that there doesn't seem to be a way to "escape" from seccomp strict mode, i.e. you can't call seccomp() again to turn it off - which makes sense for seccomp since it's a security issue, but not so much sense with cpu_isolated. So, do you think there's a good role for the seccomp() API to play in achieving this goal? It's certainly not a question of "the kernel at fault" but rather "asking the kernel to help catch user mistakes" (typically third-party libraries in our customers' experience). You could imagine a SECCOMP_SET_MODE_ISOLATED or something. Alternatively, we could stick with the API proposed in my patch series, or something similar, and just try to piggy-back on the seccomp internals to make it happen. It would require Kconfig to ensure that SECCOMP was enabled though, which obviously isn't currently required to do cpu isolation. >> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) >> return 0; >> >> prev_ctx = this_cpu_read(context_tracking.state); >> - if (prev_ctx != CONTEXT_KERNEL) >> - context_tracking_exit(prev_ctx); >> + if (prev_ctx != CONTEXT_KERNEL) { >> + if (context_tracking_exit(prev_ctx)) { >> + if (tick_nohz_cpu_isolated_strict()) >> + tick_nohz_cpu_isolated_exception(); >> + } >> + } > NACK. I'm cautiously optimistic that an x86 kernel 4.3 or newer will > simply never call exception_enter. It certainly won't call it > frequently unless something goes wrong with the patches that are > already in -tip. This is intended to catch user exceptions like page faults, GPV or (on platforms where this would happen) unaligned data traps. The kernel still has a role to play here and cpu_isolated mode needs to let the user know they have accidentally entered the kernel in this case. >> --- a/kernel/context_tracking.c >> +++ b/kernel/context_tracking.c >> @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); >> * This call supports re-entrancy. This way it can be called from any exception >> * handler without needing to know if we came from userspace or not. >> */ >> -void context_tracking_exit(enum ctx_state state) >> +bool context_tracking_exit(enum ctx_state state) >> { >> unsigned long flags; >> + bool from_user = false; >> > IMO the internal context tracking API (e.g. context_tracking_exit) are > mostly of the form "hey context tracking: I don't really know what > you're doing or what I'm doing, but let me call you and make both of > us feel better." You're making it somewhat worse: now it's all of the > above plus "I don't even know whether I just entered the kernel -- > maybe you have a better idea". > > Starting with 4.3, x86 kernels will know *exactly* when they enter the > kernel. All of this context tracking what-was-my-previous-state stuff > will remain until someone kills it, but when it goes away we'll get a > nice performance boost. > > So, no, let's implement this for real if we're going to implement it. I'm certainly OK with rebasing on top of 4.3 after the context tracking stuff is better. That said, I think it makes sense to continue to debate the intent of the patch series even if we pull this one patch out and defer it until after 4.3, or having it end up pulled into some other repo that includes the improvements and is being pulled for 4.3. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode @ 2015-07-21 19:42 ` Andy Lutomirski 0 siblings, 0 replies; 340+ messages in thread From: Andy Lutomirski @ 2015-07-21 19:42 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On Tue, Jul 21, 2015 at 12:34 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 07/13/2015 05:47 PM, Andy Lutomirski wrote: >> >> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf@ezchip.com> >> wrote: >>> >>> With cpu_isolated mode, the task is in principle guaranteed not to be >>> interrupted by the kernel, but only if it behaves. In particular, if it >>> enters the kernel via system call, page fault, or any of a number of >>> other >>> synchronous traps, it may be unexpectedly exposed to long latencies. >>> Add a simple flag that puts the process into a state where any such >>> kernel entry is fatal. >>> >> To me, this seems like the wrong design. If nothing else, it seems >> too much like an abusable anti-debugging mechanism. I can imagine >> some per-task flag "I think I shouldn't be interrupted now" and a >> tracepoint that fires if the task is interrupted with that flag set. >> But the strong cpu isolation stuff requires systemwide configuration, >> and I think that monitoring that it works should work similarly. > > > First, you mention a per-task flag, but not specifically whether the > proposed prctl() mechanism is a reasonable way to set that flag. > Just wanted to clarify that this wasn't an issue in and of itself for you. I think I'm okay with a per-task flag for this and, if you add one, then prctl() is presumably the way to go. Unless people think that nohz should be 100% reliable always, in which case might as well make the flag per-cpu. > > Second, you suggest a tracepoint. I'm OK with creating a tracepoint > dedicated to cpu_isolated strict failures and making that the only > way this mechanism works. But, earlier community feedback seemed to > suggest that the signal mechanism was OK; one piece of feedback > just requested being able to set which signal was delivered. Do you > think the signal idea is a bad one? Are you proposing potentially > having a signal and/or a tracepoint? I prefer the tracepoint. It's friendlier to debuggers, and it's really about diagnosing a kernel problem, not a userspace problem. Also, I really doubt that people should deploy a signal thing in production. What if an NMI fires and kills their realtime program? > > Last, you mention systemwide configuration for monitoring. Can you > expand on what you mean by that? We already support the monitoring > only on the nohz_full cores, so to that extent it's already systemwide. > And the per-task flag has to be set by the running process when it's > ready for this state, so that can't really be systemwide configuration. > I don't understand your suggestion on this point. I'm really thinking about systemwide configuration for isolation. I think we'll always (at least in the nearish term) need the admin's help to set up isolated CPUs. If the admin makes a whole CPU be isolated, then monitoring just that CPU and monitoring it all the time seems sensible. If we really do think that isolating a CPU should require a syscall of some sort because it's too expensive otherwise, then we can do it that way, too. And if full isolation requires some user help (e.g. don't do certain things that break isolation), then having a per-task monitoring flag seems reasonable. We may always need the user's help to avoid IPIs. For example, if one thread calls munmap, the other thread is going to get an IPI. There's nothing we can do about that. > I'm certainly OK with rebasing on top of 4.3 after the context > tracking stuff is better. That said, I think it makes sense to continue > to debate the intent of the patch series even if we pull this one > patch out and defer it until after 4.3, or having it end up pulled > into some other repo that includes the improvements and > is being pulled for 4.3. Sure, no problem. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode @ 2015-07-21 19:42 ` Andy Lutomirski 0 siblings, 0 replies; 340+ messages in thread From: Andy Lutomirski @ 2015-07-21 19:42 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, Jul 21, 2015 at 12:34 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > On 07/13/2015 05:47 PM, Andy Lutomirski wrote: >> >> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >> wrote: >>> >>> With cpu_isolated mode, the task is in principle guaranteed not to be >>> interrupted by the kernel, but only if it behaves. In particular, if it >>> enters the kernel via system call, page fault, or any of a number of >>> other >>> synchronous traps, it may be unexpectedly exposed to long latencies. >>> Add a simple flag that puts the process into a state where any such >>> kernel entry is fatal. >>> >> To me, this seems like the wrong design. If nothing else, it seems >> too much like an abusable anti-debugging mechanism. I can imagine >> some per-task flag "I think I shouldn't be interrupted now" and a >> tracepoint that fires if the task is interrupted with that flag set. >> But the strong cpu isolation stuff requires systemwide configuration, >> and I think that monitoring that it works should work similarly. > > > First, you mention a per-task flag, but not specifically whether the > proposed prctl() mechanism is a reasonable way to set that flag. > Just wanted to clarify that this wasn't an issue in and of itself for you. I think I'm okay with a per-task flag for this and, if you add one, then prctl() is presumably the way to go. Unless people think that nohz should be 100% reliable always, in which case might as well make the flag per-cpu. > > Second, you suggest a tracepoint. I'm OK with creating a tracepoint > dedicated to cpu_isolated strict failures and making that the only > way this mechanism works. But, earlier community feedback seemed to > suggest that the signal mechanism was OK; one piece of feedback > just requested being able to set which signal was delivered. Do you > think the signal idea is a bad one? Are you proposing potentially > having a signal and/or a tracepoint? I prefer the tracepoint. It's friendlier to debuggers, and it's really about diagnosing a kernel problem, not a userspace problem. Also, I really doubt that people should deploy a signal thing in production. What if an NMI fires and kills their realtime program? > > Last, you mention systemwide configuration for monitoring. Can you > expand on what you mean by that? We already support the monitoring > only on the nohz_full cores, so to that extent it's already systemwide. > And the per-task flag has to be set by the running process when it's > ready for this state, so that can't really be systemwide configuration. > I don't understand your suggestion on this point. I'm really thinking about systemwide configuration for isolation. I think we'll always (at least in the nearish term) need the admin's help to set up isolated CPUs. If the admin makes a whole CPU be isolated, then monitoring just that CPU and monitoring it all the time seems sensible. If we really do think that isolating a CPU should require a syscall of some sort because it's too expensive otherwise, then we can do it that way, too. And if full isolation requires some user help (e.g. don't do certain things that break isolation), then having a per-task monitoring flag seems reasonable. We may always need the user's help to avoid IPIs. For example, if one thread calls munmap, the other thread is going to get an IPI. There's nothing we can do about that. > I'm certainly OK with rebasing on top of 4.3 after the context > tracking stuff is better. That said, I think it makes sense to continue > to debate the intent of the patch series even if we pull this one > patch out and defer it until after 4.3, or having it end up pulled > into some other repo that includes the improvements and > is being pulled for 4.3. Sure, no problem. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode 2015-07-21 19:42 ` Andy Lutomirski (?) @ 2015-07-24 20:29 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-24 20:29 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On 07/21/2015 03:42 PM, Andy Lutomirski wrote: > On Tue, Jul 21, 2015 at 12:34 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> Second, you suggest a tracepoint. I'm OK with creating a tracepoint >> dedicated to cpu_isolated strict failures and making that the only >> way this mechanism works. But, earlier community feedback seemed to >> suggest that the signal mechanism was OK; one piece of feedback >> just requested being able to set which signal was delivered. Do you >> think the signal idea is a bad one? Are you proposing potentially >> having a signal and/or a tracepoint? > I prefer the tracepoint. It's friendlier to debuggers, and it's > really about diagnosing a kernel problem, not a userspace problem. > Also, I really doubt that people should deploy a signal thing in > production. What if an NMI fires and kills their realtime program? No, this piece of the patch series is about diagnosing bugs in the userspace program (likely in third-party code, in our customers' experience). When you violate strict mode, you get a signal and you have a nice pointer to what instruction it was that caused you to enter the kernel. You are right that running this in production is likely not a great idea, as is true for other debugging mechanisms. But you might really want to have it as a signal with a signal handler that fires to generate a trace of some kind into the application's existing tracing mechanisms, so the app doesn't just report "wow, I lost a bunch of time in here somewhere, sorry about those packets I dropped on the floor", but "here's where I took a strict signal". You probably drop a few additional packets due to the signal handling and logging, but given you've already fallen away from 100% in this case, the extra diagnostics are almost certainly worth it. In this case it's probably not as helpful to have a tracepoint-based solution, just because you really do want to be able to easily integrate into the app's existing logging framework. My sense, I think, is that we can easily add tracepoints to the strict failure code in the future, so it may not be worth trying to widen the scope of the patch series just now. >> Last, you mention systemwide configuration for monitoring. Can you >> expand on what you mean by that? We already support the monitoring >> only on the nohz_full cores, so to that extent it's already systemwide. >> And the per-task flag has to be set by the running process when it's >> ready for this state, so that can't really be systemwide configuration. >> I don't understand your suggestion on this point. > I'm really thinking about systemwide configuration for isolation. I > think we'll always (at least in the nearish term) need the admin's > help to set up isolated CPUs. If the admin makes a whole CPU be > isolated, then monitoring just that CPU and monitoring it all the time > seems sensible. If we really do think that isolating a CPU should > require a syscall of some sort because it's too expensive otherwise, > then we can do it that way, too. And if full isolation requires some > user help (e.g. don't do certain things that break isolation), then > having a per-task monitoring flag seems reasonable. > > We may always need the user's help to avoid IPIs. For example, if one > thread calls munmap, the other thread is going to get an IPI. There's > nothing we can do about that. I think we're mostly agreed on this stuff, though your use of "monitored" doesn't really match the "strict" mode in this patch. It's certainly true that, for example, we advise customers not to run the slow-path code on a housekeeping cpu as a thread in the same process space as the fast-path code on the nohz_full cores, just because things like fclose() on a file descriptor will lead to free() which can lead to munmap() and an IPI to the fast path. >> I'm certainly OK with rebasing on top of 4.3 after the context >> tracking stuff is better. That said, I think it makes sense to continue >> to debate the intent of the patch series even if we pull this one >> patch out and defer it until after 4.3, or having it end up pulled >> into some other repo that includes the improvements and >> is being pulled for 4.3. > Sure, no problem. I will add a comment to the patch and a note to the series about this, but for now I'll keep it in the series. If we can arrange to pull it into Frederic's tree after the context_tracking changes, we can respin it at that point to layer it on top. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v4 3/5] nohz: cpu_isolated strict mode configurable signal 2015-07-13 19:57 ` Chris Metcalf @ 2015-07-13 19:57 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a cpu_isolated process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/time/tick-sched.c | 15 +++++++++++---- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 0c11238a84fb..ab45bd3d5799 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) # define PR_CPU_ISOLATED_STRICT (1 << 1) +# define PR_CPU_ISOLATED_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 9f495c7c7dc2..c5eca9c99fad 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -447,11 +447,18 @@ void tick_nohz_cpu_isolated_enter(void) } } -static void kill_cpu_isolated_strict_task(void) +static void kill_cpu_isolated_strict_task(int is_syscall) { + siginfo_t info = {}; + int sig; + dump_stack(); current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -470,7 +477,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall) pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(1); } /* @@ -481,7 +488,7 @@ void tick_nohz_cpu_isolated_exception(void) { pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", current->comm, current->pid); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(0); } #endif -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v4 3/5] nohz: cpu_isolated strict mode configurable signal @ 2015-07-13 19:57 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-13 19:57 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a cpu_isolated process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/time/tick-sched.c | 15 +++++++++++---- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 0c11238a84fb..ab45bd3d5799 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) # define PR_CPU_ISOLATED_STRICT (1 << 1) +# define PR_CPU_ISOLATED_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 9f495c7c7dc2..c5eca9c99fad 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -447,11 +447,18 @@ void tick_nohz_cpu_isolated_enter(void) } } -static void kill_cpu_isolated_strict_task(void) +static void kill_cpu_isolated_strict_task(int is_syscall) { + siginfo_t info = {}; + int sig; + dump_stack(); current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -470,7 +477,7 @@ void tick_nohz_cpu_isolated_syscall(int syscall) pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(1); } /* @@ -481,7 +488,7 @@ void tick_nohz_cpu_isolated_exception(void) { pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", current->comm, current->pid); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(0); } #endif -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v4 4/5] nohz: add cpu_isolated_debug boot flag 2015-07-13 19:57 ` Chris Metcalf ` (3 preceding siblings ...) (?) @ 2015-07-13 19:58 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-13 19:58 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-doc, linux-kernel Cc: Chris Metcalf This flag simplifies debugging of NO_HZ_FULL kernels when processes are running in PR_CPU_ISOLATED_ENABLE mode. Such processes should get no interrupts from the kernel, and if they do, when this boot flag is specified a kernel stack dump on the console is generated. It's possible to use ftrace to simply detect whether a cpu_isolated core has unexpectedly entered the kernel. But what this boot flag does is allow the kernel to provide better diagnostics, e.g. by reporting in the IPI-generating code what remote core and context is preparing to deliver an interrupt to a cpu_isolated core. It may be worth considering other ways to generate useful debugging output rather than console spew, but for now that is simple and direct. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- Documentation/kernel-parameters.txt | 6 ++++++ arch/tile/mm/homecache.c | 5 ++++- include/linux/tick.h | 2 ++ kernel/irq_work.c | 4 +++- kernel/sched/core.c | 18 ++++++++++++++++++ kernel/signal.c | 5 +++++ kernel/smp.c | 4 ++++ kernel/softirq.c | 6 ++++++ 8 files changed, 48 insertions(+), 2 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 1d6f0459cd7b..76e8e2ff4a0a 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -749,6 +749,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. /proc/<pid>/coredump_filter. See also Documentation/filesystems/proc.txt. + cpu_isolated_debug [KNL] + In kernels built with CONFIG_NO_HZ_FULL and booted + in nohz_full= mode, this setting will generate console + backtraces when the kernel is about to interrupt a + task that has requested PR_CPU_ISOLATED_ENABLE. + cpuidle.off=1 [CPU_IDLE] disable the cpuidle sub-system diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c index 40ca30a9fee3..f336880e1b01 100644 --- a/arch/tile/mm/homecache.c +++ b/arch/tile/mm/homecache.c @@ -31,6 +31,7 @@ #include <linux/smp.h> #include <linux/module.h> #include <linux/hugetlb.h> +#include <linux/tick.h> #include <asm/page.h> #include <asm/sections.h> @@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask, * Don't bother to update atomically; losing a count * here is not that critical. */ - for_each_cpu(cpu, &mask) + for_each_cpu(cpu, &mask) { ++per_cpu(irq_stat, cpu).irq_hv_flush_count; + tick_nohz_cpu_isolated_debug(cpu); + } } /* diff --git a/include/linux/tick.h b/include/linux/tick.h index f79f6945f762..ed65551e2315 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -159,6 +159,7 @@ extern void __tick_nohz_task_switch(struct task_struct *tsk); extern void tick_nohz_cpu_isolated_enter(void); extern void tick_nohz_cpu_isolated_syscall(int nr); extern void tick_nohz_cpu_isolated_exception(void); +extern void tick_nohz_cpu_isolated_debug(int cpu); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -172,6 +173,7 @@ static inline bool tick_nohz_is_cpu_isolated(void) { return false; } static inline void tick_nohz_cpu_isolated_enter(void) { } static inline void tick_nohz_cpu_isolated_syscall(int nr) { } static inline void tick_nohz_cpu_isolated_exception(void) { } +static inline void tick_nohz_cpu_isolated_debug(int cpu) { } #endif static inline bool is_housekeeping_cpu(int cpu) diff --git a/kernel/irq_work.c b/kernel/irq_work.c index cbf9fb899d92..7f35c90346de 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -75,8 +75,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu) if (!irq_work_claim(work)) return false; - if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) + if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) { + tick_nohz_cpu_isolated_debug(cpu); arch_send_call_function_single_ipi(cpu); + } return true; } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 78b4bad10081..c8388f9206b2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -743,6 +743,24 @@ bool sched_can_stop_tick(void) return true; } + +/* Enable debugging of any interrupts of cpu_isolated cores. */ +static int cpu_isolated_debug; +static int __init cpu_isolated_debug_func(char *str) +{ + cpu_isolated_debug = true; + return 1; +} +__setup("cpu_isolated_debug", cpu_isolated_debug_func); + +void tick_nohz_cpu_isolated_debug(int cpu) +{ + if (cpu_isolated_debug && tick_nohz_full_cpu(cpu) && + (cpu_curr(cpu)->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE)) { + pr_err("Interrupt detected for cpu_isolated cpu %d\n", cpu); + dump_stack(); + } +} #endif /* CONFIG_NO_HZ_FULL */ void sched_avg_update(struct rq *rq) diff --git a/kernel/signal.c b/kernel/signal.c index 836df8dac6cc..90ee460c2586 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -684,6 +684,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) */ void signal_wake_up_state(struct task_struct *t, unsigned int state) { +#ifdef CONFIG_NO_HZ_FULL + /* If the task is being killed, don't complain about cpu_isolated. */ + if (state & TASK_WAKEKILL) + t->cpu_isolated_flags = 0; +#endif set_tsk_thread_flag(t, TIF_SIGPENDING); /* * TASK_WAKEKILL also means wake it up in the stopped/traced/killable diff --git a/kernel/smp.c b/kernel/smp.c index 07854477c164..6b7d8e2c8af4 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -14,6 +14,7 @@ #include <linux/smp.h> #include <linux/cpu.h> #include <linux/sched.h> +#include <linux/tick.h> #include "smpboot.h" @@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, * locking and barrier primitives. Generic code isn't really * equipped to do the right thing... */ + tick_nohz_cpu_isolated_debug(cpu); if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) arch_send_call_function_single_ipi(cpu); @@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask, } /* Send a message to all CPUs in the map */ + for_each_cpu(cpu, cfd->cpumask) + tick_nohz_cpu_isolated_debug(cpu); arch_send_call_function_ipi_mask(cfd->cpumask); if (wait) { diff --git a/kernel/softirq.c b/kernel/softirq.c index 479e4436f787..333872925ff6 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -24,6 +24,7 @@ #include <linux/ftrace.h> #include <linux/smp.h> #include <linux/smpboot.h> +#include <linux/context_tracking.h> #include <linux/tick.h> #include <linux/irq.h> @@ -335,6 +336,11 @@ void irq_enter(void) _local_bh_enable(); } + if (context_tracking_cpu_is_enabled() && + context_tracking_in_user() && + !in_interrupt()) + tick_nohz_cpu_isolated_debug(smp_processor_id()); + __irq_enter(); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v4 5/5] nohz: cpu_isolated: allow tick to be fully disabled 2015-07-13 19:57 ` Chris Metcalf ` (4 preceding siblings ...) (?) @ 2015-07-13 19:58 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-13 19:58 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, linux-kernel Cc: Chris Metcalf While the current fallback to 1-second tick is still helpful for maintaining completely correct kernel semantics, processes using prctl(PR_SET_CPU_ISOLATED) semantics place a higher priority on running completely tickless, so don't bound the time_delta for such processes. In addition, due to the way such processes quiesce by waiting for the timer tick to stop prior to returning to userspace, without this commit it won't be possible to use the cpu_isolated mode at all. Removing the 1-second cap was previously discussed (see link below) and Thomas Gleixner observed that vruntime, load balancing data, load accounting, and other things might be impacted. Frederic Weisbecker similarly observed that allowing the tick to be indefinitely deferred just meant that no one would ever fix the underlying bugs. However it's at least true that the mode proposed in this patch can only be enabled on an isolcpus core by a process requesting cpu_isolated mode, which may limit how important it is to maintain scheduler data correctly, for example. Paul McKenney observed that if provide a mode where the 1Hz fallback timer is removed, this will provide an environment where new code that relies on that tick will get punished, and we won't forgive such assumptions silently, so it may also be worth it from that perspective. Finally, it's worth observing that the tile architecture has been using similar code for its Zero-Overhead Linux for many years (starting in 2008) and customers are very enthusiastic about the resulting bare-metal performance on cores that are available to run full Linux semantics on demand (crash, logging, shutdown, etc). So this semantics is very useful if we can convince ourselves that doing this is safe. Link: https://lkml.kernel.org/r/alpine.DEB.2.11.1410311058500.32582@gentwo.org Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- kernel/time/tick-sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index c5eca9c99fad..8187b4b4c91c 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -754,7 +754,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, #ifdef CONFIG_NO_HZ_FULL /* Limit the tick delta to the maximum scheduler deferment */ - if (!ts->inidle) + if (!ts->inidle && !tick_nohz_is_cpu_isolated()) delta = min(delta, scheduler_tick_max_deferment()); #endif -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v5 0/6] support "cpu_isolated" mode for nohz_full @ 2015-07-28 19:49 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf This version of the patch series incorporates Christoph Lameter's change to add a quiet_vmstat() call, and restructures cpu_isolated as a "hard" isolation mode in contrast to nohz_full's "soft" isolation, breaking it out as a separate CONFIG_CPU_ISOLATED with its own include/linux/cpu_isolated.h and kernel/time/cpu_isolated.c. It is rebased to 4.2-rc3. Thomas: as I mentioned in v4, I haven't heard from you whether my removal of the cpu_idle calls sufficiently addresses your concerns about that aspect. Andy: as I said in email, I've left in the support where cpu_isolated relies on the context_tracking stuff currently in 4.2-rc3. I'm not sure what the cleanest way is for me to pick up the new context_tracking stuff; if that's all that ends up standing between this patch series and having it be pulled, perhaps I can rebase it onto whatever branch it is that has the new context_tracking? Original patch series cover letter follows: The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. The kernel must be built with CONFIG_CPU_ISOLATED to take advantage of this new mode. A prctl() option (PR_SET_CPU_ISOLATED) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on cpu_isolated cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on v4.2-rc3) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane v5: rebased on kernel v4.2-rc3 converted to use CONFIG_CPU_ISOLATED and separate .c and .h files incorporates Christoph Lameter's quiet_vmstat() call v4: rebased on kernel v4.2-rc1 added support for detecting CPU_ISOLATED_STRICT syscalls on arm64 v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() Note: I have not removed the commit to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if we remove the 1Hz tick, cpu_isolated threads will never re-enter userspace since a tick will always be pending. Chris Metcalf (5): cpu_isolated: add initial support cpu_isolated: support PR_CPU_ISOLATED_STRICT mode cpu_isolated: provide strict mode configurable signal cpu_isolated: add debug boot flag nohz: cpu_isolated: allow tick to be fully disabled Christoph Lameter (1): vmstat: provide a function to quiet down the diff processing Documentation/kernel-parameters.txt | 7 +++ arch/arm64/kernel/ptrace.c | 5 ++ arch/tile/kernel/process.c | 9 +++ arch/tile/kernel/ptrace.c | 5 +- arch/tile/mm/homecache.c | 5 +- arch/x86/kernel/ptrace.c | 2 + include/linux/context_tracking.h | 11 +++- include/linux/cpu_isolated.h | 42 +++++++++++++ include/linux/sched.h | 3 + include/linux/vmstat.h | 2 + include/uapi/linux/prctl.h | 8 +++ kernel/context_tracking.c | 12 +++- kernel/irq_work.c | 5 +- kernel/sched/core.c | 21 +++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 7 +++ kernel/sys.c | 8 +++ kernel/time/Kconfig | 20 +++++++ kernel/time/Makefile | 1 + kernel/time/cpu_isolated.c | 116 ++++++++++++++++++++++++++++++++++++ kernel/time/tick-sched.c | 3 +- mm/vmstat.c | 14 +++++ 23 files changed, 305 insertions(+), 10 deletions(-) create mode 100644 include/linux/cpu_isolated.h create mode 100644 kernel/time/cpu_isolated.c -- 2.1.2 ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v5 0/6] support "cpu_isolated" mode for nohz_full @ 2015-07-28 19:49 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Chris Metcalf This version of the patch series incorporates Christoph Lameter's change to add a quiet_vmstat() call, and restructures cpu_isolated as a "hard" isolation mode in contrast to nohz_full's "soft" isolation, breaking it out as a separate CONFIG_CPU_ISOLATED with its own include/linux/cpu_isolated.h and kernel/time/cpu_isolated.c. It is rebased to 4.2-rc3. Thomas: as I mentioned in v4, I haven't heard from you whether my removal of the cpu_idle calls sufficiently addresses your concerns about that aspect. Andy: as I said in email, I've left in the support where cpu_isolated relies on the context_tracking stuff currently in 4.2-rc3. I'm not sure what the cleanest way is for me to pick up the new context_tracking stuff; if that's all that ends up standing between this patch series and having it be pulled, perhaps I can rebase it onto whatever branch it is that has the new context_tracking? Original patch series cover letter follows: The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. The kernel must be built with CONFIG_CPU_ISOLATED to take advantage of this new mode. A prctl() option (PR_SET_CPU_ISOLATED) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on cpu_isolated cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on v4.2-rc3) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane v5: rebased on kernel v4.2-rc3 converted to use CONFIG_CPU_ISOLATED and separate .c and .h files incorporates Christoph Lameter's quiet_vmstat() call v4: rebased on kernel v4.2-rc1 added support for detecting CPU_ISOLATED_STRICT syscalls on arm64 v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() Note: I have not removed the commit to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if we remove the 1Hz tick, cpu_isolated threads will never re-enter userspace since a tick will always be pending. Chris Metcalf (5): cpu_isolated: add initial support cpu_isolated: support PR_CPU_ISOLATED_STRICT mode cpu_isolated: provide strict mode configurable signal cpu_isolated: add debug boot flag nohz: cpu_isolated: allow tick to be fully disabled Christoph Lameter (1): vmstat: provide a function to quiet down the diff processing Documentation/kernel-parameters.txt | 7 +++ arch/arm64/kernel/ptrace.c | 5 ++ arch/tile/kernel/process.c | 9 +++ arch/tile/kernel/ptrace.c | 5 +- arch/tile/mm/homecache.c | 5 +- arch/x86/kernel/ptrace.c | 2 + include/linux/context_tracking.h | 11 +++- include/linux/cpu_isolated.h | 42 +++++++++++++ include/linux/sched.h | 3 + include/linux/vmstat.h | 2 + include/uapi/linux/prctl.h | 8 +++ kernel/context_tracking.c | 12 +++- kernel/irq_work.c | 5 +- kernel/sched/core.c | 21 +++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 7 +++ kernel/sys.c | 8 +++ kernel/time/Kconfig | 20 +++++++ kernel/time/Makefile | 1 + kernel/time/cpu_isolated.c | 116 ++++++++++++++++++++++++++++++++++++ kernel/time/tick-sched.c | 3 +- mm/vmstat.c | 14 +++++ 23 files changed, 305 insertions(+), 10 deletions(-) create mode 100644 include/linux/cpu_isolated.h create mode 100644 kernel/time/cpu_isolated.c -- 2.1.2 ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v5 1/6] vmstat: provide a function to quiet down the diff processing 2015-07-28 19:49 ` Chris Metcalf (?) @ 2015-07-28 19:49 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel From: Christoph Lameter <cl@linux.com> quiet_vmstat() can be called in anticipation of a OS "quiet" period where no tick processing should be triggered. quiet_vmstat() will fold all pending differentials into the global counters and disable the vmstat_worker processing. Note that the shepherd thread will continue scanning the differentials from another processor and will reenable the vmstat workers if it detects any changes. Signed-off-by: Christoph Lameter <cl@linux.com> --- include/linux/vmstat.h | 2 ++ mm/vmstat.c | 14 ++++++++++++++ 2 files changed, 16 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 82e7db7f7100..c013b8d8e434 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone *, enum zone_stat_item); extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); +void quiet_vmstat(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -272,6 +273,7 @@ static inline void __dec_zone_page_state(struct page *page, static inline void refresh_cpu_vm_stats(int cpu) { } static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } +static inline void quiet_vmstat(void) { } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index 4f5cd974e11a..cf7d324f16e2 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1394,6 +1394,20 @@ static void vmstat_update(struct work_struct *w) } /* + * Switch off vmstat processing and then fold all the remaining differentials + * until the diffs stay at zero. The function is used by NOHZ and can only be + * invoked when tick processing is not active. + */ +void quiet_vmstat(void) +{ + do { + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) + cancel_delayed_work(this_cpu_ptr(&vmstat_work)); + + } while (refresh_cpu_vm_stats()); +} + +/* * Check if the diffs for a certain cpu indicate that * an update is needed. */ -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v5 2/6] cpu_isolated: add initial support 2015-07-28 19:49 ` Chris Metcalf @ 2015-07-28 19:49 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The kernel must be built with the new CPU_ISOLATED Kconfig flag to enable this mode, and the kernel booted with an appropriate nohz_full=CPULIST boot argument. The "cpu_isolated" state is then indicated by setting a new task struct field, cpu_isolated_flags, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new cpu_isolated_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only three actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, it calls quiet_vmstat() to quieten the vmstat worker to avoid a follow-on interrupt. Finally, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/process.c | 9 ++++++ include/linux/cpu_isolated.h | 24 +++++++++++++++ include/linux/sched.h | 3 ++ include/uapi/linux/prctl.h | 5 ++++ kernel/context_tracking.c | 3 ++ kernel/sys.c | 8 +++++ kernel/time/Kconfig | 20 +++++++++++++ kernel/time/Makefile | 1 + kernel/time/cpu_isolated.c | 71 ++++++++++++++++++++++++++++++++++++++++++++ 9 files changed, 144 insertions(+) create mode 100644 include/linux/cpu_isolated.h create mode 100644 kernel/time/cpu_isolated.c diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index e036c0aa9792..7db6f8386417 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -70,6 +70,15 @@ void arch_cpu_idle(void) _cpu_idle(); } +#ifdef CONFIG_CPU_ISOLATED +void cpu_isolated_wait(void) +{ + set_current_state(TASK_INTERRUPTIBLE); + _cpu_idle(); + set_current_state(TASK_RUNNING); +} +#endif + /* * Release a thread_info structure */ diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h new file mode 100644 index 000000000000..a3d17360f7ae --- /dev/null +++ b/include/linux/cpu_isolated.h @@ -0,0 +1,24 @@ +/* + * CPU isolation related global functions + */ +#ifndef _LINUX_CPU_ISOLATED_H +#define _LINUX_CPU_ISOLATED_H + +#include <linux/tick.h> +#include <linux/prctl.h> + +#ifdef CONFIG_CPU_ISOLATED +static inline bool is_cpu_isolated(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); +} + +extern void cpu_isolated_enter(void); +extern void cpu_isolated_wait(void); +#else +static inline bool is_cpu_isolated(void) { return false; } +static inline void cpu_isolated_enter(void) { } +#endif + +#endif diff --git a/include/linux/sched.h b/include/linux/sched.h index 04b5ada460b4..0bb248385d88 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1776,6 +1776,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_CPU_ISOLATED + unsigned int cpu_isolated_flags; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..edb40b6b84db 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */ +#define PR_SET_CPU_ISOLATED 47 +#define PR_GET_CPU_ISOLATED 48 +# define PR_CPU_ISOLATED_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 0a495ab35bc7..36b6509c3e2a 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/cpu_isolated.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (is_cpu_isolated()) + cpu_isolated_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/sys.c b/kernel/sys.c index 259fda25eb6b..c68417ff4800 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_CPU_ISOLATED + case PR_SET_CPU_ISOLATED: + me->cpu_isolated_flags = arg2; + break; + case PR_GET_CPU_ISOLATED: + error = me->cpu_isolated_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index 579ce1b929af..141969149994 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -195,5 +195,25 @@ config HIGH_RES_TIMERS hardware is not capable then this option only increases the size of the kernel image. +config CPU_ISOLATED + bool "Provide hard CPU isolation from the kernel on demand" + depends on NO_HZ_FULL + help + Allow userspace processes to place themselves on nohz_full + cores and run prctl(PR_SET_CPU_ISOLATED) to "isolate" + themselves from the kernel. On return to userspace, + cpu-isolated tasks will first arrange that no future kernel + activity will interrupt the task while the task is running + in userspace. This "hard" isolation from the kernel is + required for userspace tasks that are running hard real-time + tasks in userspace, such as a 10 Gbit network driver in userspace. + + Without this option, but with NO_HZ_FULL enabled, the kernel + will make a best-faith, "soft" effort to shield a single userspace + process from interrupts, but makes no guarantees. + + You should say "N" unless you are intending to run a + high-performance userspace driver or similar task. + endmenu endif diff --git a/kernel/time/Makefile b/kernel/time/Makefile index 49eca0beed32..984081cce974 100644 --- a/kernel/time/Makefile +++ b/kernel/time/Makefile @@ -12,3 +12,4 @@ obj-$(CONFIG_TICK_ONESHOT) += tick-oneshot.o tick-sched.o obj-$(CONFIG_TIMER_STATS) += timer_stats.o obj-$(CONFIG_DEBUG_FS) += timekeeping_debug.o obj-$(CONFIG_TEST_UDELAY) += test_udelay.o +obj-$(CONFIG_CPU_ISOLATED) += cpu_isolated.o diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c new file mode 100644 index 000000000000..e27259f30caf --- /dev/null +++ b/kernel/time/cpu_isolated.c @@ -0,0 +1,71 @@ +/* + * linux/kernel/time/cpu_isolated.c + * + * Implementation for cpu isolation. + * + * Distributed under GPLv2. + */ + +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/vmstat.h> +#include <linux/cpu_isolated.h> +#include "tick-sched.h" + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + */ +void __weak cpu_isolated_wait(void) +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In cpu_isolated mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two cpu_isolated processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void cpu_isolated_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + /* Quieten the vmstat worker so it won't interrupt us. */ + quiet_vmstat(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + cpu_isolated_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v5 2/6] cpu_isolated: add initial support @ 2015-07-28 19:49 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The kernel must be built with the new CPU_ISOLATED Kconfig flag to enable this mode, and the kernel booted with an appropriate nohz_full=CPULIST boot argument. The "cpu_isolated" state is then indicated by setting a new task struct field, cpu_isolated_flags, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new cpu_isolated_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only three actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, it calls quiet_vmstat() to quieten the vmstat worker to avoid a follow-on interrupt. Finally, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/process.c | 9 ++++++ include/linux/cpu_isolated.h | 24 +++++++++++++++ include/linux/sched.h | 3 ++ include/uapi/linux/prctl.h | 5 ++++ kernel/context_tracking.c | 3 ++ kernel/sys.c | 8 +++++ kernel/time/Kconfig | 20 +++++++++++++ kernel/time/Makefile | 1 + kernel/time/cpu_isolated.c | 71 ++++++++++++++++++++++++++++++++++++++++++++ 9 files changed, 144 insertions(+) create mode 100644 include/linux/cpu_isolated.h create mode 100644 kernel/time/cpu_isolated.c diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index e036c0aa9792..7db6f8386417 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -70,6 +70,15 @@ void arch_cpu_idle(void) _cpu_idle(); } +#ifdef CONFIG_CPU_ISOLATED +void cpu_isolated_wait(void) +{ + set_current_state(TASK_INTERRUPTIBLE); + _cpu_idle(); + set_current_state(TASK_RUNNING); +} +#endif + /* * Release a thread_info structure */ diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h new file mode 100644 index 000000000000..a3d17360f7ae --- /dev/null +++ b/include/linux/cpu_isolated.h @@ -0,0 +1,24 @@ +/* + * CPU isolation related global functions + */ +#ifndef _LINUX_CPU_ISOLATED_H +#define _LINUX_CPU_ISOLATED_H + +#include <linux/tick.h> +#include <linux/prctl.h> + +#ifdef CONFIG_CPU_ISOLATED +static inline bool is_cpu_isolated(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); +} + +extern void cpu_isolated_enter(void); +extern void cpu_isolated_wait(void); +#else +static inline bool is_cpu_isolated(void) { return false; } +static inline void cpu_isolated_enter(void) { } +#endif + +#endif diff --git a/include/linux/sched.h b/include/linux/sched.h index 04b5ada460b4..0bb248385d88 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1776,6 +1776,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_CPU_ISOLATED + unsigned int cpu_isolated_flags; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..edb40b6b84db 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */ +#define PR_SET_CPU_ISOLATED 47 +#define PR_GET_CPU_ISOLATED 48 +# define PR_CPU_ISOLATED_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 0a495ab35bc7..36b6509c3e2a 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/cpu_isolated.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (is_cpu_isolated()) + cpu_isolated_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/sys.c b/kernel/sys.c index 259fda25eb6b..c68417ff4800 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_CPU_ISOLATED + case PR_SET_CPU_ISOLATED: + me->cpu_isolated_flags = arg2; + break; + case PR_GET_CPU_ISOLATED: + error = me->cpu_isolated_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index 579ce1b929af..141969149994 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -195,5 +195,25 @@ config HIGH_RES_TIMERS hardware is not capable then this option only increases the size of the kernel image. +config CPU_ISOLATED + bool "Provide hard CPU isolation from the kernel on demand" + depends on NO_HZ_FULL + help + Allow userspace processes to place themselves on nohz_full + cores and run prctl(PR_SET_CPU_ISOLATED) to "isolate" + themselves from the kernel. On return to userspace, + cpu-isolated tasks will first arrange that no future kernel + activity will interrupt the task while the task is running + in userspace. This "hard" isolation from the kernel is + required for userspace tasks that are running hard real-time + tasks in userspace, such as a 10 Gbit network driver in userspace. + + Without this option, but with NO_HZ_FULL enabled, the kernel + will make a best-faith, "soft" effort to shield a single userspace + process from interrupts, but makes no guarantees. + + You should say "N" unless you are intending to run a + high-performance userspace driver or similar task. + endmenu endif diff --git a/kernel/time/Makefile b/kernel/time/Makefile index 49eca0beed32..984081cce974 100644 --- a/kernel/time/Makefile +++ b/kernel/time/Makefile @@ -12,3 +12,4 @@ obj-$(CONFIG_TICK_ONESHOT) += tick-oneshot.o tick-sched.o obj-$(CONFIG_TIMER_STATS) += timer_stats.o obj-$(CONFIG_DEBUG_FS) += timekeeping_debug.o obj-$(CONFIG_TEST_UDELAY) += test_udelay.o +obj-$(CONFIG_CPU_ISOLATED) += cpu_isolated.o diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c new file mode 100644 index 000000000000..e27259f30caf --- /dev/null +++ b/kernel/time/cpu_isolated.c @@ -0,0 +1,71 @@ +/* + * linux/kernel/time/cpu_isolated.c + * + * Implementation for cpu isolation. + * + * Distributed under GPLv2. + */ + +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/vmstat.h> +#include <linux/cpu_isolated.h> +#include "tick-sched.h" + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + */ +void __weak cpu_isolated_wait(void) +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In cpu_isolated mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two cpu_isolated processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void cpu_isolated_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + /* Quieten the vmstat worker so it won't interrupt us. */ + quiet_vmstat(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + cpu_isolated_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH v5 2/6] cpu_isolated: add initial support @ 2015-08-12 16:00 ` Frederic Weisbecker 0 siblings, 0 replies; 340+ messages in thread From: Frederic Weisbecker @ 2015-08-12 16:00 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel On Tue, Jul 28, 2015 at 03:49:36PM -0400, Chris Metcalf wrote: > The existing nohz_full mode is designed as a "soft" isolation mode > that makes tradeoffs to minimize userspace interruptions while > still attempting to avoid overheads in the kernel entry/exit path, > to provide 100% kernel semantics, etc. > > However, some applications require a "hard" commitment from the > kernel to avoid interruptions, in particular userspace device > driver style applications, such as high-speed networking code. > > This change introduces a framework to allow applications > to elect to have the "hard" semantics as needed, specifying > prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. > Subsequent commits will add additional flags and additional > semantics. We are doing this at the process level but the isolation works on the CPU scope... Now I wonder if prctl is the right interface. That said the user is rather interested in isolating a task. The CPU being the backend eventually. For example if the task is migrated by accident, we want it to be warned about that. And if the isolation is done on the CPU level instead of the task level, this won't happen. I'm also afraid that the naming clashes with cpu_isolated_map, although it could be a subset of it. So probably in this case we should consider talking about task rather than CPU isolation and change naming accordingly (sorry, I know I suggested cpu_isolation.c, I guess I had to see the result to realize). We must sort that out first. Either we consider isolation on the task level (and thus the underlying CPU by backend effect) and we use prctl(). Or we do this on the CPU level and we use a specific syscall or sysfs which takes effect on any task in the relevant isolated CPUs. What do you think? It would be nice to hear others opinions as well. > The kernel must be built with the new CPU_ISOLATED Kconfig flag > to enable this mode, and the kernel booted with an appropriate > nohz_full=CPULIST boot argument. The "cpu_isolated" state is then > indicated by setting a new task struct field, cpu_isolated_flags, > to the value passed by prctl(). When the _ENABLE bit is set for a > task, and it is returning to userspace on a nohz_full core, it calls > the new cpu_isolated_enter() routine to take additional actions > to help the task avoid being interrupted in the future. > > Initially, there are only three actions taken. First, the > task calls lru_add_drain() to prevent being interrupted by a > subsequent lru_add_drain_all() call on another core. Then, it calls > quiet_vmstat() to quieten the vmstat worker to avoid a follow-on > interrupt. Finally, the code checks for pending timer interrupts > and quiesces until they are no longer pending. As a result, sys > calls (and page faults, etc.) can be inordinately slow. However, > this quiescing guarantees that no unexpected interrupts will occur, > even if the application intentionally calls into the kernel. > > Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> > --- > arch/tile/kernel/process.c | 9 ++++++ > include/linux/cpu_isolated.h | 24 +++++++++++++++ > include/linux/sched.h | 3 ++ > include/uapi/linux/prctl.h | 5 ++++ > kernel/context_tracking.c | 3 ++ > kernel/sys.c | 8 +++++ > kernel/time/Kconfig | 20 +++++++++++++ > kernel/time/Makefile | 1 + > kernel/time/cpu_isolated.c | 71 ++++++++++++++++++++++++++++++++++++++++++++ It's not about time :-) The timer is only a part of the isolation. Moreover "isolatED" is a state. The filename should reflect the process. "isolatION" would better fit. kernel/task_isolation.c maybe or just kernel/isolation.c I think I prefer the latter because I'm not only interested in that task hard isolation feature, I would like to also drive all the general isolation operations from there (workqueue affinity, rcu nocb, ...). > 9 files changed, 144 insertions(+) > create mode 100644 include/linux/cpu_isolated.h > create mode 100644 kernel/time/cpu_isolated.c > > diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c > index e036c0aa9792..7db6f8386417 100644 > --- a/arch/tile/kernel/process.c > +++ b/arch/tile/kernel/process.c > @@ -70,6 +70,15 @@ void arch_cpu_idle(void) > _cpu_idle(); > } > > +#ifdef CONFIG_CPU_ISOLATED > +void cpu_isolated_wait(void) > +{ > + set_current_state(TASK_INTERRUPTIBLE); > + _cpu_idle(); > + set_current_state(TASK_RUNNING); > +} I'm still uncomfortable with that. A wake up model could work? > +#endif > + > /* > * Release a thread_info structure > */ > diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h > new file mode 100644 > index 000000000000..a3d17360f7ae > --- /dev/null > +++ b/include/linux/cpu_isolated.h > @@ -0,0 +1,24 @@ > +/* > + * CPU isolation related global functions > + */ > +#ifndef _LINUX_CPU_ISOLATED_H > +#define _LINUX_CPU_ISOLATED_H > + > +#include <linux/tick.h> > +#include <linux/prctl.h> > + > +#ifdef CONFIG_CPU_ISOLATED > +static inline bool is_cpu_isolated(void) > +{ > + return tick_nohz_full_cpu(smp_processor_id()) && > + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); > +} > + > +extern void cpu_isolated_enter(void); > +extern void cpu_isolated_wait(void); > +#else > +static inline bool is_cpu_isolated(void) { return false; } > +static inline void cpu_isolated_enter(void) { } > +#endif And all the naming should be about task as well, if we take that task direction. > + > +#endif > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 04b5ada460b4..0bb248385d88 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1776,6 +1776,9 @@ struct task_struct { > unsigned long task_state_change; > #endif > int pagefault_disabled; > +#ifdef CONFIG_CPU_ISOLATED > + unsigned int cpu_isolated_flags; > +#endif Can't we add a new flag to tsk->flags? There seem to be some values remaining. Thanks. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v5 2/6] cpu_isolated: add initial support @ 2015-08-12 16:00 ` Frederic Weisbecker 0 siblings, 0 replies; 340+ messages in thread From: Frederic Weisbecker @ 2015-08-12 16:00 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, Jul 28, 2015 at 03:49:36PM -0400, Chris Metcalf wrote: > The existing nohz_full mode is designed as a "soft" isolation mode > that makes tradeoffs to minimize userspace interruptions while > still attempting to avoid overheads in the kernel entry/exit path, > to provide 100% kernel semantics, etc. > > However, some applications require a "hard" commitment from the > kernel to avoid interruptions, in particular userspace device > driver style applications, such as high-speed networking code. > > This change introduces a framework to allow applications > to elect to have the "hard" semantics as needed, specifying > prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. > Subsequent commits will add additional flags and additional > semantics. We are doing this at the process level but the isolation works on the CPU scope... Now I wonder if prctl is the right interface. That said the user is rather interested in isolating a task. The CPU being the backend eventually. For example if the task is migrated by accident, we want it to be warned about that. And if the isolation is done on the CPU level instead of the task level, this won't happen. I'm also afraid that the naming clashes with cpu_isolated_map, although it could be a subset of it. So probably in this case we should consider talking about task rather than CPU isolation and change naming accordingly (sorry, I know I suggested cpu_isolation.c, I guess I had to see the result to realize). We must sort that out first. Either we consider isolation on the task level (and thus the underlying CPU by backend effect) and we use prctl(). Or we do this on the CPU level and we use a specific syscall or sysfs which takes effect on any task in the relevant isolated CPUs. What do you think? It would be nice to hear others opinions as well. > The kernel must be built with the new CPU_ISOLATED Kconfig flag > to enable this mode, and the kernel booted with an appropriate > nohz_full=CPULIST boot argument. The "cpu_isolated" state is then > indicated by setting a new task struct field, cpu_isolated_flags, > to the value passed by prctl(). When the _ENABLE bit is set for a > task, and it is returning to userspace on a nohz_full core, it calls > the new cpu_isolated_enter() routine to take additional actions > to help the task avoid being interrupted in the future. > > Initially, there are only three actions taken. First, the > task calls lru_add_drain() to prevent being interrupted by a > subsequent lru_add_drain_all() call on another core. Then, it calls > quiet_vmstat() to quieten the vmstat worker to avoid a follow-on > interrupt. Finally, the code checks for pending timer interrupts > and quiesces until they are no longer pending. As a result, sys > calls (and page faults, etc.) can be inordinately slow. However, > this quiescing guarantees that no unexpected interrupts will occur, > even if the application intentionally calls into the kernel. > > Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> > --- > arch/tile/kernel/process.c | 9 ++++++ > include/linux/cpu_isolated.h | 24 +++++++++++++++ > include/linux/sched.h | 3 ++ > include/uapi/linux/prctl.h | 5 ++++ > kernel/context_tracking.c | 3 ++ > kernel/sys.c | 8 +++++ > kernel/time/Kconfig | 20 +++++++++++++ > kernel/time/Makefile | 1 + > kernel/time/cpu_isolated.c | 71 ++++++++++++++++++++++++++++++++++++++++++++ It's not about time :-) The timer is only a part of the isolation. Moreover "isolatED" is a state. The filename should reflect the process. "isolatION" would better fit. kernel/task_isolation.c maybe or just kernel/isolation.c I think I prefer the latter because I'm not only interested in that task hard isolation feature, I would like to also drive all the general isolation operations from there (workqueue affinity, rcu nocb, ...). > 9 files changed, 144 insertions(+) > create mode 100644 include/linux/cpu_isolated.h > create mode 100644 kernel/time/cpu_isolated.c > > diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c > index e036c0aa9792..7db6f8386417 100644 > --- a/arch/tile/kernel/process.c > +++ b/arch/tile/kernel/process.c > @@ -70,6 +70,15 @@ void arch_cpu_idle(void) > _cpu_idle(); > } > > +#ifdef CONFIG_CPU_ISOLATED > +void cpu_isolated_wait(void) > +{ > + set_current_state(TASK_INTERRUPTIBLE); > + _cpu_idle(); > + set_current_state(TASK_RUNNING); > +} I'm still uncomfortable with that. A wake up model could work? > +#endif > + > /* > * Release a thread_info structure > */ > diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h > new file mode 100644 > index 000000000000..a3d17360f7ae > --- /dev/null > +++ b/include/linux/cpu_isolated.h > @@ -0,0 +1,24 @@ > +/* > + * CPU isolation related global functions > + */ > +#ifndef _LINUX_CPU_ISOLATED_H > +#define _LINUX_CPU_ISOLATED_H > + > +#include <linux/tick.h> > +#include <linux/prctl.h> > + > +#ifdef CONFIG_CPU_ISOLATED > +static inline bool is_cpu_isolated(void) > +{ > + return tick_nohz_full_cpu(smp_processor_id()) && > + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); > +} > + > +extern void cpu_isolated_enter(void); > +extern void cpu_isolated_wait(void); > +#else > +static inline bool is_cpu_isolated(void) { return false; } > +static inline void cpu_isolated_enter(void) { } > +#endif And all the naming should be about task as well, if we take that task direction. > + > +#endif > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 04b5ada460b4..0bb248385d88 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1776,6 +1776,9 @@ struct task_struct { > unsigned long task_state_change; > #endif > int pagefault_disabled; > +#ifdef CONFIG_CPU_ISOLATED > + unsigned int cpu_isolated_flags; > +#endif Can't we add a new flag to tsk->flags? There seem to be some values remaining. Thanks. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v5 2/6] cpu_isolated: add initial support 2015-08-12 16:00 ` Frederic Weisbecker @ 2015-08-12 18:22 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-08-12 18:22 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel On 08/12/2015 12:00 PM, Frederic Weisbecker wrote: > On Tue, Jul 28, 2015 at 03:49:36PM -0400, Chris Metcalf wrote: >> The existing nohz_full mode is designed as a "soft" isolation mode >> that makes tradeoffs to minimize userspace interruptions while >> still attempting to avoid overheads in the kernel entry/exit path, >> to provide 100% kernel semantics, etc. >> >> However, some applications require a "hard" commitment from the >> kernel to avoid interruptions, in particular userspace device >> driver style applications, such as high-speed networking code. >> >> This change introduces a framework to allow applications >> to elect to have the "hard" semantics as needed, specifying >> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. >> Subsequent commits will add additional flags and additional >> semantics. > We are doing this at the process level but the isolation works on > the CPU scope... Now I wonder if prctl is the right interface. > > That said the user is rather interested in isolating a task. The CPU > being the backend eventually. > > For example if the task is migrated by accident, we want it to be > warned about that. And if the isolation is done on the CPU level > instead of the task level, this won't happen. > > I'm also afraid that the naming clashes with cpu_isolated_map, > although it could be a subset of it. > > So probably in this case we should consider talking about task rather > than CPU isolation and change naming accordingly (sorry, I know I > suggested cpu_isolation.c, I guess I had to see the result to realize). > > We must sort that out first. Either we consider isolation on the task > level (and thus the underlying CPU by backend effect) and we use prctl(). > Or we do this on the CPU level and we use a specific syscall or sysfs > which takes effect on any task in the relevant isolated CPUs. > > What do you think? Yes, definitely task-centric is the right model. With the original tilegx version of this code, we also checked that the process had only a single core in its affinity mask, and that the single core in question was a nohz_full core, before allowing the "task isolated" mode to take effect. I didn't do that in this round of patches because it seemed a little silly in that the user could then immediately reset their affinity to another core and lose the effect, and it wasn't clear how to handle that: do we return EINVAL from sched_setaffinity() after enabling the "task isolated" mode? That seems potentially ugly, maybe standards-violating, etc. So I didn't bother. But you could certainly argue for failing prctl() in that case anyway, as a way to make sure users aren't doing something stupid like calling the prctl() from a task that's running on a housekeeping core. And you could even argue for doing some kind of console spew if you try to migrate a task that is in "task isolation" state - though I suppose if you migrate it to another isolcpus and nohz_full core, maybe that's kind of reasonable and doesn't deserve a warning? I'm not sure. >> The kernel must be built with the new CPU_ISOLATED Kconfig flag >> to enable this mode, and the kernel booted with an appropriate >> nohz_full=CPULIST boot argument. The "cpu_isolated" state is then >> indicated by setting a new task struct field, cpu_isolated_flags, >> to the value passed by prctl(). When the _ENABLE bit is set for a >> task, and it is returning to userspace on a nohz_full core, it calls >> the new cpu_isolated_enter() routine to take additional actions >> to help the task avoid being interrupted in the future. >> >> Initially, there are only three actions taken. First, the >> task calls lru_add_drain() to prevent being interrupted by a >> subsequent lru_add_drain_all() call on another core. Then, it calls >> quiet_vmstat() to quieten the vmstat worker to avoid a follow-on >> interrupt. Finally, the code checks for pending timer interrupts >> and quiesces until they are no longer pending. As a result, sys >> calls (and page faults, etc.) can be inordinately slow. However, >> this quiescing guarantees that no unexpected interrupts will occur, >> even if the application intentionally calls into the kernel. >> >> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> >> --- >> arch/tile/kernel/process.c | 9 ++++++ >> include/linux/cpu_isolated.h | 24 +++++++++++++++ >> include/linux/sched.h | 3 ++ >> include/uapi/linux/prctl.h | 5 ++++ >> kernel/context_tracking.c | 3 ++ >> kernel/sys.c | 8 +++++ >> kernel/time/Kconfig | 20 +++++++++++++ >> kernel/time/Makefile | 1 + >> kernel/time/cpu_isolated.c | 71 ++++++++++++++++++++++++++++++++++++++++++++ > It's not about time :-) > > The timer is only a part of the isolation. > > Moreover "isolatED" is a state. The filename should reflect the process. "isolatION" would > better fit. > > kernel/task_isolation.c maybe or just kernel/isolation.c > > I think I prefer the latter because I'm not only interested in that task > hard isolation feature, I would like to also drive all the general isolation > operations from there (workqueue affinity, rcu nocb, ...). That's reasonable, but I think the "task isolation" naming is probably better for all the stuff that we're doing in this patch. In other words, we probably should use "task_isolation" as the prefix for symbols names and API names, even if we put the code in kernel/isolation.c for now in anticipation of non-task isolation being added later. I think my instinct would still be to call it kernel/task_isolation.c until we actually add some non-task isolation, and at that point we can decide if it makes sense to rename the file, or put the new code somewhere else, but I'm OK with doing it the way I described in the previous paragraph if you think it's better. >> +#ifdef CONFIG_CPU_ISOLATED >> +void cpu_isolated_wait(void) >> +{ >> + set_current_state(TASK_INTERRUPTIBLE); >> + _cpu_idle(); >> + set_current_state(TASK_RUNNING); >> +} > I'm still uncomfortable with that. A wake up model could work? I don't know exactly what you have in mind. The theory is that at this point we're ready to return to user space and we're just waiting for a timer tick that is guaranteed to arrive, since there is something pending for the timer. And, this is an arch-specific method anyway; the generic method is actually checking to see if a signal has been delivered, scheduling is needed, etc., each time around the loop, so if you're not sure your architecture will do the right thing, just don't provide a method that idles while waiting. For tilegx I'm sure it works correctly, so I'm OK providing that method. >> +extern void cpu_isolated_enter(void); >> +extern void cpu_isolated_wait(void); >> +#else >> +static inline bool is_cpu_isolated(void) { return false; } >> +static inline void cpu_isolated_enter(void) { } >> +#endif > And all the naming should be about task as well, if we take that task direction. As discussed above, probably task_isolation_enter(), etc. >> + >> +#endif >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index 04b5ada460b4..0bb248385d88 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -1776,6 +1776,9 @@ struct task_struct { >> unsigned long task_state_change; >> #endif >> int pagefault_disabled; >> +#ifdef CONFIG_CPU_ISOLATED >> + unsigned int cpu_isolated_flags; >> +#endif > Can't we add a new flag to tsk->flags? There seem to be some values remaining. Yeah, I thought of that, but it seems like a pretty scarce resource, and I wasn't sure it was the right thing to do. Also, I'm not actually sure why the lowest two bits aren't apparently being used; looks like PF_EXITING (0x4) is the first bit used. And there are only three more bits higher up in the word that are not assigned. Also, right now we are allowing users to customize the signal delivered for STRICT violation, and that signal value is stored in the cpu_isolated_flags word as well, so we really don't have room in tsk->flags for all of that anyway. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v5 2/6] cpu_isolated: add initial support @ 2015-08-12 18:22 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-08-12 18:22 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 08/12/2015 12:00 PM, Frederic Weisbecker wrote: > On Tue, Jul 28, 2015 at 03:49:36PM -0400, Chris Metcalf wrote: >> The existing nohz_full mode is designed as a "soft" isolation mode >> that makes tradeoffs to minimize userspace interruptions while >> still attempting to avoid overheads in the kernel entry/exit path, >> to provide 100% kernel semantics, etc. >> >> However, some applications require a "hard" commitment from the >> kernel to avoid interruptions, in particular userspace device >> driver style applications, such as high-speed networking code. >> >> This change introduces a framework to allow applications >> to elect to have the "hard" semantics as needed, specifying >> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. >> Subsequent commits will add additional flags and additional >> semantics. > We are doing this at the process level but the isolation works on > the CPU scope... Now I wonder if prctl is the right interface. > > That said the user is rather interested in isolating a task. The CPU > being the backend eventually. > > For example if the task is migrated by accident, we want it to be > warned about that. And if the isolation is done on the CPU level > instead of the task level, this won't happen. > > I'm also afraid that the naming clashes with cpu_isolated_map, > although it could be a subset of it. > > So probably in this case we should consider talking about task rather > than CPU isolation and change naming accordingly (sorry, I know I > suggested cpu_isolation.c, I guess I had to see the result to realize). > > We must sort that out first. Either we consider isolation on the task > level (and thus the underlying CPU by backend effect) and we use prctl(). > Or we do this on the CPU level and we use a specific syscall or sysfs > which takes effect on any task in the relevant isolated CPUs. > > What do you think? Yes, definitely task-centric is the right model. With the original tilegx version of this code, we also checked that the process had only a single core in its affinity mask, and that the single core in question was a nohz_full core, before allowing the "task isolated" mode to take effect. I didn't do that in this round of patches because it seemed a little silly in that the user could then immediately reset their affinity to another core and lose the effect, and it wasn't clear how to handle that: do we return EINVAL from sched_setaffinity() after enabling the "task isolated" mode? That seems potentially ugly, maybe standards-violating, etc. So I didn't bother. But you could certainly argue for failing prctl() in that case anyway, as a way to make sure users aren't doing something stupid like calling the prctl() from a task that's running on a housekeeping core. And you could even argue for doing some kind of console spew if you try to migrate a task that is in "task isolation" state - though I suppose if you migrate it to another isolcpus and nohz_full core, maybe that's kind of reasonable and doesn't deserve a warning? I'm not sure. >> The kernel must be built with the new CPU_ISOLATED Kconfig flag >> to enable this mode, and the kernel booted with an appropriate >> nohz_full=CPULIST boot argument. The "cpu_isolated" state is then >> indicated by setting a new task struct field, cpu_isolated_flags, >> to the value passed by prctl(). When the _ENABLE bit is set for a >> task, and it is returning to userspace on a nohz_full core, it calls >> the new cpu_isolated_enter() routine to take additional actions >> to help the task avoid being interrupted in the future. >> >> Initially, there are only three actions taken. First, the >> task calls lru_add_drain() to prevent being interrupted by a >> subsequent lru_add_drain_all() call on another core. Then, it calls >> quiet_vmstat() to quieten the vmstat worker to avoid a follow-on >> interrupt. Finally, the code checks for pending timer interrupts >> and quiesces until they are no longer pending. As a result, sys >> calls (and page faults, etc.) can be inordinately slow. However, >> this quiescing guarantees that no unexpected interrupts will occur, >> even if the application intentionally calls into the kernel. >> >> Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >> --- >> arch/tile/kernel/process.c | 9 ++++++ >> include/linux/cpu_isolated.h | 24 +++++++++++++++ >> include/linux/sched.h | 3 ++ >> include/uapi/linux/prctl.h | 5 ++++ >> kernel/context_tracking.c | 3 ++ >> kernel/sys.c | 8 +++++ >> kernel/time/Kconfig | 20 +++++++++++++ >> kernel/time/Makefile | 1 + >> kernel/time/cpu_isolated.c | 71 ++++++++++++++++++++++++++++++++++++++++++++ > It's not about time :-) > > The timer is only a part of the isolation. > > Moreover "isolatED" is a state. The filename should reflect the process. "isolatION" would > better fit. > > kernel/task_isolation.c maybe or just kernel/isolation.c > > I think I prefer the latter because I'm not only interested in that task > hard isolation feature, I would like to also drive all the general isolation > operations from there (workqueue affinity, rcu nocb, ...). That's reasonable, but I think the "task isolation" naming is probably better for all the stuff that we're doing in this patch. In other words, we probably should use "task_isolation" as the prefix for symbols names and API names, even if we put the code in kernel/isolation.c for now in anticipation of non-task isolation being added later. I think my instinct would still be to call it kernel/task_isolation.c until we actually add some non-task isolation, and at that point we can decide if it makes sense to rename the file, or put the new code somewhere else, but I'm OK with doing it the way I described in the previous paragraph if you think it's better. >> +#ifdef CONFIG_CPU_ISOLATED >> +void cpu_isolated_wait(void) >> +{ >> + set_current_state(TASK_INTERRUPTIBLE); >> + _cpu_idle(); >> + set_current_state(TASK_RUNNING); >> +} > I'm still uncomfortable with that. A wake up model could work? I don't know exactly what you have in mind. The theory is that at this point we're ready to return to user space and we're just waiting for a timer tick that is guaranteed to arrive, since there is something pending for the timer. And, this is an arch-specific method anyway; the generic method is actually checking to see if a signal has been delivered, scheduling is needed, etc., each time around the loop, so if you're not sure your architecture will do the right thing, just don't provide a method that idles while waiting. For tilegx I'm sure it works correctly, so I'm OK providing that method. >> +extern void cpu_isolated_enter(void); >> +extern void cpu_isolated_wait(void); >> +#else >> +static inline bool is_cpu_isolated(void) { return false; } >> +static inline void cpu_isolated_enter(void) { } >> +#endif > And all the naming should be about task as well, if we take that task direction. As discussed above, probably task_isolation_enter(), etc. >> + >> +#endif >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index 04b5ada460b4..0bb248385d88 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -1776,6 +1776,9 @@ struct task_struct { >> unsigned long task_state_change; >> #endif >> int pagefault_disabled; >> +#ifdef CONFIG_CPU_ISOLATED >> + unsigned int cpu_isolated_flags; >> +#endif > Can't we add a new flag to tsk->flags? There seem to be some values remaining. Yeah, I thought of that, but it seems like a pretty scarce resource, and I wasn't sure it was the right thing to do. Also, I'm not actually sure why the lowest two bits aren't apparently being used; looks like PF_EXITING (0x4) is the first bit used. And there are only three more bits higher up in the word that are not assigned. Also, right now we are allowing users to customize the signal delivered for STRICT violation, and that signal value is stored in the cpu_isolated_flags word as well, so we really don't have room in tsk->flags for all of that anyway. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v5 2/6] cpu_isolated: add initial support @ 2015-08-26 15:26 ` Frederic Weisbecker 0 siblings, 0 replies; 340+ messages in thread From: Frederic Weisbecker @ 2015-08-26 15:26 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel On Wed, Aug 12, 2015 at 02:22:09PM -0400, Chris Metcalf wrote: > On 08/12/2015 12:00 PM, Frederic Weisbecker wrote: > >>+#ifdef CONFIG_CPU_ISOLATED > >>+void cpu_isolated_wait(void) > >>+{ > >>+ set_current_state(TASK_INTERRUPTIBLE); > >>+ _cpu_idle(); > >>+ set_current_state(TASK_RUNNING); > >>+} > >I'm still uncomfortable with that. A wake up model could work? > > I don't know exactly what you have in mind. The theory is that > at this point we're ready to return to user space and we're just > waiting for a timer tick that is guaranteed to arrive, since there > is something pending for the timer. Hmm, ok I'm going to discuss that in the new version. One worry is that it gets racy and we sleep there for ever. > > And, this is an arch-specific method anyway; the generic method > is actually checking to see if a signal has been delivered, > scheduling is needed, etc., each time around the loop, so if > you're not sure your architecture will do the right thing, just > don't provide a method that idles while waiting. For tilegx I'm > sure it works correctly, so I'm OK providing that method. Yes but we do busy waiting on all other archs then. And since we can wait for a while there, it doesn't look sane. > >>diff --git a/include/linux/sched.h b/include/linux/sched.h > >>index 04b5ada460b4..0bb248385d88 100644 > >>--- a/include/linux/sched.h > >>+++ b/include/linux/sched.h > >>@@ -1776,6 +1776,9 @@ struct task_struct { > >> unsigned long task_state_change; > >> #endif > >> int pagefault_disabled; > >>+#ifdef CONFIG_CPU_ISOLATED > >>+ unsigned int cpu_isolated_flags; > >>+#endif > >Can't we add a new flag to tsk->flags? There seem to be some values remaining. > > Yeah, I thought of that, but it seems like a pretty scarce resource, > and I wasn't sure it was the right thing to do. Also, I'm not actually > sure why the lowest two bits aren't apparently being used Probably they were used but got removed. > looks > like PF_EXITING (0x4) is the first bit used. And there are only three > more bits higher up in the word that are not assigned. Which makes room for 5 :) > > Also, right now we are allowing users to customize the signal delivered > for STRICT violation, and that signal value is stored in the > cpu_isolated_flags word as well, so we really don't have room in > tsk->flags for all of that anyway. Yeah indeed, ok lets keep it that way for now. Thanks. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v5 2/6] cpu_isolated: add initial support @ 2015-08-26 15:26 ` Frederic Weisbecker 0 siblings, 0 replies; 340+ messages in thread From: Frederic Weisbecker @ 2015-08-26 15:26 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Wed, Aug 12, 2015 at 02:22:09PM -0400, Chris Metcalf wrote: > On 08/12/2015 12:00 PM, Frederic Weisbecker wrote: > >>+#ifdef CONFIG_CPU_ISOLATED > >>+void cpu_isolated_wait(void) > >>+{ > >>+ set_current_state(TASK_INTERRUPTIBLE); > >>+ _cpu_idle(); > >>+ set_current_state(TASK_RUNNING); > >>+} > >I'm still uncomfortable with that. A wake up model could work? > > I don't know exactly what you have in mind. The theory is that > at this point we're ready to return to user space and we're just > waiting for a timer tick that is guaranteed to arrive, since there > is something pending for the timer. Hmm, ok I'm going to discuss that in the new version. One worry is that it gets racy and we sleep there for ever. > > And, this is an arch-specific method anyway; the generic method > is actually checking to see if a signal has been delivered, > scheduling is needed, etc., each time around the loop, so if > you're not sure your architecture will do the right thing, just > don't provide a method that idles while waiting. For tilegx I'm > sure it works correctly, so I'm OK providing that method. Yes but we do busy waiting on all other archs then. And since we can wait for a while there, it doesn't look sane. > >>diff --git a/include/linux/sched.h b/include/linux/sched.h > >>index 04b5ada460b4..0bb248385d88 100644 > >>--- a/include/linux/sched.h > >>+++ b/include/linux/sched.h > >>@@ -1776,6 +1776,9 @@ struct task_struct { > >> unsigned long task_state_change; > >> #endif > >> int pagefault_disabled; > >>+#ifdef CONFIG_CPU_ISOLATED > >>+ unsigned int cpu_isolated_flags; > >>+#endif > >Can't we add a new flag to tsk->flags? There seem to be some values remaining. > > Yeah, I thought of that, but it seems like a pretty scarce resource, > and I wasn't sure it was the right thing to do. Also, I'm not actually > sure why the lowest two bits aren't apparently being used Probably they were used but got removed. > looks > like PF_EXITING (0x4) is the first bit used. And there are only three > more bits higher up in the word that are not assigned. Which makes room for 5 :) > > Also, right now we are allowing users to customize the signal delivered > for STRICT violation, and that signal value is stored in the > cpu_isolated_flags word as well, so we really don't have room in > tsk->flags for all of that anyway. Yeah indeed, ok lets keep it that way for now. Thanks. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v5 2/6] cpu_isolated: add initial support 2015-08-26 15:26 ` Frederic Weisbecker @ 2015-08-26 15:55 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-08-26 15:55 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel On 08/26/2015 11:26 AM, Frederic Weisbecker wrote: > On Wed, Aug 12, 2015 at 02:22:09PM -0400, Chris Metcalf wrote: >> On 08/12/2015 12:00 PM, Frederic Weisbecker wrote: >>>> +#ifdef CONFIG_CPU_ISOLATED >>>> +void cpu_isolated_wait(void) >>>> +{ >>>> + set_current_state(TASK_INTERRUPTIBLE); >>>> + _cpu_idle(); >>>> + set_current_state(TASK_RUNNING); >>>> +} >>> I'm still uncomfortable with that. A wake up model could work? >> I don't know exactly what you have in mind. The theory is that >> at this point we're ready to return to user space and we're just >> waiting for a timer tick that is guaranteed to arrive, since there >> is something pending for the timer. > Hmm, ok I'm going to discuss that in the new version. One worry is that > it gets racy and we sleep there for ever. > >> And, this is an arch-specific method anyway; the generic method >> is actually checking to see if a signal has been delivered, >> scheduling is needed, etc., each time around the loop, so if >> you're not sure your architecture will do the right thing, just >> don't provide a method that idles while waiting. For tilegx I'm >> sure it works correctly, so I'm OK providing that method. > Yes but we do busy waiting on all other archs then. And since we can wait > for a while there, it doesn't look sane. We can wait for a while (potentially multiple ticks), which is certainly a long time, but that's what the user asked for. Since we're checking signals and scheduling in the busy loop, we definitely won't get into some nasty unkillable state, which would be the real worst-case. I think the question is, could a process just get stuck there somehow in the normal course of events, where there is a future event on the tick_cpu_device, but no interrupt is enabled that will eventually deal with it? This seems like it would be a pretty fundamental timekeeping bug, so my assumption here is that can't happen, but maybe...? -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v5 2/6] cpu_isolated: add initial support @ 2015-08-26 15:55 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-08-26 15:55 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel On 08/26/2015 11:26 AM, Frederic Weisbecker wrote: > On Wed, Aug 12, 2015 at 02:22:09PM -0400, Chris Metcalf wrote: >> On 08/12/2015 12:00 PM, Frederic Weisbecker wrote: >>>> +#ifdef CONFIG_CPU_ISOLATED >>>> +void cpu_isolated_wait(void) >>>> +{ >>>> + set_current_state(TASK_INTERRUPTIBLE); >>>> + _cpu_idle(); >>>> + set_current_state(TASK_RUNNING); >>>> +} >>> I'm still uncomfortable with that. A wake up model could work? >> I don't know exactly what you have in mind. The theory is that >> at this point we're ready to return to user space and we're just >> waiting for a timer tick that is guaranteed to arrive, since there >> is something pending for the timer. > Hmm, ok I'm going to discuss that in the new version. One worry is that > it gets racy and we sleep there for ever. > >> And, this is an arch-specific method anyway; the generic method >> is actually checking to see if a signal has been delivered, >> scheduling is needed, etc., each time around the loop, so if >> you're not sure your architecture will do the right thing, just >> don't provide a method that idles while waiting. For tilegx I'm >> sure it works correctly, so I'm OK providing that method. > Yes but we do busy waiting on all other archs then. And since we can wait > for a while there, it doesn't look sane. We can wait for a while (potentially multiple ticks), which is certainly a long time, but that's what the user asked for. Since we're checking signals and scheduling in the busy loop, we definitely won't get into some nasty unkillable state, which would be the real worst-case. I think the question is, could a process just get stuck there somehow in the normal course of events, where there is a future event on the tick_cpu_device, but no interrupt is enabled that will eventually deal with it? This seems like it would be a pretty fundamental timekeeping bug, so my assumption here is that can't happen, but maybe...? -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v5 3/6] cpu_isolated: support PR_CPU_ISOLATED_STRICT mode 2015-07-28 19:49 ` Chris Metcalf @ 2015-07-28 19:49 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With cpu_isolated mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86, arm64, and tile. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- Note: Andy Lutomirski points out that improvements are coming to the context_tracking code to make it more robust, which may mean that some of the code suggested here for context_tracking may not be necessary. I am keeping it in the series for now since it is required for it to work based on 4.2-rc3. arch/arm64/kernel/ptrace.c | 5 +++++ arch/tile/kernel/ptrace.c | 5 ++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/cpu_isolated.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/time/cpu_isolated.c | 38 ++++++++++++++++++++++++++++++++++++++ 8 files changed, 80 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index d882b833dbdb..ff83968ab4d4 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -37,6 +37,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/cpu_isolated.h> #include <asm/compat.h> #include <asm/debug-monitors.h> @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, asmlinkage int syscall_trace_enter(struct pt_regs *regs) { + /* Ensure we report cpu_isolated violations in all circumstances. */ + if (test_thread_flag(TIF_NOHZ) && cpu_isolated_strict()) + cpu_isolated_syscall(regs->syscallno); + /* Do the secure computing check first; failures should be fast. */ if (secure_computing() == -1) return -1; diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..e54256c54311 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (cpu_isolated_strict()) + cpu_isolated_syscall(regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 9be72bc3613f..e5aec57e8e25 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (cpu_isolated_strict()) + cpu_isolated_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b96bd299966f..590414ef2bf1 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/cpu_isolated.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (cpu_isolated_strict()) + cpu_isolated_exception(); + } + } return prev_ctx; } diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h index a3d17360f7ae..b0f1c2669b2f 100644 --- a/include/linux/cpu_isolated.h +++ b/include/linux/cpu_isolated.h @@ -15,10 +15,26 @@ static inline bool is_cpu_isolated(void) } extern void cpu_isolated_enter(void); +extern void cpu_isolated_syscall(int nr); +extern void cpu_isolated_exception(void); extern void cpu_isolated_wait(void); #else static inline bool is_cpu_isolated(void) { return false; } static inline void cpu_isolated_enter(void) { } +static inline void cpu_isolated_syscall(int nr) { } +static inline void cpu_isolated_exception(void) { } #endif +static inline bool cpu_isolated_strict(void) +{ +#ifdef CONFIG_CPU_ISOLATED + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) == + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index edb40b6b84db..0c11238a84fb 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_CPU_ISOLATED 47 #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) +# define PR_CPU_ISOLATED_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 36b6509c3e2a..c740850eea11 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c index e27259f30caf..d30bf3852897 100644 --- a/kernel/time/cpu_isolated.c +++ b/kernel/time/cpu_isolated.c @@ -10,6 +10,7 @@ #include <linux/swap.h> #include <linux/vmstat.h> #include <linux/cpu_isolated.h> +#include <asm/unistd.h> #include "tick-sched.h" /* @@ -69,3 +70,40 @@ void cpu_isolated_enter(void) dump_stack(); } } + +static void kill_cpu_isolated_strict_task(void) +{ + dump_stack(); + current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void cpu_isolated_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_cpu_isolated_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void cpu_isolated_exception(void) +{ + pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", + current->comm, current->pid); + kill_cpu_isolated_strict_task(); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v5 3/6] cpu_isolated: support PR_CPU_ISOLATED_STRICT mode @ 2015-07-28 19:49 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With cpu_isolated mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86, arm64, and tile. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- Note: Andy Lutomirski points out that improvements are coming to the context_tracking code to make it more robust, which may mean that some of the code suggested here for context_tracking may not be necessary. I am keeping it in the series for now since it is required for it to work based on 4.2-rc3. arch/arm64/kernel/ptrace.c | 5 +++++ arch/tile/kernel/ptrace.c | 5 ++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/cpu_isolated.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/time/cpu_isolated.c | 38 ++++++++++++++++++++++++++++++++++++++ 8 files changed, 80 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index d882b833dbdb..ff83968ab4d4 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -37,6 +37,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/cpu_isolated.h> #include <asm/compat.h> #include <asm/debug-monitors.h> @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, asmlinkage int syscall_trace_enter(struct pt_regs *regs) { + /* Ensure we report cpu_isolated violations in all circumstances. */ + if (test_thread_flag(TIF_NOHZ) && cpu_isolated_strict()) + cpu_isolated_syscall(regs->syscallno); + /* Do the secure computing check first; failures should be fast. */ if (secure_computing() == -1) return -1; diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..e54256c54311 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (cpu_isolated_strict()) + cpu_isolated_syscall(regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 9be72bc3613f..e5aec57e8e25 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (cpu_isolated_strict()) + cpu_isolated_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b96bd299966f..590414ef2bf1 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/cpu_isolated.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (cpu_isolated_strict()) + cpu_isolated_exception(); + } + } return prev_ctx; } diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h index a3d17360f7ae..b0f1c2669b2f 100644 --- a/include/linux/cpu_isolated.h +++ b/include/linux/cpu_isolated.h @@ -15,10 +15,26 @@ static inline bool is_cpu_isolated(void) } extern void cpu_isolated_enter(void); +extern void cpu_isolated_syscall(int nr); +extern void cpu_isolated_exception(void); extern void cpu_isolated_wait(void); #else static inline bool is_cpu_isolated(void) { return false; } static inline void cpu_isolated_enter(void) { } +static inline void cpu_isolated_syscall(int nr) { } +static inline void cpu_isolated_exception(void) { } #endif +static inline bool cpu_isolated_strict(void) +{ +#ifdef CONFIG_CPU_ISOLATED + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) == + (PR_CPU_ISOLATED_ENABLE | PR_CPU_ISOLATED_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index edb40b6b84db..0c11238a84fb 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_CPU_ISOLATED 47 #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) +# define PR_CPU_ISOLATED_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 36b6509c3e2a..c740850eea11 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c index e27259f30caf..d30bf3852897 100644 --- a/kernel/time/cpu_isolated.c +++ b/kernel/time/cpu_isolated.c @@ -10,6 +10,7 @@ #include <linux/swap.h> #include <linux/vmstat.h> #include <linux/cpu_isolated.h> +#include <asm/unistd.h> #include "tick-sched.h" /* @@ -69,3 +70,40 @@ void cpu_isolated_enter(void) dump_stack(); } } + +static void kill_cpu_isolated_strict_task(void) +{ + dump_stack(); + current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void cpu_isolated_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_cpu_isolated_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void cpu_isolated_exception(void) +{ + pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", + current->comm, current->pid); + kill_cpu_isolated_strict_task(); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v5 4/6] cpu_isolated: provide strict mode configurable signal 2015-07-28 19:49 ` Chris Metcalf @ 2015-07-28 19:49 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a cpu_isolated process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/time/cpu_isolated.c | 17 ++++++++++++----- 2 files changed, 14 insertions(+), 5 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 0c11238a84fb..ab45bd3d5799 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) # define PR_CPU_ISOLATED_STRICT (1 << 1) +# define PR_CPU_ISOLATED_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c index d30bf3852897..9f8fcbd97770 100644 --- a/kernel/time/cpu_isolated.c +++ b/kernel/time/cpu_isolated.c @@ -71,11 +71,18 @@ void cpu_isolated_enter(void) } } -static void kill_cpu_isolated_strict_task(void) -{ +static void kill_cpu_isolated_strict_task(int is_syscall) + { + siginfo_t info = {}; + int sig; + dump_stack(); current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -94,7 +101,7 @@ void cpu_isolated_syscall(int syscall) pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(1); } /* @@ -105,5 +112,5 @@ void cpu_isolated_exception(void) { pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", current->comm, current->pid); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(0); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v5 4/6] cpu_isolated: provide strict mode configurable signal @ 2015-07-28 19:49 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a cpu_isolated process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/time/cpu_isolated.c | 17 ++++++++++++----- 2 files changed, 14 insertions(+), 5 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 0c11238a84fb..ab45bd3d5799 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_CPU_ISOLATED 48 # define PR_CPU_ISOLATED_ENABLE (1 << 0) # define PR_CPU_ISOLATED_STRICT (1 << 1) +# define PR_CPU_ISOLATED_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_CPU_ISOLATED_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c index d30bf3852897..9f8fcbd97770 100644 --- a/kernel/time/cpu_isolated.c +++ b/kernel/time/cpu_isolated.c @@ -71,11 +71,18 @@ void cpu_isolated_enter(void) } } -static void kill_cpu_isolated_strict_task(void) -{ +static void kill_cpu_isolated_strict_task(int is_syscall) + { + siginfo_t info = {}; + int sig; + dump_stack(); current->cpu_isolated_flags &= ~PR_CPU_ISOLATED_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_CPU_ISOLATED_GET_SIG(current->cpu_isolated_flags) ?: SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -94,7 +101,7 @@ void cpu_isolated_syscall(int syscall) pr_warn("%s/%d: cpu_isolated strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(1); } /* @@ -105,5 +112,5 @@ void cpu_isolated_exception(void) { pr_warn("%s/%d: cpu_isolated strict mode violated by exception\n", current->comm, current->pid); - kill_cpu_isolated_strict_task(); + kill_cpu_isolated_strict_task(0); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v5 5/6] cpu_isolated: add debug boot flag 2015-07-28 19:49 ` Chris Metcalf ` (4 preceding siblings ...) (?) @ 2015-07-28 19:49 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-kernel Cc: Chris Metcalf The new "cpu_isolated_debug" flag simplifies debugging of CPU_ISOLATED kernels when processes are running in PR_CPU_ISOLATED_ENABLE mode. Such processes should get no interrupts from the kernel, and if they do, when this boot flag is specified a kernel stack dump on the console is generated. It's possible to use ftrace to simply detect whether a cpu_isolated core has unexpectedly entered the kernel. But what this boot flag does is allow the kernel to provide better diagnostics, e.g. by reporting in the IPI-generating code what remote core and context is preparing to deliver an interrupt to a cpu_isolated core. It may be worth considering other ways to generate useful debugging output rather than console spew, but for now that is simple and direct. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- Documentation/kernel-parameters.txt | 7 +++++++ arch/tile/mm/homecache.c | 5 ++++- include/linux/cpu_isolated.h | 2 ++ kernel/irq_work.c | 5 ++++- kernel/sched/core.c | 21 +++++++++++++++++++++ kernel/signal.c | 5 +++++ kernel/smp.c | 4 ++++ kernel/softirq.c | 7 +++++++ 8 files changed, 54 insertions(+), 2 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 1d6f0459cd7b..940e4c9f1978 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -749,6 +749,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. /proc/<pid>/coredump_filter. See also Documentation/filesystems/proc.txt. + cpu_isolated_debug [KNL] + In kernels built with CONFIG_CPU_ISOLATED and booted + in nohz_full= mode, this setting will generate console + backtraces when the kernel is about to interrupt a + task that has requested PR_CPU_ISOLATED_ENABLE + and is running on a nohz_full core. + cpuidle.off=1 [CPU_IDLE] disable the cpuidle sub-system diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c index 40ca30a9fee3..fdef5e3d6396 100644 --- a/arch/tile/mm/homecache.c +++ b/arch/tile/mm/homecache.c @@ -31,6 +31,7 @@ #include <linux/smp.h> #include <linux/module.h> #include <linux/hugetlb.h> +#include <linux/cpu_isolated.h> #include <asm/page.h> #include <asm/sections.h> @@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask, * Don't bother to update atomically; losing a count * here is not that critical. */ - for_each_cpu(cpu, &mask) + for_each_cpu(cpu, &mask) { ++per_cpu(irq_stat, cpu).irq_hv_flush_count; + cpu_isolated_debug(cpu); + } } /* diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h index b0f1c2669b2f..4ea67d640be7 100644 --- a/include/linux/cpu_isolated.h +++ b/include/linux/cpu_isolated.h @@ -18,11 +18,13 @@ extern void cpu_isolated_enter(void); extern void cpu_isolated_syscall(int nr); extern void cpu_isolated_exception(void); extern void cpu_isolated_wait(void); +extern void cpu_isolated_debug(int cpu); #else static inline bool is_cpu_isolated(void) { return false; } static inline void cpu_isolated_enter(void) { } static inline void cpu_isolated_syscall(int nr) { } static inline void cpu_isolated_exception(void) { } +static inline void cpu_isolated_debug(int cpu) { } #endif static inline bool cpu_isolated_strict(void) diff --git a/kernel/irq_work.c b/kernel/irq_work.c index cbf9fb899d92..3c08a41f9898 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -17,6 +17,7 @@ #include <linux/cpu.h> #include <linux/notifier.h> #include <linux/smp.h> +#include <linux/cpu_isolated.h> #include <asm/processor.h> @@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu) if (!irq_work_claim(work)) return false; - if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) + if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) { + cpu_isolated_debug(cpu); arch_send_call_function_single_ipi(cpu); + } return true; } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 78b4bad10081..647671900497 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -74,6 +74,7 @@ #include <linux/binfmts.h> #include <linux/context_tracking.h> #include <linux/compiler.h> +#include <linux/cpu_isolated.h> #include <asm/switch_to.h> #include <asm/tlb.h> @@ -745,6 +746,26 @@ bool sched_can_stop_tick(void) } #endif /* CONFIG_NO_HZ_FULL */ +#ifdef CONFIG_CPU_ISOLATED +/* Enable debugging of any interrupts of cpu_isolated cores. */ +static int cpu_isolated_debug_flag; +static int __init cpu_isolated_debug_func(char *str) +{ + cpu_isolated_debug_flag = true; + return 1; +} +__setup("cpu_isolated_debug", cpu_isolated_debug_func); + +void cpu_isolated_debug(int cpu) +{ + if (cpu_isolated_debug_flag && tick_nohz_full_cpu(cpu) && + (cpu_curr(cpu)->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE)) { + pr_err("Interrupt detected for cpu_isolated cpu %d\n", cpu); + dump_stack(); + } +} +#endif + void sched_avg_update(struct rq *rq) { s64 period = sched_avg_period(); diff --git a/kernel/signal.c b/kernel/signal.c index 836df8dac6cc..90ee460c2586 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -684,6 +684,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) */ void signal_wake_up_state(struct task_struct *t, unsigned int state) { +#ifdef CONFIG_NO_HZ_FULL + /* If the task is being killed, don't complain about cpu_isolated. */ + if (state & TASK_WAKEKILL) + t->cpu_isolated_flags = 0; +#endif set_tsk_thread_flag(t, TIF_SIGPENDING); /* * TASK_WAKEKILL also means wake it up in the stopped/traced/killable diff --git a/kernel/smp.c b/kernel/smp.c index 07854477c164..846e42a3daa3 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -14,6 +14,7 @@ #include <linux/smp.h> #include <linux/cpu.h> #include <linux/sched.h> +#include <linux/cpu_isolated.h> #include "smpboot.h" @@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, * locking and barrier primitives. Generic code isn't really * equipped to do the right thing... */ + cpu_isolated_debug(cpu); if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) arch_send_call_function_single_ipi(cpu); @@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask, } /* Send a message to all CPUs in the map */ + for_each_cpu(cpu, cfd->cpumask) + cpu_isolated_debug(cpu); arch_send_call_function_ipi_mask(cfd->cpumask); if (wait) { diff --git a/kernel/softirq.c b/kernel/softirq.c index 479e4436f787..456149a4a34f 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -24,8 +24,10 @@ #include <linux/ftrace.h> #include <linux/smp.h> #include <linux/smpboot.h> +#include <linux/context_tracking.h> #include <linux/tick.h> #include <linux/irq.h> +#include <linux/cpu_isolated.h> #define CREATE_TRACE_POINTS #include <trace/events/irq.h> @@ -335,6 +337,11 @@ void irq_enter(void) _local_bh_enable(); } + if (context_tracking_cpu_is_enabled() && + context_tracking_in_user() && + !in_interrupt()) + cpu_isolated_debug(smp_processor_id()); + __irq_enter(); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v5 6/6] nohz: cpu_isolated: allow tick to be fully disabled 2015-07-28 19:49 ` Chris Metcalf ` (5 preceding siblings ...) (?) @ 2015-07-28 19:49 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-07-28 19:49 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel Cc: Chris Metcalf While the current fallback to 1-second tick is still helpful for maintaining completely correct kernel semantics, processes using prctl(PR_SET_CPU_ISOLATED) semantics place a higher priority on running completely tickless, so don't bound the time_delta for such processes. In addition, due to the way such processes quiesce by waiting for the timer tick to stop prior to returning to userspace, without this commit it won't be possible to use the cpu_isolated mode at all. Removing the 1-second cap was previously discussed (see link below) and Thomas Gleixner observed that vruntime, load balancing data, load accounting, and other things might be impacted. Frederic Weisbecker similarly observed that allowing the tick to be indefinitely deferred just meant that no one would ever fix the underlying bugs. However it's at least true that the mode proposed in this patch can only be enabled on a nohz_full core by a process requesting cpu_isolated mode, which may limit how important it is to maintain scheduler data correctly, for example. Paul McKenney observed that if provide a mode where the 1Hz fallback timer is removed, this will provide an environment where new code that relies on that tick will get punished, and we won't forgive such assumptions silently, so it may also be worth it from that perspective. Finally, it's worth observing that the tile architecture has been using similar code for its Zero-Overhead Linux for many years (starting in 2008) and customers are very enthusiastic about the resulting bare-metal performance on cores that are available to run full Linux semantics on demand (crash, logging, shutdown, etc). So this semantics is very useful if we can convince ourselves that doing this is safe. Link: https://lkml.kernel.org/r/alpine.DEB.2.11.1410311058500.32582@gentwo.org Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- kernel/time/tick-sched.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index c792429e98c6..3a1d48418499 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -24,6 +24,7 @@ #include <linux/posix-timers.h> #include <linux/perf_event.h> #include <linux/context_tracking.h> +#include <linux/cpu_isolated.h> #include <asm/irq_regs.h> @@ -652,7 +653,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, #ifdef CONFIG_NO_HZ_FULL /* Limit the tick delta to the maximum scheduler deferment */ - if (!ts->inidle) + if (!ts->inidle && !is_cpu_isolated()) delta = min(delta, scheduler_tick_max_deferment()); #endif -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v6 0/6] support "task_isolated" mode for nohz_full 2015-07-28 19:49 ` Chris Metcalf @ 2015-08-25 19:55 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The cover email for the patch series is getting a little unwieldy so I will provide a terser summary here, and just update the list of changes from version to version. Please see the previous versions linked by the In-Reply-To for more detailed comments about changes in earlier versions of the patch series. v6: restructured to be a "task_isolation" mode not a "cpu_isolated" mode (Frederic) v5: rebased on kernel v4.2-rc3 converted to use CONFIG_CPU_ISOLATED and separate .c and .h files incorporates Christoph Lameter's quiet_vmstat() call v4: rebased on kernel v4.2-rc1 added support for detecting CPU_ISOLATED_STRICT syscalls on arm64 v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() General summary: The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. The kernel must be built with CONFIG_TASK_ISOLATION to take advantage of this new mode. A prctl() option (PR_SET_TASK_ISOLATION) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on task_isolation cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on v4.2-rc3) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane Note: I have not removed the commit to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if we remove the 1Hz tick, task_isolation threads will never re-enter userspace since a tick will always be pending. Chris Metcalf (5): task_isolation: add initial support task_isolation: support PR_TASK_ISOLATION_STRICT mode task_isolation: provide strict mode configurable signal task_isolation: add debug boot flag nohz: task_isolation: allow tick to be fully disabled Christoph Lameter (1): vmstat: provide a function to quiet down the diff processing Documentation/kernel-parameters.txt | 7 +++ arch/arm64/kernel/ptrace.c | 5 ++ arch/tile/kernel/process.c | 9 +++ arch/tile/kernel/ptrace.c | 5 +- arch/tile/mm/homecache.c | 5 +- arch/x86/kernel/ptrace.c | 2 + include/linux/context_tracking.h | 11 +++- include/linux/isolation.h | 42 +++++++++++++ include/linux/sched.h | 3 + include/linux/vmstat.h | 2 + include/uapi/linux/prctl.h | 8 +++ init/Kconfig | 20 ++++++ kernel/Makefile | 1 + kernel/context_tracking.c | 12 +++- kernel/irq_work.c | 5 +- kernel/isolation.c | 122 ++++++++++++++++++++++++++++++++++++ kernel/sched/core.c | 21 +++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 7 +++ kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 3 +- mm/vmstat.c | 14 +++++ 23 files changed, 311 insertions(+), 10 deletions(-) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c -- 2.1.2 ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v6 0/6] support "task_isolated" mode for nohz_full @ 2015-08-25 19:55 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The cover email for the patch series is getting a little unwieldy so I will provide a terser summary here, and just update the list of changes from version to version. Please see the previous versions linked by the In-Reply-To for more detailed comments about changes in earlier versions of the patch series. v6: restructured to be a "task_isolation" mode not a "cpu_isolated" mode (Frederic) v5: rebased on kernel v4.2-rc3 converted to use CONFIG_CPU_ISOLATED and separate .c and .h files incorporates Christoph Lameter's quiet_vmstat() call v4: rebased on kernel v4.2-rc1 added support for detecting CPU_ISOLATED_STRICT syscalls on arm64 v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() General summary: The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. The kernel must be built with CONFIG_TASK_ISOLATION to take advantage of this new mode. A prctl() option (PR_SET_TASK_ISOLATION) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on task_isolation cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on v4.2-rc3) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane Note: I have not removed the commit to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if we remove the 1Hz tick, task_isolation threads will never re-enter userspace since a tick will always be pending. Chris Metcalf (5): task_isolation: add initial support task_isolation: support PR_TASK_ISOLATION_STRICT mode task_isolation: provide strict mode configurable signal task_isolation: add debug boot flag nohz: task_isolation: allow tick to be fully disabled Christoph Lameter (1): vmstat: provide a function to quiet down the diff processing Documentation/kernel-parameters.txt | 7 +++ arch/arm64/kernel/ptrace.c | 5 ++ arch/tile/kernel/process.c | 9 +++ arch/tile/kernel/ptrace.c | 5 +- arch/tile/mm/homecache.c | 5 +- arch/x86/kernel/ptrace.c | 2 + include/linux/context_tracking.h | 11 +++- include/linux/isolation.h | 42 +++++++++++++ include/linux/sched.h | 3 + include/linux/vmstat.h | 2 + include/uapi/linux/prctl.h | 8 +++ init/Kconfig | 20 ++++++ kernel/Makefile | 1 + kernel/context_tracking.c | 12 +++- kernel/irq_work.c | 5 +- kernel/isolation.c | 122 ++++++++++++++++++++++++++++++++++++ kernel/sched/core.c | 21 +++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 7 +++ kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 3 +- mm/vmstat.c | 14 +++++ 23 files changed, 311 insertions(+), 10 deletions(-) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c -- 2.1.2 ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v6 1/6] vmstat: provide a function to quiet down the diff processing 2015-08-25 19:55 ` Chris Metcalf (?) @ 2015-08-25 19:55 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel From: Christoph Lameter <cl@linux.com> quiet_vmstat() can be called in anticipation of a OS "quiet" period where no tick processing should be triggered. quiet_vmstat() will fold all pending differentials into the global counters and disable the vmstat_worker processing. Note that the shepherd thread will continue scanning the differentials from another processor and will reenable the vmstat workers if it detects any changes. Signed-off-by: Christoph Lameter <cl@linux.com> --- include/linux/vmstat.h | 2 ++ mm/vmstat.c | 14 ++++++++++++++ 2 files changed, 16 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 82e7db7f7100..c013b8d8e434 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone *, enum zone_stat_item); extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); +void quiet_vmstat(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -272,6 +273,7 @@ static inline void __dec_zone_page_state(struct page *page, static inline void refresh_cpu_vm_stats(int cpu) { } static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } +static inline void quiet_vmstat(void) { } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index 4f5cd974e11a..cf7d324f16e2 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1394,6 +1394,20 @@ static void vmstat_update(struct work_struct *w) } /* + * Switch off vmstat processing and then fold all the remaining differentials + * until the diffs stay at zero. The function is used by NOHZ and can only be + * invoked when tick processing is not active. + */ +void quiet_vmstat(void) +{ + do { + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) + cancel_delayed_work(this_cpu_ptr(&vmstat_work)); + + } while (refresh_cpu_vm_stats()); +} + +/* * Check if the diffs for a certain cpu indicate that * an update is needed. */ -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v6 2/6] task_isolation: add initial support 2015-08-25 19:55 ` Chris Metcalf @ 2015-08-25 19:55 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The kernel must be built with the new TASK_ISOLATION Kconfig flag to enable this mode, and the kernel booted with an appropriate nohz_full=CPULIST boot argument. The "task_isolation" state is then indicated by setting a new task struct field, task_isolation_flag, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new task_isolation_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only three actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, it calls quiet_vmstat() to quieten the vmstat worker to avoid a follow-on interrupt. Finally, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/process.c | 9 ++++++ include/linux/isolation.h | 24 +++++++++++++++ include/linux/sched.h | 3 ++ include/uapi/linux/prctl.h | 5 ++++ init/Kconfig | 20 +++++++++++++ kernel/Makefile | 1 + kernel/context_tracking.c | 3 ++ kernel/isolation.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++ kernel/sys.c | 8 +++++ 9 files changed, 148 insertions(+) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index e036c0aa9792..1d9bd2320a50 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -70,6 +70,15 @@ void arch_cpu_idle(void) _cpu_idle(); } +#ifdef CONFIG_TASK_ISOLATION +void task_isolation_wait(void) +{ + set_current_state(TASK_INTERRUPTIBLE); + _cpu_idle(); + set_current_state(TASK_RUNNING); +} +#endif + /* * Release a thread_info structure */ diff --git a/include/linux/isolation.h b/include/linux/isolation.h new file mode 100644 index 000000000000..fd04011b1c1e --- /dev/null +++ b/include/linux/isolation.h @@ -0,0 +1,24 @@ +/* + * Task isolation related global functions + */ +#ifndef _LINUX_ISOLATION_H +#define _LINUX_ISOLATION_H + +#include <linux/tick.h> +#include <linux/prctl.h> + +#ifdef CONFIG_TASK_ISOLATION +static inline bool task_isolation_enabled(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE); +} + +extern void task_isolation_enter(void); +extern void task_isolation_wait(void); +#else +static inline bool task_isolation_enabled(void) { return false; } +static inline void task_isolation_enter(void) { } +#endif + +#endif diff --git a/include/linux/sched.h b/include/linux/sched.h index 04b5ada460b4..2acb618189d0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1776,6 +1776,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_TASK_ISOLATION + unsigned int task_isolation_flags; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..79da784fe17a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */ +#define PR_SET_TASK_ISOLATION 47 +#define PR_GET_TASK_ISOLATION 48 +# define PR_TASK_ISOLATION_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/init/Kconfig b/init/Kconfig index af09b4fb43d2..82d313cbd70f 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -795,6 +795,26 @@ config RCU_EXPEDITE_BOOT endmenu # "RCU Subsystem" +config TASK_ISOLATION + bool "Provide hard CPU isolation from the kernel on demand" + depends on NO_HZ_FULL + help + Allow userspace processes to place themselves on nohz_full + cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate" + themselves from the kernel. On return to userspace, + isolated tasks will first arrange that no future kernel + activity will interrupt the task while the task is running + in userspace. This "hard" isolation from the kernel is + required for userspace tasks that are running hard real-time + tasks in userspace, such as a 10 Gbit network driver in userspace. + + Without this option, but with NO_HZ_FULL enabled, the kernel + will make a best-faith, "soft" effort to shield a single userspace + process from interrupts, but makes no guarantees. + + You should say "N" unless you are intending to run a + high-performance userspace driver or similar task. + config BUILD_BIN2C bool default n diff --git a/kernel/Makefile b/kernel/Makefile index 43c4c920f30a..9ffb5c021767 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o obj-$(CONFIG_JUMP_LABEL) += jump_label.o obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o obj-$(CONFIG_TORTURE_TEST) += torture.o +obj-$(CONFIG_TASK_ISOLATION) += isolation.o $(obj)/configs.o: $(obj)/config_data.h diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 0a495ab35bc7..c57c99f5c4d7 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/isolation.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (task_isolation_enabled()) + task_isolation_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/isolation.c b/kernel/isolation.c new file mode 100644 index 000000000000..d4618cd9e23d --- /dev/null +++ b/kernel/isolation.c @@ -0,0 +1,75 @@ +/* + * linux/kernel/isolation.c + * + * Implementation for task isolation. + * + * Distributed under GPLv2. + */ + +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/vmstat.h> +#include <linux/isolation.h> +#include "time/tick-sched.h" + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + * + * Note that it must be guaranteed for a particular architecture + * that if next_event is not KTIME_MAX, then a timer interrupt will + * occur, otherwise the sleep may never awaken. + */ +void __weak task_isolation_wait(void) +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In task_isolation mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two task_isolation processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void task_isolation_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + /* Quieten the vmstat worker so it won't interrupt us. */ + quiet_vmstat(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + task_isolation_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: task_isolation task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} diff --git a/kernel/sys.c b/kernel/sys.c index 259fda25eb6b..c7024be2d79b 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_TASK_ISOLATION + case PR_SET_TASK_ISOLATION: + me->task_isolation_flags = arg2; + break; + case PR_GET_TASK_ISOLATION: + error = me->task_isolation_flags; + break; +#endif default: error = -EINVAL; break; -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v6 2/6] task_isolation: add initial support @ 2015-08-25 19:55 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The kernel must be built with the new TASK_ISOLATION Kconfig flag to enable this mode, and the kernel booted with an appropriate nohz_full=CPULIST boot argument. The "task_isolation" state is then indicated by setting a new task struct field, task_isolation_flag, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new task_isolation_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only three actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, it calls quiet_vmstat() to quieten the vmstat worker to avoid a follow-on interrupt. Finally, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/process.c | 9 ++++++ include/linux/isolation.h | 24 +++++++++++++++ include/linux/sched.h | 3 ++ include/uapi/linux/prctl.h | 5 ++++ init/Kconfig | 20 +++++++++++++ kernel/Makefile | 1 + kernel/context_tracking.c | 3 ++ kernel/isolation.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++ kernel/sys.c | 8 +++++ 9 files changed, 148 insertions(+) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index e036c0aa9792..1d9bd2320a50 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -70,6 +70,15 @@ void arch_cpu_idle(void) _cpu_idle(); } +#ifdef CONFIG_TASK_ISOLATION +void task_isolation_wait(void) +{ + set_current_state(TASK_INTERRUPTIBLE); + _cpu_idle(); + set_current_state(TASK_RUNNING); +} +#endif + /* * Release a thread_info structure */ diff --git a/include/linux/isolation.h b/include/linux/isolation.h new file mode 100644 index 000000000000..fd04011b1c1e --- /dev/null +++ b/include/linux/isolation.h @@ -0,0 +1,24 @@ +/* + * Task isolation related global functions + */ +#ifndef _LINUX_ISOLATION_H +#define _LINUX_ISOLATION_H + +#include <linux/tick.h> +#include <linux/prctl.h> + +#ifdef CONFIG_TASK_ISOLATION +static inline bool task_isolation_enabled(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE); +} + +extern void task_isolation_enter(void); +extern void task_isolation_wait(void); +#else +static inline bool task_isolation_enabled(void) { return false; } +static inline void task_isolation_enter(void) { } +#endif + +#endif diff --git a/include/linux/sched.h b/include/linux/sched.h index 04b5ada460b4..2acb618189d0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1776,6 +1776,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_TASK_ISOLATION + unsigned int task_isolation_flags; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..79da784fe17a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */ +#define PR_SET_TASK_ISOLATION 47 +#define PR_GET_TASK_ISOLATION 48 +# define PR_TASK_ISOLATION_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/init/Kconfig b/init/Kconfig index af09b4fb43d2..82d313cbd70f 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -795,6 +795,26 @@ config RCU_EXPEDITE_BOOT endmenu # "RCU Subsystem" +config TASK_ISOLATION + bool "Provide hard CPU isolation from the kernel on demand" + depends on NO_HZ_FULL + help + Allow userspace processes to place themselves on nohz_full + cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate" + themselves from the kernel. On return to userspace, + isolated tasks will first arrange that no future kernel + activity will interrupt the task while the task is running + in userspace. This "hard" isolation from the kernel is + required for userspace tasks that are running hard real-time + tasks in userspace, such as a 10 Gbit network driver in userspace. + + Without this option, but with NO_HZ_FULL enabled, the kernel + will make a best-faith, "soft" effort to shield a single userspace + process from interrupts, but makes no guarantees. + + You should say "N" unless you are intending to run a + high-performance userspace driver or similar task. + config BUILD_BIN2C bool default n diff --git a/kernel/Makefile b/kernel/Makefile index 43c4c920f30a..9ffb5c021767 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o obj-$(CONFIG_JUMP_LABEL) += jump_label.o obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o obj-$(CONFIG_TORTURE_TEST) += torture.o +obj-$(CONFIG_TASK_ISOLATION) += isolation.o $(obj)/configs.o: $(obj)/config_data.h diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 0a495ab35bc7..c57c99f5c4d7 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/isolation.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (task_isolation_enabled()) + task_isolation_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/isolation.c b/kernel/isolation.c new file mode 100644 index 000000000000..d4618cd9e23d --- /dev/null +++ b/kernel/isolation.c @@ -0,0 +1,75 @@ +/* + * linux/kernel/isolation.c + * + * Implementation for task isolation. + * + * Distributed under GPLv2. + */ + +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/vmstat.h> +#include <linux/isolation.h> +#include "time/tick-sched.h" + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + * + * Note that it must be guaranteed for a particular architecture + * that if next_event is not KTIME_MAX, then a timer interrupt will + * occur, otherwise the sleep may never awaken. + */ +void __weak task_isolation_wait(void) +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In task_isolation mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two task_isolation processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void task_isolation_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + /* Quieten the vmstat worker so it won't interrupt us. */ + quiet_vmstat(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + task_isolation_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: task_isolation task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} diff --git a/kernel/sys.c b/kernel/sys.c index 259fda25eb6b..c7024be2d79b 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_TASK_ISOLATION + case PR_SET_TASK_ISOLATION: + me->task_isolation_flags = arg2; + break; + case PR_GET_TASK_ISOLATION: + error = me->task_isolation_flags; + break; +#endif default: error = -EINVAL; break; -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-08-25 19:55 ` Chris Metcalf @ 2015-08-25 19:55 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With task_isolation mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86, arm64, and tile. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/arm64/kernel/ptrace.c | 5 +++++ arch/tile/kernel/ptrace.c | 5 ++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/isolation.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/isolation.c | 38 ++++++++++++++++++++++++++++++++++++++ 8 files changed, 80 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index d882b833dbdb..e3d83a12f3cf 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -37,6 +37,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/isolation.h> #include <asm/compat.h> #include <asm/debug-monitors.h> @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, asmlinkage int syscall_trace_enter(struct pt_regs *regs) { + /* Ensure we report task_isolation violations in all circumstances. */ + if (test_thread_flag(TIF_NOHZ) && task_isolation_strict()) + task_isolation_syscall(regs->syscallno); + /* Do the secure computing check first; failures should be fast. */ if (secure_computing() == -1) return -1; diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..c327cb918a44 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (task_isolation_strict()) + task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 9be72bc3613f..2f9ce9466daf 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (task_isolation_strict()) + task_isolation_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b96bd299966f..e0ac0228fea1 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/isolation.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (task_isolation_strict()) + task_isolation_exception(); + } + } return prev_ctx; } diff --git a/include/linux/isolation.h b/include/linux/isolation.h index fd04011b1c1e..27a4469831c1 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void) } extern void task_isolation_enter(void); +extern void task_isolation_syscall(int nr); +extern void task_isolation_exception(void); extern void task_isolation_wait(void); #else static inline bool task_isolation_enabled(void) { return false; } static inline void task_isolation_enter(void) { } +static inline void task_isolation_syscall(int nr) { } +static inline void task_isolation_exception(void) { } #endif +static inline bool task_isolation_strict(void) +{ +#ifdef CONFIG_TASK_ISOLATION + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) == + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 79da784fe17a..e16e13911e8a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_TASK_ISOLATION 47 #define PR_GET_TASK_ISOLATION 48 # define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index c57c99f5c4d7..17a71f7b66b8 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/isolation.c b/kernel/isolation.c index d4618cd9e23d..a89a6e9adfb4 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -10,6 +10,7 @@ #include <linux/swap.h> #include <linux/vmstat.h> #include <linux/isolation.h> +#include <asm/unistd.h> #include "time/tick-sched.h" /* @@ -73,3 +74,40 @@ void task_isolation_enter(void) dump_stack(); } } + +static void kill_task_isolation_strict_task(void) +{ + dump_stack(); + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void task_isolation_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_task_isolation_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void task_isolation_exception(void) +{ + pr_warn("%s/%d: task_isolation strict mode violated by exception\n", + current->comm, current->pid); + kill_task_isolation_strict_task(); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode @ 2015-08-25 19:55 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With task_isolation mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86, arm64, and tile. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/arm64/kernel/ptrace.c | 5 +++++ arch/tile/kernel/ptrace.c | 5 ++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/isolation.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/isolation.c | 38 ++++++++++++++++++++++++++++++++++++++ 8 files changed, 80 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index d882b833dbdb..e3d83a12f3cf 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -37,6 +37,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/isolation.h> #include <asm/compat.h> #include <asm/debug-monitors.h> @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, asmlinkage int syscall_trace_enter(struct pt_regs *regs) { + /* Ensure we report task_isolation violations in all circumstances. */ + if (test_thread_flag(TIF_NOHZ) && task_isolation_strict()) + task_isolation_syscall(regs->syscallno); + /* Do the secure computing check first; failures should be fast. */ if (secure_computing() == -1) return -1; diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..c327cb918a44 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (task_isolation_strict()) + task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 9be72bc3613f..2f9ce9466daf 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (task_isolation_strict()) + task_isolation_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b96bd299966f..e0ac0228fea1 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/isolation.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (task_isolation_strict()) + task_isolation_exception(); + } + } return prev_ctx; } diff --git a/include/linux/isolation.h b/include/linux/isolation.h index fd04011b1c1e..27a4469831c1 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void) } extern void task_isolation_enter(void); +extern void task_isolation_syscall(int nr); +extern void task_isolation_exception(void); extern void task_isolation_wait(void); #else static inline bool task_isolation_enabled(void) { return false; } static inline void task_isolation_enter(void) { } +static inline void task_isolation_syscall(int nr) { } +static inline void task_isolation_exception(void) { } #endif +static inline bool task_isolation_strict(void) +{ +#ifdef CONFIG_TASK_ISOLATION + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) == + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 79da784fe17a..e16e13911e8a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_TASK_ISOLATION 47 #define PR_GET_TASK_ISOLATION 48 # define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index c57c99f5c4d7..17a71f7b66b8 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/isolation.c b/kernel/isolation.c index d4618cd9e23d..a89a6e9adfb4 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -10,6 +10,7 @@ #include <linux/swap.h> #include <linux/vmstat.h> #include <linux/isolation.h> +#include <asm/unistd.h> #include "time/tick-sched.h" /* @@ -73,3 +74,40 @@ void task_isolation_enter(void) dump_stack(); } } + +static void kill_task_isolation_strict_task(void) +{ + dump_stack(); + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void task_isolation_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_task_isolation_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void task_isolation_exception(void) +{ + pr_warn("%s/%d: task_isolation strict mode violated by exception\n", + current->comm, current->pid); + kill_task_isolation_strict_task(); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-08-25 19:55 ` Chris Metcalf (?) @ 2015-08-26 10:36 ` Will Deacon 2015-08-26 15:10 ` Chris Metcalf 2015-08-28 15:31 ` Chris Metcalf -1 siblings, 2 replies; 340+ messages in thread From: Will Deacon @ 2015-08-26 10:36 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, linux-doc, linux-api, linux-kernel Hi Chris, On Tue, Aug 25, 2015 at 08:55:52PM +0100, Chris Metcalf wrote: > With task_isolation mode, the task is in principle guaranteed not to > be interrupted by the kernel, but only if it behaves. In particular, > if it enters the kernel via system call, page fault, or any of a > number of other synchronous traps, it may be unexpectedly exposed > to long latencies. Add a simple flag that puts the process into > a state where any such kernel entry is fatal. > > To allow the state to be entered and exited, we ignore the prctl() > syscall so that we can clear the bit again later, and we ignore > exit/exit_group to allow exiting the task without a pointless signal > killing you as you try to do so. > > This change adds the syscall-detection hooks only for x86, arm64, > and tile. > > The signature of context_tracking_exit() changes to report whether > we, in fact, are exiting back to user space, so that we can track > user exceptions properly separately from other kernel entries. > > Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> > --- > arch/arm64/kernel/ptrace.c | 5 +++++ > arch/tile/kernel/ptrace.c | 5 ++++- > arch/x86/kernel/ptrace.c | 2 ++ > include/linux/context_tracking.h | 11 ++++++++--- > include/linux/isolation.h | 16 ++++++++++++++++ > include/uapi/linux/prctl.h | 1 + > kernel/context_tracking.c | 9 ++++++--- > kernel/isolation.c | 38 ++++++++++++++++++++++++++++++++++++++ > 8 files changed, 80 insertions(+), 7 deletions(-) > > diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c > index d882b833dbdb..e3d83a12f3cf 100644 > --- a/arch/arm64/kernel/ptrace.c > +++ b/arch/arm64/kernel/ptrace.c > @@ -37,6 +37,7 @@ > #include <linux/regset.h> > #include <linux/tracehook.h> > #include <linux/elf.h> > +#include <linux/isolation.h> > > #include <asm/compat.h> > #include <asm/debug-monitors.h> > @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, > > asmlinkage int syscall_trace_enter(struct pt_regs *regs) > { > + /* Ensure we report task_isolation violations in all circumstances. */ > + if (test_thread_flag(TIF_NOHZ) && task_isolation_strict()) This is going to force us to check TIF_NOHZ on the syscall slowpath even when CONFIG_TASK_ISOLATION=n. > + task_isolation_syscall(regs->syscallno); > + > /* Do the secure computing check first; failures should be fast. */ Here we have the usual priority problems with all the subsystems that hook into the syscall path. If a prctl is later rewritten to a different syscall, do you care about catching it? Either way, the comment about doing secure computing "first" needs fixing. Cheers, Will ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-08-26 10:36 ` Will Deacon @ 2015-08-26 15:10 ` Chris Metcalf 2015-09-02 10:13 ` Will Deacon 2015-08-28 15:31 ` Chris Metcalf 1 sibling, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-08-26 15:10 UTC (permalink / raw) To: Will Deacon Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, linux-doc, linux-api, linux-kernel On 08/26/2015 06:36 AM, Will Deacon wrote: > Hi Chris, > > On Tue, Aug 25, 2015 at 08:55:52PM +0100, Chris Metcalf wrote: >> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c >> index d882b833dbdb..e3d83a12f3cf 100644 >> --- a/arch/arm64/kernel/ptrace.c >> +++ b/arch/arm64/kernel/ptrace.c >> @@ -37,6 +37,7 @@ >> #include <linux/regset.h> >> #include <linux/tracehook.h> >> #include <linux/elf.h> >> +#include <linux/isolation.h> >> >> #include <asm/compat.h> >> #include <asm/debug-monitors.h> >> @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, >> >> asmlinkage int syscall_trace_enter(struct pt_regs *regs) >> { >> + /* Ensure we report task_isolation violations in all circumstances. */ >> + if (test_thread_flag(TIF_NOHZ) && task_isolation_strict()) > This is going to force us to check TIF_NOHZ on the syscall slowpath even > when CONFIG_TASK_ISOLATION=n. Yes, good catch. I was thinking the "&& false" would suppress the TIF test but I forgot that test_bit() takes a volatile argument, so it gets evaluated even though the result isn't actually used. But I don't want to just reorder the two tests, because when isolation is enabled, testing TIF_NOHZ first is better. I think probably the right solution is just to put an #ifdef CONFIG_TASK_ISOLATION around that test, even though that is a little crufty. The alternative is to provide a task_isolation_configured() macro that just returns true or false, and make it a three-part "&&" test with that new macro first, but that seems a little crufty as well. Do you have a preference? >> + task_isolation_syscall(regs->syscallno); >> + >> /* Do the secure computing check first; failures should be fast. */ > Here we have the usual priority problems with all the subsystems that > hook into the syscall path. If a prctl is later rewritten to a different > syscall, do you care about catching it? Either way, the comment about > doing secure computing "first" needs fixing. I admit I am unclear on the utility of rewriting prctl. My instinct is that we are trying to catch userspace invocations of prctl and allow them, and fail most everything else, so doing it pre-rewrite seems OK. I'm not sure if it makes sense to catch it before or after the secure computing check, though. On reflection maybe doing it afterwards makes more sense - what do you think? Thanks! -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode @ 2015-09-02 10:13 ` Will Deacon 0 siblings, 0 replies; 340+ messages in thread From: Will Deacon @ 2015-09-02 10:13 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, linux-doc, linux-api, linux-kernel On Wed, Aug 26, 2015 at 04:10:34PM +0100, Chris Metcalf wrote: > On 08/26/2015 06:36 AM, Will Deacon wrote: > > On Tue, Aug 25, 2015 at 08:55:52PM +0100, Chris Metcalf wrote: > >> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c > >> index d882b833dbdb..e3d83a12f3cf 100644 > >> --- a/arch/arm64/kernel/ptrace.c > >> +++ b/arch/arm64/kernel/ptrace.c > >> @@ -37,6 +37,7 @@ > >> #include <linux/regset.h> > >> #include <linux/tracehook.h> > >> #include <linux/elf.h> > >> +#include <linux/isolation.h> > >> > >> #include <asm/compat.h> > >> #include <asm/debug-monitors.h> > >> @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, > >> > >> asmlinkage int syscall_trace_enter(struct pt_regs *regs) > >> { > >> + /* Ensure we report task_isolation violations in all circumstances. */ > >> + if (test_thread_flag(TIF_NOHZ) && task_isolation_strict()) > > This is going to force us to check TIF_NOHZ on the syscall slowpath even > > when CONFIG_TASK_ISOLATION=n. > > Yes, good catch. I was thinking the "&& false" would suppress the TIF > test but I forgot that test_bit() takes a volatile argument, so it gets > evaluated even though the result isn't actually used. > > But I don't want to just reorder the two tests, because when isolation > is enabled, testing TIF_NOHZ first is better. I think probably the right > solution is just to put an #ifdef CONFIG_TASK_ISOLATION around that > test, even though that is a little crufty. The alternative is to provide > a task_isolation_configured() macro that just returns true or false, and > make it a three-part "&&" test with that new macro first, but > that seems a little crufty as well. Do you have a preference? Maybe use IS_ENABLED(CONFIG_TASK_ISOLATION) ? > >> + task_isolation_syscall(regs->syscallno); > >> + > >> /* Do the secure computing check first; failures should be fast. */ > > Here we have the usual priority problems with all the subsystems that > > hook into the syscall path. If a prctl is later rewritten to a different > > syscall, do you care about catching it? Either way, the comment about > > doing secure computing "first" needs fixing. > > I admit I am unclear on the utility of rewriting prctl. My instinct is that > we are trying to catch userspace invocations of prctl and allow them, > and fail most everything else, so doing it pre-rewrite seems OK. > > I'm not sure if it makes sense to catch it before or after the > secure computing check, though. On reflection maybe doing it > afterwards makes more sense - what do you think? I don't have a strong preference (I really hate all these hooks we have on the syscall entry/exit path), but we do need to make sure that the behaviour is consistent across architectures. Will ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode @ 2015-09-02 10:13 ` Will Deacon 0 siblings, 0 replies; 340+ messages in thread From: Will Deacon @ 2015-09-02 10:13 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Wed, Aug 26, 2015 at 04:10:34PM +0100, Chris Metcalf wrote: > On 08/26/2015 06:36 AM, Will Deacon wrote: > > On Tue, Aug 25, 2015 at 08:55:52PM +0100, Chris Metcalf wrote: > >> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c > >> index d882b833dbdb..e3d83a12f3cf 100644 > >> --- a/arch/arm64/kernel/ptrace.c > >> +++ b/arch/arm64/kernel/ptrace.c > >> @@ -37,6 +37,7 @@ > >> #include <linux/regset.h> > >> #include <linux/tracehook.h> > >> #include <linux/elf.h> > >> +#include <linux/isolation.h> > >> > >> #include <asm/compat.h> > >> #include <asm/debug-monitors.h> > >> @@ -1150,6 +1151,10 @@ static void tracehook_report_syscall(struct pt_regs *regs, > >> > >> asmlinkage int syscall_trace_enter(struct pt_regs *regs) > >> { > >> + /* Ensure we report task_isolation violations in all circumstances. */ > >> + if (test_thread_flag(TIF_NOHZ) && task_isolation_strict()) > > This is going to force us to check TIF_NOHZ on the syscall slowpath even > > when CONFIG_TASK_ISOLATION=n. > > Yes, good catch. I was thinking the "&& false" would suppress the TIF > test but I forgot that test_bit() takes a volatile argument, so it gets > evaluated even though the result isn't actually used. > > But I don't want to just reorder the two tests, because when isolation > is enabled, testing TIF_NOHZ first is better. I think probably the right > solution is just to put an #ifdef CONFIG_TASK_ISOLATION around that > test, even though that is a little crufty. The alternative is to provide > a task_isolation_configured() macro that just returns true or false, and > make it a three-part "&&" test with that new macro first, but > that seems a little crufty as well. Do you have a preference? Maybe use IS_ENABLED(CONFIG_TASK_ISOLATION) ? > >> + task_isolation_syscall(regs->syscallno); > >> + > >> /* Do the secure computing check first; failures should be fast. */ > > Here we have the usual priority problems with all the subsystems that > > hook into the syscall path. If a prctl is later rewritten to a different > > syscall, do you care about catching it? Either way, the comment about > > doing secure computing "first" needs fixing. > > I admit I am unclear on the utility of rewriting prctl. My instinct is that > we are trying to catch userspace invocations of prctl and allow them, > and fail most everything else, so doing it pre-rewrite seems OK. > > I'm not sure if it makes sense to catch it before or after the > secure computing check, though. On reflection maybe doing it > afterwards makes more sense - what do you think? I don't have a strong preference (I really hate all these hooks we have on the syscall entry/exit path), but we do need to make sure that the behaviour is consistent across architectures. Will ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v6.1 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-08-26 10:36 ` Will Deacon @ 2015-08-28 15:31 ` Chris Metcalf 2015-08-28 15:31 ` Chris Metcalf 1 sibling, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-08-28 15:31 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With task_isolation mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86, arm64, and tile. For arm64 we use an explict #ifdef CONFIG_TASK_ISOLATION so we can both achieve no overhead for !TASK_ISOLATION, but also achieve low latency (test TIF_NOHZ first) for TASK_ISOLATION. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- This "v6.1" is just a tweak to the existing v6 series to reflect Will Deacon's suggestions about the arm64 syscall entry code. I've updated the git tree with this updated patch in the series. A more disruptive change would be to capture the thread flags up front like x86 and tile, which allows the test itself to be optimized away if the task_isolation call becomes a no-op. arch/arm64/kernel/ptrace.c | 6 ++++++ arch/tile/kernel/ptrace.c | 5 ++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/isolation.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/isolation.c | 38 ++++++++++++++++++++++++++++++++++++++ 8 files changed, 81 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index d882b833dbdb..5d4284445f70 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -37,6 +37,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/isolation.h> #include <asm/compat.h> #include <asm/debug-monitors.h> @@ -1154,6 +1155,11 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs) if (secure_computing() == -1) return -1; +#ifdef CONFIG_TASK_ISOLATION + if (test_thread_flag(TIF_NOHZ) && task_isolation_strict()) + task_isolation_syscall(regs->syscallno); +#endif + if (test_thread_flag(TIF_SYSCALL_TRACE)) tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER); diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..c327cb918a44 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (task_isolation_strict()) + task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 9be72bc3613f..2f9ce9466daf 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (task_isolation_strict()) + task_isolation_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b96bd299966f..e0ac0228fea1 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/isolation.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (task_isolation_strict()) + task_isolation_exception(); + } + } return prev_ctx; } diff --git a/include/linux/isolation.h b/include/linux/isolation.h index fd04011b1c1e..27a4469831c1 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void) } extern void task_isolation_enter(void); +extern void task_isolation_syscall(int nr); +extern void task_isolation_exception(void); extern void task_isolation_wait(void); #else static inline bool task_isolation_enabled(void) { return false; } static inline void task_isolation_enter(void) { } +static inline void task_isolation_syscall(int nr) { } +static inline void task_isolation_exception(void) { } #endif +static inline bool task_isolation_strict(void) +{ +#ifdef CONFIG_TASK_ISOLATION + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) == + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 79da784fe17a..e16e13911e8a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_TASK_ISOLATION 47 #define PR_GET_TASK_ISOLATION 48 # define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index c57c99f5c4d7..17a71f7b66b8 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/isolation.c b/kernel/isolation.c index d4618cd9e23d..a89a6e9adfb4 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -10,6 +10,7 @@ #include <linux/swap.h> #include <linux/vmstat.h> #include <linux/isolation.h> +#include <asm/unistd.h> #include "time/tick-sched.h" /* @@ -73,3 +74,40 @@ void task_isolation_enter(void) dump_stack(); } } + +static void kill_task_isolation_strict_task(void) +{ + dump_stack(); + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void task_isolation_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_task_isolation_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void task_isolation_exception(void) +{ + pr_warn("%s/%d: task_isolation strict mode violated by exception\n", + current->comm, current->pid); + kill_task_isolation_strict_task(); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v6.1 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode @ 2015-08-28 15:31 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-08-28 15:31 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With task_isolation mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86, arm64, and tile. For arm64 we use an explict #ifdef CONFIG_TASK_ISOLATION so we can both achieve no overhead for !TASK_ISOLATION, but also achieve low latency (test TIF_NOHZ first) for TASK_ISOLATION. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- This "v6.1" is just a tweak to the existing v6 series to reflect Will Deacon's suggestions about the arm64 syscall entry code. I've updated the git tree with this updated patch in the series. A more disruptive change would be to capture the thread flags up front like x86 and tile, which allows the test itself to be optimized away if the task_isolation call becomes a no-op. arch/arm64/kernel/ptrace.c | 6 ++++++ arch/tile/kernel/ptrace.c | 5 ++++- arch/x86/kernel/ptrace.c | 2 ++ include/linux/context_tracking.h | 11 ++++++++--- include/linux/isolation.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/isolation.c | 38 ++++++++++++++++++++++++++++++++++++++ 8 files changed, 81 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index d882b833dbdb..5d4284445f70 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -37,6 +37,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/isolation.h> #include <asm/compat.h> #include <asm/debug-monitors.h> @@ -1154,6 +1155,11 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs) if (secure_computing() == -1) return -1; +#ifdef CONFIG_TASK_ISOLATION + if (test_thread_flag(TIF_NOHZ) && task_isolation_strict()) + task_isolation_syscall(regs->syscallno); +#endif + if (test_thread_flag(TIF_SYSCALL_TRACE)) tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER); diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..c327cb918a44 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (task_isolation_strict()) + task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 9be72bc3613f..2f9ce9466daf 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1479,6 +1479,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) if (work & _TIF_NOHZ) { user_exit(); work &= ~_TIF_NOHZ; + if (task_isolation_strict()) + task_isolation_syscall(regs->orig_ax); } #ifdef CONFIG_SECCOMP diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b96bd299966f..e0ac0228fea1 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/isolation.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (task_isolation_strict()) + task_isolation_exception(); + } + } return prev_ctx; } diff --git a/include/linux/isolation.h b/include/linux/isolation.h index fd04011b1c1e..27a4469831c1 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void) } extern void task_isolation_enter(void); +extern void task_isolation_syscall(int nr); +extern void task_isolation_exception(void); extern void task_isolation_wait(void); #else static inline bool task_isolation_enabled(void) { return false; } static inline void task_isolation_enter(void) { } +static inline void task_isolation_syscall(int nr) { } +static inline void task_isolation_exception(void) { } #endif +static inline bool task_isolation_strict(void) +{ +#ifdef CONFIG_TASK_ISOLATION + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) == + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 79da784fe17a..e16e13911e8a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_TASK_ISOLATION 47 #define PR_GET_TASK_ISOLATION 48 # define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index c57c99f5c4d7..17a71f7b66b8 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/isolation.c b/kernel/isolation.c index d4618cd9e23d..a89a6e9adfb4 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -10,6 +10,7 @@ #include <linux/swap.h> #include <linux/vmstat.h> #include <linux/isolation.h> +#include <asm/unistd.h> #include "time/tick-sched.h" /* @@ -73,3 +74,40 @@ void task_isolation_enter(void) dump_stack(); } } + +static void kill_task_isolation_strict_task(void) +{ + dump_stack(); + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void task_isolation_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_task_isolation_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void task_isolation_exception(void) +{ + pr_warn("%s/%d: task_isolation strict mode violated by exception\n", + current->comm, current->pid); + kill_task_isolation_strict_task(); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v6 4/6] task_isolation: provide strict mode configurable signal @ 2015-08-25 19:55 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a task_isolation process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/isolation.c | 17 +++++++++++++---- 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index e16e13911e8a..2a4ddc890e22 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_TASK_ISOLATION 48 # define PR_TASK_ISOLATION_ENABLE (1 << 0) # define PR_TASK_ISOLATION_STRICT (1 << 1) +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/isolation.c b/kernel/isolation.c index a89a6e9adfb4..b776aa632c8f 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -75,11 +75,20 @@ void task_isolation_enter(void) } } -static void kill_task_isolation_strict_task(void) +static void kill_task_isolation_strict_task(int is_syscall) { + siginfo_t info = {}; + int sig; + dump_stack(); current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags); + if (sig == 0) + sig = SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -98,7 +107,7 @@ void task_isolation_syscall(int syscall) pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_task_isolation_strict_task(); + kill_task_isolation_strict_task(1); } /* @@ -109,5 +118,5 @@ void task_isolation_exception(void) { pr_warn("%s/%d: task_isolation strict mode violated by exception\n", current->comm, current->pid); - kill_task_isolation_strict_task(); + kill_task_isolation_strict_task(0); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v6 4/6] task_isolation: provide strict mode configurable signal @ 2015-08-25 19:55 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a task_isolation process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> --- include/uapi/linux/prctl.h | 2 ++ kernel/isolation.c | 17 +++++++++++++---- 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index e16e13911e8a..2a4ddc890e22 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -195,5 +195,7 @@ struct prctl_mm_map { #define PR_GET_TASK_ISOLATION 48 # define PR_TASK_ISOLATION_ENABLE (1 << 0) # define PR_TASK_ISOLATION_STRICT (1 << 1) +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/isolation.c b/kernel/isolation.c index a89a6e9adfb4..b776aa632c8f 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -75,11 +75,20 @@ void task_isolation_enter(void) } } -static void kill_task_isolation_strict_task(void) +static void kill_task_isolation_strict_task(int is_syscall) { + siginfo_t info = {}; + int sig; + dump_stack(); current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags); + if (sig == 0) + sig = SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -98,7 +107,7 @@ void task_isolation_syscall(int syscall) pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_task_isolation_strict_task(); + kill_task_isolation_strict_task(1); } /* @@ -109,5 +118,5 @@ void task_isolation_exception(void) { pr_warn("%s/%d: task_isolation strict mode violated by exception\n", current->comm, current->pid); - kill_task_isolation_strict_task(); + kill_task_isolation_strict_task(0); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH v6 4/6] task_isolation: provide strict mode configurable signal 2015-08-25 19:55 ` Chris Metcalf (?) @ 2015-08-28 19:22 ` Andy Lutomirski 2015-09-02 18:38 ` Chris Metcalf -1 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-08-28 19:22 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On Tue, Aug 25, 2015 at 12:55 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > Allow userspace to override the default SIGKILL delivered > when a task_isolation process in STRICT mode does a syscall > or otherwise synchronously enters the kernel. > > In addition to being able to set the signal, we now also > pass whether or not the interruption was from a syscall in > the si_code field of the siginfo. > > Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> > --- > include/uapi/linux/prctl.h | 2 ++ > kernel/isolation.c | 17 +++++++++++++---- > 2 files changed, 15 insertions(+), 4 deletions(-) > > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index e16e13911e8a..2a4ddc890e22 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -195,5 +195,7 @@ struct prctl_mm_map { > #define PR_GET_TASK_ISOLATION 48 > # define PR_TASK_ISOLATION_ENABLE (1 << 0) > # define PR_TASK_ISOLATION_STRICT (1 << 1) > +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) > +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) > > #endif /* _LINUX_PRCTL_H */ > diff --git a/kernel/isolation.c b/kernel/isolation.c > index a89a6e9adfb4..b776aa632c8f 100644 > --- a/kernel/isolation.c > +++ b/kernel/isolation.c > @@ -75,11 +75,20 @@ void task_isolation_enter(void) > } > } > > -static void kill_task_isolation_strict_task(void) > +static void kill_task_isolation_strict_task(int is_syscall) > { > + siginfo_t info = {}; > + int sig; > + > dump_stack(); > current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; > - send_sig(SIGKILL, current, 1); > + > + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags); > + if (sig == 0) > + sig = SIGKILL; > + info.si_signo = sig; > + info.si_code = is_syscall; > + send_sig_info(sig, &info, current); The stuff you're doing here is sufficiently nasty that I think you should add something like: rcu_lockdep_assert(rcu_is_watching(), "some message here"); Because as it stands this is just asking for trouble. For the record, I am *extremely* unhappy with the state of the context tracking hooks. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v6.2 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode @ 2015-09-02 18:38 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-02 18:38 UTC (permalink / raw) To: Will Deacon, Andy Lutomirski, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf This change updates just one patch of the patch series, so rather than spamming out the whole series again, I've just updated this patch: - Will Deacon suggested using IS_ENABLED(CONFIG_TASK_ISOLATION) and also recommended having the same ordering between SECCOMP and TASK_ISOLATION on all platforms, an excellent suggestion. - Andy Lutomirski suggested using rcu_lockdep_assert(rcu_is_watching()) to ensure RCU was properly turned back on during our syscall test-and-kill for strict mode. I will update a full PATCH v7 once there seem to be no further comments on the rest of the v6 series. -- From: Chris Metcalf <cmetcalf@ezchip.com> Date: Tue, 28 Jul 2015 13:25:46 -0400 Subject: [PATCH v6.2 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode With task_isolation mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86, arm64, and tile. We specify that it happens immediately after the SECCOMP test, which appropriately should be tested first. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/arm64/kernel/ptrace.c | 6 ++++++ arch/tile/kernel/ptrace.c | 5 ++++- arch/x86/kernel/ptrace.c | 10 +++++++++- include/linux/context_tracking.h | 11 ++++++++--- include/linux/isolation.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/isolation.c | 41 ++++++++++++++++++++++++++++++++++++++++ 8 files changed, 91 insertions(+), 8 deletions(-) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index d882b833dbdb..737f62db8a6f 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -37,6 +37,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/isolation.h> #include <asm/compat.h> #include <asm/debug-monitors.h> @@ -1154,6 +1155,11 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs) if (secure_computing() == -1) return -1; + if (IS_ENABLED(CONFIG_TASK_ISOLATION) && + test_thread_flag(TIF_NOHZ) && + task_isolation_strict()) + task_isolation_syscall(regs->syscallno); + if (test_thread_flag(TIF_SYSCALL_TRACE)) tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER); diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..c327cb918a44 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (task_isolation_strict()) + task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 9be72bc3613f..821699513a94 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1478,7 +1478,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) */ if (work & _TIF_NOHZ) { user_exit(); - work &= ~_TIF_NOHZ; + if (!IS_ENABLED(CONFIG_TASK_ISOLATION)) + work &= ~_TIF_NOHZ; } #ifdef CONFIG_SECCOMP @@ -1527,6 +1528,13 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) } #endif + /* Now check task isolation, if needed. */ + if (IS_ENABLED(CONFIG_TASK_ISOLATION) && (work & _TIF_NOHZ)) { + work &= ~_TIF_NOHZ; + if (task_isolation_strict()) + task_isolation_syscall(regs->orig_ax); + } + /* Do our best to finish without phase 2. */ if (work == 0) return ret; /* seccomp and/or nohz only (ret == 0 here) */ diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b96bd299966f..e0ac0228fea1 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/isolation.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (task_isolation_strict()) + task_isolation_exception(); + } + } return prev_ctx; } diff --git a/include/linux/isolation.h b/include/linux/isolation.h index fd04011b1c1e..27a4469831c1 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void) } extern void task_isolation_enter(void); +extern void task_isolation_syscall(int nr); +extern void task_isolation_exception(void); extern void task_isolation_wait(void); #else static inline bool task_isolation_enabled(void) { return false; } static inline void task_isolation_enter(void) { } +static inline void task_isolation_syscall(int nr) { } +static inline void task_isolation_exception(void) { } #endif +static inline bool task_isolation_strict(void) +{ +#ifdef CONFIG_TASK_ISOLATION + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) == + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 79da784fe17a..e16e13911e8a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_TASK_ISOLATION 47 #define PR_GET_TASK_ISOLATION 48 # define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index c57c99f5c4d7..17a71f7b66b8 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/isolation.c b/kernel/isolation.c index d4618cd9e23d..caa40583fe0b 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -10,6 +10,7 @@ #include <linux/swap.h> #include <linux/vmstat.h> #include <linux/isolation.h> +#include <asm/unistd.h> #include "time/tick-sched.h" /* @@ -73,3 +74,43 @@ void task_isolation_enter(void) dump_stack(); } } + +static void kill_task_isolation_strict_task(void) +{ + /* RCU should have been enabled prior to checking the syscall. */ + rcu_lockdep_assert(rcu_is_watching(), "syscall entry without RCU"); + + dump_stack(); + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void task_isolation_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_task_isolation_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void task_isolation_exception(void) +{ + pr_warn("%s/%d: task_isolation strict mode violated by exception\n", + current->comm, current->pid); + kill_task_isolation_strict_task(); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v6.2 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode @ 2015-09-02 18:38 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-02 18:38 UTC (permalink / raw) To: Will Deacon, Andy Lutomirski, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Chris Metcalf This change updates just one patch of the patch series, so rather than spamming out the whole series again, I've just updated this patch: - Will Deacon suggested using IS_ENABLED(CONFIG_TASK_ISOLATION) and also recommended having the same ordering between SECCOMP and TASK_ISOLATION on all platforms, an excellent suggestion. - Andy Lutomirski suggested using rcu_lockdep_assert(rcu_is_watching()) to ensure RCU was properly turned back on during our syscall test-and-kill for strict mode. I will update a full PATCH v7 once there seem to be no further comments on the rest of the v6 series. -- From: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> Date: Tue, 28 Jul 2015 13:25:46 -0400 Subject: [PATCH v6.2 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode With task_isolation mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. This change adds the syscall-detection hooks only for x86, arm64, and tile. We specify that it happens immediately after the SECCOMP test, which appropriately should be tested first. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> --- arch/arm64/kernel/ptrace.c | 6 ++++++ arch/tile/kernel/ptrace.c | 5 ++++- arch/x86/kernel/ptrace.c | 10 +++++++++- include/linux/context_tracking.h | 11 ++++++++--- include/linux/isolation.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/isolation.c | 41 ++++++++++++++++++++++++++++++++++++++++ 8 files changed, 91 insertions(+), 8 deletions(-) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index d882b833dbdb..737f62db8a6f 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -37,6 +37,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/isolation.h> #include <asm/compat.h> #include <asm/debug-monitors.h> @@ -1154,6 +1155,11 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs) if (secure_computing() == -1) return -1; + if (IS_ENABLED(CONFIG_TASK_ISOLATION) && + test_thread_flag(TIF_NOHZ) && + task_isolation_strict()) + task_isolation_syscall(regs->syscallno); + if (test_thread_flag(TIF_SYSCALL_TRACE)) tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER); diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index f84eed8243da..c327cb918a44 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -259,8 +259,11 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (task_isolation_strict()) + task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]); + } if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 9be72bc3613f..821699513a94 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1478,7 +1478,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) */ if (work & _TIF_NOHZ) { user_exit(); - work &= ~_TIF_NOHZ; + if (!IS_ENABLED(CONFIG_TASK_ISOLATION)) + work &= ~_TIF_NOHZ; } #ifdef CONFIG_SECCOMP @@ -1527,6 +1528,13 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) } #endif + /* Now check task isolation, if needed. */ + if (IS_ENABLED(CONFIG_TASK_ISOLATION) && (work & _TIF_NOHZ)) { + work &= ~_TIF_NOHZ; + if (task_isolation_strict()) + task_isolation_syscall(regs->orig_ax); + } + /* Do our best to finish without phase 2. */ if (work == 0) return ret; /* seccomp and/or nohz only (ret == 0 here) */ diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b96bd299966f..e0ac0228fea1 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/isolation.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (task_isolation_strict()) + task_isolation_exception(); + } + } return prev_ctx; } diff --git a/include/linux/isolation.h b/include/linux/isolation.h index fd04011b1c1e..27a4469831c1 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void) } extern void task_isolation_enter(void); +extern void task_isolation_syscall(int nr); +extern void task_isolation_exception(void); extern void task_isolation_wait(void); #else static inline bool task_isolation_enabled(void) { return false; } static inline void task_isolation_enter(void) { } +static inline void task_isolation_syscall(int nr) { } +static inline void task_isolation_exception(void) { } #endif +static inline bool task_isolation_strict(void) +{ +#ifdef CONFIG_TASK_ISOLATION + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) == + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 79da784fe17a..e16e13911e8a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -194,5 +194,6 @@ struct prctl_mm_map { #define PR_SET_TASK_ISOLATION 47 #define PR_GET_TASK_ISOLATION 48 # define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index c57c99f5c4d7..17a71f7b66b8 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -147,15 +147,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -169,6 +170,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -178,6 +180,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/isolation.c b/kernel/isolation.c index d4618cd9e23d..caa40583fe0b 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -10,6 +10,7 @@ #include <linux/swap.h> #include <linux/vmstat.h> #include <linux/isolation.h> +#include <asm/unistd.h> #include "time/tick-sched.h" /* @@ -73,3 +74,43 @@ void task_isolation_enter(void) dump_stack(); } } + +static void kill_task_isolation_strict_task(void) +{ + /* RCU should have been enabled prior to checking the syscall. */ + rcu_lockdep_assert(rcu_is_watching(), "syscall entry without RCU"); + + dump_stack(); + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void task_isolation_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_task_isolation_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void task_isolation_exception(void) +{ + pr_warn("%s/%d: task_isolation strict mode violated by exception\n", + current->comm, current->pid); + kill_task_isolation_strict_task(); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v6 5/6] task_isolation: add debug boot flag 2015-08-25 19:55 ` Chris Metcalf ` (4 preceding siblings ...) (?) @ 2015-08-25 19:55 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-kernel Cc: Chris Metcalf The new "task_isolation_debug" flag simplifies debugging of TASK_ISOLATION kernels when processes are running in PR_TASK_ISOLATION_ENABLE mode. Such processes should get no interrupts from the kernel, and if they do, when this boot flag is specified a kernel stack dump on the console is generated. It's possible to use ftrace to simply detect whether a task_isolation core has unexpectedly entered the kernel. But what this boot flag does is allow the kernel to provide better diagnostics, e.g. by reporting in the IPI-generating code what remote core and context is preparing to deliver an interrupt to a task_isolation core. It may be worth considering other ways to generate useful debugging output rather than console spew, but for now that is simple and direct. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- Documentation/kernel-parameters.txt | 7 +++++++ arch/tile/mm/homecache.c | 5 ++++- include/linux/isolation.h | 2 ++ kernel/irq_work.c | 5 ++++- kernel/sched/core.c | 21 +++++++++++++++++++++ kernel/signal.c | 5 +++++ kernel/smp.c | 4 ++++ kernel/softirq.c | 7 +++++++ 8 files changed, 54 insertions(+), 2 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 1d6f0459cd7b..934f172eb140 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -3595,6 +3595,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. neutralize any effect of /proc/sys/kernel/sysrq. Useful for debugging. + task_isolation_debug [KNL] + In kernels built with CONFIG_TASK_ISOLATION and booted + in nohz_full= mode, this setting will generate console + backtraces when the kernel is about to interrupt a + task that has requested PR_TASK_ISOLATION_ENABLE + and is running on a nohz_full core. + tcpmhash_entries= [KNL,NET] Set the number of tcp_metrics_hash slots. Default value is 8192 or 16384 depending on total diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c index 40ca30a9fee3..a79325113105 100644 --- a/arch/tile/mm/homecache.c +++ b/arch/tile/mm/homecache.c @@ -31,6 +31,7 @@ #include <linux/smp.h> #include <linux/module.h> #include <linux/hugetlb.h> +#include <linux/isolation.h> #include <asm/page.h> #include <asm/sections.h> @@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask, * Don't bother to update atomically; losing a count * here is not that critical. */ - for_each_cpu(cpu, &mask) + for_each_cpu(cpu, &mask) { ++per_cpu(irq_stat, cpu).irq_hv_flush_count; + task_isolation_debug(cpu); + } } /* diff --git a/include/linux/isolation.h b/include/linux/isolation.h index 27a4469831c1..9f1747331a36 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -18,11 +18,13 @@ extern void task_isolation_enter(void); extern void task_isolation_syscall(int nr); extern void task_isolation_exception(void); extern void task_isolation_wait(void); +extern void task_isolation_debug(int cpu); #else static inline bool task_isolation_enabled(void) { return false; } static inline void task_isolation_enter(void) { } static inline void task_isolation_syscall(int nr) { } static inline void task_isolation_exception(void) { } +static inline void task_isolation_debug(int cpu) { } #endif static inline bool task_isolation_strict(void) diff --git a/kernel/irq_work.c b/kernel/irq_work.c index cbf9fb899d92..745c2ea6a4e4 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -17,6 +17,7 @@ #include <linux/cpu.h> #include <linux/notifier.h> #include <linux/smp.h> +#include <linux/isolation.h> #include <asm/processor.h> @@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu) if (!irq_work_claim(work)) return false; - if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) + if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) { + task_isolation_debug(cpu); arch_send_call_function_single_ipi(cpu); + } return true; } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 78b4bad10081..0c4e4eba69b1 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -74,6 +74,7 @@ #include <linux/binfmts.h> #include <linux/context_tracking.h> #include <linux/compiler.h> +#include <linux/isolation.h> #include <asm/switch_to.h> #include <asm/tlb.h> @@ -745,6 +746,26 @@ bool sched_can_stop_tick(void) } #endif /* CONFIG_NO_HZ_FULL */ +#ifdef CONFIG_TASK_ISOLATION +/* Enable debugging of any interrupts of task_isolation cores. */ +static int task_isolation_debug_flag; +static int __init task_isolation_debug_func(char *str) +{ + task_isolation_debug_flag = true; + return 1; +} +__setup("task_isolation_debug", task_isolation_debug_func); + +void task_isolation_debug(int cpu) +{ + if (task_isolation_debug_flag && tick_nohz_full_cpu(cpu) && + (cpu_curr(cpu)->task_isolation_flags & PR_TASK_ISOLATION_ENABLE)) { + pr_err("Interrupt detected for task_isolation cpu %d\n", cpu); + dump_stack(); + } +} +#endif + void sched_avg_update(struct rq *rq) { s64 period = sched_avg_period(); diff --git a/kernel/signal.c b/kernel/signal.c index 836df8dac6cc..60e15e835b9e 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -684,6 +684,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) */ void signal_wake_up_state(struct task_struct *t, unsigned int state) { +#ifdef CONFIG_TASK_ISOLATION + /* If the task is being killed, don't complain about task_isolation. */ + if (state & TASK_WAKEKILL) + t->task_isolation_flags = 0; +#endif set_tsk_thread_flag(t, TIF_SIGPENDING); /* * TASK_WAKEKILL also means wake it up in the stopped/traced/killable diff --git a/kernel/smp.c b/kernel/smp.c index 07854477c164..b0bddff2693d 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -14,6 +14,7 @@ #include <linux/smp.h> #include <linux/cpu.h> #include <linux/sched.h> +#include <linux/isolation.h> #include "smpboot.h" @@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, * locking and barrier primitives. Generic code isn't really * equipped to do the right thing... */ + task_isolation_debug(cpu); if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) arch_send_call_function_single_ipi(cpu); @@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask, } /* Send a message to all CPUs in the map */ + for_each_cpu(cpu, cfd->cpumask) + task_isolation_debug(cpu); arch_send_call_function_ipi_mask(cfd->cpumask); if (wait) { diff --git a/kernel/softirq.c b/kernel/softirq.c index 479e4436f787..ed762fec7265 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -24,8 +24,10 @@ #include <linux/ftrace.h> #include <linux/smp.h> #include <linux/smpboot.h> +#include <linux/context_tracking.h> #include <linux/tick.h> #include <linux/irq.h> +#include <linux/isolation.h> #define CREATE_TRACE_POINTS #include <trace/events/irq.h> @@ -335,6 +337,11 @@ void irq_enter(void) _local_bh_enable(); } + if (context_tracking_cpu_is_enabled() && + context_tracking_in_user() && + !in_interrupt()) + task_isolation_debug(smp_processor_id()); + __irq_enter(); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v6 6/6] nohz: task_isolation: allow tick to be fully disabled 2015-08-25 19:55 ` Chris Metcalf ` (5 preceding siblings ...) (?) @ 2015-08-25 19:55 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-08-25 19:55 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel Cc: Chris Metcalf While the current fallback to 1-second tick is still helpful for maintaining completely correct kernel semantics, processes using prctl(PR_SET_TASK_ISOLATION) semantics place a higher priority on running completely tickless, so don't bound the time_delta for such processes. In addition, due to the way such processes quiesce by waiting for the timer tick to stop prior to returning to userspace, without this commit it won't be possible to use the task_isolation mode at all. Removing the 1-second cap was previously discussed (see link below) and Thomas Gleixner observed that vruntime, load balancing data, load accounting, and other things might be impacted. Frederic Weisbecker similarly observed that allowing the tick to be indefinitely deferred just meant that no one would ever fix the underlying bugs. However it's at least true that the mode proposed in this patch can only be enabled on a nohz_full core by a process requesting task_isolation mode, which may limit how important it is to maintain scheduler data correctly, for example. Paul McKenney observed that if provide a mode where the 1Hz fallback timer is removed, this will provide an environment where new code that relies on that tick will get punished, and we won't forgive such assumptions silently, so it may also be worth it from that perspective. Finally, it's worth observing that the tile architecture has been using similar code for its Zero-Overhead Linux for many years (starting in 2008) and customers are very enthusiastic about the resulting bare-metal performance on cores that are available to run full Linux semantics on demand (crash, logging, shutdown, etc). So this semantics is very useful if we can convince ourselves that doing this is safe. Link: https://lkml.kernel.org/r/alpine.DEB.2.11.1410311058500.32582@gentwo.org Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- kernel/time/tick-sched.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index c792429e98c6..be296499b753 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -24,6 +24,7 @@ #include <linux/posix-timers.h> #include <linux/perf_event.h> #include <linux/context_tracking.h> +#include <linux/isolation.h> #include <asm/irq_regs.h> @@ -652,7 +653,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, #ifdef CONFIG_NO_HZ_FULL /* Limit the tick delta to the maximum scheduler deferment */ - if (!ts->inidle) + if (!ts->inidle && !task_isolation_enabled()) delta = min(delta, scheduler_tick_max_deferment()); #endif -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v7 00/11] support "task_isolated" mode for nohz_full 2015-08-25 19:55 ` Chris Metcalf @ 2015-09-28 15:17 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The cover email for the patch series is getting a little unwieldy so I will provide a terser summary here, and just update the list of changes from version to version. Please see the previous versions linked by the In-Reply-To for more detailed comments about changes in earlier versions of the patch series. v7: The main change in this version is a change in where we call task_isolation_enter(). The arm64 code only invokes the context_tracking code right at kernel entry, and right at kernel exit, and the exit point is too late for task isolation; one of my test cases, when run on arm64, showed that a signal delivered while task isolation is waiting for the timer interrupt to quiesce was not properly handled before returning to userspace. The tilegx code properly handled that case because it ran user_exit() in the work-pending loop. But since arm64 calls user_exit() later, it was too late to go back and handle the signal. I decided to make the task isolation work explicit in the "work" loop done on return to userspace, and although I could have done this by hacking up the arm64 assembly code for this purpose, I decided to follow the x86 approach and use the prepare_exit_to_usermode() model where architectures handles work looping in C code. I added that support to arm64 and tile as a pre-requisite change, then modified the loop in C to call task isolation appropriately. This both makes the slowpath return-to-user code more maintainable for arm64 and tile going forward, and also avoids some of the subtlety where the context tracking code was being asked to invoke task isolation at user_enter() time. As a result of this change, I have moved all the architecture-specific changes to individual patches: two patches to switch arm64 and tile to the prepare_exit_to_usermode() loop, and three patches (one each for x86, arm64, and tile) to add the necessary call to task_isolation(), plus changes to check at syscall entry for strict mode. In addition, since arm64 doesn't use exception_enter(), I added an explicit call to task_isolation_exception() in do_mem_abort() so that page faults would be properly flagged in strict mode. I also added an RCU_LOCKDEP_WARN() at Andy Lutomirski's suggestion. And, the patch series is rebased to v4.3-rc1. v6: restructured to be a "task_isolation" mode not a "cpu_isolated" mode (Frederic) v5: rebased on kernel v4.2-rc3 converted to use CONFIG_CPU_ISOLATED and separate .c and .h files incorporates Christoph Lameter's quiet_vmstat() call v4: rebased on kernel v4.2-rc1 added support for detecting CPU_ISOLATED_STRICT syscalls on arm64 v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() General summary: The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. The kernel must be built with CONFIG_TASK_ISOLATION to take advantage of this new mode. A prctl() option (PR_SET_TASK_ISOLATION) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on task_isolation cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on v4.3-rc1) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane Note: I have not removed the commit to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if we remove the 1Hz tick, task_isolation threads will never re-enter userspace since a tick will always be pending. Chris Metcalf (10): task_isolation: add initial support task_isolation: support PR_TASK_ISOLATION_STRICT mode task_isolation: provide strict mode configurable signal task_isolation: add debug boot flag nohz: task_isolation: allow tick to be fully disabled arch/x86: enable task isolation functionality arch/arm64: adopt prepare_exit_to_usermode() model from x86 arch/arm64: enable task isolation functionality arch/tile: adopt prepare_exit_to_usermode() model from x86 arch/tile: enable task isolation functionality Christoph Lameter (1): vmstat: provide a function to quiet down the diff processing Documentation/kernel-parameters.txt | 7 ++ arch/arm64/include/asm/thread_info.h | 18 +++-- arch/arm64/kernel/entry.S | 6 +- arch/arm64/kernel/ptrace.c | 10 ++- arch/arm64/kernel/signal.c | 36 +++++++--- arch/arm64/mm/fault.c | 8 +++ arch/tile/include/asm/processor.h | 2 +- arch/tile/include/asm/thread_info.h | 8 ++- arch/tile/kernel/intvec_32.S | 46 ++++--------- arch/tile/kernel/intvec_64.S | 49 +++++--------- arch/tile/kernel/process.c | 92 ++++++++++++++----------- arch/tile/kernel/ptrace.c | 3 + arch/tile/mm/homecache.c | 5 +- arch/x86/entry/common.c | 45 ++++++++++--- include/linux/context_tracking.h | 11 ++- include/linux/isolation.h | 42 ++++++++++++ include/linux/sched.h | 3 + include/linux/vmstat.h | 2 + include/uapi/linux/prctl.h | 8 +++ init/Kconfig | 20 ++++++ kernel/Makefile | 1 + kernel/context_tracking.c | 9 ++- kernel/irq_work.c | 5 +- kernel/isolation.c | 127 +++++++++++++++++++++++++++++++++++ kernel/sched/core.c | 21 ++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 7 ++ kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 3 +- mm/vmstat.c | 14 ++++ 31 files changed, 477 insertions(+), 148 deletions(-) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c -- 2.1.2 ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v7 00/11] support "task_isolated" mode for nohz_full @ 2015-09-28 15:17 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The cover email for the patch series is getting a little unwieldy so I will provide a terser summary here, and just update the list of changes from version to version. Please see the previous versions linked by the In-Reply-To for more detailed comments about changes in earlier versions of the patch series. v7: The main change in this version is a change in where we call task_isolation_enter(). The arm64 code only invokes the context_tracking code right at kernel entry, and right at kernel exit, and the exit point is too late for task isolation; one of my test cases, when run on arm64, showed that a signal delivered while task isolation is waiting for the timer interrupt to quiesce was not properly handled before returning to userspace. The tilegx code properly handled that case because it ran user_exit() in the work-pending loop. But since arm64 calls user_exit() later, it was too late to go back and handle the signal. I decided to make the task isolation work explicit in the "work" loop done on return to userspace, and although I could have done this by hacking up the arm64 assembly code for this purpose, I decided to follow the x86 approach and use the prepare_exit_to_usermode() model where architectures handles work looping in C code. I added that support to arm64 and tile as a pre-requisite change, then modified the loop in C to call task isolation appropriately. This both makes the slowpath return-to-user code more maintainable for arm64 and tile going forward, and also avoids some of the subtlety where the context tracking code was being asked to invoke task isolation at user_enter() time. As a result of this change, I have moved all the architecture-specific changes to individual patches: two patches to switch arm64 and tile to the prepare_exit_to_usermode() loop, and three patches (one each for x86, arm64, and tile) to add the necessary call to task_isolation(), plus changes to check at syscall entry for strict mode. In addition, since arm64 doesn't use exception_enter(), I added an explicit call to task_isolation_exception() in do_mem_abort() so that page faults would be properly flagged in strict mode. I also added an RCU_LOCKDEP_WARN() at Andy Lutomirski's suggestion. And, the patch series is rebased to v4.3-rc1. v6: restructured to be a "task_isolation" mode not a "cpu_isolated" mode (Frederic) v5: rebased on kernel v4.2-rc3 converted to use CONFIG_CPU_ISOLATED and separate .c and .h files incorporates Christoph Lameter's quiet_vmstat() call v4: rebased on kernel v4.2-rc1 added support for detecting CPU_ISOLATED_STRICT syscalls on arm64 v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() General summary: The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. The kernel must be built with CONFIG_TASK_ISOLATION to take advantage of this new mode. A prctl() option (PR_SET_TASK_ISOLATION) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts (for example, workqueues on task_isolation cores still remain to be dealt with), this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series (based currently on v4.3-rc1) is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane Note: I have not removed the commit to disable the 1Hz timer tick fallback that was nack'ed by PeterZ, pending a decision on that thread as to what to do (https://lkml.org/lkml/2015/5/8/555); also since if we remove the 1Hz tick, task_isolation threads will never re-enter userspace since a tick will always be pending. Chris Metcalf (10): task_isolation: add initial support task_isolation: support PR_TASK_ISOLATION_STRICT mode task_isolation: provide strict mode configurable signal task_isolation: add debug boot flag nohz: task_isolation: allow tick to be fully disabled arch/x86: enable task isolation functionality arch/arm64: adopt prepare_exit_to_usermode() model from x86 arch/arm64: enable task isolation functionality arch/tile: adopt prepare_exit_to_usermode() model from x86 arch/tile: enable task isolation functionality Christoph Lameter (1): vmstat: provide a function to quiet down the diff processing Documentation/kernel-parameters.txt | 7 ++ arch/arm64/include/asm/thread_info.h | 18 +++-- arch/arm64/kernel/entry.S | 6 +- arch/arm64/kernel/ptrace.c | 10 ++- arch/arm64/kernel/signal.c | 36 +++++++--- arch/arm64/mm/fault.c | 8 +++ arch/tile/include/asm/processor.h | 2 +- arch/tile/include/asm/thread_info.h | 8 ++- arch/tile/kernel/intvec_32.S | 46 ++++--------- arch/tile/kernel/intvec_64.S | 49 +++++--------- arch/tile/kernel/process.c | 92 ++++++++++++++----------- arch/tile/kernel/ptrace.c | 3 + arch/tile/mm/homecache.c | 5 +- arch/x86/entry/common.c | 45 ++++++++++--- include/linux/context_tracking.h | 11 ++- include/linux/isolation.h | 42 ++++++++++++ include/linux/sched.h | 3 + include/linux/vmstat.h | 2 + include/uapi/linux/prctl.h | 8 +++ init/Kconfig | 20 ++++++ kernel/Makefile | 1 + kernel/context_tracking.c | 9 ++- kernel/irq_work.c | 5 +- kernel/isolation.c | 127 +++++++++++++++++++++++++++++++++++ kernel/sched/core.c | 21 ++++++ kernel/signal.c | 5 ++ kernel/smp.c | 4 ++ kernel/softirq.c | 7 ++ kernel/sys.c | 8 +++ kernel/time/tick-sched.c | 3 +- mm/vmstat.c | 14 ++++ 31 files changed, 477 insertions(+), 148 deletions(-) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c -- 2.1.2 ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v7 01/11] vmstat: provide a function to quiet down the diff processing 2015-09-28 15:17 ` Chris Metcalf (?) @ 2015-09-28 15:17 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-kernel From: Christoph Lameter <cl@linux.com> quiet_vmstat() can be called in anticipation of a OS "quiet" period where no tick processing should be triggered. quiet_vmstat() will fold all pending differentials into the global counters and disable the vmstat_worker processing. Note that the shepherd thread will continue scanning the differentials from another processor and will reenable the vmstat workers if it detects any changes. Signed-off-by: Christoph Lameter <cl@linux.com> --- include/linux/vmstat.h | 2 ++ mm/vmstat.c | 14 ++++++++++++++ 2 files changed, 16 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 82e7db7f7100..c013b8d8e434 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone *, enum zone_stat_item); extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); +void quiet_vmstat(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -272,6 +273,7 @@ static inline void __dec_zone_page_state(struct page *page, static inline void refresh_cpu_vm_stats(int cpu) { } static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } +static inline void quiet_vmstat(void) { } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index 4f5cd974e11a..cf7d324f16e2 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1394,6 +1394,20 @@ static void vmstat_update(struct work_struct *w) } /* + * Switch off vmstat processing and then fold all the remaining differentials + * until the diffs stay at zero. The function is used by NOHZ and can only be + * invoked when tick processing is not active. + */ +void quiet_vmstat(void) +{ + do { + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) + cancel_delayed_work(this_cpu_ptr(&vmstat_work)); + + } while (refresh_cpu_vm_stats()); +} + +/* * Check if the diffs for a certain cpu indicate that * an update is needed. */ -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v7 02/11] task_isolation: add initial support 2015-09-28 15:17 ` Chris Metcalf @ 2015-09-28 15:17 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The kernel must be built with the new TASK_ISOLATION Kconfig flag to enable this mode, and the kernel booted with an appropriate nohz_full=CPULIST boot argument. The "task_isolation" state is then indicated by setting a new task struct field, task_isolation_flag, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new task_isolation_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only three actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, it calls quiet_vmstat() to quieten the vmstat worker to avoid a follow-on interrupt. Finally, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. The task_isolation_enter() routine must be called just before the hard return to userspace, so it is appropriately placed in the prepare_exit_to_usermode() routine for an individual architecture or some comparable location. Separate patches that follow provide these changes for x86, arm64, and tile. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/isolation.h | 24 +++++++++++++++ include/linux/sched.h | 3 ++ include/uapi/linux/prctl.h | 5 +++ init/Kconfig | 20 ++++++++++++ kernel/Makefile | 1 + kernel/isolation.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++ kernel/sys.c | 8 +++++ 7 files changed, 138 insertions(+) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c diff --git a/include/linux/isolation.h b/include/linux/isolation.h new file mode 100644 index 000000000000..fd04011b1c1e --- /dev/null +++ b/include/linux/isolation.h @@ -0,0 +1,24 @@ +/* + * Task isolation related global functions + */ +#ifndef _LINUX_ISOLATION_H +#define _LINUX_ISOLATION_H + +#include <linux/tick.h> +#include <linux/prctl.h> + +#ifdef CONFIG_TASK_ISOLATION +static inline bool task_isolation_enabled(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE); +} + +extern void task_isolation_enter(void); +extern void task_isolation_wait(void); +#else +static inline bool task_isolation_enabled(void) { return false; } +static inline void task_isolation_enter(void) { } +#endif + +#endif diff --git a/include/linux/sched.h b/include/linux/sched.h index a4ab9daa387c..bd2dc26948a6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1800,6 +1800,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_TASK_ISOLATION + unsigned int task_isolation_flags; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index a8d0759a9e40..67224df4b559 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -197,4 +197,9 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */ +#define PR_SET_TASK_ISOLATION 48 +#define PR_GET_TASK_ISOLATION 49 +# define PR_TASK_ISOLATION_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/init/Kconfig b/init/Kconfig index c24b6f767bf0..4ff7f052059a 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -787,6 +787,26 @@ config RCU_EXPEDITE_BOOT endmenu # "RCU Subsystem" +config TASK_ISOLATION + bool "Provide hard CPU isolation from the kernel on demand" + depends on NO_HZ_FULL + help + Allow userspace processes to place themselves on nohz_full + cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate" + themselves from the kernel. On return to userspace, + isolated tasks will first arrange that no future kernel + activity will interrupt the task while the task is running + in userspace. This "hard" isolation from the kernel is + required for userspace tasks that are running hard real-time + tasks in userspace, such as a 10 Gbit network driver in userspace. + + Without this option, but with NO_HZ_FULL enabled, the kernel + will make a best-faith, "soft" effort to shield a single userspace + process from interrupts, but makes no guarantees. + + You should say "N" unless you are intending to run a + high-performance userspace driver or similar task. + config BUILD_BIN2C bool default n diff --git a/kernel/Makefile b/kernel/Makefile index 53abf008ecb3..693a2ba35679 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o obj-$(CONFIG_MEMBARRIER) += membarrier.o obj-$(CONFIG_HAS_IOMEM) += memremap.o +obj-$(CONFIG_TASK_ISOLATION) += isolation.o $(obj)/configs.o: $(obj)/config_data.h diff --git a/kernel/isolation.c b/kernel/isolation.c new file mode 100644 index 000000000000..6ace866c69f6 --- /dev/null +++ b/kernel/isolation.c @@ -0,0 +1,77 @@ +/* + * linux/kernel/isolation.c + * + * Implementation for task isolation. + * + * Distributed under GPLv2. + */ + +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/vmstat.h> +#include <linux/isolation.h> +#include "time/tick-sched.h" + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + * + * Note that it must be guaranteed for a particular architecture + * that if next_event is not KTIME_MAX, then a timer interrupt will + * occur, otherwise the sleep may never awaken. + */ +void __weak task_isolation_wait(void) +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In task_isolation mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two task_isolation processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void task_isolation_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + if (WARN_ON(irqs_disabled())) + local_irq_enable(); + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + /* Quieten the vmstat worker so it won't interrupt us. */ + quiet_vmstat(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + cond_resched(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + task_isolation_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: task_isolation task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} diff --git a/kernel/sys.c b/kernel/sys.c index fa2f2f671a5c..a2c6eb1d4ad9 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2266,6 +2266,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_TASK_ISOLATION + case PR_SET_TASK_ISOLATION: + me->task_isolation_flags = arg2; + break; + case PR_GET_TASK_ISOLATION: + error = me->task_isolation_flags; + break; +#endif default: error = -EINVAL; break; -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v7 02/11] task_isolation: add initial support @ 2015-09-28 15:17 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The kernel must be built with the new TASK_ISOLATION Kconfig flag to enable this mode, and the kernel booted with an appropriate nohz_full=CPULIST boot argument. The "task_isolation" state is then indicated by setting a new task struct field, task_isolation_flag, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new task_isolation_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only three actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, it calls quiet_vmstat() to quieten the vmstat worker to avoid a follow-on interrupt. Finally, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. The task_isolation_enter() routine must be called just before the hard return to userspace, so it is appropriately placed in the prepare_exit_to_usermode() routine for an individual architecture or some comparable location. Separate patches that follow provide these changes for x86, arm64, and tile. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/isolation.h | 24 +++++++++++++++ include/linux/sched.h | 3 ++ include/uapi/linux/prctl.h | 5 +++ init/Kconfig | 20 ++++++++++++ kernel/Makefile | 1 + kernel/isolation.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++ kernel/sys.c | 8 +++++ 7 files changed, 138 insertions(+) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c diff --git a/include/linux/isolation.h b/include/linux/isolation.h new file mode 100644 index 000000000000..fd04011b1c1e --- /dev/null +++ b/include/linux/isolation.h @@ -0,0 +1,24 @@ +/* + * Task isolation related global functions + */ +#ifndef _LINUX_ISOLATION_H +#define _LINUX_ISOLATION_H + +#include <linux/tick.h> +#include <linux/prctl.h> + +#ifdef CONFIG_TASK_ISOLATION +static inline bool task_isolation_enabled(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE); +} + +extern void task_isolation_enter(void); +extern void task_isolation_wait(void); +#else +static inline bool task_isolation_enabled(void) { return false; } +static inline void task_isolation_enter(void) { } +#endif + +#endif diff --git a/include/linux/sched.h b/include/linux/sched.h index a4ab9daa387c..bd2dc26948a6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1800,6 +1800,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_TASK_ISOLATION + unsigned int task_isolation_flags; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index a8d0759a9e40..67224df4b559 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -197,4 +197,9 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */ +#define PR_SET_TASK_ISOLATION 48 +#define PR_GET_TASK_ISOLATION 49 +# define PR_TASK_ISOLATION_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/init/Kconfig b/init/Kconfig index c24b6f767bf0..4ff7f052059a 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -787,6 +787,26 @@ config RCU_EXPEDITE_BOOT endmenu # "RCU Subsystem" +config TASK_ISOLATION + bool "Provide hard CPU isolation from the kernel on demand" + depends on NO_HZ_FULL + help + Allow userspace processes to place themselves on nohz_full + cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate" + themselves from the kernel. On return to userspace, + isolated tasks will first arrange that no future kernel + activity will interrupt the task while the task is running + in userspace. This "hard" isolation from the kernel is + required for userspace tasks that are running hard real-time + tasks in userspace, such as a 10 Gbit network driver in userspace. + + Without this option, but with NO_HZ_FULL enabled, the kernel + will make a best-faith, "soft" effort to shield a single userspace + process from interrupts, but makes no guarantees. + + You should say "N" unless you are intending to run a + high-performance userspace driver or similar task. + config BUILD_BIN2C bool default n diff --git a/kernel/Makefile b/kernel/Makefile index 53abf008ecb3..693a2ba35679 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o obj-$(CONFIG_MEMBARRIER) += membarrier.o obj-$(CONFIG_HAS_IOMEM) += memremap.o +obj-$(CONFIG_TASK_ISOLATION) += isolation.o $(obj)/configs.o: $(obj)/config_data.h diff --git a/kernel/isolation.c b/kernel/isolation.c new file mode 100644 index 000000000000..6ace866c69f6 --- /dev/null +++ b/kernel/isolation.c @@ -0,0 +1,77 @@ +/* + * linux/kernel/isolation.c + * + * Implementation for task isolation. + * + * Distributed under GPLv2. + */ + +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/vmstat.h> +#include <linux/isolation.h> +#include "time/tick-sched.h" + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + * + * Note that it must be guaranteed for a particular architecture + * that if next_event is not KTIME_MAX, then a timer interrupt will + * occur, otherwise the sleep may never awaken. + */ +void __weak task_isolation_wait(void) +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In task_isolation mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two task_isolation processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void task_isolation_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + if (WARN_ON(irqs_disabled())) + local_irq_enable(); + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + /* Quieten the vmstat worker so it won't interrupt us. */ + quiet_vmstat(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + cond_resched(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + task_isolation_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: task_isolation task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} diff --git a/kernel/sys.c b/kernel/sys.c index fa2f2f671a5c..a2c6eb1d4ad9 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2266,6 +2266,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_TASK_ISOLATION + case PR_SET_TASK_ISOLATION: + me->task_isolation_flags = arg2; + break; + case PR_GET_TASK_ISOLATION: + error = me->task_isolation_flags; + break; +#endif default: error = -EINVAL; break; -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support 2015-09-28 15:17 ` Chris Metcalf (?) @ 2015-10-01 12:14 ` Frederic Weisbecker 2015-10-01 12:18 ` Thomas Gleixner 2015-10-01 19:25 ` Chris Metcalf -1 siblings, 2 replies; 340+ messages in thread From: Frederic Weisbecker @ 2015-10-01 12:14 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote: > diff --git a/include/linux/isolation.h b/include/linux/isolation.h > new file mode 100644 > index 000000000000..fd04011b1c1e > --- /dev/null > +++ b/include/linux/isolation.h > @@ -0,0 +1,24 @@ > +/* > + * Task isolation related global functions > + */ > +#ifndef _LINUX_ISOLATION_H > +#define _LINUX_ISOLATION_H > + > +#include <linux/tick.h> > +#include <linux/prctl.h> > + > +#ifdef CONFIG_TASK_ISOLATION > +static inline bool task_isolation_enabled(void) > +{ > + return tick_nohz_full_cpu(smp_processor_id()) && > + (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE); Ok, I may be a bit burdening with that but, how about using the regular existing task flags, and if needed later we can still introduce a new field in struct task_struct? > diff --git a/kernel/isolation.c b/kernel/isolation.c > new file mode 100644 > index 000000000000..6ace866c69f6 > --- /dev/null > +++ b/kernel/isolation.c > @@ -0,0 +1,77 @@ > +/* > + * linux/kernel/isolation.c > + * > + * Implementation for task isolation. > + * > + * Distributed under GPLv2. > + */ > + > +#include <linux/mm.h> > +#include <linux/swap.h> > +#include <linux/vmstat.h> > +#include <linux/isolation.h> > +#include "time/tick-sched.h" > + > +/* > + * Rather than continuously polling for the next_event in the > + * tick_cpu_device, architectures can provide a method to save power > + * by sleeping until an interrupt arrives. > + * > + * Note that it must be guaranteed for a particular architecture > + * that if next_event is not KTIME_MAX, then a timer interrupt will > + * occur, otherwise the sleep may never awaken. > + */ > +void __weak task_isolation_wait(void) > +{ > + cpu_relax(); > +} > + > +/* > + * We normally return immediately to userspace. > + * > + * In task_isolation mode we wait until no more interrupts are > + * pending. Otherwise we nap with interrupts enabled and wait for the > + * next interrupt to fire, then loop back and retry. > + * > + * Note that if you schedule two task_isolation processes on the same > + * core, neither will ever leave the kernel, and one will have to be > + * killed manually. Otherwise in situations where another process is > + * in the runqueue on this cpu, this task will just wait for that > + * other task to go idle before returning to user space. > + */ > +void task_isolation_enter(void) > +{ > + struct clock_event_device *dev = > + __this_cpu_read(tick_cpu_device.evtdev); > + struct task_struct *task = current; > + unsigned long start = jiffies; > + bool warned = false; > + > + if (WARN_ON(irqs_disabled())) > + local_irq_enable(); > + > + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ > + lru_add_drain(); > + > + /* Quieten the vmstat worker so it won't interrupt us. */ > + quiet_vmstat(); > + > + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { You should add a function in tick-sched.c to get the next tick. This is supposed to be a private field. > + if (!warned && (jiffies - start) >= (5 * HZ)) { > + pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n", > + task->comm, task->pid, smp_processor_id(), > + (jiffies - start) / HZ); > + warned = true; > + } > + cond_resched(); > + if (test_thread_flag(TIF_SIGPENDING)) > + break; Why not use signal_pending()? > + task_isolation_wait(); I still think we could try a wait-wake standard scheme. Thanks. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support 2015-10-01 12:14 ` Frederic Weisbecker @ 2015-10-01 12:18 ` Thomas Gleixner 2015-10-01 12:23 ` Frederic Weisbecker 2015-10-01 17:02 ` Chris Metcalf 2015-10-01 19:25 ` Chris Metcalf 1 sibling, 2 replies; 340+ messages in thread From: Thomas Gleixner @ 2015-10-01 12:18 UTC (permalink / raw) To: Frederic Weisbecker Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Thu, 1 Oct 2015, Frederic Weisbecker wrote: > On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote: > > + > > + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { > > You should add a function in tick-sched.c to get the next tick. This > is supposed to be a private field. Just to make it clear. Neither the above nor a similar check in tick-sched.c is going to happen. This busy waiting is just horrible. Get your act together and solve the problems at the root and do not inflict your quick and dirty 'solutions' on us. Thanks, tglx ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support 2015-10-01 12:18 ` Thomas Gleixner @ 2015-10-01 12:23 ` Frederic Weisbecker 2015-10-01 12:31 ` Thomas Gleixner 2015-10-01 17:02 ` Chris Metcalf 1 sibling, 1 reply; 340+ messages in thread From: Frederic Weisbecker @ 2015-10-01 12:23 UTC (permalink / raw) To: Thomas Gleixner Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Thu, Oct 01, 2015 at 02:18:42PM +0200, Thomas Gleixner wrote: > On Thu, 1 Oct 2015, Frederic Weisbecker wrote: > > On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote: > > > + > > > + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { > > > > You should add a function in tick-sched.c to get the next tick. This > > is supposed to be a private field. > > Just to make it clear. Neither the above nor a similar check in > tick-sched.c is going to happen. > > This busy waiting is just horrible. Get your act together and solve > the problems at the root and do not inflict your quick and dirty > 'solutions' on us. That's why I proposed a wait-wake scheme instead with the tick stop code. What's your opinion about such direction? ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support 2015-10-01 12:23 ` Frederic Weisbecker @ 2015-10-01 12:31 ` Thomas Gleixner 0 siblings, 0 replies; 340+ messages in thread From: Thomas Gleixner @ 2015-10-01 12:31 UTC (permalink / raw) To: Frederic Weisbecker Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Thu, 1 Oct 2015, Frederic Weisbecker wrote: > On Thu, Oct 01, 2015 at 02:18:42PM +0200, Thomas Gleixner wrote: > > On Thu, 1 Oct 2015, Frederic Weisbecker wrote: > > > On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote: > > > > + > > > > + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { > > > > > > You should add a function in tick-sched.c to get the next tick. This > > > is supposed to be a private field. > > > > Just to make it clear. Neither the above nor a similar check in > > tick-sched.c is going to happen. > > > > This busy waiting is just horrible. Get your act together and solve > > the problems at the root and do not inflict your quick and dirty > > 'solutions' on us. > > That's why I proposed a wait-wake scheme instead with the tick stop > code. What's your opinion about such direction? Definitely more sensible than mindlessly busy looping. Thanks, tglx ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support 2015-10-01 12:18 ` Thomas Gleixner @ 2015-10-01 17:02 ` Chris Metcalf 2015-10-01 17:02 ` Chris Metcalf 1 sibling, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-01 17:02 UTC (permalink / raw) To: Thomas Gleixner, Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 10/01/2015 08:18 AM, Thomas Gleixner wrote: > On Thu, 1 Oct 2015, Frederic Weisbecker wrote: >> On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote: >>> + >>> + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { >> You should add a function in tick-sched.c to get the next tick. This >> is supposed to be a private field. > Just to make it clear. Neither the above nor a similar check in > tick-sched.c is going to happen. > > This busy waiting is just horrible. Get your act together and solve > the problems at the root and do not inflict your quick and dirty > 'solutions' on us. Thomas, You've raised a couple of different concerns and I want to make sure I try to address them individually. But first I want to address the question of the basic semantics of the patch series. I wrote up a description of why it's useful in my email yesterday: https://lkml.kernel.org/r/560C4CF4.9090601@ezchip.com I haven't directly heard from you as to whether you buy the basic premise of "hard isolation" in terms of protecting tasks from all kernel interrupts while they execute in userspace. I will add here that we've heard from multiple customers that the equivalent Tilera functionality (Zero-Overhead Linux) was the thing that brought them to buy our hardware rather than a competitor's. It's allowed them to write code that runs under a full-featured Linux environment rather than doing the thing that they otherwise would have been required to do, which is to target a minimal bare-metal environment. So as a feature, if we can gain consensus on an implementation of it, I think it will be an important step for that class of users, and potential users, of Linux. So I first want to address what is effectively the API concern that you raised, namely that you're concerned that there is a wait loop in the implementation. The nice thing here is that there is in fact no requirement in the API/ABI that we have a wait loop in the kernel at all. Let's say hypothetically that in the future we come up with a way to guarantee, perhaps in some constrained kind of way, that you can enter and exit the kernel and are guaranteed no further timer interrupts, and we are so confident of this property that we don't have to test for it programmatically on kernel exit. (In fact, we would likely still use the task_isolation_debug boot flag to generate a console warning if it ever did happen, but whatever.) At this point we could simply remove the timer interrupt test loop in task_isolation_wait(); the applications would be none the wiser, and the kernel would be that much cleaner. However, today, and I think for the future, I see that loop as an important backstop for whatever timer-elimination coding happens. In general, the hard task-isolation requirement is something that is of particular interest only to a subset of the kernel community. As the kernel grows, adds features, re-implements functionality, etc., it seems entirely likely that odd bits of deferred functionality might be added in the same way that RCU, workqueues, etc., have done in the past. Or, applications might exercise unusual corners of the kernel's semantics and come across an existing mechanism that ends up enabling kernel ticks (maybe only one or two) before returning to userspace. The proposed busy-loop just prevents that from damaging the application. I'm skeptical that we can prevent all such possible changes today and in the future, and I think the loop is a simple way of arranging to avoid breaking applications with interrupts, that only triggers for applications that have requested it, on cores that have been configured to support it. One additional insight that argues in favor of a busy-waiting solution is that a task that requests task isolation is almost certainly alone on the core. If multiple tasks are in fact runnable on that core, we have already abandoned the ability to use proper task isolation since we will want to use timer ticks to run the scheduler for pre-emption. So we only busy wait when, in fact, no other useful work is likely to get done on that core anyway. The other questions you raise have to do with the mechanism for ensuring that we wait until no timer interrupts are scheduled. First is the question of how we detect that case. As I said yesterday, the original approach I chose for the Tilera implementation was one where we simply wait until the timer interrupt is masked (as is done via the set_state_shutdown, set_state_oneshot, and tick_resume callbacks in the tile clock_event_device). When unmasked, the timer down-counter just counts down to zero, fires the interrupt, resets to its start value, and counts down again until it fires again. So we use masking of the interrupt to turn off the timer tick. Once we have done so, we are guaranteed no further timer interrupts can occur. I'm less familiar with the timer subsystems of other architectures, but there are clearly per-platform ways to make the same kinds of checks. If this seems like a better approach, I'm happy to work to add the necessary checks on tile, arm64, and x86, though I'd certainly benefit from some guidance on the timer implementation on the latter two platforms. One reason this might be necessary is if there is support on some platforms for multiple timer interrupts any of which can fire, not just a single timer driven by the clock_event_device. I'm not sure whether this is ever in fact a problem, but if it is, that would seem like it would almost certainly require per-architecture code to determine whether all the relevant timers were quiesced. However, I'm not sure whether you don't like the fact of checking the next_event in tick_cpu_device per se, or if it's the busy-waiting we do when it indicates a pending timer that bothers you. If you could help clarify this piece, that would be good. The last question is what to do when we detect that there is a timer interrupt scheduled. The current code spins, testing for resched or signal events, and bails out back to the work-pending loop when that happens. As an extension, one can add support for spinning in a lower-power state, as I did for tile, but this isn't required and frankly isn't that important, since we don't anticipate spending much time in the busy-loop state anyway. The suggestion proposed by Frederic and echoed by you is a wake-wait scheme. I'm curious to hear a more fully fleshed-out suggestion. Clearly, we can test for pending timer interrupts and put the task to sleep (pretty late in the return-to-userspace process, but maybe that's OK). The question is, how and when do we wake the task? We could add a hook to the platform timer shutdown code that would also wake any process that was waiting for the no-timer case; that process would then end up getting scheduled sometime later, and hopefully when it came time for it to try exiting to userspace again, the timer would still be shutdown. This could be problematic if the scheduler code or some other part of the kernel sets up the timer again before scheduling the waiting task back in. Arguably we can work to avoid this if it's really a problem. And, there is the question of how to handle multiple timer interrupt sources, since they would all have to quiesce before we would want to wake the waiting process, but the "multiple timers" isn't handled by the current code either, and it seems not to be a problem, so perhaps that's OK. Lastly, of course, is the question of what the kernel would end up doing while waiting: and the answer is almost certainly that it would sit in the cpu idle loop, waiting for the pending timer to fire and wake the waiting task. I'm not convinced that the extra complexity here is worth the gain. But I am open and willing to being convinced that I am wrong, and to implement different approaches. Let me know! -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support @ 2015-10-01 17:02 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-01 17:02 UTC (permalink / raw) To: Thomas Gleixner, Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 10/01/2015 08:18 AM, Thomas Gleixner wrote: > On Thu, 1 Oct 2015, Frederic Weisbecker wrote: >> On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote: >>> + >>> + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { >> You should add a function in tick-sched.c to get the next tick. This >> is supposed to be a private field. > Just to make it clear. Neither the above nor a similar check in > tick-sched.c is going to happen. > > This busy waiting is just horrible. Get your act together and solve > the problems at the root and do not inflict your quick and dirty > 'solutions' on us. Thomas, You've raised a couple of different concerns and I want to make sure I try to address them individually. But first I want to address the question of the basic semantics of the patch series. I wrote up a description of why it's useful in my email yesterday: https://lkml.kernel.org/r/560C4CF4.9090601@ezchip.com I haven't directly heard from you as to whether you buy the basic premise of "hard isolation" in terms of protecting tasks from all kernel interrupts while they execute in userspace. I will add here that we've heard from multiple customers that the equivalent Tilera functionality (Zero-Overhead Linux) was the thing that brought them to buy our hardware rather than a competitor's. It's allowed them to write code that runs under a full-featured Linux environment rather than doing the thing that they otherwise would have been required to do, which is to target a minimal bare-metal environment. So as a feature, if we can gain consensus on an implementation of it, I think it will be an important step for that class of users, and potential users, of Linux. So I first want to address what is effectively the API concern that you raised, namely that you're concerned that there is a wait loop in the implementation. The nice thing here is that there is in fact no requirement in the API/ABI that we have a wait loop in the kernel at all. Let's say hypothetically that in the future we come up with a way to guarantee, perhaps in some constrained kind of way, that you can enter and exit the kernel and are guaranteed no further timer interrupts, and we are so confident of this property that we don't have to test for it programmatically on kernel exit. (In fact, we would likely still use the task_isolation_debug boot flag to generate a console warning if it ever did happen, but whatever.) At this point we could simply remove the timer interrupt test loop in task_isolation_wait(); the applications would be none the wiser, and the kernel would be that much cleaner. However, today, and I think for the future, I see that loop as an important backstop for whatever timer-elimination coding happens. In general, the hard task-isolation requirement is something that is of particular interest only to a subset of the kernel community. As the kernel grows, adds features, re-implements functionality, etc., it seems entirely likely that odd bits of deferred functionality might be added in the same way that RCU, workqueues, etc., have done in the past. Or, applications might exercise unusual corners of the kernel's semantics and come across an existing mechanism that ends up enabling kernel ticks (maybe only one or two) before returning to userspace. The proposed busy-loop just prevents that from damaging the application. I'm skeptical that we can prevent all such possible changes today and in the future, and I think the loop is a simple way of arranging to avoid breaking applications with interrupts, that only triggers for applications that have requested it, on cores that have been configured to support it. One additional insight that argues in favor of a busy-waiting solution is that a task that requests task isolation is almost certainly alone on the core. If multiple tasks are in fact runnable on that core, we have already abandoned the ability to use proper task isolation since we will want to use timer ticks to run the scheduler for pre-emption. So we only busy wait when, in fact, no other useful work is likely to get done on that core anyway. The other questions you raise have to do with the mechanism for ensuring that we wait until no timer interrupts are scheduled. First is the question of how we detect that case. As I said yesterday, the original approach I chose for the Tilera implementation was one where we simply wait until the timer interrupt is masked (as is done via the set_state_shutdown, set_state_oneshot, and tick_resume callbacks in the tile clock_event_device). When unmasked, the timer down-counter just counts down to zero, fires the interrupt, resets to its start value, and counts down again until it fires again. So we use masking of the interrupt to turn off the timer tick. Once we have done so, we are guaranteed no further timer interrupts can occur. I'm less familiar with the timer subsystems of other architectures, but there are clearly per-platform ways to make the same kinds of checks. If this seems like a better approach, I'm happy to work to add the necessary checks on tile, arm64, and x86, though I'd certainly benefit from some guidance on the timer implementation on the latter two platforms. One reason this might be necessary is if there is support on some platforms for multiple timer interrupts any of which can fire, not just a single timer driven by the clock_event_device. I'm not sure whether this is ever in fact a problem, but if it is, that would seem like it would almost certainly require per-architecture code to determine whether all the relevant timers were quiesced. However, I'm not sure whether you don't like the fact of checking the next_event in tick_cpu_device per se, or if it's the busy-waiting we do when it indicates a pending timer that bothers you. If you could help clarify this piece, that would be good. The last question is what to do when we detect that there is a timer interrupt scheduled. The current code spins, testing for resched or signal events, and bails out back to the work-pending loop when that happens. As an extension, one can add support for spinning in a lower-power state, as I did for tile, but this isn't required and frankly isn't that important, since we don't anticipate spending much time in the busy-loop state anyway. The suggestion proposed by Frederic and echoed by you is a wake-wait scheme. I'm curious to hear a more fully fleshed-out suggestion. Clearly, we can test for pending timer interrupts and put the task to sleep (pretty late in the return-to-userspace process, but maybe that's OK). The question is, how and when do we wake the task? We could add a hook to the platform timer shutdown code that would also wake any process that was waiting for the no-timer case; that process would then end up getting scheduled sometime later, and hopefully when it came time for it to try exiting to userspace again, the timer would still be shutdown. This could be problematic if the scheduler code or some other part of the kernel sets up the timer again before scheduling the waiting task back in. Arguably we can work to avoid this if it's really a problem. And, there is the question of how to handle multiple timer interrupt sources, since they would all have to quiesce before we would want to wake the waiting process, but the "multiple timers" isn't handled by the current code either, and it seems not to be a problem, so perhaps that's OK. Lastly, of course, is the question of what the kernel would end up doing while waiting: and the answer is almost certainly that it would sit in the cpu idle loop, waiting for the pending timer to fire and wake the waiting task. I'm not convinced that the extra complexity here is worth the gain. But I am open and willing to being convinced that I am wrong, and to implement different approaches. Let me know! -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support @ 2015-10-01 21:20 ` Thomas Gleixner 0 siblings, 0 replies; 340+ messages in thread From: Thomas Gleixner @ 2015-10-01 21:20 UTC (permalink / raw) To: Chris Metcalf Cc: Frederic Weisbecker, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Thu, 1 Oct 2015, Chris Metcalf wrote: > But first I want to address the question of the basic semantics > of the patch series. I wrote up a description of why it's useful > in my email yesterday: > > https://lkml.kernel.org/r/560C4CF4.9090601@ezchip.com > > I haven't directly heard from you as to whether you buy the > basic premise of "hard isolation" in terms of protecting tasks > from all kernel interrupts while they execute in userspace. Just for the record. The first serious initiative to solve that problem started here in my own company when I guided Frederic through the endavour of figuring out what needs to be done to achieve that. That was the assignement of his master thesis, which I gave him. So I'm very well aware why this is needed and what needs to be done. I started this, because I got tired of half baken attempts to solve the problem, which were even worse than what you are trying to do now. > So I first want to address what is effectively the API concern that > you raised, namely that you're concerned that there is a wait > loop in the implementation. That wait loop is just a place holder for the underlying more serious concern I have with this whole approach. And I raised that concern several times in the past and I'm happy to do so again. The people working on this, especially you, are just dead set to achieve a certain functionality by jamming half baken mechanisms into the kernel and especially into the low level entry/exit code. And that's something which really annoys me, simply because you refuse to tackle the problems which have been identified as need to be solved 5+ years ago when Frederic did his thesis. Remote accounting: ================== It's not an easy problem, but it's not rocket science either. It's just quite some work. I know that you just give a shit about it because your use case does not care. But it's an essential part of the problem space. You just work around it, by shutting down the tick completely and rely on the fact that it does not explode in your face today. If we accept your hackery, then who is going to fix it, when it explodes in half a year from now? Tick shut down: =============== I still have to understand why the tick is needed at all. There is exactly one reason why the tick must run if a cpu is in full isolation mode: More than one SCHED_OTHER task is runnable on that cpu. There is no other reason, period. If there are requirements today to switch on the tick when a task running in full isolation mode enters the kernel, then they need to be fixed first. And again you don't care, because for your particular use case it's good enough to slap a busy wait loop into every archs low level exit code and be done with it. >From your mail excusing that approach: > The nice thing here is that there is in fact no requirement in > the API/ABI that we have a wait loop in the kernel at all. Let's > say hypothetically that in the future we come up with a way to > guarantee, perhaps in some constrained kind of way, that you > can enter and exit the kernel and are guaranteed no further > timer interrupts, .... "Let's say hypothetically" tells it all. You are not even trying to find a proper solution. You just try to get your particular interest solved. That's exactly the attitude which drives me nuts and that's the point where I say no. You can do all of that in an out of tree patch set as many other hard to solve features have done for years. Yes, it's an annoying catchup game, but it forces you to think harder, refactor code and do a lot of extra work to finally get it merged. Thanks, tglx ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support @ 2015-10-01 21:20 ` Thomas Gleixner 0 siblings, 0 replies; 340+ messages in thread From: Thomas Gleixner @ 2015-10-01 21:20 UTC (permalink / raw) To: Chris Metcalf Cc: Frederic Weisbecker, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Thu, 1 Oct 2015, Chris Metcalf wrote: > But first I want to address the question of the basic semantics > of the patch series. I wrote up a description of why it's useful > in my email yesterday: > > https://lkml.kernel.org/r/560C4CF4.9090601-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org > > I haven't directly heard from you as to whether you buy the > basic premise of "hard isolation" in terms of protecting tasks > from all kernel interrupts while they execute in userspace. Just for the record. The first serious initiative to solve that problem started here in my own company when I guided Frederic through the endavour of figuring out what needs to be done to achieve that. That was the assignement of his master thesis, which I gave him. So I'm very well aware why this is needed and what needs to be done. I started this, because I got tired of half baken attempts to solve the problem, which were even worse than what you are trying to do now. > So I first want to address what is effectively the API concern that > you raised, namely that you're concerned that there is a wait > loop in the implementation. That wait loop is just a place holder for the underlying more serious concern I have with this whole approach. And I raised that concern several times in the past and I'm happy to do so again. The people working on this, especially you, are just dead set to achieve a certain functionality by jamming half baken mechanisms into the kernel and especially into the low level entry/exit code. And that's something which really annoys me, simply because you refuse to tackle the problems which have been identified as need to be solved 5+ years ago when Frederic did his thesis. Remote accounting: ================== It's not an easy problem, but it's not rocket science either. It's just quite some work. I know that you just give a shit about it because your use case does not care. But it's an essential part of the problem space. You just work around it, by shutting down the tick completely and rely on the fact that it does not explode in your face today. If we accept your hackery, then who is going to fix it, when it explodes in half a year from now? Tick shut down: =============== I still have to understand why the tick is needed at all. There is exactly one reason why the tick must run if a cpu is in full isolation mode: More than one SCHED_OTHER task is runnable on that cpu. There is no other reason, period. If there are requirements today to switch on the tick when a task running in full isolation mode enters the kernel, then they need to be fixed first. And again you don't care, because for your particular use case it's good enough to slap a busy wait loop into every archs low level exit code and be done with it. >From your mail excusing that approach: > The nice thing here is that there is in fact no requirement in > the API/ABI that we have a wait loop in the kernel at all. Let's > say hypothetically that in the future we come up with a way to > guarantee, perhaps in some constrained kind of way, that you > can enter and exit the kernel and are guaranteed no further > timer interrupts, .... "Let's say hypothetically" tells it all. You are not even trying to find a proper solution. You just try to get your particular interest solved. That's exactly the attitude which drives me nuts and that's the point where I say no. You can do all of that in an out of tree patch set as many other hard to solve features have done for years. Yes, it's an annoying catchup game, but it forces you to think harder, refactor code and do a lot of extra work to finally get it merged. Thanks, tglx ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support 2015-10-01 21:20 ` Thomas Gleixner @ 2015-10-02 17:15 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-02 17:15 UTC (permalink / raw) To: Thomas Gleixner Cc: Frederic Weisbecker, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 10/01/2015 05:20 PM, Thomas Gleixner wrote: > On Thu, 1 Oct 2015, Chris Metcalf wrote: >> But first I want to address the question of the basic semantics >> of the patch series. I wrote up a description of why it's useful >> in my email yesterday: >> >> https://lkml.kernel.org/r/560C4CF4.9090601@ezchip.com >> >> I haven't directly heard from you as to whether you buy the >> basic premise of "hard isolation" in terms of protecting tasks >> from all kernel interrupts while they execute in userspace. > Just for the record. The first serious initiative to solve that > problem started here in my own company when I guided Frederic through > the endavour of figuring out what needs to be done to achieve > that. That was the assignement of his master thesis, which I gave him. Thanks for that background. I didn't know you had gotten Frederic started down that path originally. >> So I first want to address what is effectively the API concern that >> you raised, namely that you're concerned that there is a wait >> loop in the implementation. > That wait loop is just a place holder for the underlying more serious > concern I have with this whole approach. And I raised that concern > several times in the past and I'm happy to do so again. > > The people working on this, especially you, are just dead set to > achieve a certain functionality by jamming half baken mechanisms into > the kernel and especially into the low level entry/exit code. And > that's something which really annoys me, simply because you refuse to > tackle the problems which have been identified as need to be solved 5+ > years ago when Frederic did his thesis. I think you raise a good point. I still claim my arguments are plausible, but you may be right that this is an instance where forcing a different approach is better for the kernel community as a whole. Given that, what would you think of the following two changes to my proposed patch series: 1. Rather than spinning in a busy loop if timers are pending, we reschedule if more than one task is ready to run. This directly targets the "architected" problem with the scheduler tick, rather than sweeping up the scheduler tick and any other timers into the one catch-all of "any timer ready to fire". (We can use sched_can_stop_tick() to check the case where other tasks can preempt us.) This would then provide part of the semantics of the task-isolation flag. The other part is running whatever code can be run to avoid the various ways tasks might get interrupted later (lru_add_drain(), quiet_vmstat(), etc) that are not appropriate to run unconditionally for tasks that aren't trying to be isolated. 2. Remove the tie between disabling the 1 Hz max deferment and task isolation per se. Instead add a boot flag (e.g. "debug_1hz_tick") that lets us turn off the 1 Hz tick to make it easy to experiment with both the negative effects of the missing tick, as well as to try to learn in parallel what actual timer interrupts are firing "on purpose" rather than just due to the 1 Hz tick to try to eliminate them as well. For #1, I'm not sure if it's better to hack up the scheduler's pick_next_task callback methods to avoid task-isolation tasks when other tasks are also available to run, or just to observe that there are additional tasks ready to run during exit to userspace, and yield the cpu to allow those other tasks to run. The advantage of doing it at exit to userspace is that we can easily yield in a loop and pay attention to whether we seem not to be making forward progress with that task and generate a suitable warning; it also keeps a lot of task-isolation stuff out of the core scheduler code, which may be a plus. With these changes, and booting with the "debug_1hz_tick" flag, I'm seeing a couple of timer ticks hit my task-isolation task in the first 20 ms or so, and then it quiesces. I will plan to work on figuring out what is triggering those interrupts and seeing how to fix them. My hope is that in parallel with that work, other folks can be working on how to fix problems that occur more silently with the scheduler tick max deferment disabled; I'm also happy to work on those problems to the extent that I understand them (and I'm always happy to learn more). As part of the patch series I'd extend the proposed task_isolation_debug flag to also track timer scheduling events against task-isolation tasks that are ready to run in userspace (no other runnable tasks). What do you think of this approach? -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support @ 2015-10-02 17:15 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-02 17:15 UTC (permalink / raw) To: Thomas Gleixner Cc: Frederic Weisbecker, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 10/01/2015 05:20 PM, Thomas Gleixner wrote: > On Thu, 1 Oct 2015, Chris Metcalf wrote: >> But first I want to address the question of the basic semantics >> of the patch series. I wrote up a description of why it's useful >> in my email yesterday: >> >> https://lkml.kernel.org/r/560C4CF4.9090601@ezchip.com >> >> I haven't directly heard from you as to whether you buy the >> basic premise of "hard isolation" in terms of protecting tasks >> from all kernel interrupts while they execute in userspace. > Just for the record. The first serious initiative to solve that > problem started here in my own company when I guided Frederic through > the endavour of figuring out what needs to be done to achieve > that. That was the assignement of his master thesis, which I gave him. Thanks for that background. I didn't know you had gotten Frederic started down that path originally. >> So I first want to address what is effectively the API concern that >> you raised, namely that you're concerned that there is a wait >> loop in the implementation. > That wait loop is just a place holder for the underlying more serious > concern I have with this whole approach. And I raised that concern > several times in the past and I'm happy to do so again. > > The people working on this, especially you, are just dead set to > achieve a certain functionality by jamming half baken mechanisms into > the kernel and especially into the low level entry/exit code. And > that's something which really annoys me, simply because you refuse to > tackle the problems which have been identified as need to be solved 5+ > years ago when Frederic did his thesis. I think you raise a good point. I still claim my arguments are plausible, but you may be right that this is an instance where forcing a different approach is better for the kernel community as a whole. Given that, what would you think of the following two changes to my proposed patch series: 1. Rather than spinning in a busy loop if timers are pending, we reschedule if more than one task is ready to run. This directly targets the "architected" problem with the scheduler tick, rather than sweeping up the scheduler tick and any other timers into the one catch-all of "any timer ready to fire". (We can use sched_can_stop_tick() to check the case where other tasks can preempt us.) This would then provide part of the semantics of the task-isolation flag. The other part is running whatever code can be run to avoid the various ways tasks might get interrupted later (lru_add_drain(), quiet_vmstat(), etc) that are not appropriate to run unconditionally for tasks that aren't trying to be isolated. 2. Remove the tie between disabling the 1 Hz max deferment and task isolation per se. Instead add a boot flag (e.g. "debug_1hz_tick") that lets us turn off the 1 Hz tick to make it easy to experiment with both the negative effects of the missing tick, as well as to try to learn in parallel what actual timer interrupts are firing "on purpose" rather than just due to the 1 Hz tick to try to eliminate them as well. For #1, I'm not sure if it's better to hack up the scheduler's pick_next_task callback methods to avoid task-isolation tasks when other tasks are also available to run, or just to observe that there are additional tasks ready to run during exit to userspace, and yield the cpu to allow those other tasks to run. The advantage of doing it at exit to userspace is that we can easily yield in a loop and pay attention to whether we seem not to be making forward progress with that task and generate a suitable warning; it also keeps a lot of task-isolation stuff out of the core scheduler code, which may be a plus. With these changes, and booting with the "debug_1hz_tick" flag, I'm seeing a couple of timer ticks hit my task-isolation task in the first 20 ms or so, and then it quiesces. I will plan to work on figuring out what is triggering those interrupts and seeing how to fix them. My hope is that in parallel with that work, other folks can be working on how to fix problems that occur more silently with the scheduler tick max deferment disabled; I'm also happy to work on those problems to the extent that I understand them (and I'm always happy to learn more). As part of the patch series I'd extend the proposed task_isolation_debug flag to also track timer scheduling events against task-isolation tasks that are ready to run in userspace (no other runnable tasks). What do you think of this approach? -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support @ 2015-10-02 19:02 ` Thomas Gleixner 0 siblings, 0 replies; 340+ messages in thread From: Thomas Gleixner @ 2015-10-02 19:02 UTC (permalink / raw) To: Chris Metcalf Cc: Frederic Weisbecker, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Chris, On Fri, 2 Oct 2015, Chris Metcalf wrote: > 1. Rather than spinning in a busy loop if timers are pending, > we reschedule if more than one task is ready to run. This > directly targets the "architected" problem with the scheduler > tick, rather than sweeping up the scheduler tick and any other > timers into the one catch-all of "any timer ready to fire". > (We can use sched_can_stop_tick() to check the case where > other tasks can preempt us.) This would then provide part > of the semantics of the task-isolation flag. The other part is > running whatever code can be run to avoid the various ways > tasks might get interrupted later (lru_add_drain(), > quiet_vmstat(), etc) that are not appropriate to run > unconditionally for tasks that aren't trying to be isolated. Sounds like a plan > 2. Remove the tie between disabling the 1 Hz max deferment > and task isolation per se. Instead add a boot flag (e.g. > "debug_1hz_tick") that lets us turn off the 1 Hz tick to make it > easy to experiment with both the negative effects of the > missing tick, as well as to try to learn in parallel what actual > timer interrupts are firing "on purpose" rather than just due > to the 1 Hz tick to try to eliminate them as well. I have no problem with a debug flag, which allows you to experiment, though I'm not entirely sure whether we need to carry it in mainline or just in an extra isolation git tree. > For #1, I'm not sure if it's better to hack up the scheduler's > pick_next_task callback methods to avoid task-isolation tasks > when other tasks are also available to run, or just to observe > that there are additional tasks ready to run during exit to > userspace, and yield the cpu to allow those other tasks to run. > The advantage of doing it at exit to userspace is that we can > easily yield in a loop and pay attention to whether we seem > not to be making forward progress with that task and generate > a suitable warning; it also keeps a lot of task-isolation stuff > out of the core scheduler code, which may be a plus. You should discuss that with Peter Zijlstra. I see the plus not to have it in the scheduler, but OTOH having it in the core code has its advantages as well. Let's see how ugly it gets. > With these changes, and booting with the "debug_1hz_tick" > flag, I'm seeing a couple of timer ticks hit my task-isolation > task in the first 20 ms or so, and then it quiesces. I will > plan to work on figuring out what is triggering those > interrupts and seeing how to fix them. My hope is that in > parallel with that work, other folks can be working on how to > fix problems that occur more silently with the scheduler > tick max deferment disabled; I'm also happy to work on those > problems to the extent that I understand them (and I'm > always happy to learn more). I like that approach :) > As part of the patch series I'd extend the proposed > task_isolation_debug flag to also track timer scheduling > events against task-isolation tasks that are ready to run > in userspace (no other runnable tasks). > > What do you think of this approach? Makes sense. Thanks, tglx ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support @ 2015-10-02 19:02 ` Thomas Gleixner 0 siblings, 0 replies; 340+ messages in thread From: Thomas Gleixner @ 2015-10-02 19:02 UTC (permalink / raw) To: Chris Metcalf Cc: Frederic Weisbecker, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Chris, On Fri, 2 Oct 2015, Chris Metcalf wrote: > 1. Rather than spinning in a busy loop if timers are pending, > we reschedule if more than one task is ready to run. This > directly targets the "architected" problem with the scheduler > tick, rather than sweeping up the scheduler tick and any other > timers into the one catch-all of "any timer ready to fire". > (We can use sched_can_stop_tick() to check the case where > other tasks can preempt us.) This would then provide part > of the semantics of the task-isolation flag. The other part is > running whatever code can be run to avoid the various ways > tasks might get interrupted later (lru_add_drain(), > quiet_vmstat(), etc) that are not appropriate to run > unconditionally for tasks that aren't trying to be isolated. Sounds like a plan > 2. Remove the tie between disabling the 1 Hz max deferment > and task isolation per se. Instead add a boot flag (e.g. > "debug_1hz_tick") that lets us turn off the 1 Hz tick to make it > easy to experiment with both the negative effects of the > missing tick, as well as to try to learn in parallel what actual > timer interrupts are firing "on purpose" rather than just due > to the 1 Hz tick to try to eliminate them as well. I have no problem with a debug flag, which allows you to experiment, though I'm not entirely sure whether we need to carry it in mainline or just in an extra isolation git tree. > For #1, I'm not sure if it's better to hack up the scheduler's > pick_next_task callback methods to avoid task-isolation tasks > when other tasks are also available to run, or just to observe > that there are additional tasks ready to run during exit to > userspace, and yield the cpu to allow those other tasks to run. > The advantage of doing it at exit to userspace is that we can > easily yield in a loop and pay attention to whether we seem > not to be making forward progress with that task and generate > a suitable warning; it also keeps a lot of task-isolation stuff > out of the core scheduler code, which may be a plus. You should discuss that with Peter Zijlstra. I see the plus not to have it in the scheduler, but OTOH having it in the core code has its advantages as well. Let's see how ugly it gets. > With these changes, and booting with the "debug_1hz_tick" > flag, I'm seeing a couple of timer ticks hit my task-isolation > task in the first 20 ms or so, and then it quiesces. I will > plan to work on figuring out what is triggering those > interrupts and seeing how to fix them. My hope is that in > parallel with that work, other folks can be working on how to > fix problems that occur more silently with the scheduler > tick max deferment disabled; I'm also happy to work on those > problems to the extent that I understand them (and I'm > always happy to learn more). I like that approach :) > As part of the patch series I'd extend the proposed > task_isolation_debug flag to also track timer scheduling > events against task-isolation tasks that are ready to run > in userspace (no other runnable tasks). > > What do you think of this approach? Makes sense. Thanks, tglx ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support 2015-10-01 12:14 ` Frederic Weisbecker @ 2015-10-01 19:25 ` Chris Metcalf 2015-10-01 19:25 ` Chris Metcalf 1 sibling, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-01 19:25 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 10/01/2015 08:14 AM, Frederic Weisbecker wrote: > On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote: >> diff --git a/include/linux/isolation.h b/include/linux/isolation.h >> new file mode 100644 >> index 000000000000..fd04011b1c1e >> --- /dev/null >> +++ b/include/linux/isolation.h >> @@ -0,0 +1,24 @@ >> +/* >> + * Task isolation related global functions >> + */ >> +#ifndef _LINUX_ISOLATION_H >> +#define _LINUX_ISOLATION_H >> + >> +#include <linux/tick.h> >> +#include <linux/prctl.h> >> + >> +#ifdef CONFIG_TASK_ISOLATION >> +static inline bool task_isolation_enabled(void) >> +{ >> + return tick_nohz_full_cpu(smp_processor_id()) && >> + (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE); > Ok, I may be a bit burdening with that but, how about using the regular > existing task flags, and if needed later we can still introduce a new field > in struct task_struct? The problem is still that we have two basic bits ("enabled" and "strict") plus eight bits of signal number to override SIGKILL. So we end up with *something* extra in task_struct no matter what. And, right now it's conveniently the same value as the bits passed to prctl(), so we don't need to marshall and unmarshall the prctl() get/set results. If we could convince ourselves not to do the "settable signal" stuff I'd agree that use task flags makes sense, but I was convinced for v2 of the patch series to add a settable signal, and I suspect it still does make sense. >> + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { > You should add a function in tick-sched.c to get the next tick. This > is supposed to be a private field. Yes. Or probably better, a function that just says whether the timer is quiesced. Obviously I'll wait to hear what Thomas says on this subject first, though. >> + if (!warned && (jiffies - start) >= (5 * HZ)) { >> + pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n", >> + task->comm, task->pid, smp_processor_id(), >> + (jiffies - start) / HZ); >> + warned = true; >> + } >> + cond_resched(); >> + if (test_thread_flag(TIF_SIGPENDING)) >> + break; > Why not use signal_pending()? Makes sense, thanks. > I still think we could try a wait-wake standard scheme. I'm curious to hear what you make of my arguments in the other thread on this subject! -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 02/11] task_isolation: add initial support @ 2015-10-01 19:25 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-01 19:25 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 10/01/2015 08:14 AM, Frederic Weisbecker wrote: > On Mon, Sep 28, 2015 at 11:17:17AM -0400, Chris Metcalf wrote: >> diff --git a/include/linux/isolation.h b/include/linux/isolation.h >> new file mode 100644 >> index 000000000000..fd04011b1c1e >> --- /dev/null >> +++ b/include/linux/isolation.h >> @@ -0,0 +1,24 @@ >> +/* >> + * Task isolation related global functions >> + */ >> +#ifndef _LINUX_ISOLATION_H >> +#define _LINUX_ISOLATION_H >> + >> +#include <linux/tick.h> >> +#include <linux/prctl.h> >> + >> +#ifdef CONFIG_TASK_ISOLATION >> +static inline bool task_isolation_enabled(void) >> +{ >> + return tick_nohz_full_cpu(smp_processor_id()) && >> + (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE); > Ok, I may be a bit burdening with that but, how about using the regular > existing task flags, and if needed later we can still introduce a new field > in struct task_struct? The problem is still that we have two basic bits ("enabled" and "strict") plus eight bits of signal number to override SIGKILL. So we end up with *something* extra in task_struct no matter what. And, right now it's conveniently the same value as the bits passed to prctl(), so we don't need to marshall and unmarshall the prctl() get/set results. If we could convince ourselves not to do the "settable signal" stuff I'd agree that use task flags makes sense, but I was convinced for v2 of the patch series to add a settable signal, and I suspect it still does make sense. >> + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { > You should add a function in tick-sched.c to get the next tick. This > is supposed to be a private field. Yes. Or probably better, a function that just says whether the timer is quiesced. Obviously I'll wait to hear what Thomas says on this subject first, though. >> + if (!warned && (jiffies - start) >= (5 * HZ)) { >> + pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n", >> + task->comm, task->pid, smp_processor_id(), >> + (jiffies - start) / HZ); >> + warned = true; >> + } >> + cond_resched(); >> + if (test_thread_flag(TIF_SIGPENDING)) >> + break; > Why not use signal_pending()? Makes sense, thanks. > I still think we could try a wait-wake standard scheme. I'm curious to hear what you make of my arguments in the other thread on this subject! -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-09-28 15:17 ` Chris Metcalf @ 2015-09-28 15:17 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With task_isolation mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal; this is defined as happening immediately after the SECCOMP test. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/context_tracking.h | 11 ++++++++--- include/linux/isolation.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/isolation.c | 41 ++++++++++++++++++++++++++++++++++++++++ 5 files changed, 72 insertions(+), 6 deletions(-) diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index 008fc67d0d96..a840374f5d29 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/isolation.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (task_isolation_strict()) + task_isolation_exception(); + } + } return prev_ctx; } diff --git a/include/linux/isolation.h b/include/linux/isolation.h index fd04011b1c1e..27a4469831c1 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void) } extern void task_isolation_enter(void); +extern void task_isolation_syscall(int nr); +extern void task_isolation_exception(void); extern void task_isolation_wait(void); #else static inline bool task_isolation_enabled(void) { return false; } static inline void task_isolation_enter(void) { } +static inline void task_isolation_syscall(int nr) { } +static inline void task_isolation_exception(void) { } #endif +static inline bool task_isolation_strict(void) +{ +#ifdef CONFIG_TASK_ISOLATION + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) == + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 67224df4b559..2b8038b0d1e1 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -201,5 +201,6 @@ struct prctl_mm_map { #define PR_SET_TASK_ISOLATION 48 #define PR_GET_TASK_ISOLATION 49 # define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 0a495ab35bc7..ffca3c3fe64a 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -144,15 +144,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -166,6 +167,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -175,6 +177,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/isolation.c b/kernel/isolation.c index 6ace866c69f6..3779ba670472 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -10,6 +10,7 @@ #include <linux/swap.h> #include <linux/vmstat.h> #include <linux/isolation.h> +#include <asm/unistd.h> #include "time/tick-sched.h" /* @@ -75,3 +76,43 @@ void task_isolation_enter(void) dump_stack(); } } + +static void kill_task_isolation_strict_task(void) +{ + /* RCU should have been enabled prior to this point. */ + RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); + + dump_stack(); + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void task_isolation_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_task_isolation_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void task_isolation_exception(void) +{ + pr_warn("%s/%d: task_isolation strict mode violated by exception\n", + current->comm, current->pid); + kill_task_isolation_strict_task(); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode @ 2015-09-28 15:17 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With task_isolation mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal; this is defined as happening immediately after the SECCOMP test. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. The signature of context_tracking_exit() changes to report whether we, in fact, are exiting back to user space, so that we can track user exceptions properly separately from other kernel entries. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/context_tracking.h | 11 ++++++++--- include/linux/isolation.h | 16 ++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/context_tracking.c | 9 ++++++--- kernel/isolation.c | 41 ++++++++++++++++++++++++++++++++++++++++ 5 files changed, 72 insertions(+), 6 deletions(-) diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index 008fc67d0d96..a840374f5d29 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,7 @@ #include <linux/sched.h> #include <linux/vtime.h> +#include <linux/isolation.h> #include <linux/context_tracking_state.h> #include <asm/ptrace.h> @@ -11,7 +12,7 @@ extern void context_tracking_cpu_set(int cpu); extern void context_tracking_enter(enum ctx_state state); -extern void context_tracking_exit(enum ctx_state state); +extern bool context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - if (prev_ctx != CONTEXT_KERNEL) - context_tracking_exit(prev_ctx); + if (prev_ctx != CONTEXT_KERNEL) { + if (context_tracking_exit(prev_ctx)) { + if (task_isolation_strict()) + task_isolation_exception(); + } + } return prev_ctx; } diff --git a/include/linux/isolation.h b/include/linux/isolation.h index fd04011b1c1e..27a4469831c1 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -15,10 +15,26 @@ static inline bool task_isolation_enabled(void) } extern void task_isolation_enter(void); +extern void task_isolation_syscall(int nr); +extern void task_isolation_exception(void); extern void task_isolation_wait(void); #else static inline bool task_isolation_enabled(void) { return false; } static inline void task_isolation_enter(void) { } +static inline void task_isolation_syscall(int nr) { } +static inline void task_isolation_exception(void) { } #endif +static inline bool task_isolation_strict(void) +{ +#ifdef CONFIG_TASK_ISOLATION + if (tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) == + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) + return true; +#endif + return false; +} + #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 67224df4b559..2b8038b0d1e1 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -201,5 +201,6 @@ struct prctl_mm_map { #define PR_SET_TASK_ISOLATION 48 #define PR_GET_TASK_ISOLATION 49 # define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 0a495ab35bc7..ffca3c3fe64a 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -144,15 +144,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void context_tracking_exit(enum ctx_state state) +bool context_tracking_exit(enum ctx_state state) { unsigned long flags; + bool from_user = false; if (!context_tracking_is_enabled()) - return; + return false; if (in_interrupt()) - return; + return false; local_irq_save(flags); if (!context_tracking_recursion_enter()) @@ -166,6 +167,7 @@ void context_tracking_exit(enum ctx_state state) */ rcu_user_exit(); if (state == CONTEXT_USER) { + from_user = true; vtime_user_exit(current); trace_user_exit(0); } @@ -175,6 +177,7 @@ void context_tracking_exit(enum ctx_state state) context_tracking_recursion_exit(); out_irq_restore: local_irq_restore(flags); + return from_user; } NOKPROBE_SYMBOL(context_tracking_exit); EXPORT_SYMBOL_GPL(context_tracking_exit); diff --git a/kernel/isolation.c b/kernel/isolation.c index 6ace866c69f6..3779ba670472 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -10,6 +10,7 @@ #include <linux/swap.h> #include <linux/vmstat.h> #include <linux/isolation.h> +#include <asm/unistd.h> #include "time/tick-sched.h" /* @@ -75,3 +76,43 @@ void task_isolation_enter(void) dump_stack(); } } + +static void kill_task_isolation_strict_task(void) +{ + /* RCU should have been enabled prior to this point. */ + RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); + + dump_stack(); + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; + send_sig(SIGKILL, current, 1); +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +void task_isolation_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return; + } + + pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", + current->comm, current->pid, syscall); + kill_task_isolation_strict_task(); +} + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +void task_isolation_exception(void) +{ + pr_warn("%s/%d: task_isolation strict mode violated by exception\n", + current->comm, current->pid); + kill_task_isolation_strict_task(); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode @ 2015-09-28 20:51 ` Andy Lutomirski 0 siblings, 0 replies; 340+ messages in thread From: Andy Lutomirski @ 2015-09-28 20:51 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > With task_isolation mode, the task is in principle guaranteed not to > be interrupted by the kernel, but only if it behaves. In particular, > if it enters the kernel via system call, page fault, or any of a > number of other synchronous traps, it may be unexpectedly exposed > to long latencies. Add a simple flag that puts the process into > a state where any such kernel entry is fatal; this is defined as > happening immediately after the SECCOMP test. Why after seccomp? Seccomp is still an entry, and the code would be considerably simpler if it were before seccomp. > @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) > return 0; > > prev_ctx = this_cpu_read(context_tracking.state); > - if (prev_ctx != CONTEXT_KERNEL) > - context_tracking_exit(prev_ctx); > + if (prev_ctx != CONTEXT_KERNEL) { > + if (context_tracking_exit(prev_ctx)) { > + if (task_isolation_strict()) > + task_isolation_exception(); > + } > + } > > return prev_ctx; > } x86 does not promise to call this function. In fact, x86 is rather likely to stop ever calling this function in the reasonably near future. > --- a/kernel/context_tracking.c > +++ b/kernel/context_tracking.c > @@ -144,15 +144,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); > * This call supports re-entrancy. This way it can be called from any exception > * handler without needing to know if we came from userspace or not. > */ > -void context_tracking_exit(enum ctx_state state) > +bool context_tracking_exit(enum ctx_state state) This needs clear documentation of what the return value means. > +static void kill_task_isolation_strict_task(void) > +{ > + /* RCU should have been enabled prior to this point. */ > + RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); > + > + dump_stack(); > + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; > + send_sig(SIGKILL, current, 1); > +} Wasn't this supposed to be configurable? Or is that something that happens later on in the series? > + > +/* > + * This routine is called from syscall entry (with the syscall number > + * passed in) if the _STRICT flag is set. > + */ > +void task_isolation_syscall(int syscall) > +{ > + /* Ignore prctl() syscalls or any task exit. */ > + switch (syscall) { > + case __NR_prctl: > + case __NR_exit: > + case __NR_exit_group: > + return; > + } > + > + pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", > + current->comm, current->pid, syscall); > + kill_task_isolation_strict_task(); > +} Ick. I guess it works, but this is still quite ugly IMO. > +void task_isolation_exception(void) > +{ > + pr_warn("%s/%d: task_isolation strict mode violated by exception\n", > + current->comm, current->pid); > + kill_task_isolation_strict_task(); > +} Should this say what exception? --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode @ 2015-09-28 20:51 ` Andy Lutomirski 0 siblings, 0 replies; 340+ messages in thread From: Andy Lutomirski @ 2015-09-28 20:51 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > With task_isolation mode, the task is in principle guaranteed not to > be interrupted by the kernel, but only if it behaves. In particular, > if it enters the kernel via system call, page fault, or any of a > number of other synchronous traps, it may be unexpectedly exposed > to long latencies. Add a simple flag that puts the process into > a state where any such kernel entry is fatal; this is defined as > happening immediately after the SECCOMP test. Why after seccomp? Seccomp is still an entry, and the code would be considerably simpler if it were before seccomp. > @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) > return 0; > > prev_ctx = this_cpu_read(context_tracking.state); > - if (prev_ctx != CONTEXT_KERNEL) > - context_tracking_exit(prev_ctx); > + if (prev_ctx != CONTEXT_KERNEL) { > + if (context_tracking_exit(prev_ctx)) { > + if (task_isolation_strict()) > + task_isolation_exception(); > + } > + } > > return prev_ctx; > } x86 does not promise to call this function. In fact, x86 is rather likely to stop ever calling this function in the reasonably near future. > --- a/kernel/context_tracking.c > +++ b/kernel/context_tracking.c > @@ -144,15 +144,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); > * This call supports re-entrancy. This way it can be called from any exception > * handler without needing to know if we came from userspace or not. > */ > -void context_tracking_exit(enum ctx_state state) > +bool context_tracking_exit(enum ctx_state state) This needs clear documentation of what the return value means. > +static void kill_task_isolation_strict_task(void) > +{ > + /* RCU should have been enabled prior to this point. */ > + RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); > + > + dump_stack(); > + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; > + send_sig(SIGKILL, current, 1); > +} Wasn't this supposed to be configurable? Or is that something that happens later on in the series? > + > +/* > + * This routine is called from syscall entry (with the syscall number > + * passed in) if the _STRICT flag is set. > + */ > +void task_isolation_syscall(int syscall) > +{ > + /* Ignore prctl() syscalls or any task exit. */ > + switch (syscall) { > + case __NR_prctl: > + case __NR_exit: > + case __NR_exit_group: > + return; > + } > + > + pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", > + current->comm, current->pid, syscall); > + kill_task_isolation_strict_task(); > +} Ick. I guess it works, but this is still quite ugly IMO. > +void task_isolation_exception(void) > +{ > + pr_warn("%s/%d: task_isolation strict mode violated by exception\n", > + current->comm, current->pid); > + kill_task_isolation_strict_task(); > +} Should this say what exception? --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-09-28 20:51 ` Andy Lutomirski (?) @ 2015-09-28 21:54 ` Chris Metcalf 2015-09-28 22:38 ` Andy Lutomirski -1 siblings, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 21:54 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On 09/28/2015 04:51 PM, Andy Lutomirski wrote: > On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> With task_isolation mode, the task is in principle guaranteed not to >> be interrupted by the kernel, but only if it behaves. In particular, >> if it enters the kernel via system call, page fault, or any of a >> number of other synchronous traps, it may be unexpectedly exposed >> to long latencies. Add a simple flag that puts the process into >> a state where any such kernel entry is fatal; this is defined as >> happening immediately after the SECCOMP test. > Why after seccomp? Seccomp is still an entry, and the code would be > considerably simpler if it were before seccomp. I could be convinced to do it either way. My initial thinking was that a security violation was more interesting and more important to report than a strict-mode task-isolation violation. But see my comments in response to your email on patch 07/11. >> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) >> return 0; >> >> prev_ctx = this_cpu_read(context_tracking.state); >> - if (prev_ctx != CONTEXT_KERNEL) >> - context_tracking_exit(prev_ctx); >> + if (prev_ctx != CONTEXT_KERNEL) { >> + if (context_tracking_exit(prev_ctx)) { >> + if (task_isolation_strict()) >> + task_isolation_exception(); >> + } >> + } >> >> return prev_ctx; >> } > x86 does not promise to call this function. In fact, x86 is rather > likely to stop ever calling this function in the reasonably near > future. Yes, in which case we'd have to do it the same way we are doing it for arm64 (see patch 09/11), by calling task_isolation_exception() explicitly from within the relevant exception handlers. If we start doing that, it's probably worth wrapping up the logic into a single inline function to keep the added code short and sweet. If in fact this might happen in the short term, it might be a good idea to hook the individual exception handlers in x86 now, and not hook the exception_enter() mechanism at all. >> --- a/kernel/context_tracking.c >> +++ b/kernel/context_tracking.c >> @@ -144,15 +144,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); >> * This call supports re-entrancy. This way it can be called from any exception >> * handler without needing to know if we came from userspace or not. >> */ >> -void context_tracking_exit(enum ctx_state state) >> +bool context_tracking_exit(enum ctx_state state) > This needs clear documentation of what the return value means. Added: * Return: if called with state == CONTEXT_USER, the function returns * true if we were in fact previously in user mode. >> +static void kill_task_isolation_strict_task(void) >> +{ >> + /* RCU should have been enabled prior to this point. */ >> + RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); >> + >> + dump_stack(); >> + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; >> + send_sig(SIGKILL, current, 1); >> +} > Wasn't this supposed to be configurable? Or is that something that > happens later on in the series? Yup, next patch. >> +void task_isolation_exception(void) >> +{ >> + pr_warn("%s/%d: task_isolation strict mode violated by exception\n", >> + current->comm, current->pid); >> + kill_task_isolation_strict_task(); >> +} > Should this say what exception? I could modify it to take a string argument (and then use it for the arm64 case at least). For the exception_enter() caller, we actually don't have the information available to pass down, and it would be hard to get it. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-09-28 21:54 ` Chris Metcalf @ 2015-09-28 22:38 ` Andy Lutomirski 2015-09-29 17:35 ` Chris Metcalf 0 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-09-28 22:38 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On Mon, Sep 28, 2015 at 2:54 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 09/28/2015 04:51 PM, Andy Lutomirski wrote: >> >> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> >>> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) >>> return 0; >>> >>> prev_ctx = this_cpu_read(context_tracking.state); >>> - if (prev_ctx != CONTEXT_KERNEL) >>> - context_tracking_exit(prev_ctx); >>> + if (prev_ctx != CONTEXT_KERNEL) { >>> + if (context_tracking_exit(prev_ctx)) { >>> + if (task_isolation_strict()) >>> + task_isolation_exception(); >>> + } >>> + } >>> >>> return prev_ctx; >>> } >> >> x86 does not promise to call this function. In fact, x86 is rather >> likely to stop ever calling this function in the reasonably near >> future. > > > Yes, in which case we'd have to do it the same way we are doing > it for arm64 (see patch 09/11), by calling task_isolation_exception() > explicitly from within the relevant exception handlers. If we start > doing that, it's probably worth wrapping up the logic into a single > inline function to keep the added code short and sweet. > > If in fact this might happen in the short term, it might be a good > idea to hook the individual exception handlers in x86 now, and not > hook the exception_enter() mechanism at all. It's already like that in Linus' tree. FWIW, most of those exception handlers send signals, so it might pay to do it in notify_die or die instead. > >>> --- a/kernel/context_tracking.c >>> +++ b/kernel/context_tracking.c >>> @@ -144,15 +144,16 @@ NOKPROBE_SYMBOL(context_tracking_user_enter); >>> * This call supports re-entrancy. This way it can be called from any >>> exception >>> * handler without needing to know if we came from userspace or not. >>> */ >>> -void context_tracking_exit(enum ctx_state state) >>> +bool context_tracking_exit(enum ctx_state state) >> >> This needs clear documentation of what the return value means. > > > Added: > > * Return: if called with state == CONTEXT_USER, the function returns > * true if we were in fact previously in user mode. This should note that it only returns true if context tracking is on. >>> +void task_isolation_exception(void) >>> +{ >>> + pr_warn("%s/%d: task_isolation strict mode violated by >>> exception\n", >>> + current->comm, current->pid); >>> + kill_task_isolation_strict_task(); >>> +} >> >> Should this say what exception? > > > I could modify it to take a string argument (and then use it for > the arm64 case at least). For the exception_enter() caller, we actually > don't have the information available to pass down, and it would > be hard to get it. For x86, the relevant info might be the actual hw error number (error_code, which makes it into die) or the signal. If we send a death signal, then reporting the error number the usual way might make sense. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-09-28 22:38 ` Andy Lutomirski @ 2015-09-29 17:35 ` Chris Metcalf 2015-09-29 17:46 ` Andy Lutomirski 0 siblings, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-09-29 17:35 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On 09/28/2015 06:38 PM, Andy Lutomirski wrote: > On Mon, Sep 28, 2015 at 2:54 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> On 09/28/2015 04:51 PM, Andy Lutomirski wrote: >>> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> >>>> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) >>>> return 0; >>>> >>>> prev_ctx = this_cpu_read(context_tracking.state); >>>> - if (prev_ctx != CONTEXT_KERNEL) >>>> - context_tracking_exit(prev_ctx); >>>> + if (prev_ctx != CONTEXT_KERNEL) { >>>> + if (context_tracking_exit(prev_ctx)) { >>>> + if (task_isolation_strict()) >>>> + task_isolation_exception(); >>>> + } >>>> + } >>>> >>>> return prev_ctx; >>>> } >>> x86 does not promise to call this function. In fact, x86 is rather >>> likely to stop ever calling this function in the reasonably near >>> future. >> >> Yes, in which case we'd have to do it the same way we are doing >> it for arm64 (see patch 09/11), by calling task_isolation_exception() >> explicitly from within the relevant exception handlers. If we start >> doing that, it's probably worth wrapping up the logic into a single >> inline function to keep the added code short and sweet. >> >> If in fact this might happen in the short term, it might be a good >> idea to hook the individual exception handlers in x86 now, and not >> hook the exception_enter() mechanism at all. > It's already like that in Linus' tree. OK, I will restructure so that it doesn't rely on the context_tracking code at all, but instead requires a line of code in every relevant kernel exception handler. > FWIW, most of those exception handlers send signals, so it might pay > to do it in notify_die or die instead. Well, the most interesting category is things that don't actually trigger a signal (e.g. minor page fault) since those are things that cause significant issues with task isolation processes (kernel-induced jitter) but aren't otherwise user-visible, much like an undiscovered syscall in a third-party library can cause unexpected jitter. > For x86, the relevant info might be the actual hw error number > (error_code, which makes it into die) or the signal. If we send a > death signal, then reporting the error number the usual way might make > sense. I may just choose to use a task_isolation_exception(fmt, ...) signature so that code can printk a suitable one-liner before delivering the SIGKILL (or whatever signal was configured). -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode @ 2015-09-29 17:46 ` Andy Lutomirski 0 siblings, 0 replies; 340+ messages in thread From: Andy Lutomirski @ 2015-09-29 17:46 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 09/28/2015 06:38 PM, Andy Lutomirski wrote: >> >> On Mon, Sep 28, 2015 at 2:54 PM, Chris Metcalf <cmetcalf@ezchip.com> >> wrote: >>> >>> On 09/28/2015 04:51 PM, Andy Lutomirski wrote: >>>> >>>> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> >>>>> >>>>> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) >>>>> return 0; >>>>> >>>>> prev_ctx = this_cpu_read(context_tracking.state); >>>>> - if (prev_ctx != CONTEXT_KERNEL) >>>>> - context_tracking_exit(prev_ctx); >>>>> + if (prev_ctx != CONTEXT_KERNEL) { >>>>> + if (context_tracking_exit(prev_ctx)) { >>>>> + if (task_isolation_strict()) >>>>> + task_isolation_exception(); >>>>> + } >>>>> + } >>>>> >>>>> return prev_ctx; >>>>> } >>>> >>>> x86 does not promise to call this function. In fact, x86 is rather >>>> likely to stop ever calling this function in the reasonably near >>>> future. >>> >>> >>> Yes, in which case we'd have to do it the same way we are doing >>> it for arm64 (see patch 09/11), by calling task_isolation_exception() >>> explicitly from within the relevant exception handlers. If we start >>> doing that, it's probably worth wrapping up the logic into a single >>> inline function to keep the added code short and sweet. >>> >>> If in fact this might happen in the short term, it might be a good >>> idea to hook the individual exception handlers in x86 now, and not >>> hook the exception_enter() mechanism at all. >> >> It's already like that in Linus' tree. > > > OK, I will restructure so that it doesn't rely on the context_tracking > code at all, but instead requires a line of code in every relevant > kernel exception handler. > >> FWIW, most of those exception handlers send signals, so it might pay >> to do it in notify_die or die instead. > > > Well, the most interesting category is things that don't actually > trigger a signal (e.g. minor page fault) since those are things that > cause significant issues with task isolation processes > (kernel-induced jitter) but aren't otherwise user-visible, > much like an undiscovered syscall in a third-party library > can cause unexpected jitter. Would it make sense to exempt the exceptions that result in signals? After all, those are detectable even without your patches. Going through all of the exception types: divide_error, overflow, invalid_op, coprocessor_segment_overrun, invalid_TSS, segment_not_present, stack_segment, alignment_check: these all send signals anyway. double_fault is fatal. bounds: MPX faults can be silently fixed up, and those will need notification. (Or user code should know not to do that, since it requires an explicit opt in, and user code can flip it back off to get the signals.) general_protection: always signals except in vm86 mode. int3: silently fixed if uprobes are in use, but I don't think isolation cares about that. Otherwise signals. debug: The perf hw_breakpoint can result in silent fixups, but those require explicit opt-in from the admin. Otherwise, unless there's a bug or a debugger, the user will get a signal. (As a practical matter, the only interesting case is the undocumented ICEBP instruction.) math_error, simd_coprocessor_error: Sends a signal. spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK. We should just WARN if this hits. device_not_available: If you're using isolation without an FPU, you have bigger problems. page_fault: Needs notification. NMI, MCE: arguably these should *not* notify or at least not fatally. So maybe a better approach would be to explicitly notify for the relevant entries: IRQs, non-signalling page faults, and non-signalling MPX fixups. Other arches would have their own lists, but they're probably also short except for emulated instructions. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode @ 2015-09-29 17:46 ` Andy Lutomirski 0 siblings, 0 replies; 340+ messages in thread From: Andy Lutomirski @ 2015-09-29 17:46 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > On 09/28/2015 06:38 PM, Andy Lutomirski wrote: >> >> On Mon, Sep 28, 2015 at 2:54 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >> wrote: >>> >>> On 09/28/2015 04:51 PM, Andy Lutomirski wrote: >>>> >>>> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >>>>> >>>>> @@ -35,8 +36,12 @@ static inline enum ctx_state exception_enter(void) >>>>> return 0; >>>>> >>>>> prev_ctx = this_cpu_read(context_tracking.state); >>>>> - if (prev_ctx != CONTEXT_KERNEL) >>>>> - context_tracking_exit(prev_ctx); >>>>> + if (prev_ctx != CONTEXT_KERNEL) { >>>>> + if (context_tracking_exit(prev_ctx)) { >>>>> + if (task_isolation_strict()) >>>>> + task_isolation_exception(); >>>>> + } >>>>> + } >>>>> >>>>> return prev_ctx; >>>>> } >>>> >>>> x86 does not promise to call this function. In fact, x86 is rather >>>> likely to stop ever calling this function in the reasonably near >>>> future. >>> >>> >>> Yes, in which case we'd have to do it the same way we are doing >>> it for arm64 (see patch 09/11), by calling task_isolation_exception() >>> explicitly from within the relevant exception handlers. If we start >>> doing that, it's probably worth wrapping up the logic into a single >>> inline function to keep the added code short and sweet. >>> >>> If in fact this might happen in the short term, it might be a good >>> idea to hook the individual exception handlers in x86 now, and not >>> hook the exception_enter() mechanism at all. >> >> It's already like that in Linus' tree. > > > OK, I will restructure so that it doesn't rely on the context_tracking > code at all, but instead requires a line of code in every relevant > kernel exception handler. > >> FWIW, most of those exception handlers send signals, so it might pay >> to do it in notify_die or die instead. > > > Well, the most interesting category is things that don't actually > trigger a signal (e.g. minor page fault) since those are things that > cause significant issues with task isolation processes > (kernel-induced jitter) but aren't otherwise user-visible, > much like an undiscovered syscall in a third-party library > can cause unexpected jitter. Would it make sense to exempt the exceptions that result in signals? After all, those are detectable even without your patches. Going through all of the exception types: divide_error, overflow, invalid_op, coprocessor_segment_overrun, invalid_TSS, segment_not_present, stack_segment, alignment_check: these all send signals anyway. double_fault is fatal. bounds: MPX faults can be silently fixed up, and those will need notification. (Or user code should know not to do that, since it requires an explicit opt in, and user code can flip it back off to get the signals.) general_protection: always signals except in vm86 mode. int3: silently fixed if uprobes are in use, but I don't think isolation cares about that. Otherwise signals. debug: The perf hw_breakpoint can result in silent fixups, but those require explicit opt-in from the admin. Otherwise, unless there's a bug or a debugger, the user will get a signal. (As a practical matter, the only interesting case is the undocumented ICEBP instruction.) math_error, simd_coprocessor_error: Sends a signal. spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK. We should just WARN if this hits. device_not_available: If you're using isolation without an FPU, you have bigger problems. page_fault: Needs notification. NMI, MCE: arguably these should *not* notify or at least not fatally. So maybe a better approach would be to explicitly notify for the relevant entries: IRQs, non-signalling page faults, and non-signalling MPX fixups. Other arches would have their own lists, but they're probably also short except for emulated instructions. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode @ 2015-09-29 17:57 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-29 17:57 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On 09/29/2015 01:46 PM, Andy Lutomirski wrote: > On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> Well, the most interesting category is things that don't actually >> trigger a signal (e.g. minor page fault) since those are things that >> cause significant issues with task isolation processes >> (kernel-induced jitter) but aren't otherwise user-visible, >> much like an undiscovered syscall in a third-party library >> can cause unexpected jitter. > Would it make sense to exempt the exceptions that result in signals? > After all, those are detectable even without your patches. Going > through all of the exception types: > > divide_error, overflow, invalid_op, coprocessor_segment_overrun, > invalid_TSS, segment_not_present, stack_segment, alignment_check: > these all send signals anyway. > > double_fault is fatal. > > bounds: MPX faults can be silently fixed up, and those will need > notification. (Or user code should know not to do that, since it > requires an explicit opt in, and user code can flip it back off to get > the signals.) > > general_protection: always signals except in vm86 mode. > > int3: silently fixed if uprobes are in use, but I don't think > isolation cares about that. Otherwise signals. > > debug: The perf hw_breakpoint can result in silent fixups, but those > require explicit opt-in from the admin. Otherwise, unless there's a > bug or a debugger, the user will get a signal. (As a practical > matter, the only interesting case is the undocumented ICEBP > instruction.) > > math_error, simd_coprocessor_error: Sends a signal. > > spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK. We should > just WARN if this hits. > > device_not_available: If you're using isolation without an FPU, you > have bigger problems. > > page_fault: Needs notification. > > NMI, MCE: arguably these should *not* notify or at least not fatally. > > So maybe a better approach would be to explicitly notify for the > relevant entries: IRQs, non-signalling page faults, and non-signalling > MPX fixups. Other arches would have their own lists, but they're > probably also short except for emulated instructions. IRQs should get notified via the task_isolation_debug boot flag; the intent is that they should never get delivered to nohz_full cores anyway, so we produce a console backtrace if the boot flag is enabled. This isn't tied to having a task running with TASK_ISOLATION enabled, since it just shouldn't ever happen. Thanks for reviewing the possible exception sources on x86, which I'm less familiar with than tile. Non-signalling page faults and MPX fixups sounds exactly right - and I didn't know about MPX before your email (other than the userspace side of the notion of bounds registers), so thanks for the pointer. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode @ 2015-09-29 17:57 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-29 17:57 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 09/29/2015 01:46 PM, Andy Lutomirski wrote: > On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> Well, the most interesting category is things that don't actually >> trigger a signal (e.g. minor page fault) since those are things that >> cause significant issues with task isolation processes >> (kernel-induced jitter) but aren't otherwise user-visible, >> much like an undiscovered syscall in a third-party library >> can cause unexpected jitter. > Would it make sense to exempt the exceptions that result in signals? > After all, those are detectable even without your patches. Going > through all of the exception types: > > divide_error, overflow, invalid_op, coprocessor_segment_overrun, > invalid_TSS, segment_not_present, stack_segment, alignment_check: > these all send signals anyway. > > double_fault is fatal. > > bounds: MPX faults can be silently fixed up, and those will need > notification. (Or user code should know not to do that, since it > requires an explicit opt in, and user code can flip it back off to get > the signals.) > > general_protection: always signals except in vm86 mode. > > int3: silently fixed if uprobes are in use, but I don't think > isolation cares about that. Otherwise signals. > > debug: The perf hw_breakpoint can result in silent fixups, but those > require explicit opt-in from the admin. Otherwise, unless there's a > bug or a debugger, the user will get a signal. (As a practical > matter, the only interesting case is the undocumented ICEBP > instruction.) > > math_error, simd_coprocessor_error: Sends a signal. > > spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK. We should > just WARN if this hits. > > device_not_available: If you're using isolation without an FPU, you > have bigger problems. > > page_fault: Needs notification. > > NMI, MCE: arguably these should *not* notify or at least not fatally. > > So maybe a better approach would be to explicitly notify for the > relevant entries: IRQs, non-signalling page faults, and non-signalling > MPX fixups. Other arches would have their own lists, but they're > probably also short except for emulated instructions. IRQs should get notified via the task_isolation_debug boot flag; the intent is that they should never get delivered to nohz_full cores anyway, so we produce a console backtrace if the boot flag is enabled. This isn't tied to having a task running with TASK_ISOLATION enabled, since it just shouldn't ever happen. Thanks for reviewing the possible exception sources on x86, which I'm less familiar with than tile. Non-signalling page faults and MPX fixups sounds exactly right - and I didn't know about MPX before your email (other than the userspace side of the notion of bounds registers), so thanks for the pointer. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-09-29 17:57 ` Chris Metcalf (?) @ 2015-09-29 18:00 ` Andy Lutomirski 2015-10-01 19:25 ` Chris Metcalf -1 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-09-29 18:00 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On Tue, Sep 29, 2015 at 10:57 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 09/29/2015 01:46 PM, Andy Lutomirski wrote: >> >> On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf@ezchip.com> >> wrote: >>> >>> Well, the most interesting category is things that don't actually >>> trigger a signal (e.g. minor page fault) since those are things that >>> cause significant issues with task isolation processes >>> (kernel-induced jitter) but aren't otherwise user-visible, >>> much like an undiscovered syscall in a third-party library >>> can cause unexpected jitter. >> >> Would it make sense to exempt the exceptions that result in signals? >> After all, those are detectable even without your patches. Going >> through all of the exception types: >> >> divide_error, overflow, invalid_op, coprocessor_segment_overrun, >> invalid_TSS, segment_not_present, stack_segment, alignment_check: >> these all send signals anyway. >> >> double_fault is fatal. >> >> bounds: MPX faults can be silently fixed up, and those will need >> notification. (Or user code should know not to do that, since it >> requires an explicit opt in, and user code can flip it back off to get >> the signals.) >> >> general_protection: always signals except in vm86 mode. >> >> int3: silently fixed if uprobes are in use, but I don't think >> isolation cares about that. Otherwise signals. >> >> debug: The perf hw_breakpoint can result in silent fixups, but those >> require explicit opt-in from the admin. Otherwise, unless there's a >> bug or a debugger, the user will get a signal. (As a practical >> matter, the only interesting case is the undocumented ICEBP >> instruction.) >> >> math_error, simd_coprocessor_error: Sends a signal. >> >> spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK. We should >> just WARN if this hits. >> >> device_not_available: If you're using isolation without an FPU, you >> have bigger problems. >> >> page_fault: Needs notification. >> >> NMI, MCE: arguably these should *not* notify or at least not fatally. >> >> So maybe a better approach would be to explicitly notify for the >> relevant entries: IRQs, non-signalling page faults, and non-signalling >> MPX fixups. Other arches would have their own lists, but they're >> probably also short except for emulated instructions. > > > IRQs should get notified via the task_isolation_debug boot flag; > the intent is that they should never get delivered to nohz_full > cores anyway, so we produce a console backtrace if the boot > flag is enabled. This isn't tied to having a task running with > TASK_ISOLATION enabled, since it just shouldn't ever happen. OK, I like that. In that case, maybe NMI and MCE should be in a similar category. (IOW if a non-fatal MCE happens and the debug param is set, we could warn, assuming that anyone is willing to write the code. Doing printk from MCE is not entirely trivial, although it's less bad in recent kernels.) --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode @ 2015-10-01 19:25 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-01 19:25 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On 09/29/2015 02:00 PM, Andy Lutomirski wrote: > On Tue, Sep 29, 2015 at 10:57 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> On 09/29/2015 01:46 PM, Andy Lutomirski wrote: >>> On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf@ezchip.com> >>> wrote: >>>> Well, the most interesting category is things that don't actually >>>> trigger a signal (e.g. minor page fault) since those are things that >>>> cause significant issues with task isolation processes >>>> (kernel-induced jitter) but aren't otherwise user-visible, >>>> much like an undiscovered syscall in a third-party library >>>> can cause unexpected jitter. >>> Would it make sense to exempt the exceptions that result in signals? >>> After all, those are detectable even without your patches. Going >>> through all of the exception types: >>> >>> divide_error, overflow, invalid_op, coprocessor_segment_overrun, >>> invalid_TSS, segment_not_present, stack_segment, alignment_check: >>> these all send signals anyway. >>> >>> double_fault is fatal. >>> >>> bounds: MPX faults can be silently fixed up, and those will need >>> notification. (Or user code should know not to do that, since it >>> requires an explicit opt in, and user code can flip it back off to get >>> the signals.) >>> >>> general_protection: always signals except in vm86 mode. >>> >>> int3: silently fixed if uprobes are in use, but I don't think >>> isolation cares about that. Otherwise signals. >>> >>> debug: The perf hw_breakpoint can result in silent fixups, but those >>> require explicit opt-in from the admin. Otherwise, unless there's a >>> bug or a debugger, the user will get a signal. (As a practical >>> matter, the only interesting case is the undocumented ICEBP >>> instruction.) >>> >>> math_error, simd_coprocessor_error: Sends a signal. >>> >>> spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK. We should >>> just WARN if this hits. >>> >>> device_not_available: If you're using isolation without an FPU, you >>> have bigger problems. >>> >>> page_fault: Needs notification. >>> >>> NMI, MCE: arguably these should *not* notify or at least not fatally. >>> >>> So maybe a better approach would be to explicitly notify for the >>> relevant entries: IRQs, non-signalling page faults, and non-signalling >>> MPX fixups. Other arches would have their own lists, but they're >>> probably also short except for emulated instructions. >> >> IRQs should get notified via the task_isolation_debug boot flag; >> the intent is that they should never get delivered to nohz_full >> cores anyway, so we produce a console backtrace if the boot >> flag is enabled. This isn't tied to having a task running with >> TASK_ISOLATION enabled, since it just shouldn't ever happen. > OK, I like that. In that case, maybe NMI and MCE should be in a > similar category. (IOW if a non-fatal MCE happens and the debug param > is set, we could warn, assuming that anyone is willing to write the > code. Doing printk from MCE is not entirely trivial, although it's > less bad in recent kernels.) For now I will stay away from tampering with the NMI/MCE handlers, though if it turns out that it's the cause of mysterious latencies in task-isolation applications in the future, it will likely make sense to add some debugging there. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode @ 2015-10-01 19:25 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-01 19:25 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 09/29/2015 02:00 PM, Andy Lutomirski wrote: > On Tue, Sep 29, 2015 at 10:57 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> On 09/29/2015 01:46 PM, Andy Lutomirski wrote: >>> On Tue, Sep 29, 2015 at 10:35 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >>> wrote: >>>> Well, the most interesting category is things that don't actually >>>> trigger a signal (e.g. minor page fault) since those are things that >>>> cause significant issues with task isolation processes >>>> (kernel-induced jitter) but aren't otherwise user-visible, >>>> much like an undiscovered syscall in a third-party library >>>> can cause unexpected jitter. >>> Would it make sense to exempt the exceptions that result in signals? >>> After all, those are detectable even without your patches. Going >>> through all of the exception types: >>> >>> divide_error, overflow, invalid_op, coprocessor_segment_overrun, >>> invalid_TSS, segment_not_present, stack_segment, alignment_check: >>> these all send signals anyway. >>> >>> double_fault is fatal. >>> >>> bounds: MPX faults can be silently fixed up, and those will need >>> notification. (Or user code should know not to do that, since it >>> requires an explicit opt in, and user code can flip it back off to get >>> the signals.) >>> >>> general_protection: always signals except in vm86 mode. >>> >>> int3: silently fixed if uprobes are in use, but I don't think >>> isolation cares about that. Otherwise signals. >>> >>> debug: The perf hw_breakpoint can result in silent fixups, but those >>> require explicit opt-in from the admin. Otherwise, unless there's a >>> bug or a debugger, the user will get a signal. (As a practical >>> matter, the only interesting case is the undocumented ICEBP >>> instruction.) >>> >>> math_error, simd_coprocessor_error: Sends a signal. >>> >>> spurious_interrupt_bug: Irrelevant on any modern CPU AFAIK. We should >>> just WARN if this hits. >>> >>> device_not_available: If you're using isolation without an FPU, you >>> have bigger problems. >>> >>> page_fault: Needs notification. >>> >>> NMI, MCE: arguably these should *not* notify or at least not fatally. >>> >>> So maybe a better approach would be to explicitly notify for the >>> relevant entries: IRQs, non-signalling page faults, and non-signalling >>> MPX fixups. Other arches would have their own lists, but they're >>> probably also short except for emulated instructions. >> >> IRQs should get notified via the task_isolation_debug boot flag; >> the intent is that they should never get delivered to nohz_full >> cores anyway, so we produce a console backtrace if the boot >> flag is enabled. This isn't tied to having a task running with >> TASK_ISOLATION enabled, since it just shouldn't ever happen. > OK, I like that. In that case, maybe NMI and MCE should be in a > similar category. (IOW if a non-fatal MCE happens and the debug param > is set, we could warn, assuming that anyone is willing to write the > code. Doing printk from MCE is not entirely trivial, although it's > less bad in recent kernels.) For now I will stay away from tampering with the NMI/MCE handlers, though if it turns out that it's the cause of mysterious latencies in task-isolation applications in the future, it will likely make sense to add some debugging there. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v7 04/11] task_isolation: provide strict mode configurable signal 2015-09-28 15:17 ` Chris Metcalf @ 2015-09-28 15:17 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a task_isolation process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/isolation.c | 17 +++++++++++++---- 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 2b8038b0d1e1..a5582ace987f 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -202,5 +202,7 @@ struct prctl_mm_map { #define PR_GET_TASK_ISOLATION 49 # define PR_TASK_ISOLATION_ENABLE (1 << 0) # define PR_TASK_ISOLATION_STRICT (1 << 1) +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/isolation.c b/kernel/isolation.c index 3779ba670472..44bafcd08bca 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -77,14 +77,23 @@ void task_isolation_enter(void) } } -static void kill_task_isolation_strict_task(void) +static void kill_task_isolation_strict_task(int is_syscall) { + siginfo_t info = {}; + int sig; + /* RCU should have been enabled prior to this point. */ RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); dump_stack(); current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags); + if (sig == 0) + sig = SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -103,7 +112,7 @@ void task_isolation_syscall(int syscall) pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_task_isolation_strict_task(); + kill_task_isolation_strict_task(1); } /* @@ -114,5 +123,5 @@ void task_isolation_exception(void) { pr_warn("%s/%d: task_isolation strict mode violated by exception\n", current->comm, current->pid); - kill_task_isolation_strict_task(); + kill_task_isolation_strict_task(0); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v7 04/11] task_isolation: provide strict mode configurable signal @ 2015-09-28 15:17 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a task_isolation process in STRICT mode does a syscall or otherwise synchronously enters the kernel. In addition to being able to set the signal, we now also pass whether or not the interruption was from a syscall in the si_code field of the siginfo. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/isolation.c | 17 +++++++++++++---- 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 2b8038b0d1e1..a5582ace987f 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -202,5 +202,7 @@ struct prctl_mm_map { #define PR_GET_TASK_ISOLATION 49 # define PR_TASK_ISOLATION_ENABLE (1 << 0) # define PR_TASK_ISOLATION_STRICT (1 << 1) +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/isolation.c b/kernel/isolation.c index 3779ba670472..44bafcd08bca 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -77,14 +77,23 @@ void task_isolation_enter(void) } } -static void kill_task_isolation_strict_task(void) +static void kill_task_isolation_strict_task(int is_syscall) { + siginfo_t info = {}; + int sig; + /* RCU should have been enabled prior to this point. */ RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); dump_stack(); current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags); + if (sig == 0) + sig = SIGKILL; + info.si_signo = sig; + info.si_code = is_syscall; + send_sig_info(sig, &info, current); } /* @@ -103,7 +112,7 @@ void task_isolation_syscall(int syscall) pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", current->comm, current->pid, syscall); - kill_task_isolation_strict_task(); + kill_task_isolation_strict_task(1); } /* @@ -114,5 +123,5 @@ void task_isolation_exception(void) { pr_warn("%s/%d: task_isolation strict mode violated by exception\n", current->comm, current->pid); - kill_task_isolation_strict_task(); + kill_task_isolation_strict_task(0); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH v7 04/11] task_isolation: provide strict mode configurable signal 2015-09-28 15:17 ` Chris Metcalf (?) @ 2015-09-28 20:54 ` Andy Lutomirski 2015-09-28 21:54 ` Chris Metcalf -1 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-09-28 20:54 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > Allow userspace to override the default SIGKILL delivered > when a task_isolation process in STRICT mode does a syscall > or otherwise synchronously enters the kernel. > > In addition to being able to set the signal, we now also > pass whether or not the interruption was from a syscall in > the si_code field of the siginfo. > > Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> > --- > include/uapi/linux/prctl.h | 2 ++ > kernel/isolation.c | 17 +++++++++++++---- > 2 files changed, 15 insertions(+), 4 deletions(-) > > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 2b8038b0d1e1..a5582ace987f 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -202,5 +202,7 @@ struct prctl_mm_map { > #define PR_GET_TASK_ISOLATION 49 > # define PR_TASK_ISOLATION_ENABLE (1 << 0) > # define PR_TASK_ISOLATION_STRICT (1 << 1) > +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) > +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) > > #endif /* _LINUX_PRCTL_H */ > diff --git a/kernel/isolation.c b/kernel/isolation.c > index 3779ba670472..44bafcd08bca 100644 > --- a/kernel/isolation.c > +++ b/kernel/isolation.c > @@ -77,14 +77,23 @@ void task_isolation_enter(void) > } > } > > -static void kill_task_isolation_strict_task(void) > +static void kill_task_isolation_strict_task(int is_syscall) > { > + siginfo_t info = {}; > + int sig; > + > /* RCU should have been enabled prior to this point. */ > RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); > > dump_stack(); > current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; > - send_sig(SIGKILL, current, 1); > + > + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags); > + if (sig == 0) > + sig = SIGKILL; > + info.si_signo = sig; > + info.si_code = is_syscall; I think this needs real SI_ defines. > + send_sig_info(sig, &info, current); > } > > /* > @@ -103,7 +112,7 @@ void task_isolation_syscall(int syscall) > > pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", > current->comm, current->pid, syscall); > - kill_task_isolation_strict_task(); > + kill_task_isolation_strict_task(1); No magic numbers please. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 04/11] task_isolation: provide strict mode configurable signal @ 2015-09-28 21:54 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 21:54 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On 09/28/2015 04:54 PM, Andy Lutomirski wrote: > On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> Allow userspace to override the default SIGKILL delivered >> when a task_isolation process in STRICT mode does a syscall >> or otherwise synchronously enters the kernel. >> >> In addition to being able to set the signal, we now also >> pass whether or not the interruption was from a syscall in >> the si_code field of the siginfo. >> >> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> >> --- >> include/uapi/linux/prctl.h | 2 ++ >> kernel/isolation.c | 17 +++++++++++++---- >> 2 files changed, 15 insertions(+), 4 deletions(-) >> >> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h >> index 2b8038b0d1e1..a5582ace987f 100644 >> --- a/include/uapi/linux/prctl.h >> +++ b/include/uapi/linux/prctl.h >> @@ -202,5 +202,7 @@ struct prctl_mm_map { >> #define PR_GET_TASK_ISOLATION 49 >> # define PR_TASK_ISOLATION_ENABLE (1 << 0) >> # define PR_TASK_ISOLATION_STRICT (1 << 1) >> +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) >> +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) >> >> #endif /* _LINUX_PRCTL_H */ >> diff --git a/kernel/isolation.c b/kernel/isolation.c >> index 3779ba670472..44bafcd08bca 100644 >> --- a/kernel/isolation.c >> +++ b/kernel/isolation.c >> @@ -77,14 +77,23 @@ void task_isolation_enter(void) >> } >> } >> >> -static void kill_task_isolation_strict_task(void) >> +static void kill_task_isolation_strict_task(int is_syscall) >> { >> + siginfo_t info = {}; >> + int sig; >> + >> /* RCU should have been enabled prior to this point. */ >> RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); >> >> dump_stack(); >> current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; >> - send_sig(SIGKILL, current, 1); >> + >> + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags); >> + if (sig == 0) >> + sig = SIGKILL; >> + info.si_signo = sig; >> + info.si_code = is_syscall; > I think this needs real SI_ defines. Yeah, it's a fair point, but of course SIGKILL has no SI_ defines at all right now. I'm tempted to suggest we just back out setting si_code altogether. It might be worth a one-line console message (a la show_signal_message()), and use that to pack in the extra information, instead of trying to fuss with the siginfo data. >> + send_sig_info(sig, &info, current); >> } >> >> /* >> @@ -103,7 +112,7 @@ void task_isolation_syscall(int syscall) >> >> pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", >> current->comm, current->pid, syscall); >> - kill_task_isolation_strict_task(); >> + kill_task_isolation_strict_task(1); > No magic numbers please. I think mooted by the above, but, good point. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 04/11] task_isolation: provide strict mode configurable signal @ 2015-09-28 21:54 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 21:54 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 09/28/2015 04:54 PM, Andy Lutomirski wrote: > On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> Allow userspace to override the default SIGKILL delivered >> when a task_isolation process in STRICT mode does a syscall >> or otherwise synchronously enters the kernel. >> >> In addition to being able to set the signal, we now also >> pass whether or not the interruption was from a syscall in >> the si_code field of the siginfo. >> >> Signed-off-by: Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >> --- >> include/uapi/linux/prctl.h | 2 ++ >> kernel/isolation.c | 17 +++++++++++++---- >> 2 files changed, 15 insertions(+), 4 deletions(-) >> >> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h >> index 2b8038b0d1e1..a5582ace987f 100644 >> --- a/include/uapi/linux/prctl.h >> +++ b/include/uapi/linux/prctl.h >> @@ -202,5 +202,7 @@ struct prctl_mm_map { >> #define PR_GET_TASK_ISOLATION 49 >> # define PR_TASK_ISOLATION_ENABLE (1 << 0) >> # define PR_TASK_ISOLATION_STRICT (1 << 1) >> +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) >> +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) >> >> #endif /* _LINUX_PRCTL_H */ >> diff --git a/kernel/isolation.c b/kernel/isolation.c >> index 3779ba670472..44bafcd08bca 100644 >> --- a/kernel/isolation.c >> +++ b/kernel/isolation.c >> @@ -77,14 +77,23 @@ void task_isolation_enter(void) >> } >> } >> >> -static void kill_task_isolation_strict_task(void) >> +static void kill_task_isolation_strict_task(int is_syscall) >> { >> + siginfo_t info = {}; >> + int sig; >> + >> /* RCU should have been enabled prior to this point. */ >> RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); >> >> dump_stack(); >> current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; >> - send_sig(SIGKILL, current, 1); >> + >> + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags); >> + if (sig == 0) >> + sig = SIGKILL; >> + info.si_signo = sig; >> + info.si_code = is_syscall; > I think this needs real SI_ defines. Yeah, it's a fair point, but of course SIGKILL has no SI_ defines at all right now. I'm tempted to suggest we just back out setting si_code altogether. It might be worth a one-line console message (a la show_signal_message()), and use that to pack in the extra information, instead of trying to fuss with the siginfo data. >> + send_sig_info(sig, &info, current); >> } >> >> /* >> @@ -103,7 +112,7 @@ void task_isolation_syscall(int syscall) >> >> pr_warn("%s/%d: task_isolation strict mode violated by syscall %d\n", >> current->comm, current->pid, syscall); >> - kill_task_isolation_strict_task(); >> + kill_task_isolation_strict_task(1); > No magic numbers please. I think mooted by the above, but, good point. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v7 05/11] task_isolation: add debug boot flag 2015-09-28 15:17 ` Chris Metcalf ` (4 preceding siblings ...) (?) @ 2015-09-28 15:17 ` Chris Metcalf 2015-09-28 20:59 ` Andy Lutomirski 2015-10-05 17:07 ` Luiz Capitulino -1 siblings, 2 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-kernel Cc: Chris Metcalf The new "task_isolation_debug" flag simplifies debugging of TASK_ISOLATION kernels when processes are running in PR_TASK_ISOLATION_ENABLE mode. Such processes should get no interrupts from the kernel, and if they do, when this boot flag is specified a kernel stack dump on the console is generated. It's possible to use ftrace to simply detect whether a task_isolation core has unexpectedly entered the kernel. But what this boot flag does is allow the kernel to provide better diagnostics, e.g. by reporting in the IPI-generating code what remote core and context is preparing to deliver an interrupt to a task_isolation core. It may be worth considering other ways to generate useful debugging output rather than console spew, but for now that is simple and direct. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- Documentation/kernel-parameters.txt | 7 +++++++ include/linux/isolation.h | 2 ++ kernel/irq_work.c | 5 ++++- kernel/sched/core.c | 21 +++++++++++++++++++++ kernel/signal.c | 5 +++++ kernel/smp.c | 4 ++++ kernel/softirq.c | 7 +++++++ 7 files changed, 50 insertions(+), 1 deletion(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 22a4b687ea5b..48ff15f3166f 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -3623,6 +3623,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. neutralize any effect of /proc/sys/kernel/sysrq. Useful for debugging. + task_isolation_debug [KNL] + In kernels built with CONFIG_TASK_ISOLATION and booted + in nohz_full= mode, this setting will generate console + backtraces when the kernel is about to interrupt a + task that has requested PR_TASK_ISOLATION_ENABLE + and is running on a nohz_full core. + tcpmhash_entries= [KNL,NET] Set the number of tcp_metrics_hash slots. Default value is 8192 or 16384 depending on total diff --git a/include/linux/isolation.h b/include/linux/isolation.h index 27a4469831c1..9f1747331a36 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -18,11 +18,13 @@ extern void task_isolation_enter(void); extern void task_isolation_syscall(int nr); extern void task_isolation_exception(void); extern void task_isolation_wait(void); +extern void task_isolation_debug(int cpu); #else static inline bool task_isolation_enabled(void) { return false; } static inline void task_isolation_enter(void) { } static inline void task_isolation_syscall(int nr) { } static inline void task_isolation_exception(void) { } +static inline void task_isolation_debug(int cpu) { } #endif static inline bool task_isolation_strict(void) diff --git a/kernel/irq_work.c b/kernel/irq_work.c index cbf9fb899d92..745c2ea6a4e4 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -17,6 +17,7 @@ #include <linux/cpu.h> #include <linux/notifier.h> #include <linux/smp.h> +#include <linux/isolation.h> #include <asm/processor.h> @@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu) if (!irq_work_claim(work)) return false; - if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) + if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) { + task_isolation_debug(cpu); arch_send_call_function_single_ipi(cpu); + } return true; } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3595403921bd..8ddabb0d7510 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -74,6 +74,7 @@ #include <linux/binfmts.h> #include <linux/context_tracking.h> #include <linux/compiler.h> +#include <linux/isolation.h> #include <asm/switch_to.h> #include <asm/tlb.h> @@ -743,6 +744,26 @@ bool sched_can_stop_tick(void) } #endif /* CONFIG_NO_HZ_FULL */ +#ifdef CONFIG_TASK_ISOLATION +/* Enable debugging of any interrupts of task_isolation cores. */ +static int task_isolation_debug_flag; +static int __init task_isolation_debug_func(char *str) +{ + task_isolation_debug_flag = true; + return 1; +} +__setup("task_isolation_debug", task_isolation_debug_func); + +void task_isolation_debug(int cpu) +{ + if (task_isolation_debug_flag && tick_nohz_full_cpu(cpu) && + (cpu_curr(cpu)->task_isolation_flags & PR_TASK_ISOLATION_ENABLE)) { + pr_err("Interrupt detected for task_isolation cpu %d\n", cpu); + dump_stack(); + } +} +#endif + void sched_avg_update(struct rq *rq) { s64 period = sched_avg_period(); diff --git a/kernel/signal.c b/kernel/signal.c index 0f6bbbe77b46..c6e09f0f7e24 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -684,6 +684,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) */ void signal_wake_up_state(struct task_struct *t, unsigned int state) { +#ifdef CONFIG_TASK_ISOLATION + /* If the task is being killed, don't complain about task_isolation. */ + if (state & TASK_WAKEKILL) + t->task_isolation_flags = 0; +#endif set_tsk_thread_flag(t, TIF_SIGPENDING); /* * TASK_WAKEKILL also means wake it up in the stopped/traced/killable diff --git a/kernel/smp.c b/kernel/smp.c index 07854477c164..b0bddff2693d 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -14,6 +14,7 @@ #include <linux/smp.h> #include <linux/cpu.h> #include <linux/sched.h> +#include <linux/isolation.h> #include "smpboot.h" @@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, * locking and barrier primitives. Generic code isn't really * equipped to do the right thing... */ + task_isolation_debug(cpu); if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) arch_send_call_function_single_ipi(cpu); @@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask, } /* Send a message to all CPUs in the map */ + for_each_cpu(cpu, cfd->cpumask) + task_isolation_debug(cpu); arch_send_call_function_ipi_mask(cfd->cpumask); if (wait) { diff --git a/kernel/softirq.c b/kernel/softirq.c index 479e4436f787..ed762fec7265 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -24,8 +24,10 @@ #include <linux/ftrace.h> #include <linux/smp.h> #include <linux/smpboot.h> +#include <linux/context_tracking.h> #include <linux/tick.h> #include <linux/irq.h> +#include <linux/isolation.h> #define CREATE_TRACE_POINTS #include <trace/events/irq.h> @@ -335,6 +337,11 @@ void irq_enter(void) _local_bh_enable(); } + if (context_tracking_cpu_is_enabled() && + context_tracking_in_user() && + !in_interrupt()) + task_isolation_debug(smp_processor_id()); + __irq_enter(); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH v7 05/11] task_isolation: add debug boot flag 2015-09-28 15:17 ` [PATCH v7 05/11] task_isolation: add debug boot flag Chris Metcalf @ 2015-09-28 20:59 ` Andy Lutomirski 2015-09-28 21:55 ` Chris Metcalf 2015-10-05 17:07 ` Luiz Capitulino 1 sibling, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-09-28 20:59 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-kernel On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > The new "task_isolation_debug" flag simplifies debugging > of TASK_ISOLATION kernels when processes are running in > PR_TASK_ISOLATION_ENABLE mode. Such processes should get no > interrupts from the kernel, and if they do, when this boot flag is > specified a kernel stack dump on the console is generated. > > It's possible to use ftrace to simply detect whether a task_isolation > core has unexpectedly entered the kernel. But what this boot flag > does is allow the kernel to provide better diagnostics, e.g. by > reporting in the IPI-generating code what remote core and context > is preparing to deliver an interrupt to a task_isolation core. > > It may be worth considering other ways to generate useful debugging > output rather than console spew, but for now that is simple and direct. This may be addressed elsewhere, but is there anything that alerts the task or the admin if it's PR_TASK_ISOLATION_ENABLE and *not* on a nohz_full core? --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 05/11] task_isolation: add debug boot flag 2015-09-28 20:59 ` Andy Lutomirski @ 2015-09-28 21:55 ` Chris Metcalf 2015-09-28 22:40 ` Andy Lutomirski 0 siblings, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 21:55 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-kernel On 09/28/2015 04:59 PM, Andy Lutomirski wrote: > On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> The new "task_isolation_debug" flag simplifies debugging >> of TASK_ISOLATION kernels when processes are running in >> PR_TASK_ISOLATION_ENABLE mode. Such processes should get no >> interrupts from the kernel, and if they do, when this boot flag is >> specified a kernel stack dump on the console is generated. >> >> It's possible to use ftrace to simply detect whether a task_isolation >> core has unexpectedly entered the kernel. But what this boot flag >> does is allow the kernel to provide better diagnostics, e.g. by >> reporting in the IPI-generating code what remote core and context >> is preparing to deliver an interrupt to a task_isolation core. >> >> It may be worth considering other ways to generate useful debugging >> output rather than console spew, but for now that is simple and direct. > This may be addressed elsewhere, but is there anything that alerts the > task or the admin if it's PR_TASK_ISOLATION_ENABLE and *not* on a > nohz_full core? No, and I've thought about it without coming up with a great solution. We could certainly fail the initial prctl() if the caller was not on a nohz_full core. But this seems a little asymmetric since the task could be on such a core at prctl() time, and then do a sched_setaffinity() later to a non-nohz-full core. Would we want to fail that call? Seems heavy-handed. Or we could then clear the task-isolation state and emit a console message. I suppose we could notice that we were on a nohz-full enabled system and the task isolation flags were set on return to userspace, but we were not on a nohz-full core, and emit a console message and clear the task-isolation state at that point. But that also seems a little questionable; maybe the user for some reason was doing some odd migratory thing with their tasks or threads and was going to end up migrating them to a final destination where the prctl() would apply. Any suggestions for a better approach? Is it worth doing the minimal printk-warning approach in the previous paragraph? My instinct is to say that we just leave it as-is, I think. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 05/11] task_isolation: add debug boot flag 2015-09-28 21:55 ` Chris Metcalf @ 2015-09-28 22:40 ` Andy Lutomirski 2015-09-29 17:35 ` Chris Metcalf 0 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-09-28 22:40 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-kernel On Mon, Sep 28, 2015 at 2:55 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 09/28/2015 04:59 PM, Andy Lutomirski wrote: >> >> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> >> wrote: >>> >>> The new "task_isolation_debug" flag simplifies debugging >>> of TASK_ISOLATION kernels when processes are running in >>> PR_TASK_ISOLATION_ENABLE mode. Such processes should get no >>> interrupts from the kernel, and if they do, when this boot flag is >>> specified a kernel stack dump on the console is generated. >>> >>> It's possible to use ftrace to simply detect whether a task_isolation >>> core has unexpectedly entered the kernel. But what this boot flag >>> does is allow the kernel to provide better diagnostics, e.g. by >>> reporting in the IPI-generating code what remote core and context >>> is preparing to deliver an interrupt to a task_isolation core. >>> >>> It may be worth considering other ways to generate useful debugging >>> output rather than console spew, but for now that is simple and direct. >> >> This may be addressed elsewhere, but is there anything that alerts the >> task or the admin if it's PR_TASK_ISOLATION_ENABLE and *not* on a >> nohz_full core? > > > No, and I've thought about it without coming up with a great > solution. We could certainly fail the initial prctl() if the caller > was not on a nohz_full core. But this seems a little asymmetric > since the task could be on such a core at prctl() time, and then > do a sched_setaffinity() later to a non-nohz-full core. Would > we want to fail that call? Seems heavy-handed. Or we could > then clear the task-isolation state and emit a console message. If I were writing a program that used this feature, I think I'd want to know early that it's not going to work so I can tell the admin very loudly to fix it. Maybe just failing the prctl would be enough. If someone turns on the prctl and then changes their affinity, maybe we should treat it as though they're just asking for trouble and allow it. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 05/11] task_isolation: add debug boot flag 2015-09-28 22:40 ` Andy Lutomirski @ 2015-09-29 17:35 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-29 17:35 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, linux-kernel On 09/28/2015 06:40 PM, Andy Lutomirski wrote: > If I were writing a program that used this feature, I think I'd want > to know early that it's not going to work so I can tell the admin very > loudly to fix it. Maybe just failing the prctl would be enough. If > someone turns on the prctl and then changes their affinity, maybe we > should treat it as though they're just asking for trouble and allow > it. Yes, the original Tilera implementation required the task to be affinitized to a single, nohz_full cpu before you could call the prctl() successfully. I think I will re-instate that for the patch series, and if the user then re-affinitizes to a non-nohz_full core later, well, "don't do that then". -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 05/11] task_isolation: add debug boot flag 2015-09-28 15:17 ` [PATCH v7 05/11] task_isolation: add debug boot flag Chris Metcalf 2015-09-28 20:59 ` Andy Lutomirski @ 2015-10-05 17:07 ` Luiz Capitulino 2015-10-08 0:33 ` Chris Metcalf 1 sibling, 1 reply; 340+ messages in thread From: Luiz Capitulino @ 2015-10-05 17:07 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-kernel On Mon, 28 Sep 2015 11:17:20 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > The new "task_isolation_debug" flag simplifies debugging > of TASK_ISOLATION kernels when processes are running in > PR_TASK_ISOLATION_ENABLE mode. Such processes should get no > interrupts from the kernel, and if they do, when this boot flag is > specified a kernel stack dump on the console is generated. > > It's possible to use ftrace to simply detect whether a task_isolation > core has unexpectedly entered the kernel. But what this boot flag > does is allow the kernel to provide better diagnostics, e.g. by > reporting in the IPI-generating code what remote core and context > is preparing to deliver an interrupt to a task_isolation core. > > It may be worth considering other ways to generate useful debugging > output rather than console spew, but for now that is simple and direct. Honest question: does any of the task_isolation_debug() calls added by this patch take care of the case where vmstat_shepherd() may schedule vmstat_update() to run because a TASK_ISOLATION process is changing memory stats? If that's not taken care of yet, should we? I just don't know if we should call task_isolation_exception() or task_isolation_debug(). In the case of the latter, wouldn't it be interesting to add it to __queue_work() then? > > Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> > --- > Documentation/kernel-parameters.txt | 7 +++++++ > include/linux/isolation.h | 2 ++ > kernel/irq_work.c | 5 ++++- > kernel/sched/core.c | 21 +++++++++++++++++++++ > kernel/signal.c | 5 +++++ > kernel/smp.c | 4 ++++ > kernel/softirq.c | 7 +++++++ > 7 files changed, 50 insertions(+), 1 deletion(-) > > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt > index 22a4b687ea5b..48ff15f3166f 100644 > --- a/Documentation/kernel-parameters.txt > +++ b/Documentation/kernel-parameters.txt > @@ -3623,6 +3623,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. > neutralize any effect of /proc/sys/kernel/sysrq. > Useful for debugging. > > + task_isolation_debug [KNL] > + In kernels built with CONFIG_TASK_ISOLATION and booted > + in nohz_full= mode, this setting will generate console > + backtraces when the kernel is about to interrupt a > + task that has requested PR_TASK_ISOLATION_ENABLE > + and is running on a nohz_full core. > + > tcpmhash_entries= [KNL,NET] > Set the number of tcp_metrics_hash slots. > Default value is 8192 or 16384 depending on total > diff --git a/include/linux/isolation.h b/include/linux/isolation.h > index 27a4469831c1..9f1747331a36 100644 > --- a/include/linux/isolation.h > +++ b/include/linux/isolation.h > @@ -18,11 +18,13 @@ extern void task_isolation_enter(void); > extern void task_isolation_syscall(int nr); > extern void task_isolation_exception(void); > extern void task_isolation_wait(void); > +extern void task_isolation_debug(int cpu); > #else > static inline bool task_isolation_enabled(void) { return false; } > static inline void task_isolation_enter(void) { } > static inline void task_isolation_syscall(int nr) { } > static inline void task_isolation_exception(void) { } > +static inline void task_isolation_debug(int cpu) { } > #endif > > static inline bool task_isolation_strict(void) > diff --git a/kernel/irq_work.c b/kernel/irq_work.c > index cbf9fb899d92..745c2ea6a4e4 100644 > --- a/kernel/irq_work.c > +++ b/kernel/irq_work.c > @@ -17,6 +17,7 @@ > #include <linux/cpu.h> > #include <linux/notifier.h> > #include <linux/smp.h> > +#include <linux/isolation.h> > #include <asm/processor.h> > > > @@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu) > if (!irq_work_claim(work)) > return false; > > - if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) > + if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) { > + task_isolation_debug(cpu); > arch_send_call_function_single_ipi(cpu); > + } > > return true; > } > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 3595403921bd..8ddabb0d7510 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -74,6 +74,7 @@ > #include <linux/binfmts.h> > #include <linux/context_tracking.h> > #include <linux/compiler.h> > +#include <linux/isolation.h> > > #include <asm/switch_to.h> > #include <asm/tlb.h> > @@ -743,6 +744,26 @@ bool sched_can_stop_tick(void) > } > #endif /* CONFIG_NO_HZ_FULL */ > > +#ifdef CONFIG_TASK_ISOLATION > +/* Enable debugging of any interrupts of task_isolation cores. */ > +static int task_isolation_debug_flag; > +static int __init task_isolation_debug_func(char *str) > +{ > + task_isolation_debug_flag = true; > + return 1; > +} > +__setup("task_isolation_debug", task_isolation_debug_func); > + > +void task_isolation_debug(int cpu) > +{ > + if (task_isolation_debug_flag && tick_nohz_full_cpu(cpu) && > + (cpu_curr(cpu)->task_isolation_flags & PR_TASK_ISOLATION_ENABLE)) { > + pr_err("Interrupt detected for task_isolation cpu %d\n", cpu); > + dump_stack(); > + } > +} > +#endif > + > void sched_avg_update(struct rq *rq) > { > s64 period = sched_avg_period(); > diff --git a/kernel/signal.c b/kernel/signal.c > index 0f6bbbe77b46..c6e09f0f7e24 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -684,6 +684,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) > */ > void signal_wake_up_state(struct task_struct *t, unsigned int state) > { > +#ifdef CONFIG_TASK_ISOLATION > + /* If the task is being killed, don't complain about task_isolation. */ > + if (state & TASK_WAKEKILL) > + t->task_isolation_flags = 0; > +#endif > set_tsk_thread_flag(t, TIF_SIGPENDING); > /* > * TASK_WAKEKILL also means wake it up in the stopped/traced/killable > diff --git a/kernel/smp.c b/kernel/smp.c > index 07854477c164..b0bddff2693d 100644 > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -14,6 +14,7 @@ > #include <linux/smp.h> > #include <linux/cpu.h> > #include <linux/sched.h> > +#include <linux/isolation.h> > > #include "smpboot.h" > > @@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, > * locking and barrier primitives. Generic code isn't really > * equipped to do the right thing... > */ > + task_isolation_debug(cpu); > if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) > arch_send_call_function_single_ipi(cpu); > > @@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask, > } > > /* Send a message to all CPUs in the map */ > + for_each_cpu(cpu, cfd->cpumask) > + task_isolation_debug(cpu); > arch_send_call_function_ipi_mask(cfd->cpumask); > > if (wait) { > diff --git a/kernel/softirq.c b/kernel/softirq.c > index 479e4436f787..ed762fec7265 100644 > --- a/kernel/softirq.c > +++ b/kernel/softirq.c > @@ -24,8 +24,10 @@ > #include <linux/ftrace.h> > #include <linux/smp.h> > #include <linux/smpboot.h> > +#include <linux/context_tracking.h> > #include <linux/tick.h> > #include <linux/irq.h> > +#include <linux/isolation.h> > > #define CREATE_TRACE_POINTS > #include <trace/events/irq.h> > @@ -335,6 +337,11 @@ void irq_enter(void) > _local_bh_enable(); > } > > + if (context_tracking_cpu_is_enabled() && > + context_tracking_in_user() && > + !in_interrupt()) > + task_isolation_debug(smp_processor_id()); > + > __irq_enter(); > } > ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 05/11] task_isolation: add debug boot flag 2015-10-05 17:07 ` Luiz Capitulino @ 2015-10-08 0:33 ` Chris Metcalf 2015-10-08 20:28 ` Luiz Capitulino 0 siblings, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-10-08 0:33 UTC (permalink / raw) To: Luiz Capitulino Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-kernel On 10/5/2015 1:07 PM, Luiz Capitulino wrote: > On Mon, 28 Sep 2015 11:17:20 -0400 > Chris Metcalf <cmetcalf@ezchip.com> wrote: > >> The new "task_isolation_debug" flag simplifies debugging >> of TASK_ISOLATION kernels when processes are running in >> PR_TASK_ISOLATION_ENABLE mode. Such processes should get no >> interrupts from the kernel, and if they do, when this boot flag is >> specified a kernel stack dump on the console is generated. >> >> It's possible to use ftrace to simply detect whether a task_isolation >> core has unexpectedly entered the kernel. But what this boot flag >> does is allow the kernel to provide better diagnostics, e.g. by >> reporting in the IPI-generating code what remote core and context >> is preparing to deliver an interrupt to a task_isolation core. >> >> It may be worth considering other ways to generate useful debugging >> output rather than console spew, but for now that is simple and direct. > Honest question: does any of the task_isolation_debug() calls added > by this patch take care of the case where vmstat_shepherd() may > schedule vmstat_update() to run because a TASK_ISOLATION process is > changing memory stats? The task_isolation_debug() calls don't "take care of" any cases - they are really just there to generate console dumps when the kernel unexpectedly interrupts a task_isolated task. The idea with vmstat is that before a task_isolated task returns to userspace, it quiesces the vmstat thread (does a final sweep to collect the stats and turns off the scheduled work item). As a result, the vmstat shepherd won't run while the task is in userspace. When and if it returns to the kernel, it will again sweep up the stats before returning to userspace. The usual shepherd mechanism on a housekeeping core might notice that the task had entered the kernel and started changing stats, and might then asynchronously restart the scheduled work, but it should be quiesced again regardless on the way back out to userspace. > If that's not taken care of yet, should we? I just don't know if we > should call task_isolation_exception() or task_isolation_debug(). task_isolation_exception() is called when an exception (page fault or similar) is generated synchronously by the running task and we want to make sure to notify the task with a signal if it has set up STRICT mode to indicate that it is not planning to enter the kernel. > In the case of the latter, wouldn't it be interesting to add it to > __queue_work() then? Well, queuing remote work involves sending an IPI, and we already tag both the SMP send side AND the client side IRQ side with a task_isolation_debug(), so I expect in practice it would be detected. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 05/11] task_isolation: add debug boot flag 2015-10-08 0:33 ` Chris Metcalf @ 2015-10-08 20:28 ` Luiz Capitulino 0 siblings, 0 replies; 340+ messages in thread From: Luiz Capitulino @ 2015-10-08 20:28 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-kernel On Wed, 7 Oct 2015 20:33:56 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 10/5/2015 1:07 PM, Luiz Capitulino wrote: > > On Mon, 28 Sep 2015 11:17:20 -0400 > > Chris Metcalf <cmetcalf@ezchip.com> wrote: > > > >> The new "task_isolation_debug" flag simplifies debugging > >> of TASK_ISOLATION kernels when processes are running in > >> PR_TASK_ISOLATION_ENABLE mode. Such processes should get no > >> interrupts from the kernel, and if they do, when this boot flag is > >> specified a kernel stack dump on the console is generated. > >> > >> It's possible to use ftrace to simply detect whether a task_isolation > >> core has unexpectedly entered the kernel. But what this boot flag > >> does is allow the kernel to provide better diagnostics, e.g. by > >> reporting in the IPI-generating code what remote core and context > >> is preparing to deliver an interrupt to a task_isolation core. > >> > >> It may be worth considering other ways to generate useful debugging > >> output rather than console spew, but for now that is simple and direct. > > Honest question: does any of the task_isolation_debug() calls added > > by this patch take care of the case where vmstat_shepherd() may > > schedule vmstat_update() to run because a TASK_ISOLATION process is > > changing memory stats? > > The task_isolation_debug() calls don't "take care of" any cases - they are > really just there to generate console dumps when the kernel unexpectedly > interrupts a task_isolated task. > > The idea with vmstat is that before a task_isolated task returns to > userspace, it quiesces the vmstat thread (does a final sweep to collect > the stats and turns off the scheduled work item). As a result, the vmstat > shepherd won't run while the task is in userspace. When and if it returns > to the kernel, it will again sweep up the stats before returning to userspace. > > The usual shepherd mechanism on a housekeeping core might notice > that the task had entered the kernel and started changing stats, and > might then asynchronously restart the scheduled work, but it should be > quiesced again regardless on the way back out to userspace. OK, I've missed the (obvious) fact that the process has to enter the kernel to change stats. Thanks a lot for your explanation. > > If that's not taken care of yet, should we? I just don't know if we > > should call task_isolation_exception() or task_isolation_debug(). > > task_isolation_exception() is called when an exception (page fault or > similar) is generated synchronously by the running task and we want > to make sure to notify the task with a signal if it has set up STRICT mode > to indicate that it is not planning to enter the kernel. > > > In the case of the latter, wouldn't it be interesting to add it to > > __queue_work() then? > > Well, queuing remote work involves sending an IPI, and we already tag > both the SMP send side AND the client side IRQ side with a task_isolation_debug(), > so I expect in practice it would be detected. > ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v7 06/11] nohz: task_isolation: allow tick to be fully disabled 2015-09-28 15:17 ` Chris Metcalf ` (5 preceding siblings ...) (?) @ 2015-09-28 15:17 ` Chris Metcalf 2015-09-28 20:40 ` Andy Lutomirski -1 siblings, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-kernel Cc: Chris Metcalf While the current fallback to 1-second tick is still helpful for maintaining completely correct kernel semantics, processes using prctl(PR_SET_TASK_ISOLATION) semantics place a higher priority on running completely tickless, so don't bound the time_delta for such processes. In addition, due to the way such processes quiesce by waiting for the timer tick to stop prior to returning to userspace, without this commit it won't be possible to use the task_isolation mode at all. Removing the 1-second cap was previously discussed (see link below) and Thomas Gleixner observed that vruntime, load balancing data, load accounting, and other things might be impacted. Frederic Weisbecker similarly observed that allowing the tick to be indefinitely deferred just meant that no one would ever fix the underlying bugs. However it's at least true that the mode proposed in this patch can only be enabled on a nohz_full core by a process requesting task_isolation mode, which may limit how important it is to maintain scheduler data correctly, for example. Paul McKenney observed that if provide a mode where the 1Hz fallback timer is removed, this will provide an environment where new code that relies on that tick will get punished, and we won't forgive such assumptions silently, so it may also be worth it from that perspective. Finally, it's worth observing that the tile architecture has been using similar code for its Zero-Overhead Linux for many years (starting in 2008) and customers are very enthusiastic about the resulting bare-metal performance on cores that are available to run full Linux semantics on demand (crash, logging, shutdown, etc). So this semantics is very useful if we can convince ourselves that doing this is safe. Link: https://lkml.kernel.org/r/alpine.DEB.2.11.1410311058500.32582@gentwo.org Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- kernel/time/tick-sched.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 3319e16f31e5..4504c0b95d0d 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -24,6 +24,7 @@ #include <linux/posix-timers.h> #include <linux/perf_event.h> #include <linux/context_tracking.h> +#include <linux/isolation.h> #include <asm/irq_regs.h> @@ -634,7 +635,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, #ifdef CONFIG_NO_HZ_FULL /* Limit the tick delta to the maximum scheduler deferment */ - if (!ts->inidle) + if (!ts->inidle && !task_isolation_enabled()) delta = min(delta, scheduler_tick_max_deferment()); #endif -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH v7 06/11] nohz: task_isolation: allow tick to be fully disabled 2015-09-28 15:17 ` [PATCH v7 06/11] nohz: task_isolation: allow tick to be fully disabled Chris Metcalf @ 2015-09-28 20:40 ` Andy Lutomirski 2015-10-01 13:07 ` Frederic Weisbecker 0 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-09-28 20:40 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > While the current fallback to 1-second tick is still helpful for > maintaining completely correct kernel semantics, processes using > prctl(PR_SET_TASK_ISOLATION) semantics place a higher priority on > running completely tickless, so don't bound the time_delta for such > processes. In addition, due to the way such processes quiesce by > waiting for the timer tick to stop prior to returning to userspace, > without this commit it won't be possible to use the task_isolation > mode at all. > > Removing the 1-second cap was previously discussed (see link > below) and Thomas Gleixner observed that vruntime, load balancing > data, load accounting, and other things might be impacted. > Frederic Weisbecker similarly observed that allowing the tick to > be indefinitely deferred just meant that no one would ever fix the > underlying bugs. However it's at least true that the mode proposed > in this patch can only be enabled on a nohz_full core by a process > requesting task_isolation mode, which may limit how important it is > to maintain scheduler data correctly, for example. What goes wrong when a task enables this? Presumably either tasks that enable it experience problems or performance issues or it should always be enabled. One possible issue: __vdso_clock_gettime with any of the COARSE clocks as well as __vdso_time will break if the timekeeping code doesn't run somewhere with reasonable frequency on some core. Hopefully this always works. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 06/11] nohz: task_isolation: allow tick to be fully disabled 2015-09-28 20:40 ` Andy Lutomirski @ 2015-10-01 13:07 ` Frederic Weisbecker 2015-10-01 14:13 ` Thomas Gleixner 0 siblings, 1 reply; 340+ messages in thread From: Frederic Weisbecker @ 2015-10-01 13:07 UTC (permalink / raw) To: Andy Lutomirski Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel On Mon, Sep 28, 2015 at 04:40:56PM -0400, Andy Lutomirski wrote: > On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > > While the current fallback to 1-second tick is still helpful for > > maintaining completely correct kernel semantics, processes using > > prctl(PR_SET_TASK_ISOLATION) semantics place a higher priority on > > running completely tickless, so don't bound the time_delta for such > > processes. In addition, due to the way such processes quiesce by > > waiting for the timer tick to stop prior to returning to userspace, > > without this commit it won't be possible to use the task_isolation > > mode at all. > > > > Removing the 1-second cap was previously discussed (see link > > below) and Thomas Gleixner observed that vruntime, load balancing > > data, load accounting, and other things might be impacted. > > Frederic Weisbecker similarly observed that allowing the tick to > > be indefinitely deferred just meant that no one would ever fix the > > underlying bugs. However it's at least true that the mode proposed > > in this patch can only be enabled on a nohz_full core by a process > > requesting task_isolation mode, which may limit how important it is > > to maintain scheduler data correctly, for example. > > What goes wrong when a task enables this? Presumably either tasks > that enable it experience problems or performance issues or it should > always be enabled. We need to make the scheduler resilient to 0Hz tick. Currently it doesn't even correctly support 1Hz or any dynticks behaviour that isn't idle. See update_cpu_load_active() for exemple. > > One possible issue: __vdso_clock_gettime with any of the COARSE clocks > as well as __vdso_time will break if the timekeeping code doesn't run > somewhere with reasonable frequency on some core. Hopefully this > always works. > > --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 06/11] nohz: task_isolation: allow tick to be fully disabled 2015-10-01 13:07 ` Frederic Weisbecker @ 2015-10-01 14:13 ` Thomas Gleixner 0 siblings, 0 replies; 340+ messages in thread From: Thomas Gleixner @ 2015-10-01 14:13 UTC (permalink / raw) To: Frederic Weisbecker Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel On Thu, 1 Oct 2015, Frederic Weisbecker wrote: > On Mon, Sep 28, 2015 at 04:40:56PM -0400, Andy Lutomirski wrote: > > On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > > > While the current fallback to 1-second tick is still helpful for > > > maintaining completely correct kernel semantics, processes using > > > prctl(PR_SET_TASK_ISOLATION) semantics place a higher priority on > > > running completely tickless, so don't bound the time_delta for such > > > processes. In addition, due to the way such processes quiesce by > > > waiting for the timer tick to stop prior to returning to userspace, > > > without this commit it won't be possible to use the task_isolation > > > mode at all. > > > > > > Removing the 1-second cap was previously discussed (see link > > > below) and Thomas Gleixner observed that vruntime, load balancing > > > data, load accounting, and other things might be impacted. > > > Frederic Weisbecker similarly observed that allowing the tick to > > > be indefinitely deferred just meant that no one would ever fix the > > > underlying bugs. However it's at least true that the mode proposed > > > in this patch can only be enabled on a nohz_full core by a process > > > requesting task_isolation mode, which may limit how important it is > > > to maintain scheduler data correctly, for example. > > > > What goes wrong when a task enables this? Presumably either tasks > > that enable it experience problems or performance issues or it should > > always be enabled. > > We need to make the scheduler resilient to 0Hz tick. Currently it doesn't > even correctly support 1Hz or any dynticks behaviour that isn't idle. Rik has started to work on this. No idea what the status of that is. Thanks, tglx ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v7 07/11] arch/x86: enable task isolation functionality 2015-09-28 15:17 ` Chris Metcalf ` (6 preceding siblings ...) (?) @ 2015-09-28 15:17 ` Chris Metcalf 2015-09-28 20:59 ` Andy Lutomirski -1 siblings, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-kernel, H. Peter Anvin, x86 Cc: Chris Metcalf In prepare_exit_to_usermode(), we would like to call task_isolation_enter() on every return to userspace, and like other work items, we would like to recheck for more work after calling it, since it will enable interrupts internally. However, if task_isolation_enter() is the only work item, and it has already been called once, we don't want to continue calling it in a loop. We don't have a dedicated TIF flag for task isolation, and it wouldn't make sense to have one, since we'd want to set it before starting exit every time, and then clear it the first time around the loop. Instead, we change the loop structure somewhat, so that we have a more inclusive set of flags that are tested for on the first entry to the function (including TIF_NOHZ), and if any of those flags are set, we enter the loop. And, we do the task_isolation() test unconditionally at the bottom of the loop, but then when making the decision to loop back, we just use the set of flags that doesn't include TIF_NOHZ. That way we only loop if there is other work to do, but then if that work is done, we again unconditionally call task_isolation_enter(). In syscall_trace_enter_phase1(), we try to add the necessary support for strict-mode detection of syscalls in an optimized way, by letting the code remain unchanged if we are not using TASK_ISOLATION, but otherwise calling enter_from_user_mode() under the first time we see _TIF_NOHZ, and then waiting until after we do the secure computing work to actually clear the bit from the "work" variable and call task_isolation_syscall(). Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/x86/entry/common.c | 47 ++++++++++++++++++++++++++++++++++++----------- 1 file changed, 36 insertions(+), 11 deletions(-) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 80dcc9261ca3..0f74389c6f3b 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -21,6 +21,7 @@ #include <linux/context_tracking.h> #include <linux/user-return-notifier.h> #include <linux/uprobes.h> +#include <linux/isolation.h> #include <asm/desc.h> #include <asm/traps.h> @@ -81,7 +82,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) */ if (work & _TIF_NOHZ) { enter_from_user_mode(); - work &= ~_TIF_NOHZ; + if (!IS_ENABLED(CONFIG_TASK_ISOLATION)) + work &= ~_TIF_NOHZ; } #endif @@ -131,6 +133,13 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) } #endif + /* Now check task isolation, if needed. */ + if (IS_ENABLED(CONFIG_TASK_ISOLATION) && (work & _TIF_NOHZ)) { + work &= ~_TIF_NOHZ; + if (task_isolation_strict()) + task_isolation_syscall(regs->orig_ax); + } + /* Do our best to finish without phase 2. */ if (work == 0) return ret; /* seccomp and/or nohz only (ret == 0 here) */ @@ -217,10 +226,26 @@ static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs) /* Called with IRQs disabled. */ __visible void prepare_exit_to_usermode(struct pt_regs *regs) { + u32 cached_flags; + if (WARN_ON(!irqs_disabled())) local_irq_disable(); /* + * We may want to enter the loop here unconditionally to make + * sure to do some work at least once. Test here for all + * possible conditions that might make us enter the loop, + * and return immediately if none of them are set. + */ + cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags); + if (!(cached_flags & (TIF_SIGPENDING | _TIF_NOTIFY_RESUME | + _TIF_UPROBE | _TIF_NEED_RESCHED | + _TIF_USER_RETURN_NOTIFY | _TIF_NOHZ))) { + user_enter(); + return; + } + + /* * In order to return to user mode, we need to have IRQs off with * none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY, * _TIF_UPROBE, or _TIF_NEED_RESCHED set. Several of these flags @@ -228,15 +253,7 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs) * so we need to loop. Disabling preemption wouldn't help: doing the * work to clear some of the flags can sleep. */ - while (true) { - u32 cached_flags = - READ_ONCE(pt_regs_to_thread_info(regs)->flags); - - if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | - _TIF_UPROBE | _TIF_NEED_RESCHED | - _TIF_USER_RETURN_NOTIFY))) - break; - + do { /* We have work to do. */ local_irq_enable(); @@ -258,9 +275,17 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs) if (cached_flags & _TIF_USER_RETURN_NOTIFY) fire_user_return_notifiers(); + if (task_isolation_enabled()) + task_isolation_enter(); + /* Disable IRQs and retry */ local_irq_disable(); - } + + cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags); + + } while (!(cached_flags & (TIF_SIGPENDING | _TIF_NOTIFY_RESUME | + _TIF_UPROBE | _TIF_NEED_RESCHED | + _TIF_USER_RETURN_NOTIFY))); user_enter(); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality 2015-09-28 15:17 ` [PATCH v7 07/11] arch/x86: enable task isolation functionality Chris Metcalf @ 2015-09-28 20:59 ` Andy Lutomirski 2015-09-28 21:57 ` Chris Metcalf 0 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-09-28 20:59 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel, H. Peter Anvin, X86 ML On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > In prepare_exit_to_usermode(), we would like to call > task_isolation_enter() on every return to userspace, and like > other work items, we would like to recheck for more work after > calling it, since it will enable interrupts internally. > > However, if task_isolation_enter() is the only work item, > and it has already been called once, we don't want to continue > calling it in a loop. We don't have a dedicated TIF flag for > task isolation, and it wouldn't make sense to have one, since > we'd want to set it before starting exit every time, and then > clear it the first time around the loop. > > Instead, we change the loop structure somewhat, so that we > have a more inclusive set of flags that are tested for on the > first entry to the function (including TIF_NOHZ), and if any > of those flags are set, we enter the loop. And, we do the > task_isolation() test unconditionally at the bottom of the loop, > but then when making the decision to loop back, we just use the > set of flags that doesn't include TIF_NOHZ. That way we only > loop if there is other work to do, but then if that work > is done, we again unconditionally call task_isolation_enter(). > > In syscall_trace_enter_phase1(), we try to add the necessary > support for strict-mode detection of syscalls in an optimized > way, by letting the code remain unchanged if we are not using > TASK_ISOLATION, but otherwise calling enter_from_user_mode() > under the first time we see _TIF_NOHZ, and then waiting until > after we do the secure computing work to actually clear the bit > from the "work" variable and call task_isolation_syscall(). > > Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> > --- > arch/x86/entry/common.c | 47 ++++++++++++++++++++++++++++++++++++----------- > 1 file changed, 36 insertions(+), 11 deletions(-) > > diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c > index 80dcc9261ca3..0f74389c6f3b 100644 > --- a/arch/x86/entry/common.c > +++ b/arch/x86/entry/common.c > @@ -21,6 +21,7 @@ > #include <linux/context_tracking.h> > #include <linux/user-return-notifier.h> > #include <linux/uprobes.h> > +#include <linux/isolation.h> > > #include <asm/desc.h> > #include <asm/traps.h> > @@ -81,7 +82,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) > */ > if (work & _TIF_NOHZ) { > enter_from_user_mode(); > - work &= ~_TIF_NOHZ; > + if (!IS_ENABLED(CONFIG_TASK_ISOLATION)) > + work &= ~_TIF_NOHZ; > } > #endif > > @@ -131,6 +133,13 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) > } > #endif > > + /* Now check task isolation, if needed. */ > + if (IS_ENABLED(CONFIG_TASK_ISOLATION) && (work & _TIF_NOHZ)) { > + work &= ~_TIF_NOHZ; > + if (task_isolation_strict()) > + task_isolation_syscall(regs->orig_ax); > + } > + This is IMO rather nasty. Can you try to find a way to do this without making the control flow depend on config options? What guarantees that TIF_NOHZ is an acceptable thing to check? > /* Do our best to finish without phase 2. */ > if (work == 0) > return ret; /* seccomp and/or nohz only (ret == 0 here) */ > @@ -217,10 +226,26 @@ static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs) > /* Called with IRQs disabled. */ > __visible void prepare_exit_to_usermode(struct pt_regs *regs) > { > + u32 cached_flags; > + > if (WARN_ON(!irqs_disabled())) > local_irq_disable(); > > /* > + * We may want to enter the loop here unconditionally to make > + * sure to do some work at least once. Test here for all > + * possible conditions that might make us enter the loop, > + * and return immediately if none of them are set. > + */ > + cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags); > + if (!(cached_flags & (TIF_SIGPENDING | _TIF_NOTIFY_RESUME | > + _TIF_UPROBE | _TIF_NEED_RESCHED | > + _TIF_USER_RETURN_NOTIFY | _TIF_NOHZ))) { > + user_enter(); > + return; > + } > + Too complicated and too error prone. In any event, I don't think that the property you actually want is for the loop to be entered once. I think the property you want is that we're isolated by the time we're finished. Why not just check that directly in the loop condition? > + /* > * In order to return to user mode, we need to have IRQs off with > * none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY, > * _TIF_UPROBE, or _TIF_NEED_RESCHED set. Several of these flags > @@ -228,15 +253,7 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs) > * so we need to loop. Disabling preemption wouldn't help: doing the > * work to clear some of the flags can sleep. > */ > - while (true) { > - u32 cached_flags = > - READ_ONCE(pt_regs_to_thread_info(regs)->flags); > - > - if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | > - _TIF_UPROBE | _TIF_NEED_RESCHED | > - _TIF_USER_RETURN_NOTIFY))) > - break; > - > + do { > /* We have work to do. */ > local_irq_enable(); > > @@ -258,9 +275,17 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs) > if (cached_flags & _TIF_USER_RETURN_NOTIFY) > fire_user_return_notifiers(); > > + if (task_isolation_enabled()) > + task_isolation_enter(); > + Does anything here guarantee forward progress or at least give reasonable confidence that we'll make forward progress? > /* Disable IRQs and retry */ > local_irq_disable(); > - } > + > + cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags); > + > + } while (!(cached_flags & (TIF_SIGPENDING | _TIF_NOTIFY_RESUME | > + _TIF_UPROBE | _TIF_NEED_RESCHED | > + _TIF_USER_RETURN_NOTIFY))); > > user_enter(); > } > -- > 2.1.2 > --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality 2015-09-28 20:59 ` Andy Lutomirski @ 2015-09-28 21:57 ` Chris Metcalf 2015-09-28 22:43 ` Andy Lutomirski 0 siblings, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 21:57 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel, H. Peter Anvin, X86 ML On 09/28/2015 04:59 PM, Andy Lutomirski wrote: > On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> In prepare_exit_to_usermode(), we would like to call >> task_isolation_enter() on every return to userspace, and like >> other work items, we would like to recheck for more work after >> calling it, since it will enable interrupts internally. >> >> However, if task_isolation_enter() is the only work item, >> and it has already been called once, we don't want to continue >> calling it in a loop. We don't have a dedicated TIF flag for >> task isolation, and it wouldn't make sense to have one, since >> we'd want to set it before starting exit every time, and then >> clear it the first time around the loop. >> >> Instead, we change the loop structure somewhat, so that we >> have a more inclusive set of flags that are tested for on the >> first entry to the function (including TIF_NOHZ), and if any >> of those flags are set, we enter the loop. And, we do the >> task_isolation() test unconditionally at the bottom of the loop, >> but then when making the decision to loop back, we just use the >> set of flags that doesn't include TIF_NOHZ. That way we only >> loop if there is other work to do, but then if that work >> is done, we again unconditionally call task_isolation_enter(). >> >> In syscall_trace_enter_phase1(), we try to add the necessary >> support for strict-mode detection of syscalls in an optimized >> way, by letting the code remain unchanged if we are not using >> TASK_ISOLATION, but otherwise calling enter_from_user_mode() >> under the first time we see _TIF_NOHZ, and then waiting until >> after we do the secure computing work to actually clear the bit >> from the "work" variable and call task_isolation_syscall(). >> >> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> >> --- >> arch/x86/entry/common.c | 47 ++++++++++++++++++++++++++++++++++++----------- >> 1 file changed, 36 insertions(+), 11 deletions(-) >> >> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c >> index 80dcc9261ca3..0f74389c6f3b 100644 >> --- a/arch/x86/entry/common.c >> +++ b/arch/x86/entry/common.c >> @@ -21,6 +21,7 @@ >> #include <linux/context_tracking.h> >> #include <linux/user-return-notifier.h> >> #include <linux/uprobes.h> >> +#include <linux/isolation.h> >> >> #include <asm/desc.h> >> #include <asm/traps.h> >> @@ -81,7 +82,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) >> */ >> if (work & _TIF_NOHZ) { >> enter_from_user_mode(); >> - work &= ~_TIF_NOHZ; >> + if (!IS_ENABLED(CONFIG_TASK_ISOLATION)) >> + work &= ~_TIF_NOHZ; >> } >> #endif >> >> @@ -131,6 +133,13 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) >> } >> #endif >> >> + /* Now check task isolation, if needed. */ >> + if (IS_ENABLED(CONFIG_TASK_ISOLATION) && (work & _TIF_NOHZ)) { >> + work &= ~_TIF_NOHZ; >> + if (task_isolation_strict()) >> + task_isolation_syscall(regs->orig_ax); >> + } >> + > This is IMO rather nasty. Can you try to find a way to do this > without making the control flow depend on config options? Well, I suppose this is the best argument for testing for task isolation before seccomp :-) Honestly, if not, it's tricky to see how to do better; I did spend some time looking at it. One possibility is to just unconditionally clear _TIF_NOHZ before testing "work == 0", so that we can test (work & TIF_NOHZ) once early and once after seccomp. This presumably costs a cycle in the no-nohz-full case. So maybe just do it before seccomp... > What guarantees that TIF_NOHZ is an acceptable thing to check? Well, TIF_NOHZ is set on all tasks whenever we are running with nohz_full enabled anywhere, so testing it lets us do stuff on the fastpath without slowing down the fastpath much. See context_tracking_cpu_set(). >> /* Do our best to finish without phase 2. */ >> if (work == 0) >> return ret; /* seccomp and/or nohz only (ret == 0 here) */ >> @@ -217,10 +226,26 @@ static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs) >> /* Called with IRQs disabled. */ >> __visible void prepare_exit_to_usermode(struct pt_regs *regs) >> { >> + u32 cached_flags; >> + >> if (WARN_ON(!irqs_disabled())) >> local_irq_disable(); >> >> /* >> + * We may want to enter the loop here unconditionally to make >> + * sure to do some work at least once. Test here for all >> + * possible conditions that might make us enter the loop, >> + * and return immediately if none of them are set. >> + */ >> + cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags); >> + if (!(cached_flags & (TIF_SIGPENDING | _TIF_NOTIFY_RESUME | >> + _TIF_UPROBE | _TIF_NEED_RESCHED | >> + _TIF_USER_RETURN_NOTIFY | _TIF_NOHZ))) { >> + user_enter(); >> + return; >> + } >> + > Too complicated and too error prone. > > In any event, I don't think that the property you actually want is for > the loop to be entered once. I think the property you want is that > we're isolated by the time we're finished. Why not just check that > directly in the loop condition? So something like this (roughly): if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY)) && + task_isolation_done()) break; i.e. just add the one extra call? That could work, I suppose. In the body we would then keep the proposed logic that unconditionally calls task_isolation_enter(). >> + /* >> * In order to return to user mode, we need to have IRQs off with >> * none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY, >> * _TIF_UPROBE, or _TIF_NEED_RESCHED set. Several of these flags >> @@ -228,15 +253,7 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs) >> * so we need to loop. Disabling preemption wouldn't help: doing the >> * work to clear some of the flags can sleep. >> */ >> - while (true) { >> - u32 cached_flags = >> - READ_ONCE(pt_regs_to_thread_info(regs)->flags); >> - >> - if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | >> - _TIF_UPROBE | _TIF_NEED_RESCHED | >> - _TIF_USER_RETURN_NOTIFY))) >> - break; >> - >> + do { >> /* We have work to do. */ >> local_irq_enable(); >> >> @@ -258,9 +275,17 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs) >> if (cached_flags & _TIF_USER_RETURN_NOTIFY) >> fire_user_return_notifiers(); >> >> + if (task_isolation_enabled()) >> + task_isolation_enter(); >> + > Does anything here guarantee forward progress or at least give > reasonable confidence that we'll make forward progress? A given task can get stuck in the kernel if it has a lengthy far-future alarm() type situation, or if there are multiple task-isolated tasks scheduled onto the same core, but that only affects those tasks; other tasks on the same core, and the system as a whole, are OK. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality 2015-09-28 21:57 ` Chris Metcalf @ 2015-09-28 22:43 ` Andy Lutomirski 2015-09-29 17:42 ` Chris Metcalf 0 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-09-28 22:43 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel, H. Peter Anvin, X86 ML On Mon, Sep 28, 2015 at 2:57 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 09/28/2015 04:59 PM, Andy Lutomirski wrote: >> >> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@ezchip.com> >> wrote: >>> >>> In prepare_exit_to_usermode(), we would like to call >>> task_isolation_enter() on every return to userspace, and like >>> other work items, we would like to recheck for more work after >>> calling it, since it will enable interrupts internally. >>> >>> However, if task_isolation_enter() is the only work item, >>> and it has already been called once, we don't want to continue >>> calling it in a loop. We don't have a dedicated TIF flag for >>> task isolation, and it wouldn't make sense to have one, since >>> we'd want to set it before starting exit every time, and then >>> clear it the first time around the loop. >>> >>> Instead, we change the loop structure somewhat, so that we >>> have a more inclusive set of flags that are tested for on the >>> first entry to the function (including TIF_NOHZ), and if any >>> of those flags are set, we enter the loop. And, we do the >>> task_isolation() test unconditionally at the bottom of the loop, >>> but then when making the decision to loop back, we just use the >>> set of flags that doesn't include TIF_NOHZ. That way we only >>> loop if there is other work to do, but then if that work >>> is done, we again unconditionally call task_isolation_enter(). >>> >>> In syscall_trace_enter_phase1(), we try to add the necessary >>> support for strict-mode detection of syscalls in an optimized >>> way, by letting the code remain unchanged if we are not using >>> TASK_ISOLATION, but otherwise calling enter_from_user_mode() >>> under the first time we see _TIF_NOHZ, and then waiting until >>> after we do the secure computing work to actually clear the bit >>> from the "work" variable and call task_isolation_syscall(). >>> >>> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> >>> --- >>> arch/x86/entry/common.c | 47 >>> ++++++++++++++++++++++++++++++++++++----------- >>> 1 file changed, 36 insertions(+), 11 deletions(-) >>> >>> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c >>> index 80dcc9261ca3..0f74389c6f3b 100644 >>> --- a/arch/x86/entry/common.c >>> +++ b/arch/x86/entry/common.c >>> @@ -21,6 +21,7 @@ >>> #include <linux/context_tracking.h> >>> #include <linux/user-return-notifier.h> >>> #include <linux/uprobes.h> >>> +#include <linux/isolation.h> >>> >>> #include <asm/desc.h> >>> #include <asm/traps.h> >>> @@ -81,7 +82,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs >>> *regs, u32 arch) >>> */ >>> if (work & _TIF_NOHZ) { >>> enter_from_user_mode(); >>> - work &= ~_TIF_NOHZ; >>> + if (!IS_ENABLED(CONFIG_TASK_ISOLATION)) >>> + work &= ~_TIF_NOHZ; >>> } >>> #endif >>> >>> @@ -131,6 +133,13 @@ unsigned long syscall_trace_enter_phase1(struct >>> pt_regs *regs, u32 arch) >>> } >>> #endif >>> >>> + /* Now check task isolation, if needed. */ >>> + if (IS_ENABLED(CONFIG_TASK_ISOLATION) && (work & _TIF_NOHZ)) { >>> + work &= ~_TIF_NOHZ; >>> + if (task_isolation_strict()) >>> + task_isolation_syscall(regs->orig_ax); >>> + } >>> + >> >> This is IMO rather nasty. Can you try to find a way to do this >> without making the control flow depend on config options? > > > Well, I suppose this is the best argument for testing for task > isolation before seccomp :-) > > Honestly, if not, it's tricky to see how to do better; I did spend > some time looking at it. One possibility is to just unconditionally > clear _TIF_NOHZ before testing "work == 0", so that we can > test (work & TIF_NOHZ) once early and once after seccomp. > This presumably costs a cycle in the no-nohz-full case. > > So maybe just do it before seccomp... > >> What guarantees that TIF_NOHZ is an acceptable thing to check? > > > Well, TIF_NOHZ is set on all tasks whenever we are running with > nohz_full enabled anywhere, so testing it lets us do stuff on > the fastpath without slowing down the fastpath much. > See context_tracking_cpu_set(). > > >>> /* Do our best to finish without phase 2. */ >>> if (work == 0) >>> return ret; /* seccomp and/or nohz only (ret == 0 here) >>> */ >>> @@ -217,10 +226,26 @@ static struct thread_info >>> *pt_regs_to_thread_info(struct pt_regs *regs) >>> /* Called with IRQs disabled. */ >>> __visible void prepare_exit_to_usermode(struct pt_regs *regs) >>> { >>> + u32 cached_flags; >>> + >>> if (WARN_ON(!irqs_disabled())) >>> local_irq_disable(); >>> >>> /* >>> + * We may want to enter the loop here unconditionally to make >>> + * sure to do some work at least once. Test here for all >>> + * possible conditions that might make us enter the loop, >>> + * and return immediately if none of them are set. >>> + */ >>> + cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags); >>> + if (!(cached_flags & (TIF_SIGPENDING | _TIF_NOTIFY_RESUME | >>> + _TIF_UPROBE | _TIF_NEED_RESCHED | >>> + _TIF_USER_RETURN_NOTIFY | _TIF_NOHZ))) { >>> + user_enter(); >>> + return; >>> + } >>> + >> >> Too complicated and too error prone. >> >> In any event, I don't think that the property you actually want is for >> the loop to be entered once. I think the property you want is that >> we're isolated by the time we're finished. Why not just check that >> directly in the loop condition? > > > So something like this (roughly): > > if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | > _TIF_UPROBE | _TIF_NEED_RESCHED | > _TIF_USER_RETURN_NOTIFY)) && > + task_isolation_done()) > break; > > i.e. just add the one extra call? That could work, I suppose. > In the body we would then keep the proposed logic that unconditionally > calls task_isolation_enter(). Yeah, I think so. >> Does anything here guarantee forward progress or at least give >> reasonable confidence that we'll make forward progress? > > > A given task can get stuck in the kernel if it has a lengthy far-future > alarm() type situation, or if there are multiple task-isolated tasks > scheduled onto the same core, but that only affects those tasks; > other tasks on the same core, and the system as a whole, are OK. Why are we treating alarms as something that should defer entry to userspace? I think it would be entirely reasonable to set an alarm for ten minutes, ask for isolation, and then think hard for ten minutes. A bigger issue would be if there's an RT task that asks for isolation and a bunch of other stuff (most notably KVM hosts) running with uncontrained affinity at full load. If task_isolation_enter always sleeps, then your KVM host will get scheduled, and it'll ask for a user return notifier on the way out, and you might just loop forever. Can this happen? ISTM something's suboptimal with the inner workings of all this if task_isolation_enter needs to sleep to wait for an event that isn't scheduled for the immediate future (e.g. already queued up as an interrupt). --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality 2015-09-28 22:43 ` Andy Lutomirski @ 2015-09-29 17:42 ` Chris Metcalf 2015-09-29 17:57 ` Andy Lutomirski 0 siblings, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-09-29 17:42 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel, H. Peter Anvin, X86 ML On 09/28/2015 06:43 PM, Andy Lutomirski wrote: > Why are we treating alarms as something that should defer entry to > userspace? I think it would be entirely reasonable to set an alarm > for ten minutes, ask for isolation, and then think hard for ten > minutes. > > A bigger issue would be if there's an RT task that asks for isolation > and a bunch of other stuff (most notably KVM hosts) running with > uncontrained affinity at full load. If task_isolation_enter always > sleeps, then your KVM host will get scheduled, and it'll ask for a > user return notifier on the way out, and you might just loop forever. > Can this happen? task_isolation_enter() doesn't sleep - it spins. This is intentional, because the point is that there should be nothing else that could be scheduled on that cpu. We're just waiting for any pending kernel management timer interrupts to fire. In any case, you normally wouldn't have a KVM host running on an isolcpus, nohz_full cpu, unless it was the only thing running there, I imagine (just as would be true for any other host process). > ISTM something's suboptimal with the inner workings of all this if > task_isolation_enter needs to sleep to wait for an event that isn't > scheduled for the immediate future (e.g. already queued up as an > interrupt). Scheduling a timer for 10 minutes away is typically done by scheduling timers for the max timer granularity (which could be just a few seconds) and then waking up a couple of hundred times between now and now+10 minutes. Doing this breaks the task isolation guarantee, so we can't return to userspace while something like that is pending. You'd have to do it by polling in userspace to avoid the unexpected interrupts. I suppose if your hardware supported it, you could imagine a mode where userspace can request an alarm a specific amount of time in the future, and the task isolation code would then ignore an alarm that was going off at that specific time. But I'm not sure what hardware does support that (I know tile uses the "few seconds and re-arm" model), and it seems like a pretty corner use-case. We could certainly investigate adding such support later, but I don't see it as part of the core value proposition for task isolation. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality 2015-09-29 17:42 ` Chris Metcalf @ 2015-09-29 17:57 ` Andy Lutomirski 2015-09-30 20:25 ` Thomas Gleixner 0 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-09-29 17:57 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel, H. Peter Anvin, X86 ML On Tue, Sep 29, 2015 at 10:42 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 09/28/2015 06:43 PM, Andy Lutomirski wrote: >> >> Why are we treating alarms as something that should defer entry to >> userspace? I think it would be entirely reasonable to set an alarm >> for ten minutes, ask for isolation, and then think hard for ten >> minutes. >> >> A bigger issue would be if there's an RT task that asks for isolation >> and a bunch of other stuff (most notably KVM hosts) running with >> uncontrained affinity at full load. If task_isolation_enter always >> sleeps, then your KVM host will get scheduled, and it'll ask for a >> user return notifier on the way out, and you might just loop forever. >> Can this happen? > > > task_isolation_enter() doesn't sleep - it spins. This is intentional, > because the point is that there should be nothing else that > could be scheduled on that cpu. We're just waiting for any > pending kernel management timer interrupts to fire. > > In any case, you normally wouldn't have a KVM host running > on an isolcpus, nohz_full cpu, unless it was the only thing > running there, I imagine (just as would be true for any other > host process). The problem is that AFAICT systemd (and possibly other things) makes is rather painful to guarantee that nothing low-priority (systemd itself) would schedule on an arbitrary CPU. I would hope that merely setting affinity and RT would be enough to get isolation without playing fancy cgroup games. Maybe not. > >> ISTM something's suboptimal with the inner workings of all this if >> task_isolation_enter needs to sleep to wait for an event that isn't >> scheduled for the immediate future (e.g. already queued up as an >> interrupt). > > > Scheduling a timer for 10 minutes away is typically done by > scheduling timers for the max timer granularity (which could > be just a few seconds) and then waking up a couple of hundred > times between now and now+10 minutes. Doing this breaks > the task isolation guarantee, so we can't return to userspace > while something like that is pending. You'd have to do it > by polling in userspace to avoid the unexpected interrupts. > Really? That sucks. Hopefully we can fix it. > I suppose if your hardware supported it, you could imagine > a mode where userspace can request an alarm a specific > amount of time in the future, and the task isolation code > would then ignore an alarm that was going off at that > specific time. But I'm not sure what hardware does support > that (I know tile uses the "few seconds and re-arm" model), > and it seems like a pretty corner use-case. We could > certainly investigate adding such support later, but I don't see > it as part of the core value proposition for task isolation. > Intel chips Sandy Bridge and newer certainly support this. Other chips might support it as well. Whether the kernel is able to program the TSC deadline timer like that is a different question. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality 2015-09-29 17:57 ` Andy Lutomirski @ 2015-09-30 20:25 ` Thomas Gleixner 2015-09-30 20:58 ` Chris Metcalf 0 siblings, 1 reply; 340+ messages in thread From: Thomas Gleixner @ 2015-09-30 20:25 UTC (permalink / raw) To: Andy Lutomirski Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel, H. Peter Anvin, X86 ML On Tue, 29 Sep 2015, Andy Lutomirski wrote: > On Tue, Sep 29, 2015 at 10:42 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > > Scheduling a timer for 10 minutes away is typically done by > > scheduling timers for the max timer granularity (which could > > be just a few seconds) and then waking up a couple of hundred > > times between now and now+10 minutes. Doing this breaks > > the task isolation guarantee, so we can't return to userspace > > while something like that is pending. You'd have to do it > > by polling in userspace to avoid the unexpected interrupts. > > > > Really? That sucks. Hopefully we can fix it. Well. If there is a timer pending, then what do you want to fix? If the hardware does not allow to program a long out expiry time, then the kernel cannot do anything about it. That depends on the timer hardware, really. PIT can do ~50ms, HPET ~3min, APIC timer ~32sec, TSC deadline timer ~500years. > > I suppose if your hardware supported it, you could imagine > > a mode where userspace can request an alarm a specific > > amount of time in the future, and the task isolation code > > would then ignore an alarm that was going off at that > > specific time. Ignore in what way? > > But I'm not sure what hardware does support > > that (I know tile uses the "few seconds and re-arm" model), > > and it seems like a pretty corner use-case. We could > > certainly investigate adding such support later, but I don't see > > it as part of the core value proposition for task isolation. I'm really not following here. If user space requested that timer, WHY on earth do you want to spin in the kernel until it fired? That's insane. > Intel chips Sandy Bridge and newer certainly support this. Other chips > might support it as well. Whether the kernel is able to program the > TSC deadline timer like that is a different question. If the next expiry is out 100 years then we will happily program it for that expiry time. Thanks, tglx ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality 2015-09-30 20:25 ` Thomas Gleixner @ 2015-09-30 20:58 ` Chris Metcalf 2015-09-30 22:02 ` Thomas Gleixner 0 siblings, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-09-30 20:58 UTC (permalink / raw) To: Thomas Gleixner, Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel, H. Peter Anvin, X86 ML On 09/30/2015 04:25 PM, Thomas Gleixner wrote: > On Tue, 29 Sep 2015, Andy Lutomirski wrote: >> On Tue, Sep 29, 2015 at 10:42 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >>> I suppose if your hardware supported it, you could imagine >>> a mode where userspace can request an alarm a specific >>> amount of time in the future, and the task isolation code >>> would then ignore an alarm that was going off at that >>> specific time. > Ignore in what way? So the model for task isolation, as proposed, is that you are NOT trying to use kernel services, because the overheads are too great. The typical example is that you are running a userspace packet-processing application, and entering the kernel for pretty much any reason will cause your thread to drop packets all over the floor. So, we ensure that you never enter the kernel asynchronously. Now obviously if the task needs to enter the kernel, you have to allow it do so, and in fact there will be plenty of scenarios where that's exactly what you want to do; but for example you may have a way to register that a particular thread is opting out of packet processing for the moment, and it will instead go off and use the kernel to log some kind of information about some exceptional packet flow, or some debugging state about its own internals, or whatever. At that moment you would like the thread to be able to use arbitrary kernel facilities in a relatively transparent way. However, when it is done using the kernel and returns to userspace and signs up to handle packets again, you REALLY don't want to encounter some kind of tailing off of timer interrupts while some kernel subsystem quiesces or whatever. So we would like to spin in the kernel until no further timer interrupts are queued. In the original tile implementation, we would just wait until the timer interrupt was masked, which guaranteed it wouldn't fire again; for a more portable approach in the task-isolation patch series, I'm testing that the tick_cpu_device has next_event set to KTIME_MAX. The discussion in this email thread is that maybe it might make sense for one of these userspace driver threads to request a SIGALARM in 10 minutes, and then you'd assume it was done very intentionally, and figure out a way to discount that timer event somehow -- so you'd still return to userspace if all that was pending was one timer interrupt scheduled 10 minutes out, but if your hardware didn't allow setting such a long timer, you wouldn't return early, or if some other event was scheduled sooner, you wouldn't return early, etc. As I said in my previous email, this seems like a total corner case and not worth worrying about now; maybe in the future someone will want to think about it. So for now, if a task-isolation thread sets up a timer, they're screwed: so, don't do that. And it's really not part of the typical programming model for these kinds of userspace drivers anyway, so it's pretty reasonable to forbid it. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality 2015-09-30 20:58 ` Chris Metcalf @ 2015-09-30 22:02 ` Thomas Gleixner 2015-09-30 22:11 ` Andy Lutomirski 0 siblings, 1 reply; 340+ messages in thread From: Thomas Gleixner @ 2015-09-30 22:02 UTC (permalink / raw) To: Chris Metcalf Cc: Andy Lutomirski, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel, H. Peter Anvin, X86 ML On Wed, 30 Sep 2015, Chris Metcalf wrote: > So for now, if a task-isolation thread sets up a timer, > they're screwed: so, don't do that. And it's really not part of > the typical programming model for these kinds of userspace > drivers anyway, so it's pretty reasonable to forbid it. There is a difference between forbidding it and looping for 10 minutes in the kernel. I have yet to understand WHY this loop is there at all. All I've seen so far is that things might need to settle and that the indicator for settlement is that the next expiry time of the per cpu timer is set to KTIME_MAX. To be blunt, that just stinks. This is duct tape engineering and not even close to a reasonable design. Thanks, tglx ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality 2015-09-30 22:02 ` Thomas Gleixner @ 2015-09-30 22:11 ` Andy Lutomirski 2015-10-01 8:12 ` Thomas Gleixner 0 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-09-30 22:11 UTC (permalink / raw) To: Thomas Gleixner Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel, H. Peter Anvin, X86 ML On Wed, Sep 30, 2015 at 3:02 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > On Wed, 30 Sep 2015, Chris Metcalf wrote: >> So for now, if a task-isolation thread sets up a timer, >> they're screwed: so, don't do that. And it's really not part of >> the typical programming model for these kinds of userspace >> drivers anyway, so it's pretty reasonable to forbid it. > > There is a difference between forbidding it and looping for 10 minutes > in the kernel. I don't even like forbidding it. Setting timers seems like an entirely reasonable thing for even highly RT or isolated programs to do, although admittedly they can do it on a non-RT thread and then kick the RT thread when they're ready. Heck, even without the TSC deadline timer, the kernel could, in principle, support that use case by having whatever core is doing housekeeping keep kicking the can forward until it's time to IPI the isolated core because it needs to wake up. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality 2015-09-30 22:11 ` Andy Lutomirski @ 2015-10-01 8:12 ` Thomas Gleixner 2015-10-01 9:08 ` Christoph Lameter 2015-10-01 19:25 ` Chris Metcalf 0 siblings, 2 replies; 340+ messages in thread From: Thomas Gleixner @ 2015-10-01 8:12 UTC (permalink / raw) To: Andy Lutomirski Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel, H. Peter Anvin, X86 ML On Wed, 30 Sep 2015, Andy Lutomirski wrote: > On Wed, Sep 30, 2015 at 3:02 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > On Wed, 30 Sep 2015, Chris Metcalf wrote: > >> So for now, if a task-isolation thread sets up a timer, > >> they're screwed: so, don't do that. And it's really not part of > >> the typical programming model for these kinds of userspace > >> drivers anyway, so it's pretty reasonable to forbid it. > > > > There is a difference between forbidding it and looping for 10 minutes > > in the kernel. > > I don't even like forbidding it. Setting timers seems like an > entirely reasonable thing for even highly RT or isolated programs to > do, although admittedly they can do it on a non-RT thread and then > kick the RT thread when they're ready. > > Heck, even without the TSC deadline timer, the kernel could, in > principle, support that use case by having whatever core is doing > housekeeping keep kicking the can forward until it's time to IPI the > isolated core because it needs to wake up. That's simple. Just arm the timer on the other core. It's not rocket science to do that. But the whole problem with this isolation stuff is, that it tries to push half baken duct tape concepts into the tree. That would be the same if we'd brute force merge the RT stuff and then let everyone deal with the fallout. There is a really good reason, why the remaining - hard to solve - pieces of RT are still out of tree. And I really want to see a proper engineering for that isolation stuff, which can be done with an out of tree patch set in the first place. But sure, it's more convenient to push crap into mainline and let everyone else deal with the fallouts. Thanks, tglx ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality 2015-10-01 8:12 ` Thomas Gleixner @ 2015-10-01 9:08 ` Christoph Lameter 2015-10-01 10:10 ` Thomas Gleixner 2015-10-01 19:25 ` Chris Metcalf 1 sibling, 1 reply; 340+ messages in thread From: Christoph Lameter @ 2015-10-01 9:08 UTC (permalink / raw) To: Thomas Gleixner Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel, H. Peter Anvin, X86 ML On Thu, 1 Oct 2015, Thomas Gleixner wrote: > And I really want to see a proper engineering for that isolation > stuff, which can be done with an out of tree patch set in the first > place. But sure, it's more convenient to push crap into mainline and > let everyone else deal with the fallouts. Yes lets keep the half baked stuff out please. Firing a timer that signals the app via a signal causes an event that is not unlike the OS noise that we are trying to avoid. Its an asynchrononous event that may interrupt at random locations in the code. In that case I would say its perfectly fine to use the tick and other timer processing on the processor that requested it. If you really want low latency and be OS intervention free then please do not set up timers. In fact any signal should bring on full OS services on a processor. AFAICT one would communicate via shared memory structures rather than IPIs within the threads of an app that requires low latency and the OS noise to be minimal. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality 2015-10-01 9:08 ` Christoph Lameter @ 2015-10-01 10:10 ` Thomas Gleixner 0 siblings, 0 replies; 340+ messages in thread From: Thomas Gleixner @ 2015-10-01 10:10 UTC (permalink / raw) To: Christoph Lameter Cc: Andy Lutomirski, Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel, H. Peter Anvin, X86 ML On Thu, 1 Oct 2015, Christoph Lameter wrote: > On Thu, 1 Oct 2015, Thomas Gleixner wrote: > > > And I really want to see a proper engineering for that isolation > > stuff, which can be done with an out of tree patch set in the first > > place. But sure, it's more convenient to push crap into mainline and > > let everyone else deal with the fallouts. > > Yes lets keep the half baked stuff out please. Firing a timer that signals > the app via a signal causes an event that is not unlike the OS noise that > we are trying to avoid. Its an asynchrononous event that may interrupt at > random locations in the code. In that case I would say its perfectly fine > to use the tick and other timer processing on the processor that requested > it. If you really want low latency and be OS intervention free then please > do not set up timers. In fact any signal should bring on full OS services > on a processor. Right. That's a recommendation to the application developer, which he can follow or not. And he has to deal with the consequences if not. Thanks, tglx ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality 2015-10-01 8:12 ` Thomas Gleixner 2015-10-01 9:08 ` Christoph Lameter @ 2015-10-01 19:25 ` Chris Metcalf 1 sibling, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-01 19:25 UTC (permalink / raw) To: Thomas Gleixner, Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-kernel, H. Peter Anvin, X86 ML On 10/01/2015 04:12 AM, Thomas Gleixner wrote: > On Wed, 30 Sep 2015, Andy Lutomirski wrote: >> On Wed, Sep 30, 2015 at 3:02 PM, Thomas Gleixner <tglx@linutronix.de> wrote: >>> On Wed, 30 Sep 2015, Chris Metcalf wrote: >>>> So for now, if a task-isolation thread sets up a timer, >>>> they're screwed: so, don't do that. And it's really not part of >>>> the typical programming model for these kinds of userspace >>>> drivers anyway, so it's pretty reasonable to forbid it. >>> There is a difference between forbidding it and looping for 10 minutes >>> in the kernel. >> I don't even like forbidding it. Setting timers seems like an >> entirely reasonable thing for even highly RT or isolated programs to >> do, although admittedly they can do it on a non-RT thread and then >> kick the RT thread when they're ready. >> >> Heck, even without the TSC deadline timer, the kernel could, in >> principle, support that use case by having whatever core is doing >> housekeeping keep kicking the can forward until it's time to IPI the >> isolated core because it needs to wake up. > That's simple. Just arm the timer on the other core. It's not rocket > science to do that. This is a plausible direction to go for alarms requested when task isolation is enabled. But as Christoph said, it's almost certainly a bad idea anyway. Our customers are advised to do this kind of stuff (what we call "control-plane" activity) in a separate process on a housekeeping core, which communicates with the nohz_full cores via shared memory. On the nohz_full side the threads just use polling and simple atomics for locking. (That's fun too, because you can't actually allow those locks to get into contended state since that obliges the unlocker to invoke futex_wake in the kernel, so we can't just use pthread mutexes or other common implementations.) -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v7 08/11] arch/arm64: adopt prepare_exit_to_usermode() model from x86 2015-09-28 15:17 ` Chris Metcalf @ 2015-09-28 15:17 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-kernel, linux-arm-kernel Cc: Chris Metcalf This change is a prerequisite change for TASK_ISOLATION but also stands on its own for readability and maintainability. The existing arm64 do_notify_resume() is called in a loop from assembly on the slow path; this change moves the loop into C code as well. For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add new, comprehensible entry and exit handlers written in C"). Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/arm64/kernel/entry.S | 6 +++--- arch/arm64/kernel/signal.c | 32 ++++++++++++++++++++++---------- 2 files changed, 25 insertions(+), 13 deletions(-) diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S index 4306c937b1ff..6fcbf8ea307b 100644 --- a/arch/arm64/kernel/entry.S +++ b/arch/arm64/kernel/entry.S @@ -628,9 +628,8 @@ work_pending: mov x0, sp // 'regs' tst x2, #PSR_MODE_MASK // user mode regs? b.ne no_work_pending // returning to kernel - enable_irq // enable interrupts for do_notify_resume() - bl do_notify_resume - b ret_to_user + bl prepare_exit_to_usermode + b no_user_work_pending work_resched: bl schedule @@ -642,6 +641,7 @@ ret_to_user: ldr x1, [tsk, #TI_FLAGS] and x2, x1, #_TIF_WORK_MASK cbnz x2, work_pending +no_user_work_pending: enable_step_tsk x1, x2 no_work_pending: kernel_exit 0 diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index e18c48cb6db1..fde59c1139a9 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -399,18 +399,30 @@ static void do_signal(struct pt_regs *regs) restore_saved_sigmask(); } -asmlinkage void do_notify_resume(struct pt_regs *regs, - unsigned int thread_flags) +asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs, + unsigned int thread_flags) { - if (thread_flags & _TIF_SIGPENDING) - do_signal(regs); + do { + local_irq_enable(); - if (thread_flags & _TIF_NOTIFY_RESUME) { - clear_thread_flag(TIF_NOTIFY_RESUME); - tracehook_notify_resume(regs); - } + if (thread_flags & _TIF_NEED_RESCHED) + schedule(); + + if (thread_flags & _TIF_SIGPENDING) + do_signal(regs); + + if (thread_flags & _TIF_NOTIFY_RESUME) { + clear_thread_flag(TIF_NOTIFY_RESUME); + tracehook_notify_resume(regs); + } + + if (thread_flags & _TIF_FOREIGN_FPSTATE) + fpsimd_restore_current_state(); + + local_irq_disable(); - if (thread_flags & _TIF_FOREIGN_FPSTATE) - fpsimd_restore_current_state(); + thread_flags = READ_ONCE(current_thread_info()->flags) & + _TIF_WORK_MASK; + } while (thread_flags); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v7 08/11] arch/arm64: adopt prepare_exit_to_usermode() model from x86 @ 2015-09-28 15:17 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: linux-arm-kernel This change is a prerequisite change for TASK_ISOLATION but also stands on its own for readability and maintainability. The existing arm64 do_notify_resume() is called in a loop from assembly on the slow path; this change moves the loop into C code as well. For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add new, comprehensible entry and exit handlers written in C"). Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/arm64/kernel/entry.S | 6 +++--- arch/arm64/kernel/signal.c | 32 ++++++++++++++++++++++---------- 2 files changed, 25 insertions(+), 13 deletions(-) diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S index 4306c937b1ff..6fcbf8ea307b 100644 --- a/arch/arm64/kernel/entry.S +++ b/arch/arm64/kernel/entry.S @@ -628,9 +628,8 @@ work_pending: mov x0, sp // 'regs' tst x2, #PSR_MODE_MASK // user mode regs? b.ne no_work_pending // returning to kernel - enable_irq // enable interrupts for do_notify_resume() - bl do_notify_resume - b ret_to_user + bl prepare_exit_to_usermode + b no_user_work_pending work_resched: bl schedule @@ -642,6 +641,7 @@ ret_to_user: ldr x1, [tsk, #TI_FLAGS] and x2, x1, #_TIF_WORK_MASK cbnz x2, work_pending +no_user_work_pending: enable_step_tsk x1, x2 no_work_pending: kernel_exit 0 diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index e18c48cb6db1..fde59c1139a9 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -399,18 +399,30 @@ static void do_signal(struct pt_regs *regs) restore_saved_sigmask(); } -asmlinkage void do_notify_resume(struct pt_regs *regs, - unsigned int thread_flags) +asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs, + unsigned int thread_flags) { - if (thread_flags & _TIF_SIGPENDING) - do_signal(regs); + do { + local_irq_enable(); - if (thread_flags & _TIF_NOTIFY_RESUME) { - clear_thread_flag(TIF_NOTIFY_RESUME); - tracehook_notify_resume(regs); - } + if (thread_flags & _TIF_NEED_RESCHED) + schedule(); + + if (thread_flags & _TIF_SIGPENDING) + do_signal(regs); + + if (thread_flags & _TIF_NOTIFY_RESUME) { + clear_thread_flag(TIF_NOTIFY_RESUME); + tracehook_notify_resume(regs); + } + + if (thread_flags & _TIF_FOREIGN_FPSTATE) + fpsimd_restore_current_state(); + + local_irq_disable(); - if (thread_flags & _TIF_FOREIGN_FPSTATE) - fpsimd_restore_current_state(); + thread_flags = READ_ONCE(current_thread_info()->flags) & + _TIF_WORK_MASK; + } while (thread_flags); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v7 09/11] arch/arm64: enable task isolation functionality 2015-09-28 15:17 ` Chris Metcalf @ 2015-09-28 15:17 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-kernel, linux-arm-kernel Cc: Chris Metcalf We need to call task_isolation_enter() from prepare_exit_to_usermode(), so that we can both ensure we do it last before returning to userspace, and we also are able to re-run signal handling, etc., if something occurs while task_isolation_enter() has interrupts enabled. To do this we add _TIF_NOHZ to the _TIF_WORK_MASK if we have CONFIG_TASK_ISOLATION enabled, which brings us into prepare_exit_to_usermode() on all return to userspace. But we don't put _TIF_NOHZ in the flags that we use to loop back and recheck, since we don't need to loop back only because the flag is set. Instead we unconditionally call task_isolation_enter() at the end of the loop if any other work is done. To make the assembly code continue to be as optimized as before, we renumber the _TIF flags so that both _TIF_WORK_MASK and _TIF_SYSCALL_WORK still have contiguous runs of bits in the immediate operand for the "and" instruction, as required by the ARM64 ISA. Since TIF_NOHZ is in both masks, it must be the middle bit in the contiguous run that starts with the _TIF_WORK_MASK bits and ends with the _TIF_SYSCALL_WORK bits. We tweak syscall_trace_enter() slightly to carry the "flags" value from current_thread_info()->flags for each of the tests, rather than doing a volatile read from memory for each one. This avoids a small overhead for each test, and in particular avoids that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled. Also, we have to add an explicit check for STRICT mode in do_mem_abort() to handle the case of page faults, since arm64 does not use the exception_enter() mechanism. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/arm64/include/asm/thread_info.h | 18 ++++++++++++------ arch/arm64/kernel/ptrace.c | 10 ++++++++-- arch/arm64/kernel/signal.c | 6 +++++- arch/arm64/mm/fault.c | 8 ++++++++ 4 files changed, 33 insertions(+), 9 deletions(-) diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index dcd06d18a42a..4c36c4ee3528 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -101,11 +101,11 @@ static inline struct thread_info *current_thread_info(void) #define TIF_NEED_RESCHED 1 #define TIF_NOTIFY_RESUME 2 /* callback before returning to user */ #define TIF_FOREIGN_FPSTATE 3 /* CPU's FP state is not current's */ -#define TIF_NOHZ 7 -#define TIF_SYSCALL_TRACE 8 -#define TIF_SYSCALL_AUDIT 9 -#define TIF_SYSCALL_TRACEPOINT 10 -#define TIF_SECCOMP 11 +#define TIF_NOHZ 4 +#define TIF_SYSCALL_TRACE 5 +#define TIF_SYSCALL_AUDIT 6 +#define TIF_SYSCALL_TRACEPOINT 7 +#define TIF_SECCOMP 8 #define TIF_MEMDIE 18 /* is terminating due to OOM killer */ #define TIF_FREEZE 19 #define TIF_RESTORE_SIGMASK 20 @@ -124,9 +124,15 @@ static inline struct thread_info *current_thread_info(void) #define _TIF_SECCOMP (1 << TIF_SECCOMP) #define _TIF_32BIT (1 << TIF_32BIT) -#define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ +#define _TIF_WORK_LOOP_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE) +#ifdef CONFIG_TASK_ISOLATION +# define _TIF_WORK_MASK (_TIF_WORK_LOOP_MASK | _TIF_NOHZ) +#else +# define _TIF_WORK_MASK _TIF_WORK_LOOP_MASK +#endif + #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \ _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \ _TIF_NOHZ) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index 1971f491bb90..9113789e9486 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -37,6 +37,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/isolation.h> #include <asm/compat.h> #include <asm/debug-monitors.h> @@ -1240,14 +1241,19 @@ static void tracehook_report_syscall(struct pt_regs *regs, asmlinkage int syscall_trace_enter(struct pt_regs *regs) { + unsigned long work = ACCESS_ONCE(current_thread_info()->flags); + /* Do the secure computing check first; failures should be fast. */ if (secure_computing() == -1) return -1; - if (test_thread_flag(TIF_SYSCALL_TRACE)) + if ((work & _TIF_NOHZ) && task_isolation_strict()) + task_isolation_syscall(regs->syscallno); + + if (work & _TIF_SYSCALL_TRACE) tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER); - if (test_thread_flag(TIF_SYSCALL_TRACEPOINT)) + if (work & _TIF_SYSCALL_TRACEPOINT) trace_sys_enter(regs, regs->syscallno); audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1], diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index fde59c1139a9..def9166eac9e 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -25,6 +25,7 @@ #include <linux/uaccess.h> #include <linux/tracehook.h> #include <linux/ratelimit.h> +#include <linux/isolation.h> #include <asm/debug-monitors.h> #include <asm/elf.h> @@ -419,10 +420,13 @@ asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs, if (thread_flags & _TIF_FOREIGN_FPSTATE) fpsimd_restore_current_state(); + if (task_isolation_enabled()) + task_isolation_enter(); + local_irq_disable(); thread_flags = READ_ONCE(current_thread_info()->flags) & - _TIF_WORK_MASK; + _TIF_WORK_LOOP_MASK; } while (thread_flags); } diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index aba9ead1384c..01c9ae336887 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -29,6 +29,7 @@ #include <linux/sched.h> #include <linux/highmem.h> #include <linux/perf_event.h> +#include <linux/isolation.h> #include <asm/cpufeature.h> #include <asm/exception.h> @@ -465,6 +466,13 @@ asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr, const struct fault_info *inf = fault_info + (esr & 63); struct siginfo info; + /* We don't use exception_enter(), so we check strict isolation here. */ + if (IS_ENABLED(CONFIG_TASK_ISOLATION) && + test_thread_flag(TIF_NOHZ) && + task_isolation_strict() && + user_mode(regs)) + task_isolation_exception(); + if (!inf->fn(addr, esr, regs)) return; -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v7 09/11] arch/arm64: enable task isolation functionality @ 2015-09-28 15:17 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: linux-arm-kernel We need to call task_isolation_enter() from prepare_exit_to_usermode(), so that we can both ensure we do it last before returning to userspace, and we also are able to re-run signal handling, etc., if something occurs while task_isolation_enter() has interrupts enabled. To do this we add _TIF_NOHZ to the _TIF_WORK_MASK if we have CONFIG_TASK_ISOLATION enabled, which brings us into prepare_exit_to_usermode() on all return to userspace. But we don't put _TIF_NOHZ in the flags that we use to loop back and recheck, since we don't need to loop back only because the flag is set. Instead we unconditionally call task_isolation_enter() at the end of the loop if any other work is done. To make the assembly code continue to be as optimized as before, we renumber the _TIF flags so that both _TIF_WORK_MASK and _TIF_SYSCALL_WORK still have contiguous runs of bits in the immediate operand for the "and" instruction, as required by the ARM64 ISA. Since TIF_NOHZ is in both masks, it must be the middle bit in the contiguous run that starts with the _TIF_WORK_MASK bits and ends with the _TIF_SYSCALL_WORK bits. We tweak syscall_trace_enter() slightly to carry the "flags" value from current_thread_info()->flags for each of the tests, rather than doing a volatile read from memory for each one. This avoids a small overhead for each test, and in particular avoids that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled. Also, we have to add an explicit check for STRICT mode in do_mem_abort() to handle the case of page faults, since arm64 does not use the exception_enter() mechanism. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/arm64/include/asm/thread_info.h | 18 ++++++++++++------ arch/arm64/kernel/ptrace.c | 10 ++++++++-- arch/arm64/kernel/signal.c | 6 +++++- arch/arm64/mm/fault.c | 8 ++++++++ 4 files changed, 33 insertions(+), 9 deletions(-) diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index dcd06d18a42a..4c36c4ee3528 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -101,11 +101,11 @@ static inline struct thread_info *current_thread_info(void) #define TIF_NEED_RESCHED 1 #define TIF_NOTIFY_RESUME 2 /* callback before returning to user */ #define TIF_FOREIGN_FPSTATE 3 /* CPU's FP state is not current's */ -#define TIF_NOHZ 7 -#define TIF_SYSCALL_TRACE 8 -#define TIF_SYSCALL_AUDIT 9 -#define TIF_SYSCALL_TRACEPOINT 10 -#define TIF_SECCOMP 11 +#define TIF_NOHZ 4 +#define TIF_SYSCALL_TRACE 5 +#define TIF_SYSCALL_AUDIT 6 +#define TIF_SYSCALL_TRACEPOINT 7 +#define TIF_SECCOMP 8 #define TIF_MEMDIE 18 /* is terminating due to OOM killer */ #define TIF_FREEZE 19 #define TIF_RESTORE_SIGMASK 20 @@ -124,9 +124,15 @@ static inline struct thread_info *current_thread_info(void) #define _TIF_SECCOMP (1 << TIF_SECCOMP) #define _TIF_32BIT (1 << TIF_32BIT) -#define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ +#define _TIF_WORK_LOOP_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE) +#ifdef CONFIG_TASK_ISOLATION +# define _TIF_WORK_MASK (_TIF_WORK_LOOP_MASK | _TIF_NOHZ) +#else +# define _TIF_WORK_MASK _TIF_WORK_LOOP_MASK +#endif + #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \ _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \ _TIF_NOHZ) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index 1971f491bb90..9113789e9486 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -37,6 +37,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/isolation.h> #include <asm/compat.h> #include <asm/debug-monitors.h> @@ -1240,14 +1241,19 @@ static void tracehook_report_syscall(struct pt_regs *regs, asmlinkage int syscall_trace_enter(struct pt_regs *regs) { + unsigned long work = ACCESS_ONCE(current_thread_info()->flags); + /* Do the secure computing check first; failures should be fast. */ if (secure_computing() == -1) return -1; - if (test_thread_flag(TIF_SYSCALL_TRACE)) + if ((work & _TIF_NOHZ) && task_isolation_strict()) + task_isolation_syscall(regs->syscallno); + + if (work & _TIF_SYSCALL_TRACE) tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER); - if (test_thread_flag(TIF_SYSCALL_TRACEPOINT)) + if (work & _TIF_SYSCALL_TRACEPOINT) trace_sys_enter(regs, regs->syscallno); audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1], diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index fde59c1139a9..def9166eac9e 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -25,6 +25,7 @@ #include <linux/uaccess.h> #include <linux/tracehook.h> #include <linux/ratelimit.h> +#include <linux/isolation.h> #include <asm/debug-monitors.h> #include <asm/elf.h> @@ -419,10 +420,13 @@ asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs, if (thread_flags & _TIF_FOREIGN_FPSTATE) fpsimd_restore_current_state(); + if (task_isolation_enabled()) + task_isolation_enter(); + local_irq_disable(); thread_flags = READ_ONCE(current_thread_info()->flags) & - _TIF_WORK_MASK; + _TIF_WORK_LOOP_MASK; } while (thread_flags); } diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index aba9ead1384c..01c9ae336887 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -29,6 +29,7 @@ #include <linux/sched.h> #include <linux/highmem.h> #include <linux/perf_event.h> +#include <linux/isolation.h> #include <asm/cpufeature.h> #include <asm/exception.h> @@ -465,6 +466,13 @@ asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr, const struct fault_info *inf = fault_info + (esr & 63); struct siginfo info; + /* We don't use exception_enter(), so we check strict isolation here. */ + if (IS_ENABLED(CONFIG_TASK_ISOLATION) && + test_thread_flag(TIF_NOHZ) && + task_isolation_strict() && + user_mode(regs)) + task_isolation_exception(); + if (!inf->fn(addr, esr, regs)) return; -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v7 10/11] arch/tile: adopt prepare_exit_to_usermode() model from x86 2015-09-28 15:17 ` Chris Metcalf ` (9 preceding siblings ...) (?) @ 2015-09-28 15:17 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-kernel Cc: Chris Metcalf This change is a prerequisite change for TASK_ISOLATION but also stands on its own for readability and maintainability. The existing tile do_work_pending() was called in a loop from assembly on the slow path; this change moves the loop into C code as well. For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add new, comprehensible entry and exit handlers written in C"). This change exposes a pre-existing bug on the older tilepro platform; the singlestep processing is done last, but on tilepro (unlike tilegx) we enable interrupts while doing that processing, so we could in theory miss a signal or other asynchronous event. A future change could fix this by breaking the singlestep work into a "prepare" step done in the main loop, and a "trigger" step done after exiting the loop. Since this change is intended as purely a restructuring change, we call out the bug explicitly now, but don't yet fix it. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/include/asm/processor.h | 2 +- arch/tile/include/asm/thread_info.h | 8 +++- arch/tile/kernel/intvec_32.S | 46 +++++++-------------- arch/tile/kernel/intvec_64.S | 49 +++++++---------------- arch/tile/kernel/process.c | 79 +++++++++++++++++++------------------ 5 files changed, 77 insertions(+), 107 deletions(-) diff --git a/arch/tile/include/asm/processor.h b/arch/tile/include/asm/processor.h index 139dfdee0134..0684e88aacd8 100644 --- a/arch/tile/include/asm/processor.h +++ b/arch/tile/include/asm/processor.h @@ -212,7 +212,7 @@ static inline void release_thread(struct task_struct *dead_task) /* Nothing for now */ } -extern int do_work_pending(struct pt_regs *regs, u32 flags); +extern void prepare_exit_to_usermode(struct pt_regs *regs, u32 flags); /* diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h index dc1fb28d9636..4b7cef9e94e0 100644 --- a/arch/tile/include/asm/thread_info.h +++ b/arch/tile/include/asm/thread_info.h @@ -140,10 +140,14 @@ extern void _cpu_idle(void); #define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG) #define _TIF_NOHZ (1<<TIF_NOHZ) +/* Work to do as we loop to exit to user space. */ +#define _TIF_WORK_MASK \ + (_TIF_SIGPENDING | _TIF_NEED_RESCHED | \ + _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME) + /* Work to do on any return to user space. */ #define _TIF_ALLWORK_MASK \ - (_TIF_SIGPENDING | _TIF_NEED_RESCHED | _TIF_SINGLESTEP | \ - _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME | _TIF_NOHZ) + (_TIF_WORK_MASK | _TIF_SINGLESTEP | _TIF_NOHZ) /* Work to do at syscall entry. */ #define _TIF_SYSCALL_ENTRY_WORK \ diff --git a/arch/tile/kernel/intvec_32.S b/arch/tile/kernel/intvec_32.S index fbbe2ea882ea..33d48812872a 100644 --- a/arch/tile/kernel/intvec_32.S +++ b/arch/tile/kernel/intvec_32.S @@ -846,18 +846,6 @@ STD_ENTRY(interrupt_return) FEEDBACK_REENTER(interrupt_return) /* - * Use r33 to hold whether we have already loaded the callee-saves - * into ptregs. We don't want to do it twice in this loop, since - * then we'd clobber whatever changes are made by ptrace, etc. - * Get base of stack in r32. - */ - { - GET_THREAD_INFO(r32) - movei r33, 0 - } - -.Lretry_work_pending: - /* * Disable interrupts so as to make sure we don't * miss an interrupt that sets any of the thread flags (like * need_resched or sigpending) between sampling and the iret. @@ -867,33 +855,27 @@ STD_ENTRY(interrupt_return) IRQ_DISABLE(r20, r21) TRACE_IRQS_OFF /* Note: clobbers registers r0-r29 */ - - /* Check to see if there is any work to do before returning to user. */ + /* + * See if there are any work items (including single-shot items) + * to do. If so, save the callee-save registers to pt_regs + * and then dispatch to C code. + */ + GET_THREAD_INFO(r21) { - addi r29, r32, THREAD_INFO_FLAGS_OFFSET - moveli r1, lo16(_TIF_ALLWORK_MASK) + addi r22, r21, THREAD_INFO_FLAGS_OFFSET + moveli r20, lo16(_TIF_ALLWORK_MASK) } { - lw r29, r29 - auli r1, r1, ha16(_TIF_ALLWORK_MASK) + lw r22, r22 + auli r20, r20, ha16(_TIF_ALLWORK_MASK) } - and r1, r29, r1 - bzt r1, .Lrestore_all - - /* - * Make sure we have all the registers saved for signal - * handling, notify-resume, or single-step. Call out to C - * code to figure out exactly what we need to do for each flag bit, - * then if necessary, reload the flags and recheck. - */ + and r1, r22, r20 { PTREGS_PTR(r0, PTREGS_OFFSET_BASE) - bnz r33, 1f + bzt r1, .Lrestore_all } push_extra_callee_saves r0 - movei r33, 1 -1: jal do_work_pending - bnz r0, .Lretry_work_pending + jal prepare_exit_to_usermode /* * In the NMI case we @@ -1327,7 +1309,7 @@ STD_ENTRY(ret_from_kernel_thread) FEEDBACK_REENTER(ret_from_kernel_thread) { movei r30, 0 /* not an NMI */ - j .Lresume_userspace /* jump into middle of interrupt_return */ + j interrupt_return } STD_ENDPROC(ret_from_kernel_thread) diff --git a/arch/tile/kernel/intvec_64.S b/arch/tile/kernel/intvec_64.S index 58964d209d4d..a41c994ce237 100644 --- a/arch/tile/kernel/intvec_64.S +++ b/arch/tile/kernel/intvec_64.S @@ -879,20 +879,6 @@ STD_ENTRY(interrupt_return) FEEDBACK_REENTER(interrupt_return) /* - * Use r33 to hold whether we have already loaded the callee-saves - * into ptregs. We don't want to do it twice in this loop, since - * then we'd clobber whatever changes are made by ptrace, etc. - */ - { - movei r33, 0 - move r32, sp - } - - /* Get base of stack in r32. */ - EXTRACT_THREAD_INFO(r32) - -.Lretry_work_pending: - /* * Disable interrupts so as to make sure we don't * miss an interrupt that sets any of the thread flags (like * need_resched or sigpending) between sampling and the iret. @@ -902,33 +888,28 @@ STD_ENTRY(interrupt_return) IRQ_DISABLE(r20, r21) TRACE_IRQS_OFF /* Note: clobbers registers r0-r29 */ - - /* Check to see if there is any work to do before returning to user. */ + /* + * See if there are any work items (including single-shot items) + * to do. If so, save the callee-save registers to pt_regs + * and then dispatch to C code. + */ + move r21, sp + EXTRACT_THREAD_INFO(r21) { - addi r29, r32, THREAD_INFO_FLAGS_OFFSET - moveli r1, hw1_last(_TIF_ALLWORK_MASK) + addi r22, r21, THREAD_INFO_FLAGS_OFFSET + moveli r20, hw1_last(_TIF_ALLWORK_MASK) } { - ld r29, r29 - shl16insli r1, r1, hw0(_TIF_ALLWORK_MASK) + ld r22, r22 + shl16insli r20, r20, hw0(_TIF_ALLWORK_MASK) } - and r1, r29, r1 - beqzt r1, .Lrestore_all - - /* - * Make sure we have all the registers saved for signal - * handling or notify-resume. Call out to C code to figure out - * exactly what we need to do for each flag bit, then if - * necessary, reload the flags and recheck. - */ + and r1, r22, r20 { PTREGS_PTR(r0, PTREGS_OFFSET_BASE) - bnez r33, 1f + beqzt r1, .Lrestore_all } push_extra_callee_saves r0 - movei r33, 1 -1: jal do_work_pending - bnez r0, .Lretry_work_pending + jal prepare_exit_to_usermode /* * In the NMI case we @@ -1411,7 +1392,7 @@ STD_ENTRY(ret_from_kernel_thread) FEEDBACK_REENTER(ret_from_kernel_thread) { movei r30, 0 /* not an NMI */ - j .Lresume_userspace /* jump into middle of interrupt_return */ + j interrupt_return } STD_ENDPROC(ret_from_kernel_thread) diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index 7d5769310bef..b5f30d376ce1 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -462,54 +462,57 @@ struct task_struct *__sched _switch_to(struct task_struct *prev, /* * This routine is called on return from interrupt if any of the - * TIF_WORK_MASK flags are set in thread_info->flags. It is - * entered with interrupts disabled so we don't miss an event - * that modified the thread_info flags. If any flag is set, we - * handle it and return, and the calling assembly code will - * re-disable interrupts, reload the thread flags, and call back - * if more flags need to be handled. - * - * We return whether we need to check the thread_info flags again - * or not. Note that we don't clear TIF_SINGLESTEP here, so it's - * important that it be tested last, and then claim that we don't - * need to recheck the flags. + * TIF_ALLWORK_MASK flags are set in thread_info->flags. It is + * entered with interrupts disabled so we don't miss an event that + * modified the thread_info flags. We loop until all the tested flags + * are clear. Note that the function is called on certain conditions + * that are not listed in the loop condition here (e.g. SINGLESTEP) + * which guarantees we will do those things once, and redo them if any + * of the other work items is re-done, but won't continue looping if + * all the other work is done. */ -int do_work_pending(struct pt_regs *regs, u32 thread_info_flags) +void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags) { - /* If we enter in kernel mode, do nothing and exit the caller loop. */ - if (!user_mode(regs)) - return 0; + if (WARN_ON(!user_mode(regs))) + return; - user_exit(); + do { + local_irq_enable(); - /* Enable interrupts; they are disabled again on return to caller. */ - local_irq_enable(); + if (thread_info_flags & _TIF_NEED_RESCHED) + schedule(); - if (thread_info_flags & _TIF_NEED_RESCHED) { - schedule(); - return 1; - } #if CHIP_HAS_TILE_DMA() - if (thread_info_flags & _TIF_ASYNC_TLB) { - do_async_page_fault(regs); - return 1; - } + if (thread_info_flags & _TIF_ASYNC_TLB) + do_async_page_fault(regs); #endif - if (thread_info_flags & _TIF_SIGPENDING) { - do_signal(regs); - return 1; - } - if (thread_info_flags & _TIF_NOTIFY_RESUME) { - clear_thread_flag(TIF_NOTIFY_RESUME); - tracehook_notify_resume(regs); - return 1; - } - if (thread_info_flags & _TIF_SINGLESTEP) + + if (thread_info_flags & _TIF_SIGPENDING) + do_signal(regs); + + if (thread_info_flags & _TIF_NOTIFY_RESUME) { + clear_thread_flag(TIF_NOTIFY_RESUME); + tracehook_notify_resume(regs); + } + + local_irq_disable(); + thread_info_flags = READ_ONCE(current_thread_info()->flags); + + } while (thread_info_flags & _TIF_WORK_MASK); + + if (thread_info_flags & _TIF_SINGLESTEP) { single_step_once(regs); +#ifndef __tilegx__ + /* + * FIXME: on tilepro, since we enable interrupts in + * this routine, it's possible that we miss a signal + * or other asynchronous event. + */ + local_irq_disable(); +#endif + } user_enter(); - - return 0; } unsigned long get_wchan(struct task_struct *p) -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v7 11/11] arch/tile: enable task isolation functionality 2015-09-28 15:17 ` Chris Metcalf ` (10 preceding siblings ...) (?) @ 2015-09-28 15:17 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-09-28 15:17 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-kernel Cc: Chris Metcalf We add the necessary call to task_isolation_enter() in the prepare_exit_to_usermode() routine. We already unconditionally call into this routine if TIF_NOHZ is set, since that's where we do the user_enter() call. In addition, we add an overriding task_isolation_wait() call that runs a nap instruction while waiting for an interrupt, to make the task_isolation_enter() loop run in a lower-power state. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/process.c | 13 +++++++++++++ arch/tile/kernel/ptrace.c | 3 +++ arch/tile/mm/homecache.c | 5 ++++- 3 files changed, 20 insertions(+), 1 deletion(-) diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index b5f30d376ce1..28aa0f8b45ef 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -29,6 +29,7 @@ #include <linux/signal.h> #include <linux/delay.h> #include <linux/context_tracking.h> +#include <linux/isolation.h> #include <asm/stack.h> #include <asm/switch_to.h> #include <asm/homecache.h> @@ -70,6 +71,15 @@ void arch_cpu_idle(void) _cpu_idle(); } +#ifdef CONFIG_TASK_ISOLATION +void task_isolation_wait(void) +{ + set_current_state(TASK_INTERRUPTIBLE); + _cpu_idle(); + set_current_state(TASK_RUNNING); +} +#endif + /* * Release a thread_info structure */ @@ -495,6 +505,9 @@ void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags) tracehook_notify_resume(regs); } + if (task_isolation_enabled()) + task_isolation_enter(); + local_irq_disable(); thread_info_flags = READ_ONCE(current_thread_info()->flags); diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index bdc126faf741..04a7a6bf7d0a 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -265,6 +265,9 @@ int do_syscall_trace_enter(struct pt_regs *regs) if (secure_computing() == -1) return -1; + if ((work & _TIF_NOHZ) && task_isolation_strict()) + task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]); + if (work & _TIF_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) regs->regs[TREG_SYSCALL_NR] = -1; diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c index 40ca30a9fee3..a79325113105 100644 --- a/arch/tile/mm/homecache.c +++ b/arch/tile/mm/homecache.c @@ -31,6 +31,7 @@ #include <linux/smp.h> #include <linux/module.h> #include <linux/hugetlb.h> +#include <linux/isolation.h> #include <asm/page.h> #include <asm/sections.h> @@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask, * Don't bother to update atomically; losing a count * here is not that critical. */ - for_each_cpu(cpu, &mask) + for_each_cpu(cpu, &mask) { ++per_cpu(irq_stat, cpu).irq_hv_flush_count; + task_isolation_debug(cpu); + } } /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v8 00/14] support "task_isolation" mode for nohz_full 2015-09-28 15:17 ` Chris Metcalf @ 2015-10-20 20:35 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:35 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf This email discusses in detail the changes for v8; please see older versions of the cover letter for details about older versions. v8: The biggest difference in this version is, at Thomas Gleixner's suggestion, I removed the code that busy-waits until there are no scheduler-tick timer events queued. Instead, we now test for higher-level properties when attempting to return to userspace. We check if the core believes it has stopped the scheduler tick (which handles checking for scheduler contention from other tasks, RCU usage of the cpu, posix cpu timers, perf, etc), and if it hasn't, we request that the current process be rescheduled. In addition, we check if there are per-cpu lru pages to be drained, and we check if the vmstat worker has been quiesced. The structure is pretty clean so we can add additional tests as needed there as well. One nice aspect of this revised structure is that if the user actually requests a signal from a timer (for example), we will now return to userspace and let the program run. Of course it may get bombed with incremental timer ticks if the timer can't be programmed to the whole time interval in one step, but it still feels more correct this way then holding the process in the kernel until the user-requested timer expires. At Andy Lutomirski's suggestion, we separate out from the previous task_isolation_enter() a separate task_isolation_ready() test that can be done at the same time as we test the TIF_xxx flags, with interrupts disabled, so we can guarantee that the conditions we test for are still true when we return to userspace. To accomplish this we break out a new vmstat_idle() function that checks if the vmstat subsystem is quiesced on this core. Similarly, we factor out an lru_add_drain_needed() function from where it used to be in lru_add_drain_all(). Both of these "check" functions can now be called from task_isolation_ready() with interrupts disabled. Also at Andy's suggestion (and aligning with how I had done things previously in the Tilera private fork), the prctl() to enable task isolation will now fail with EINVAL if you attempt to enable task-isolation mode when your affinity does not lock you to a single core, or if that core is not a nohz_full core. We move the "strict" syscall test to just before SECCOMP instead of just after. It's not particularly clear that one is better than the other abstractly, and on a couple of the supported platforms (x86, tile) it makes the code structure work out better because the user_enter() can be done at the same time as the test for strict mode. The integration with context_tracking has been completely dropped; discussing with Andy showed that there are only a few exception sites that need strict-mode checking (the typical one is page faults that don't raise signals) so just putting the checks in the relevant functions feels cleaner than trying to hijack the exception_enter/exception_exit paths, which are being removed for x86 in any case. The task_isolation_exception() hook now takes full printf format arguments, so that we can generate a much more useful report as to why we are killing the task. As a result, we also remove the dump_stack() call, whose only utility was pointing the finger at which exception function had triggered. Rather than automatically disabling the 1 Hz maximum scheduler deferment for task-isolation tasks, we now require the user to specify a boot flag ("debug_1hz_tick") to do this. The boot flag allows us to test the case where all the 1 Hz updating subsystems have been fixed before that work actually is finished. An architecture-specific fix is included in this patch series for the tile architecture; I will push it through the tile tree (along with the tile prepare_exit_to_usermode restructuring) if there are no concerns. At issue is that we end up with one gratuitous timer tick when we are shutting down the timer; by setting up the set_state_oneshot_stopped function pointer callback for the tile tick timer we can avoid this problem. (Thomas, I'd particularly appreciate your ack on this fix, which is number 13 out of 14 in this patch series.) Rebased to v4.3-rc6 to pick up the fix for vmstat to properly use schedule_delayed_work_on(), since I was hitting a VM_BUG_ON without the fix (which I separately tracked down - oh well). v7: switch to architecture hooks for task_isolation_enter add an RCU_LOCKDEP_WARN() (Andy Lutomirski) rebased to v4.3-rc1 v6: restructured to be a "task_isolation" mode not a "cpu_isolated" mode (Frederic) v5: rebased on kernel v4.2-rc3 converted to use CONFIG_CPU_ISOLATED and separate .c and .h files incorporates Christoph Lameter's quiet_vmstat() call v4: rebased on kernel v4.2-rc1 added support for detecting CPU_ISOLATED_STRICT syscalls on arm64 v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() General summary: The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. The kernel must be built with CONFIG_TASK_ISOLATION to take advantage of this new mode. A prctl() option (PR_SET_TASK_ISOLATION) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts, this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane Chris Metcalf (13): vmstat: add vmstat_idle function lru_add_drain_all: factor out lru_add_drain_needed task_isolation: add initial support task_isolation: support PR_TASK_ISOLATION_STRICT mode task_isolation: provide strict mode configurable signal task_isolation: add debug boot flag nohz_full: allow disabling the 1Hz minimum tick at boot arch/x86: enable task isolation functionality arch/arm64: adopt prepare_exit_to_usermode() model from x86 arch/arm64: enable task isolation functionality arch/tile: adopt prepare_exit_to_usermode() model from x86 arch/tile: turn off timer tick for oneshot_stopped state arch/tile: enable task isolation functionality Christoph Lameter (1): vmstat: provide a function to quiet down the diff processing Documentation/kernel-parameters.txt | 7 ++ arch/arm64/include/asm/thread_info.h | 18 +++-- arch/arm64/kernel/entry.S | 6 +- arch/arm64/kernel/ptrace.c | 12 +++- arch/arm64/kernel/signal.c | 35 +++++++--- arch/arm64/mm/fault.c | 4 ++ arch/tile/include/asm/processor.h | 2 +- arch/tile/include/asm/thread_info.h | 8 ++- arch/tile/kernel/intvec_32.S | 46 ++++--------- arch/tile/kernel/intvec_64.S | 49 +++++--------- arch/tile/kernel/process.c | 83 ++++++++++++----------- arch/tile/kernel/ptrace.c | 6 +- arch/tile/kernel/single_step.c | 5 ++ arch/tile/kernel/time.c | 1 + arch/tile/kernel/unaligned.c | 3 + arch/tile/mm/fault.c | 3 + arch/tile/mm/homecache.c | 5 +- arch/x86/entry/common.c | 10 ++- arch/x86/kernel/traps.c | 2 + arch/x86/mm/fault.c | 2 + include/linux/isolation.h | 61 +++++++++++++++++ include/linux/sched.h | 3 + include/linux/swap.h | 1 + include/linux/vmstat.h | 4 ++ include/uapi/linux/prctl.h | 8 +++ init/Kconfig | 20 ++++++ kernel/Makefile | 1 + kernel/irq_work.c | 5 +- kernel/isolation.c | 127 +++++++++++++++++++++++++++++++++++ kernel/sched/core.c | 37 ++++++++++ kernel/signal.c | 13 ++++ kernel/smp.c | 4 ++ kernel/softirq.c | 7 ++ kernel/sys.c | 9 +++ mm/swap.c | 13 ++-- mm/vmstat.c | 24 +++++++ 36 files changed, 507 insertions(+), 137 deletions(-) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c -- 2.1.2 ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v8 00/14] support "task_isolation" mode for nohz_full @ 2015-10-20 20:35 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:35 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf This email discusses in detail the changes for v8; please see older versions of the cover letter for details about older versions. v8: The biggest difference in this version is, at Thomas Gleixner's suggestion, I removed the code that busy-waits until there are no scheduler-tick timer events queued. Instead, we now test for higher-level properties when attempting to return to userspace. We check if the core believes it has stopped the scheduler tick (which handles checking for scheduler contention from other tasks, RCU usage of the cpu, posix cpu timers, perf, etc), and if it hasn't, we request that the current process be rescheduled. In addition, we check if there are per-cpu lru pages to be drained, and we check if the vmstat worker has been quiesced. The structure is pretty clean so we can add additional tests as needed there as well. One nice aspect of this revised structure is that if the user actually requests a signal from a timer (for example), we will now return to userspace and let the program run. Of course it may get bombed with incremental timer ticks if the timer can't be programmed to the whole time interval in one step, but it still feels more correct this way then holding the process in the kernel until the user-requested timer expires. At Andy Lutomirski's suggestion, we separate out from the previous task_isolation_enter() a separate task_isolation_ready() test that can be done at the same time as we test the TIF_xxx flags, with interrupts disabled, so we can guarantee that the conditions we test for are still true when we return to userspace. To accomplish this we break out a new vmstat_idle() function that checks if the vmstat subsystem is quiesced on this core. Similarly, we factor out an lru_add_drain_needed() function from where it used to be in lru_add_drain_all(). Both of these "check" functions can now be called from task_isolation_ready() with interrupts disabled. Also at Andy's suggestion (and aligning with how I had done things previously in the Tilera private fork), the prctl() to enable task isolation will now fail with EINVAL if you attempt to enable task-isolation mode when your affinity does not lock you to a single core, or if that core is not a nohz_full core. We move the "strict" syscall test to just before SECCOMP instead of just after. It's not particularly clear that one is better than the other abstractly, and on a couple of the supported platforms (x86, tile) it makes the code structure work out better because the user_enter() can be done at the same time as the test for strict mode. The integration with context_tracking has been completely dropped; discussing with Andy showed that there are only a few exception sites that need strict-mode checking (the typical one is page faults that don't raise signals) so just putting the checks in the relevant functions feels cleaner than trying to hijack the exception_enter/exception_exit paths, which are being removed for x86 in any case. The task_isolation_exception() hook now takes full printf format arguments, so that we can generate a much more useful report as to why we are killing the task. As a result, we also remove the dump_stack() call, whose only utility was pointing the finger at which exception function had triggered. Rather than automatically disabling the 1 Hz maximum scheduler deferment for task-isolation tasks, we now require the user to specify a boot flag ("debug_1hz_tick") to do this. The boot flag allows us to test the case where all the 1 Hz updating subsystems have been fixed before that work actually is finished. An architecture-specific fix is included in this patch series for the tile architecture; I will push it through the tile tree (along with the tile prepare_exit_to_usermode restructuring) if there are no concerns. At issue is that we end up with one gratuitous timer tick when we are shutting down the timer; by setting up the set_state_oneshot_stopped function pointer callback for the tile tick timer we can avoid this problem. (Thomas, I'd particularly appreciate your ack on this fix, which is number 13 out of 14 in this patch series.) Rebased to v4.3-rc6 to pick up the fix for vmstat to properly use schedule_delayed_work_on(), since I was hitting a VM_BUG_ON without the fix (which I separately tracked down - oh well). v7: switch to architecture hooks for task_isolation_enter add an RCU_LOCKDEP_WARN() (Andy Lutomirski) rebased to v4.3-rc1 v6: restructured to be a "task_isolation" mode not a "cpu_isolated" mode (Frederic) v5: rebased on kernel v4.2-rc3 converted to use CONFIG_CPU_ISOLATED and separate .c and .h files incorporates Christoph Lameter's quiet_vmstat() call v4: rebased on kernel v4.2-rc1 added support for detecting CPU_ISOLATED_STRICT syscalls on arm64 v3: remove dependency on cpu_idle subsystem (Thomas Gleixner) use READ_ONCE instead of ACCESS_ONCE in tick_nohz_cpu_isolated_enter use seconds for console messages instead of jiffies (Thomas Gleixner) updated commit description for patch 5/5 v2: rename "dataplane" to "cpu_isolated" drop ksoftirqd suppression changes (believed no longer needed) merge previous "QUIESCE" functionality into baseline functionality explicitly track syscalls and exceptions for "STRICT" functionality allow configuring a signal to be delivered for STRICT mode failures move debug tracking to irq_enter(), not irq_exit() General summary: The existing nohz_full mode does a nice job of suppressing extraneous kernel interrupts for cores that desire it. However, there is a need for a more deterministic mode that rigorously disallows kernel interrupts, even at a higher cost in user/kernel transition time: for example, high-speed networking applications running userspace drivers that will drop packets if they are ever interrupted. These changes attempt to provide an initial draft of such a framework; the changes do not add any overhead to the usual non-nohz_full mode, and only very small overhead to the typical nohz_full mode. The kernel must be built with CONFIG_TASK_ISOLATION to take advantage of this new mode. A prctl() option (PR_SET_TASK_ISOLATION) is added to control whether processes have requested this stricter semantics, and within that prctl() option we provide a number of different bits for more precise control. Additionally, we add a new command-line boot argument to facilitate debugging where unexpected interrupts are being delivered from. Code that is conceptually similar has been in use in Tilera's Multicore Development Environment since 2008, known as Zero-Overhead Linux, and has seen wide adoption by a range of customers. This patch series represents the first serious attempt to upstream that functionality. Although the current state of the kernel isn't quite ready to run with absolutely no kernel interrupts, this patch series provides a way to make dynamic tradeoffs between avoiding kernel interrupts on the one hand, and making voluntary calls in and out of the kernel more expensive, for tasks that want it. The series is available at: git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane Chris Metcalf (13): vmstat: add vmstat_idle function lru_add_drain_all: factor out lru_add_drain_needed task_isolation: add initial support task_isolation: support PR_TASK_ISOLATION_STRICT mode task_isolation: provide strict mode configurable signal task_isolation: add debug boot flag nohz_full: allow disabling the 1Hz minimum tick at boot arch/x86: enable task isolation functionality arch/arm64: adopt prepare_exit_to_usermode() model from x86 arch/arm64: enable task isolation functionality arch/tile: adopt prepare_exit_to_usermode() model from x86 arch/tile: turn off timer tick for oneshot_stopped state arch/tile: enable task isolation functionality Christoph Lameter (1): vmstat: provide a function to quiet down the diff processing Documentation/kernel-parameters.txt | 7 ++ arch/arm64/include/asm/thread_info.h | 18 +++-- arch/arm64/kernel/entry.S | 6 +- arch/arm64/kernel/ptrace.c | 12 +++- arch/arm64/kernel/signal.c | 35 +++++++--- arch/arm64/mm/fault.c | 4 ++ arch/tile/include/asm/processor.h | 2 +- arch/tile/include/asm/thread_info.h | 8 ++- arch/tile/kernel/intvec_32.S | 46 ++++--------- arch/tile/kernel/intvec_64.S | 49 +++++--------- arch/tile/kernel/process.c | 83 ++++++++++++----------- arch/tile/kernel/ptrace.c | 6 +- arch/tile/kernel/single_step.c | 5 ++ arch/tile/kernel/time.c | 1 + arch/tile/kernel/unaligned.c | 3 + arch/tile/mm/fault.c | 3 + arch/tile/mm/homecache.c | 5 +- arch/x86/entry/common.c | 10 ++- arch/x86/kernel/traps.c | 2 + arch/x86/mm/fault.c | 2 + include/linux/isolation.h | 61 +++++++++++++++++ include/linux/sched.h | 3 + include/linux/swap.h | 1 + include/linux/vmstat.h | 4 ++ include/uapi/linux/prctl.h | 8 +++ init/Kconfig | 20 ++++++ kernel/Makefile | 1 + kernel/irq_work.c | 5 +- kernel/isolation.c | 127 +++++++++++++++++++++++++++++++++++ kernel/sched/core.c | 37 ++++++++++ kernel/signal.c | 13 ++++ kernel/smp.c | 4 ++ kernel/softirq.c | 7 ++ kernel/sys.c | 9 +++ mm/swap.c | 13 ++-- mm/vmstat.c | 24 +++++++ 36 files changed, 507 insertions(+), 137 deletions(-) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c -- 2.1.2 ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v8 01/14] vmstat: provide a function to quiet down the diff processing 2015-10-20 20:35 ` Chris Metcalf (?) @ 2015-10-20 20:35 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:35 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-kernel Cc: Chris Metcalf From: Christoph Lameter <cl@linux.com> quiet_vmstat() can be called in anticipation of a OS "quiet" period where no tick processing should be triggered. quiet_vmstat() will fold all pending differentials into the global counters and disable the vmstat_worker processing. Note that the shepherd thread will continue scanning the differentials from another processor and will reenable the vmstat workers if it detects any changes. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/vmstat.h | 2 ++ mm/vmstat.c | 14 ++++++++++++++ 2 files changed, 16 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 82e7db7f7100..c013b8d8e434 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone *, enum zone_stat_item); extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); +void quiet_vmstat(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -272,6 +273,7 @@ static inline void __dec_zone_page_state(struct page *page, static inline void refresh_cpu_vm_stats(int cpu) { } static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } +static inline void quiet_vmstat(void) { } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index fbf14485a049..a9c446353c7e 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_struct *w) } /* + * Switch off vmstat processing and then fold all the remaining differentials + * until the diffs stay at zero. The function is used by NOHZ and can only be + * invoked when tick processing is not active. + */ +void quiet_vmstat(void) +{ + do { + if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) + cancel_delayed_work(this_cpu_ptr(&vmstat_work)); + + } while (refresh_cpu_vm_stats()); +} + +/* * Check if the diffs for a certain cpu indicate that * an update is needed. */ -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v8 02/14] vmstat: add vmstat_idle function 2015-10-20 20:35 ` Chris Metcalf (?) (?) @ 2015-10-20 20:36 ` Chris Metcalf 2015-10-20 20:45 ` Christoph Lameter -1 siblings, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-kernel Cc: Chris Metcalf This function checks to see if a vmstat worker is not running, and the vmstat diffs don't require an update. The function is called from the task-isolation code to see if we need to actually do some work to quiet vmstat. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/vmstat.h | 2 ++ mm/vmstat.c | 10 ++++++++++ 2 files changed, 12 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index c013b8d8e434..34e3b768e432 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -212,6 +212,7 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); void quiet_vmstat(void); +bool vmstat_idle(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -274,6 +275,7 @@ static inline void refresh_cpu_vm_stats(int cpu) { } static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } static inline void quiet_vmstat(void) { } +static inline bool vmstat_idle(void) { return true; } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index a9c446353c7e..05fa1f0eefc8 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1431,6 +1431,16 @@ static bool need_update(int cpu) return false; } +/* + * Report on whether vmstat processing is quiesced on the core currently: + * no vmstat worker running and no vmstat updates to perform. + */ +bool vmstat_idle(void) +{ + int cpu = smp_processor_id(); + return cpumask_test_cpu(cpu, cpu_stat_off) && !need_update(cpu); +} + /* * Shepherd worker thread that checks the -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH v8 02/14] vmstat: add vmstat_idle function 2015-10-20 20:36 ` [PATCH v8 02/14] vmstat: add vmstat_idle function Chris Metcalf @ 2015-10-20 20:45 ` Christoph Lameter 0 siblings, 0 replies; 340+ messages in thread From: Christoph Lameter @ 2015-10-20 20:45 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-kernel On Tue, 20 Oct 2015, Chris Metcalf wrote: > This function checks to see if a vmstat worker is not running, > and the vmstat diffs don't require an update. The function is > called from the task-isolation code to see if we need to > actually do some work to quiet vmstat. Acked-by: Christoph Lameter <cl@linux.com> ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v8 03/14] lru_add_drain_all: factor out lru_add_drain_needed 2015-10-20 20:35 ` Chris Metcalf @ 2015-10-20 20:36 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-mm, linux-kernel Cc: Chris Metcalf This per-cpu check was being done in the loop in lru_add_drain_all(), but having it be callable for a particular cpu is helpful for the task-isolation patches. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/swap.h | 1 + mm/swap.c | 13 +++++++++---- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 7ba7dccaf0e7..66719610c9f5 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -305,6 +305,7 @@ extern void activate_page(struct page *); extern void mark_page_accessed(struct page *); extern void lru_add_drain(void); extern void lru_add_drain_cpu(int cpu); +extern bool lru_add_drain_needed(int cpu); extern void lru_add_drain_all(void); extern void rotate_reclaimable_page(struct page *page); extern void deactivate_file_page(struct page *page); diff --git a/mm/swap.c b/mm/swap.c index 983f692a47fd..e21f3357cedd 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -854,6 +854,14 @@ void deactivate_file_page(struct page *page) } } +bool lru_add_drain_needed(int cpu) +{ + return (pagevec_count(&per_cpu(lru_add_pvec, cpu)) || + pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || + pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) || + need_activate_page_drain(cpu)); +} + void lru_add_drain(void) { lru_add_drain_cpu(get_cpu()); @@ -880,10 +888,7 @@ void lru_add_drain_all(void) for_each_online_cpu(cpu) { struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); - if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) || - pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || - pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) || - need_activate_page_drain(cpu)) { + if (lru_add_drain_needed(cpu)) { INIT_WORK(work, lru_add_drain_per_cpu); schedule_work_on(cpu, work); cpumask_set_cpu(cpu, &has_work); -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v8 03/14] lru_add_drain_all: factor out lru_add_drain_needed @ 2015-10-20 20:36 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-mm, linux-kernel Cc: Chris Metcalf This per-cpu check was being done in the loop in lru_add_drain_all(), but having it be callable for a particular cpu is helpful for the task-isolation patches. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/swap.h | 1 + mm/swap.c | 13 +++++++++---- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 7ba7dccaf0e7..66719610c9f5 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -305,6 +305,7 @@ extern void activate_page(struct page *); extern void mark_page_accessed(struct page *); extern void lru_add_drain(void); extern void lru_add_drain_cpu(int cpu); +extern bool lru_add_drain_needed(int cpu); extern void lru_add_drain_all(void); extern void rotate_reclaimable_page(struct page *page); extern void deactivate_file_page(struct page *page); diff --git a/mm/swap.c b/mm/swap.c index 983f692a47fd..e21f3357cedd 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -854,6 +854,14 @@ void deactivate_file_page(struct page *page) } } +bool lru_add_drain_needed(int cpu) +{ + return (pagevec_count(&per_cpu(lru_add_pvec, cpu)) || + pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || + pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) || + need_activate_page_drain(cpu)); +} + void lru_add_drain(void) { lru_add_drain_cpu(get_cpu()); @@ -880,10 +888,7 @@ void lru_add_drain_all(void) for_each_online_cpu(cpu) { struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); - if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) || - pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || - pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) || - need_activate_page_drain(cpu)) { + if (lru_add_drain_needed(cpu)) { INIT_WORK(work, lru_add_drain_per_cpu); schedule_work_on(cpu, work); cpumask_set_cpu(cpu, &has_work); -- 2.1.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v8 04/14] task_isolation: add initial support 2015-10-20 20:35 ` Chris Metcalf @ 2015-10-20 20:36 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The kernel must be built with the new TASK_ISOLATION Kconfig flag to enable this mode, and the kernel booted with an appropriate nohz_full=CPULIST boot argument. The "task_isolation" state is then indicated by setting a new task struct field, task_isolation_flag, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new task_isolation_ready() / task_isolation_enter() routines to take additional actions to help the task avoid being interrupted in the future. The task_isolation_ready() call plays an equivalent role to the TIF_xxx flags when returning to userspace, and should be checked in the loop check of the prepare_exit_to_usermode() routine or its architecture equivalent. It is called with interrupts disabled and inspects the kernel state to determine if it is safe to return into an isolated state. In particular, if it sees that the scheduler tick is still enabled, it sets the TIF_NEED_RESCHED bit to notify the scheduler to attempt to schedule a different task. Each time through the loop of TIF work to do, we call the new task_isolation_enter() routine, which takes any actions that might avoid a future interrupt to the core, such as a worker thread being scheduled that could be quiesced now (e.g. the vmstat worker) or a future IPI to the core to clean up some state that could be cleaned up now (e.g. the mm lru per-cpu cache). As a result of these tests on the "return to userspace" path, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Separate patches that follow provide these changes for x86, arm64, and tile. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/isolation.h | 38 ++++++++++++++++++++++ include/linux/sched.h | 3 ++ include/uapi/linux/prctl.h | 5 +++ init/Kconfig | 20 ++++++++++++ kernel/Makefile | 1 + kernel/isolation.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++ kernel/sys.c | 9 ++++++ 7 files changed, 154 insertions(+) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c diff --git a/include/linux/isolation.h b/include/linux/isolation.h new file mode 100644 index 000000000000..4bef90024924 --- /dev/null +++ b/include/linux/isolation.h @@ -0,0 +1,38 @@ +/* + * Task isolation related global functions + */ +#ifndef _LINUX_ISOLATION_H +#define _LINUX_ISOLATION_H + +#include <linux/tick.h> +#include <linux/prctl.h> + +#ifdef CONFIG_TASK_ISOLATION +extern int task_isolation_set(unsigned int flags); +static inline bool task_isolation_enabled(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE); +} + +extern bool _task_isolation_ready(void); +extern void _task_isolation_enter(void); + +static inline bool task_isolation_ready(void) +{ + return !task_isolation_enabled() || _task_isolation_ready(); +} + +static inline void task_isolation_enter(void) +{ + if (task_isolation_enabled()) + _task_isolation_enter(); +} + +#else +static inline bool task_isolation_enabled(void) { return false; } +static inline bool task_isolation_ready(void) { return true; } +static inline void task_isolation_enter(void) { } +#endif + +#endif diff --git a/include/linux/sched.h b/include/linux/sched.h index b7b9501b41af..7a50f6904675 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1812,6 +1812,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_TASK_ISOLATION + unsigned int task_isolation_flags; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index a8d0759a9e40..67224df4b559 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -197,4 +197,9 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */ +#define PR_SET_TASK_ISOLATION 48 +#define PR_GET_TASK_ISOLATION 49 +# define PR_TASK_ISOLATION_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/init/Kconfig b/init/Kconfig index c24b6f767bf0..4ff7f052059a 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -787,6 +787,26 @@ config RCU_EXPEDITE_BOOT endmenu # "RCU Subsystem" +config TASK_ISOLATION + bool "Provide hard CPU isolation from the kernel on demand" + depends on NO_HZ_FULL + help + Allow userspace processes to place themselves on nohz_full + cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate" + themselves from the kernel. On return to userspace, + isolated tasks will first arrange that no future kernel + activity will interrupt the task while the task is running + in userspace. This "hard" isolation from the kernel is + required for userspace tasks that are running hard real-time + tasks in userspace, such as a 10 Gbit network driver in userspace. + + Without this option, but with NO_HZ_FULL enabled, the kernel + will make a best-faith, "soft" effort to shield a single userspace + process from interrupts, but makes no guarantees. + + You should say "N" unless you are intending to run a + high-performance userspace driver or similar task. + config BUILD_BIN2C bool default n diff --git a/kernel/Makefile b/kernel/Makefile index 53abf008ecb3..693a2ba35679 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o obj-$(CONFIG_MEMBARRIER) += membarrier.o obj-$(CONFIG_HAS_IOMEM) += memremap.o +obj-$(CONFIG_TASK_ISOLATION) += isolation.o $(obj)/configs.o: $(obj)/config_data.h diff --git a/kernel/isolation.c b/kernel/isolation.c new file mode 100644 index 000000000000..9a73235db0bb --- /dev/null +++ b/kernel/isolation.c @@ -0,0 +1,78 @@ +/* + * linux/kernel/isolation.c + * + * Implementation for task isolation. + * + * Distributed under GPLv2. + */ + +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/vmstat.h> +#include <linux/isolation.h> +#include <linux/syscalls.h> +#include "time/tick-sched.h" + +/* + * This routine controls whether we can enable task-isolation mode. + * The task must be affinitized to a single nohz_full core or we will + * return EINVAL. Although the application could later re-affinitize + * to a housekeeping core and lose task isolation semantics, this + * initial test should catch 99% of bugs with task placement prior to + * enabling task isolation. + */ +int task_isolation_set(unsigned int flags) +{ + if (cpumask_weight(tsk_cpus_allowed(current)) != 1 || + !tick_nohz_full_cpu(smp_processor_id())) + return -EINVAL; + + current->task_isolation_flags = flags; + return 0; +} + +/* + * In task isolation mode we try to return to userspace only after + * attempting to make sure we won't be interrupted again. To handle + * the periodic scheduler tick, we test to make sure that the tick is + * stopped, and if it isn't yet, we request a reschedule so that if + * another task needs to run to completion first, it can do so. + * Similarly, if any other subsystems require quiescing, we will need + * to do that before we return to userspace. + */ +bool _task_isolation_ready(void) +{ + WARN_ON_ONCE(!irqs_disabled()); + + /* If we need to drain the LRU cache, we're not ready. */ + if (lru_add_drain_needed(smp_processor_id())) + return false; + + /* If vmstats need updating, we're not ready. */ + if (!vmstat_idle()) + return false; + + /* If the tick is running, request rescheduling; we're not ready. */ + if (!tick_nohz_tick_stopped()) { + set_tsk_need_resched(current); + return false; + } + + return true; +} + +/* + * Each time we try to prepare for return to userspace in a process + * with task isolation enabled, we run this code to quiesce whatever + * subsystems we can readily quiesce to avoid later interrupts. + */ +void _task_isolation_enter(void) +{ + WARN_ON_ONCE(irqs_disabled()); + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + /* Quieten the vmstat worker so it won't interrupt us. */ + quiet_vmstat(); +} diff --git a/kernel/sys.c b/kernel/sys.c index fa2f2f671a5c..f1b1d333f74d 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -41,6 +41,7 @@ #include <linux/syscore_ops.h> #include <linux/version.h> #include <linux/ctype.h> +#include <linux/isolation.h> #include <linux/compat.h> #include <linux/syscalls.h> @@ -2266,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_TASK_ISOLATION + case PR_SET_TASK_ISOLATION: + error = task_isolation_set(arg2); + break; + case PR_GET_TASK_ISOLATION: + error = me->task_isolation_flags; + break; +#endif default: error = -EINVAL; break; -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v8 04/14] task_isolation: add initial support @ 2015-10-20 20:36 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The kernel must be built with the new TASK_ISOLATION Kconfig flag to enable this mode, and the kernel booted with an appropriate nohz_full=CPULIST boot argument. The "task_isolation" state is then indicated by setting a new task struct field, task_isolation_flag, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new task_isolation_ready() / task_isolation_enter() routines to take additional actions to help the task avoid being interrupted in the future. The task_isolation_ready() call plays an equivalent role to the TIF_xxx flags when returning to userspace, and should be checked in the loop check of the prepare_exit_to_usermode() routine or its architecture equivalent. It is called with interrupts disabled and inspects the kernel state to determine if it is safe to return into an isolated state. In particular, if it sees that the scheduler tick is still enabled, it sets the TIF_NEED_RESCHED bit to notify the scheduler to attempt to schedule a different task. Each time through the loop of TIF work to do, we call the new task_isolation_enter() routine, which takes any actions that might avoid a future interrupt to the core, such as a worker thread being scheduled that could be quiesced now (e.g. the vmstat worker) or a future IPI to the core to clean up some state that could be cleaned up now (e.g. the mm lru per-cpu cache). As a result of these tests on the "return to userspace" path, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Separate patches that follow provide these changes for x86, arm64, and tile. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/isolation.h | 38 ++++++++++++++++++++++ include/linux/sched.h | 3 ++ include/uapi/linux/prctl.h | 5 +++ init/Kconfig | 20 ++++++++++++ kernel/Makefile | 1 + kernel/isolation.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++ kernel/sys.c | 9 ++++++ 7 files changed, 154 insertions(+) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c diff --git a/include/linux/isolation.h b/include/linux/isolation.h new file mode 100644 index 000000000000..4bef90024924 --- /dev/null +++ b/include/linux/isolation.h @@ -0,0 +1,38 @@ +/* + * Task isolation related global functions + */ +#ifndef _LINUX_ISOLATION_H +#define _LINUX_ISOLATION_H + +#include <linux/tick.h> +#include <linux/prctl.h> + +#ifdef CONFIG_TASK_ISOLATION +extern int task_isolation_set(unsigned int flags); +static inline bool task_isolation_enabled(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE); +} + +extern bool _task_isolation_ready(void); +extern void _task_isolation_enter(void); + +static inline bool task_isolation_ready(void) +{ + return !task_isolation_enabled() || _task_isolation_ready(); +} + +static inline void task_isolation_enter(void) +{ + if (task_isolation_enabled()) + _task_isolation_enter(); +} + +#else +static inline bool task_isolation_enabled(void) { return false; } +static inline bool task_isolation_ready(void) { return true; } +static inline void task_isolation_enter(void) { } +#endif + +#endif diff --git a/include/linux/sched.h b/include/linux/sched.h index b7b9501b41af..7a50f6904675 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1812,6 +1812,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_TASK_ISOLATION + unsigned int task_isolation_flags; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index a8d0759a9e40..67224df4b559 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -197,4 +197,9 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */ +#define PR_SET_TASK_ISOLATION 48 +#define PR_GET_TASK_ISOLATION 49 +# define PR_TASK_ISOLATION_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/init/Kconfig b/init/Kconfig index c24b6f767bf0..4ff7f052059a 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -787,6 +787,26 @@ config RCU_EXPEDITE_BOOT endmenu # "RCU Subsystem" +config TASK_ISOLATION + bool "Provide hard CPU isolation from the kernel on demand" + depends on NO_HZ_FULL + help + Allow userspace processes to place themselves on nohz_full + cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate" + themselves from the kernel. On return to userspace, + isolated tasks will first arrange that no future kernel + activity will interrupt the task while the task is running + in userspace. This "hard" isolation from the kernel is + required for userspace tasks that are running hard real-time + tasks in userspace, such as a 10 Gbit network driver in userspace. + + Without this option, but with NO_HZ_FULL enabled, the kernel + will make a best-faith, "soft" effort to shield a single userspace + process from interrupts, but makes no guarantees. + + You should say "N" unless you are intending to run a + high-performance userspace driver or similar task. + config BUILD_BIN2C bool default n diff --git a/kernel/Makefile b/kernel/Makefile index 53abf008ecb3..693a2ba35679 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o obj-$(CONFIG_MEMBARRIER) += membarrier.o obj-$(CONFIG_HAS_IOMEM) += memremap.o +obj-$(CONFIG_TASK_ISOLATION) += isolation.o $(obj)/configs.o: $(obj)/config_data.h diff --git a/kernel/isolation.c b/kernel/isolation.c new file mode 100644 index 000000000000..9a73235db0bb --- /dev/null +++ b/kernel/isolation.c @@ -0,0 +1,78 @@ +/* + * linux/kernel/isolation.c + * + * Implementation for task isolation. + * + * Distributed under GPLv2. + */ + +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/vmstat.h> +#include <linux/isolation.h> +#include <linux/syscalls.h> +#include "time/tick-sched.h" + +/* + * This routine controls whether we can enable task-isolation mode. + * The task must be affinitized to a single nohz_full core or we will + * return EINVAL. Although the application could later re-affinitize + * to a housekeeping core and lose task isolation semantics, this + * initial test should catch 99% of bugs with task placement prior to + * enabling task isolation. + */ +int task_isolation_set(unsigned int flags) +{ + if (cpumask_weight(tsk_cpus_allowed(current)) != 1 || + !tick_nohz_full_cpu(smp_processor_id())) + return -EINVAL; + + current->task_isolation_flags = flags; + return 0; +} + +/* + * In task isolation mode we try to return to userspace only after + * attempting to make sure we won't be interrupted again. To handle + * the periodic scheduler tick, we test to make sure that the tick is + * stopped, and if it isn't yet, we request a reschedule so that if + * another task needs to run to completion first, it can do so. + * Similarly, if any other subsystems require quiescing, we will need + * to do that before we return to userspace. + */ +bool _task_isolation_ready(void) +{ + WARN_ON_ONCE(!irqs_disabled()); + + /* If we need to drain the LRU cache, we're not ready. */ + if (lru_add_drain_needed(smp_processor_id())) + return false; + + /* If vmstats need updating, we're not ready. */ + if (!vmstat_idle()) + return false; + + /* If the tick is running, request rescheduling; we're not ready. */ + if (!tick_nohz_tick_stopped()) { + set_tsk_need_resched(current); + return false; + } + + return true; +} + +/* + * Each time we try to prepare for return to userspace in a process + * with task isolation enabled, we run this code to quiesce whatever + * subsystems we can readily quiesce to avoid later interrupts. + */ +void _task_isolation_enter(void) +{ + WARN_ON_ONCE(irqs_disabled()); + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + /* Quieten the vmstat worker so it won't interrupt us. */ + quiet_vmstat(); +} diff --git a/kernel/sys.c b/kernel/sys.c index fa2f2f671a5c..f1b1d333f74d 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -41,6 +41,7 @@ #include <linux/syscore_ops.h> #include <linux/version.h> #include <linux/ctype.h> +#include <linux/isolation.h> #include <linux/compat.h> #include <linux/syscalls.h> @@ -2266,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_TASK_ISOLATION + case PR_SET_TASK_ISOLATION: + error = task_isolation_set(arg2); + break; + case PR_GET_TASK_ISOLATION: + error = me->task_isolation_flags; + break; +#endif default: error = -EINVAL; break; -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support 2015-10-20 20:36 ` Chris Metcalf (?) @ 2015-10-20 20:56 ` Andy Lutomirski 2015-10-20 21:20 ` Chris Metcalf -1 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-10-20 20:56 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On Tue, Oct 20, 2015 at 1:36 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > +/* > + * In task isolation mode we try to return to userspace only after > + * attempting to make sure we won't be interrupted again. To handle > + * the periodic scheduler tick, we test to make sure that the tick is > + * stopped, and if it isn't yet, we request a reschedule so that if > + * another task needs to run to completion first, it can do so. > + * Similarly, if any other subsystems require quiescing, we will need > + * to do that before we return to userspace. > + */ > +bool _task_isolation_ready(void) > +{ > + WARN_ON_ONCE(!irqs_disabled()); > + > + /* If we need to drain the LRU cache, we're not ready. */ > + if (lru_add_drain_needed(smp_processor_id())) > + return false; > + > + /* If vmstats need updating, we're not ready. */ > + if (!vmstat_idle()) > + return false; > + > + /* If the tick is running, request rescheduling; we're not ready. */ > + if (!tick_nohz_tick_stopped()) { > + set_tsk_need_resched(current); > + return false; > + } > + > + return true; > +} I still don't get why this is a loop. I would argue that this should simply drain the LRU, quiet vmstat, and return. If the tick isn't stopped, then there's a reason why it's not stopped (which may involve having SCHED_OTHER tasks around, in which case user code shouldn't do that or there should simply be a requirement that isolation requires a real-time scheduler class). BTW, should isolation just be a scheduler class (SCHED_ISOLATED)? --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support @ 2015-10-20 21:20 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 21:20 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On 10/20/2015 04:56 PM, Andy Lutomirski wrote: > On Tue, Oct 20, 2015 at 1:36 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> +/* >> + * In task isolation mode we try to return to userspace only after >> + * attempting to make sure we won't be interrupted again. To handle >> + * the periodic scheduler tick, we test to make sure that the tick is >> + * stopped, and if it isn't yet, we request a reschedule so that if >> + * another task needs to run to completion first, it can do so. >> + * Similarly, if any other subsystems require quiescing, we will need >> + * to do that before we return to userspace. >> + */ >> +bool _task_isolation_ready(void) >> +{ >> + WARN_ON_ONCE(!irqs_disabled()); >> + >> + /* If we need to drain the LRU cache, we're not ready. */ >> + if (lru_add_drain_needed(smp_processor_id())) >> + return false; >> + >> + /* If vmstats need updating, we're not ready. */ >> + if (!vmstat_idle()) >> + return false; >> + >> + /* If the tick is running, request rescheduling; we're not ready. */ >> + if (!tick_nohz_tick_stopped()) { >> + set_tsk_need_resched(current); >> + return false; >> + } >> + >> + return true; >> +} > I still don't get why this is a loop. You mean, why is this code called from prepare_exit_to_userspace() in the loop, instead of after the loop? It's because the actual functions that clean up the LRU, vmstat worker, etc., may need interrupts enabled, may reschedule internally, etc. (refresh_cpu_vm_stats() calls cond_resched(), for example.) Even more importantly, we rely on rescheduling to take care of the fact that the scheduler tick may still be running, and therefore loop back to the schedule() call that's run when TIF_NEED_RESCHED gets set. And so, since interrupts and scheduling can happen, we need to be run in a loop to retest, just like the existing tests for signal dispatch, need_resched, etc. > I would argue that this should simply drain the LRU, quiet vmstat, and > return. If the tick isn't stopped, then there's a reason why it's not > stopped (which may involve having SCHED_OTHER tasks around, in which > case user code shouldn't do that or there should simply be a > requirement that isolation requires a real-time scheduler class). Sure, the tick not being stopped has a reason for not being stopped, but if it's not yet stopped, we need to schedule out and wait for that to happen. A real-time scheduler class won't completely take care of this as you still may have issues like RCU needing the cpu or any of the other cases in can_stop_full_tick(). > BTW, should isolation just be a scheduler class (SCHED_ISOLATED)? So a scheduler class is an interesting idea certainly, although not one I know immediately how to implement. I'm not sure whether it makes sense to require a user be root or have a suitable rtprio rlimit, but perhaps so. The nice thing about the current patch series is that you can affinitize yourself to a nohz_full core and declare that you want to run task-isolated, and none of that requires root nor really is there a reason it should. I guess you could make SCHED_ISOLATED like SCHED_BATCH and perhaps therefore allow non-root users to switch to it? In any case it would have to be true that we would still be doing all the other tests we do now, even if we could count on the scheduler to take care of only trying to run it when there were no other runnable processes. So it would certainly add complexity. I'm not sure how to evaluate the utility. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support @ 2015-10-20 21:20 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 21:20 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 10/20/2015 04:56 PM, Andy Lutomirski wrote: > On Tue, Oct 20, 2015 at 1:36 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> +/* >> + * In task isolation mode we try to return to userspace only after >> + * attempting to make sure we won't be interrupted again. To handle >> + * the periodic scheduler tick, we test to make sure that the tick is >> + * stopped, and if it isn't yet, we request a reschedule so that if >> + * another task needs to run to completion first, it can do so. >> + * Similarly, if any other subsystems require quiescing, we will need >> + * to do that before we return to userspace. >> + */ >> +bool _task_isolation_ready(void) >> +{ >> + WARN_ON_ONCE(!irqs_disabled()); >> + >> + /* If we need to drain the LRU cache, we're not ready. */ >> + if (lru_add_drain_needed(smp_processor_id())) >> + return false; >> + >> + /* If vmstats need updating, we're not ready. */ >> + if (!vmstat_idle()) >> + return false; >> + >> + /* If the tick is running, request rescheduling; we're not ready. */ >> + if (!tick_nohz_tick_stopped()) { >> + set_tsk_need_resched(current); >> + return false; >> + } >> + >> + return true; >> +} > I still don't get why this is a loop. You mean, why is this code called from prepare_exit_to_userspace() in the loop, instead of after the loop? It's because the actual functions that clean up the LRU, vmstat worker, etc., may need interrupts enabled, may reschedule internally, etc. (refresh_cpu_vm_stats() calls cond_resched(), for example.) Even more importantly, we rely on rescheduling to take care of the fact that the scheduler tick may still be running, and therefore loop back to the schedule() call that's run when TIF_NEED_RESCHED gets set. And so, since interrupts and scheduling can happen, we need to be run in a loop to retest, just like the existing tests for signal dispatch, need_resched, etc. > I would argue that this should simply drain the LRU, quiet vmstat, and > return. If the tick isn't stopped, then there's a reason why it's not > stopped (which may involve having SCHED_OTHER tasks around, in which > case user code shouldn't do that or there should simply be a > requirement that isolation requires a real-time scheduler class). Sure, the tick not being stopped has a reason for not being stopped, but if it's not yet stopped, we need to schedule out and wait for that to happen. A real-time scheduler class won't completely take care of this as you still may have issues like RCU needing the cpu or any of the other cases in can_stop_full_tick(). > BTW, should isolation just be a scheduler class (SCHED_ISOLATED)? So a scheduler class is an interesting idea certainly, although not one I know immediately how to implement. I'm not sure whether it makes sense to require a user be root or have a suitable rtprio rlimit, but perhaps so. The nice thing about the current patch series is that you can affinitize yourself to a nohz_full core and declare that you want to run task-isolated, and none of that requires root nor really is there a reason it should. I guess you could make SCHED_ISOLATED like SCHED_BATCH and perhaps therefore allow non-root users to switch to it? In any case it would have to be true that we would still be doing all the other tests we do now, even if we could count on the scheduler to take care of only trying to run it when there were no other runnable processes. So it would certainly add complexity. I'm not sure how to evaluate the utility. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support @ 2015-10-20 21:26 ` Andy Lutomirski 0 siblings, 0 replies; 340+ messages in thread From: Andy Lutomirski @ 2015-10-20 21:26 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On Tue, Oct 20, 2015 at 2:20 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 10/20/2015 04:56 PM, Andy Lutomirski wrote: >> >> On Tue, Oct 20, 2015 at 1:36 PM, Chris Metcalf <cmetcalf@ezchip.com> >> wrote: >>> >>> +/* >>> + * In task isolation mode we try to return to userspace only after >>> + * attempting to make sure we won't be interrupted again. To handle >>> + * the periodic scheduler tick, we test to make sure that the tick is >>> + * stopped, and if it isn't yet, we request a reschedule so that if >>> + * another task needs to run to completion first, it can do so. >>> + * Similarly, if any other subsystems require quiescing, we will need >>> + * to do that before we return to userspace. >>> + */ >>> +bool _task_isolation_ready(void) >>> +{ >>> + WARN_ON_ONCE(!irqs_disabled()); >>> + >>> + /* If we need to drain the LRU cache, we're not ready. */ >>> + if (lru_add_drain_needed(smp_processor_id())) >>> + return false; >>> + >>> + /* If vmstats need updating, we're not ready. */ >>> + if (!vmstat_idle()) >>> + return false; >>> + >>> + /* If the tick is running, request rescheduling; we're not ready. >>> */ >>> + if (!tick_nohz_tick_stopped()) { >>> + set_tsk_need_resched(current); >>> + return false; >>> + } >>> + >>> + return true; >>> +} >> >> I still don't get why this is a loop. > > > You mean, why is this code called from prepare_exit_to_userspace() > in the loop, instead of after the loop? It's because the actual functions > that clean up the LRU, vmstat worker, etc., may need interrupts enabled, > may reschedule internally, etc. (refresh_cpu_vm_stats() calls > cond_resched(), for example.) Yuck. I guess that's a reasonable argument, although it could also be fixed. > Even more importantly, we rely on > rescheduling to take care of the fact that the scheduler tick may still > be running, and therefore loop back to the schedule() call that's run > when TIF_NEED_RESCHED gets set. This just seems like a mis-design. We don't know why the scheduler tick is on, so we're just going to reschedule until the problem goes away? > >> BTW, should isolation just be a scheduler class (SCHED_ISOLATED)? > > > So a scheduler class is an interesting idea certainly, although not > one I know immediately how to implement. I'm not sure whether > it makes sense to require a user be root or have a suitable rtprio > rlimit, but perhaps so. The nice thing about the current patch > series is that you can affinitize yourself to a nohz_full core and > declare that you want to run task-isolated, and none of that > requires root nor really is there a reason it should. Your patches more or less implement "don't run me unless I'm isolated". A scheduler class would be more like "isolate me (and maybe make me super high priority so it actually happens)". I'm not a scheduler person, so I don't know. But "don't run me unless I'm isolated" seems like a design that will, at best, only ever work by dumb luck. You have to disable migration, avoid other runnable tasks, hope that the kernel keeps working the way it did when you wrote the patch, hope you continue to get lucky enough that you ever get to user mode in the first place, etc. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support @ 2015-10-20 21:26 ` Andy Lutomirski 0 siblings, 0 replies; 340+ messages in thread From: Andy Lutomirski @ 2015-10-20 21:26 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, Oct 20, 2015 at 2:20 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > On 10/20/2015 04:56 PM, Andy Lutomirski wrote: >> >> On Tue, Oct 20, 2015 at 1:36 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >> wrote: >>> >>> +/* >>> + * In task isolation mode we try to return to userspace only after >>> + * attempting to make sure we won't be interrupted again. To handle >>> + * the periodic scheduler tick, we test to make sure that the tick is >>> + * stopped, and if it isn't yet, we request a reschedule so that if >>> + * another task needs to run to completion first, it can do so. >>> + * Similarly, if any other subsystems require quiescing, we will need >>> + * to do that before we return to userspace. >>> + */ >>> +bool _task_isolation_ready(void) >>> +{ >>> + WARN_ON_ONCE(!irqs_disabled()); >>> + >>> + /* If we need to drain the LRU cache, we're not ready. */ >>> + if (lru_add_drain_needed(smp_processor_id())) >>> + return false; >>> + >>> + /* If vmstats need updating, we're not ready. */ >>> + if (!vmstat_idle()) >>> + return false; >>> + >>> + /* If the tick is running, request rescheduling; we're not ready. >>> */ >>> + if (!tick_nohz_tick_stopped()) { >>> + set_tsk_need_resched(current); >>> + return false; >>> + } >>> + >>> + return true; >>> +} >> >> I still don't get why this is a loop. > > > You mean, why is this code called from prepare_exit_to_userspace() > in the loop, instead of after the loop? It's because the actual functions > that clean up the LRU, vmstat worker, etc., may need interrupts enabled, > may reschedule internally, etc. (refresh_cpu_vm_stats() calls > cond_resched(), for example.) Yuck. I guess that's a reasonable argument, although it could also be fixed. > Even more importantly, we rely on > rescheduling to take care of the fact that the scheduler tick may still > be running, and therefore loop back to the schedule() call that's run > when TIF_NEED_RESCHED gets set. This just seems like a mis-design. We don't know why the scheduler tick is on, so we're just going to reschedule until the problem goes away? > >> BTW, should isolation just be a scheduler class (SCHED_ISOLATED)? > > > So a scheduler class is an interesting idea certainly, although not > one I know immediately how to implement. I'm not sure whether > it makes sense to require a user be root or have a suitable rtprio > rlimit, but perhaps so. The nice thing about the current patch > series is that you can affinitize yourself to a nohz_full core and > declare that you want to run task-isolated, and none of that > requires root nor really is there a reason it should. Your patches more or less implement "don't run me unless I'm isolated". A scheduler class would be more like "isolate me (and maybe make me super high priority so it actually happens)". I'm not a scheduler person, so I don't know. But "don't run me unless I'm isolated" seems like a design that will, at best, only ever work by dumb luck. You have to disable migration, avoid other runnable tasks, hope that the kernel keeps working the way it did when you wrote the patch, hope you continue to get lucky enough that you ever get to user mode in the first place, etc. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support @ 2015-10-21 0:29 ` Steven Rostedt 0 siblings, 0 replies; 340+ messages in thread From: Steven Rostedt @ 2015-10-21 0:29 UTC (permalink / raw) To: Andy Lutomirski Cc: Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On Tue, 20 Oct 2015 14:26:34 -0700 Andy Lutomirski <luto@amacapital.net> wrote: > I'm not a scheduler person, so I don't know. But "don't run me unless > I'm isolated" seems like a design that will, at best, only ever work > by dumb luck. You have to disable migration, avoid other runnable > tasks, hope that the kernel keeps working the way it did when you > wrote the patch, hope you continue to get lucky enough that you ever > get to user mode in the first place, etc. Since it only makes sense to run one isolated task per cpu (not more than one on the same CPU), I wonder if we should add a new interface for this, that would force everything else off the CPU that it requests. That is, you bind a task to a CPU, and then change it to SCHED_ISOLATED (or what not), and the kernel will force all other tasks off that CPU. Well, we would still have kernel threads, but that's a different matter. Also, doesn't RCU need to have a few ticks go by before it can safely disable itself from userspace? I recall something like that. Paul? -- Steve ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support @ 2015-10-21 0:29 ` Steven Rostedt 0 siblings, 0 replies; 340+ messages in thread From: Steven Rostedt @ 2015-10-21 0:29 UTC (permalink / raw) To: Andy Lutomirski Cc: Chris Metcalf, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, 20 Oct 2015 14:26:34 -0700 Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote: > I'm not a scheduler person, so I don't know. But "don't run me unless > I'm isolated" seems like a design that will, at best, only ever work > by dumb luck. You have to disable migration, avoid other runnable > tasks, hope that the kernel keeps working the way it did when you > wrote the patch, hope you continue to get lucky enough that you ever > get to user mode in the first place, etc. Since it only makes sense to run one isolated task per cpu (not more than one on the same CPU), I wonder if we should add a new interface for this, that would force everything else off the CPU that it requests. That is, you bind a task to a CPU, and then change it to SCHED_ISOLATED (or what not), and the kernel will force all other tasks off that CPU. Well, we would still have kernel threads, but that's a different matter. Also, doesn't RCU need to have a few ticks go by before it can safely disable itself from userspace? I recall something like that. Paul? -- Steve ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support 2015-10-21 0:29 ` Steven Rostedt (?) @ 2015-10-26 20:19 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-26 20:19 UTC (permalink / raw) To: Steven Rostedt, Andy Lutomirski Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel Andy wrote: > Your patches more or less implement "don't run me unless I'm > isolated". A scheduler class would be more like "isolate me (and > maybe make me super high priority so it actually happens)". Steven wrote: > Since it only makes sense to run one isolated task per cpu (not more > than one on the same CPU), I wonder if we should add a new interface > for this, that would force everything else off the CPU that it > requests. That is, you bind a task to a CPU, and then change it to > SCHED_ISOLATED (or what not), and the kernel will force all other tasks > off that CPU. Frederic wrote: > I think you'll have to make sure the task can not be concurrently > reaffined to more CPUs. This may involve setting task_isolation_flags > under the runqueue lock and thus move that tiny part to the scheduler > code. And then we must forbid changing the affinity while the task has > the isolation flag, or deactivate the flag. These comments are all about the same high-level question, so I want to address it in this reply. The question is, should TASK_ISOLATION be "polite" or "aggressive"? The original design was "polite": it worked as long as no other thing on the system tried to mess with it. The suggestions above are for an "aggressive" design. The "polite" design basically tags a task as being interested in having the kernel help it out by staying away from it. It relies on running on a nohz_full cpu to keep scheduler ticks away from it. It relies on running on an isolcpus cpu to keep other processes from getting dynamically load-balanced onto it and messing it up. And, of course, it relies on the other applications and users running on the machine not to affinitize themselves onto its core and mess it up that way. But, as long as all those things are true, the kernel will try to help it out by never interrupting it. (And, it allows for the kernel to report when those expectations are violated.) The "aggressive" design would have an API that said "This is my core!". The kernel would enforce keeping other processes off the core. It would require nohz_full semantics on that core. It would lock the task to that core in some way that would override attempts to reset its sched_affinity. It would do whatever else was necessary to make that core unavailable to the rest of the system. Advantages of the "polite" design: - No special privileges required - As a result, no security issues to sort through (capabilities, etc.) - Therefore easy to use when running as an unprivileged user - Won't screw up the occasional kernel task that needs to run Advantages of the "aggressive" design: - Clearer that the application will get the task isolation it wants - More reasonable that it is enforcing kernel performance tweaks on the local core (e.g. flushing the per-cpu LRU cache) The "aggressive" design is certainly tempting, but there may be other negative consequences of this design: for example, if we need to run a usermode helper process as a result of some system call, we do want to ensure that it can run, and we need to allow it to be scheduled, even if it's just a regular scheduler class thing. The "polite" design allows the usermode helper to run and just waits until it's safe for the isolated task to return to userspace. Possibly we could arrange for a SCHED_ISOLATED class to allow that kind of behavior, though I'm not familiar enough with the scheduler code to say for sure. I think it's important that we're explicit about which of these two approaches feels like the more appropriate one. Possibly my Tilera background is part of which pushes me towards the "polite" design; we have a lot of cores, so they're a kind of trivial resource that we don't need to aggressively defend, and it's a more conservative design to enable task isolation only when all the relevant criteria have been met, rather than enforcing those criteria up front. I think if we adopt the "aggressive" model, it might likely make sense to express it as a scheduling policy, since it would include core scheduler changes such as denying other tasks the right to call sched_setaffinity() with an affinity that includes cores currently in use by SCHED_ISOLATED tasks. This would be something pretty deeply hooked into the scheduler and therefore might require some more substantial changes. In addition, of course, there's the cost of documenting yet another scheduler policy. In the "polite" model, we certainly could use a SCHED_ISOLATED scheduling policy (with static priority zero) to indicate task-isolation mode, rather than using prctl() to set a task_struct bit. I'm not sure how much it gains, though. It could allow the scheduler to detect that the only "runnable" task actually didn't want to be run, and switch briefly to the idle task, but since this would likely only be for a scheduler tick or two, the power advantages are pretty minimal, for a pretty reasonable additional piece of complexity both in the API (documenting a new scheduler class) and in the implementation (putting new requirements into the scheduler implementations). So I'm somewhat dubious, although willing to be pushed in that direction if that's the consensus. On balance I think it still feels to me like the original proposed direction (a "polite" task isolation mode with a prctl bit) feels better than the scheduler-based alternatives that have been proposed. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support 2015-10-21 0:29 ` Steven Rostedt (?) (?) @ 2015-10-26 21:13 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-26 21:13 UTC (permalink / raw) To: Steven Rostedt, Andy Lutomirski Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On 10/20/2015 08:29 PM, Steven Rostedt wrote: > Also, doesn't RCU need to have a few ticks go by before it can safely > disable itself from userspace? I recall something like that. Paul? The current patch series supports that by testing tick_nohz_tick_stopped(), which internally only becomes true after tick_nohz_stop_sched_tick() manages to stop the tick, and it won't if rcu_needs_cpu() is true. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support 2015-10-20 21:26 ` Andy Lutomirski (?) (?) @ 2015-10-26 20:32 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-26 20:32 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On 10/20/2015 05:26 PM, Andy Lutomirski wrote: >> Even more importantly, we rely on >> rescheduling to take care of the fact that the scheduler tick may still >> be running, and therefore loop back to the schedule() call that's run >> when TIF_NEED_RESCHED gets set. > This just seems like a mis-design. We don't know why the scheduler > tick is on, so we're just going to reschedule until the problem goes > away? See my previous email about polite vs aggressive design for more thoughts on this, but, yes. I'm not sure there's a way to do anything else, other than my proposal there to dig deep into the scheduler and allow it to switch to idle for a few tasks - but again, I'm just not sure the complexity is worth the runtime power savings. >>> BTW, should isolation just be a scheduler class (SCHED_ISOLATED)? >> >> So a scheduler class is an interesting idea certainly, although not >> one I know immediately how to implement. I'm not sure whether >> it makes sense to require a user be root or have a suitable rtprio >> rlimit, but perhaps so. The nice thing about the current patch >> series is that you can affinitize yourself to a nohz_full core and >> declare that you want to run task-isolated, and none of that >> requires root nor really is there a reason it should. > Your patches more or less implement "don't run me unless I'm > isolated". A scheduler class would be more like "isolate me (and > maybe make me super high priority so it actually happens)". > > I'm not a scheduler person, so I don't know. But "don't run me unless > I'm isolated" seems like a design that will, at best, only ever work > by dumb luck. You have to disable migration, avoid other runnable > tasks, hope that the kernel keeps working the way it did when you > wrote the patch, hope you continue to get lucky enough that you ever > get to user mode in the first place, etc. Could you explain the "dumb luck" characterization a bit more? You're definitely right that I need to test for isolcpus separately now that it's been decoupled from nohz_full again, so I will add that to the next release of the series. But the rest of it seems like things you just control for when you are running the application, and if you do it right, the application runs. If you don't (e.g. you intentionally schedule multiple processes on the same core), the app doesn't run, and you fix it in development. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support @ 2015-10-21 16:12 ` Frederic Weisbecker 0 siblings, 0 replies; 340+ messages in thread From: Frederic Weisbecker @ 2015-10-21 16:12 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote: > diff --git a/kernel/isolation.c b/kernel/isolation.c > new file mode 100644 > index 000000000000..9a73235db0bb > --- /dev/null > +++ b/kernel/isolation.c > @@ -0,0 +1,78 @@ > +/* > + * linux/kernel/isolation.c > + * > + * Implementation for task isolation. > + * > + * Distributed under GPLv2. > + */ > + > +#include <linux/mm.h> > +#include <linux/swap.h> > +#include <linux/vmstat.h> > +#include <linux/isolation.h> > +#include <linux/syscalls.h> > +#include "time/tick-sched.h" > + > +/* > + * This routine controls whether we can enable task-isolation mode. > + * The task must be affinitized to a single nohz_full core or we will > + * return EINVAL. Although the application could later re-affinitize > + * to a housekeeping core and lose task isolation semantics, this > + * initial test should catch 99% of bugs with task placement prior to > + * enabling task isolation. > + */ > +int task_isolation_set(unsigned int flags) > +{ > + if (cpumask_weight(tsk_cpus_allowed(current)) != 1 || I think you'll have to make sure the task can not be concurrently reaffined to more CPUs. This may involve setting task_isolation_flags under the runqueue lock and thus move that tiny part to the scheduler code. And then we must forbid changing the affinity while the task has the isolation flag, or deactivate the flag. In any case this needs some synchronization. > + !tick_nohz_full_cpu(smp_processor_id())) > + return -EINVAL; > + > + current->task_isolation_flags = flags; > + return 0; > +} > + > +/* > + * In task isolation mode we try to return to userspace only after > + * attempting to make sure we won't be interrupted again. To handle > + * the periodic scheduler tick, we test to make sure that the tick is > + * stopped, and if it isn't yet, we request a reschedule so that if > + * another task needs to run to completion first, it can do so. > + * Similarly, if any other subsystems require quiescing, we will need > + * to do that before we return to userspace. > + */ > +bool _task_isolation_ready(void) > +{ > + WARN_ON_ONCE(!irqs_disabled()); > + > + /* If we need to drain the LRU cache, we're not ready. */ > + if (lru_add_drain_needed(smp_processor_id())) > + return false; > + > + /* If vmstats need updating, we're not ready. */ > + if (!vmstat_idle()) > + return false; > + > + /* If the tick is running, request rescheduling; we're not ready. */ > + if (!tick_nohz_tick_stopped()) { Note that this function tells whether the tick is in dynticks mode, which means the tick currently only run on-demand. But it's not necessarily completely stopped. I think we should rename that function and the field it refers to. > + set_tsk_need_resched(current); > + return false; > + } > + > + return true; > +} Thanks. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support @ 2015-10-21 16:12 ` Frederic Weisbecker 0 siblings, 0 replies; 340+ messages in thread From: Frederic Weisbecker @ 2015-10-21 16:12 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote: > diff --git a/kernel/isolation.c b/kernel/isolation.c > new file mode 100644 > index 000000000000..9a73235db0bb > --- /dev/null > +++ b/kernel/isolation.c > @@ -0,0 +1,78 @@ > +/* > + * linux/kernel/isolation.c > + * > + * Implementation for task isolation. > + * > + * Distributed under GPLv2. > + */ > + > +#include <linux/mm.h> > +#include <linux/swap.h> > +#include <linux/vmstat.h> > +#include <linux/isolation.h> > +#include <linux/syscalls.h> > +#include "time/tick-sched.h" > + > +/* > + * This routine controls whether we can enable task-isolation mode. > + * The task must be affinitized to a single nohz_full core or we will > + * return EINVAL. Although the application could later re-affinitize > + * to a housekeeping core and lose task isolation semantics, this > + * initial test should catch 99% of bugs with task placement prior to > + * enabling task isolation. > + */ > +int task_isolation_set(unsigned int flags) > +{ > + if (cpumask_weight(tsk_cpus_allowed(current)) != 1 || I think you'll have to make sure the task can not be concurrently reaffined to more CPUs. This may involve setting task_isolation_flags under the runqueue lock and thus move that tiny part to the scheduler code. And then we must forbid changing the affinity while the task has the isolation flag, or deactivate the flag. In any case this needs some synchronization. > + !tick_nohz_full_cpu(smp_processor_id())) > + return -EINVAL; > + > + current->task_isolation_flags = flags; > + return 0; > +} > + > +/* > + * In task isolation mode we try to return to userspace only after > + * attempting to make sure we won't be interrupted again. To handle > + * the periodic scheduler tick, we test to make sure that the tick is > + * stopped, and if it isn't yet, we request a reschedule so that if > + * another task needs to run to completion first, it can do so. > + * Similarly, if any other subsystems require quiescing, we will need > + * to do that before we return to userspace. > + */ > +bool _task_isolation_ready(void) > +{ > + WARN_ON_ONCE(!irqs_disabled()); > + > + /* If we need to drain the LRU cache, we're not ready. */ > + if (lru_add_drain_needed(smp_processor_id())) > + return false; > + > + /* If vmstats need updating, we're not ready. */ > + if (!vmstat_idle()) > + return false; > + > + /* If the tick is running, request rescheduling; we're not ready. */ > + if (!tick_nohz_tick_stopped()) { Note that this function tells whether the tick is in dynticks mode, which means the tick currently only run on-demand. But it's not necessarily completely stopped. I think we should rename that function and the field it refers to. > + set_tsk_need_resched(current); > + return false; > + } > + > + return true; > +} Thanks. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support 2015-10-21 16:12 ` Frederic Weisbecker @ 2015-10-27 16:40 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-27 16:40 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 10/21/2015 12:12 PM, Frederic Weisbecker wrote: > On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote: >> +/* >> + * This routine controls whether we can enable task-isolation mode. >> + * The task must be affinitized to a single nohz_full core or we will >> + * return EINVAL. Although the application could later re-affinitize >> + * to a housekeeping core and lose task isolation semantics, this >> + * initial test should catch 99% of bugs with task placement prior to >> + * enabling task isolation. >> + */ >> +int task_isolation_set(unsigned int flags) >> +{ >> + if (cpumask_weight(tsk_cpus_allowed(current)) != 1 || > I think you'll have to make sure the task can not be concurrently reaffined > to more CPUs. This may involve setting task_isolation_flags under the runqueue > lock and thus move that tiny part to the scheduler code. And then we must forbid > changing the affinity while the task has the isolation flag, or deactivate the flag. > > In any case this needs some synchronization. Well, as the comment says, this is not intended as a hard guarantee. As written, it might race with a concurrent sched_setaffinity(), but then again, it also is totally OK as written for sched_setaffinity() to change it away after the prctl() is complete, so it's not necessary to do any explicit synchronization. This harks back again to the whole "polite vs aggressive" issue with how we envision task isolation. The "polite" model basically allows you to set up the conditions for task isolation to be useful, and then if they are useful, great! What you're suggesting here is a bit more of the "aggressive" model, where we actually fail sched_setaffinity() either for any cpumask after task isolation is set, or perhaps just for resetting it to housekeeping cores. (Note that we could in principle use PF_NO_SETAFFINITY to just hard fail all attempts to call sched_setaffinity once we enable task isolation, so we don't have to add more mechanism on that path.) I'm a little reluctant to ever fail sched_setaffinity() based on the task isolation status with the current "polite" model, since an unprivileged application can set up for task isolation, and then presumably no one can override it via sched_setaffinity() from another task. (I suppose you could do some kind of permissions-based thing where root can always override it, or some suitable capability, etc., but I feel like that gets complicated quickly, for little benefit.) The alternative you mention is that if the task is re-affinitized, it loses its task-isolation status, and that also seems like an unfortunate API, since if you are setting it with prctl(), it's really cleanest just to only be able to unset it with prctl() as well. I think given the current "polite" API, the only question is whether in fact *no* initial test is the best thing, or if an initial test (as introduced in the v8 version) is defensible just as a help for catching an obvious mistake in setting up your task isolation. I decided the advantage of catching the mistake were more important than the "API purity" of being 100% consistent in how we handled the interactions between affinity and isolation, but I am certainly open to argument on that one. Meanwhile I think it still feels like the v8 code is the best compromise. >> + /* If the tick is running, request rescheduling; we're not ready. */ >> + if (!tick_nohz_tick_stopped()) { > Note that this function tells whether the tick is in dynticks mode, which means > the tick currently only run on-demand. But it's not necessarily completely stopped. I think in fact this is the semantics we want (and that people requested), e.g. if the user requests an alarm(), we may still be ticking even though tick_nohz_tick_stopped() is true, but that test is still the right condition to use to return to user space, since the user explicitly requested the alarm. > I think we should rename that function and the field it refers to. Sounds like a good idea. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support @ 2015-10-27 16:40 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-27 16:40 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 10/21/2015 12:12 PM, Frederic Weisbecker wrote: > On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote: >> +/* >> + * This routine controls whether we can enable task-isolation mode. >> + * The task must be affinitized to a single nohz_full core or we will >> + * return EINVAL. Although the application could later re-affinitize >> + * to a housekeeping core and lose task isolation semantics, this >> + * initial test should catch 99% of bugs with task placement prior to >> + * enabling task isolation. >> + */ >> +int task_isolation_set(unsigned int flags) >> +{ >> + if (cpumask_weight(tsk_cpus_allowed(current)) != 1 || > I think you'll have to make sure the task can not be concurrently reaffined > to more CPUs. This may involve setting task_isolation_flags under the runqueue > lock and thus move that tiny part to the scheduler code. And then we must forbid > changing the affinity while the task has the isolation flag, or deactivate the flag. > > In any case this needs some synchronization. Well, as the comment says, this is not intended as a hard guarantee. As written, it might race with a concurrent sched_setaffinity(), but then again, it also is totally OK as written for sched_setaffinity() to change it away after the prctl() is complete, so it's not necessary to do any explicit synchronization. This harks back again to the whole "polite vs aggressive" issue with how we envision task isolation. The "polite" model basically allows you to set up the conditions for task isolation to be useful, and then if they are useful, great! What you're suggesting here is a bit more of the "aggressive" model, where we actually fail sched_setaffinity() either for any cpumask after task isolation is set, or perhaps just for resetting it to housekeeping cores. (Note that we could in principle use PF_NO_SETAFFINITY to just hard fail all attempts to call sched_setaffinity once we enable task isolation, so we don't have to add more mechanism on that path.) I'm a little reluctant to ever fail sched_setaffinity() based on the task isolation status with the current "polite" model, since an unprivileged application can set up for task isolation, and then presumably no one can override it via sched_setaffinity() from another task. (I suppose you could do some kind of permissions-based thing where root can always override it, or some suitable capability, etc., but I feel like that gets complicated quickly, for little benefit.) The alternative you mention is that if the task is re-affinitized, it loses its task-isolation status, and that also seems like an unfortunate API, since if you are setting it with prctl(), it's really cleanest just to only be able to unset it with prctl() as well. I think given the current "polite" API, the only question is whether in fact *no* initial test is the best thing, or if an initial test (as introduced in the v8 version) is defensible just as a help for catching an obvious mistake in setting up your task isolation. I decided the advantage of catching the mistake were more important than the "API purity" of being 100% consistent in how we handled the interactions between affinity and isolation, but I am certainly open to argument on that one. Meanwhile I think it still feels like the v8 code is the best compromise. >> + /* If the tick is running, request rescheduling; we're not ready. */ >> + if (!tick_nohz_tick_stopped()) { > Note that this function tells whether the tick is in dynticks mode, which means > the tick currently only run on-demand. But it's not necessarily completely stopped. I think in fact this is the semantics we want (and that people requested), e.g. if the user requests an alarm(), we may still be ticking even though tick_nohz_tick_stopped() is true, but that test is still the right condition to use to return to user space, since the user explicitly requested the alarm. > I think we should rename that function and the field it refers to. Sounds like a good idea. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support @ 2016-01-28 16:38 ` Frederic Weisbecker 0 siblings, 0 replies; 340+ messages in thread From: Frederic Weisbecker @ 2016-01-28 16:38 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Tue, Oct 27, 2015 at 12:40:29PM -0400, Chris Metcalf wrote: > On 10/21/2015 12:12 PM, Frederic Weisbecker wrote: > >On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote: > >>+/* > >>+ * This routine controls whether we can enable task-isolation mode. > >>+ * The task must be affinitized to a single nohz_full core or we will > >>+ * return EINVAL. Although the application could later re-affinitize > >>+ * to a housekeeping core and lose task isolation semantics, this > >>+ * initial test should catch 99% of bugs with task placement prior to > >>+ * enabling task isolation. > >>+ */ > >>+int task_isolation_set(unsigned int flags) > >>+{ > >>+ if (cpumask_weight(tsk_cpus_allowed(current)) != 1 || > >I think you'll have to make sure the task can not be concurrently reaffined > >to more CPUs. This may involve setting task_isolation_flags under the runqueue > >lock and thus move that tiny part to the scheduler code. And then we must forbid > >changing the affinity while the task has the isolation flag, or deactivate the flag. > > > >In any case this needs some synchronization. > > Well, as the comment says, this is not intended as a hard guarantee. > As written, it might race with a concurrent sched_setaffinity(), but > then again, it also is totally OK as written for sched_setaffinity() to > change it away after the prctl() is complete, so it's not necessary to > do any explicit synchronization. > > This harks back again to the whole "polite vs aggressive" issue with > how we envision task isolation. > > The "polite" model basically allows you to set up the conditions for > task isolation to be useful, and then if they are useful, great! What > you're suggesting here is a bit more of the "aggressive" model, where > we actually fail sched_setaffinity() either for any cpumask after > task isolation is set, or perhaps just for resetting it to housekeeping > cores. (Note that we could in principle use PF_NO_SETAFFINITY to > just hard fail all attempts to call sched_setaffinity once we enable > task isolation, so we don't have to add more mechanism on that path.) > > I'm a little reluctant to ever fail sched_setaffinity() based on the > task isolation status with the current "polite" model, since an > unprivileged application can set up for task isolation, and then > presumably no one can override it via sched_setaffinity() from another > task. (I suppose you could do some kind of permissions-based thing > where root can always override it, or some suitable capability, etc., > but I feel like that gets complicated quickly, for little benefit.) > > The alternative you mention is that if the task is re-affinitized, it > loses its task-isolation status, and that also seems like an unfortunate > API, since if you are setting it with prctl(), it's really cleanest just to > only be able to unset it with prctl() as well. > > I think given the current "polite" API, the only question is whether in > fact *no* initial test is the best thing, or if an initial test (as > introduced > in the v8 version) is defensible just as a help for catching an obvious > mistake in setting up your task isolation. I decided the advantage > of catching the mistake were more important than the "API purity" > of being 100% consistent in how we handled the interactions between > affinity and isolation, but I am certainly open to argument on that one. > > Meanwhile I think it still feels like the v8 code is the best compromise. So what is the way to deal with a migration for example? When the task wakes up on the non-isolated CPU, it gets warned or killed? > > >>+ /* If the tick is running, request rescheduling; we're not ready. */ > >>+ if (!tick_nohz_tick_stopped()) { > >Note that this function tells whether the tick is in dynticks mode, which means > >the tick currently only run on-demand. But it's not necessarily completely stopped. > > I think in fact this is the semantics we want (and that people requested), > e.g. if the user requests an alarm(), we may still be ticking even though > tick_nohz_tick_stopped() is true, but that test is still the right condition > to use to return to user space, since the user explicitly requested the > alarm. It seems to break the initial purpose. If your task really doesn't want to be disturbed, it simply can't arm a timer. tick_nohz_tick_stopped() is really no other indication than the CPU trying to do its best to delay the next tick. But that next tick could be re-armed every two msecs for example. Worse yet, if the tick has been stopped and finally issues a timer that rearms itself every 1 msec, tick_nohz_tick_stopped() will still be true. Thanks. > > >I think we should rename that function and the field it refers to. > > Sounds like a good idea. > > -- > Chris Metcalf, EZChip Semiconductor > http://www.ezchip.com > ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support @ 2016-01-28 16:38 ` Frederic Weisbecker 0 siblings, 0 replies; 340+ messages in thread From: Frederic Weisbecker @ 2016-01-28 16:38 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, Oct 27, 2015 at 12:40:29PM -0400, Chris Metcalf wrote: > On 10/21/2015 12:12 PM, Frederic Weisbecker wrote: > >On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote: > >>+/* > >>+ * This routine controls whether we can enable task-isolation mode. > >>+ * The task must be affinitized to a single nohz_full core or we will > >>+ * return EINVAL. Although the application could later re-affinitize > >>+ * to a housekeeping core and lose task isolation semantics, this > >>+ * initial test should catch 99% of bugs with task placement prior to > >>+ * enabling task isolation. > >>+ */ > >>+int task_isolation_set(unsigned int flags) > >>+{ > >>+ if (cpumask_weight(tsk_cpus_allowed(current)) != 1 || > >I think you'll have to make sure the task can not be concurrently reaffined > >to more CPUs. This may involve setting task_isolation_flags under the runqueue > >lock and thus move that tiny part to the scheduler code. And then we must forbid > >changing the affinity while the task has the isolation flag, or deactivate the flag. > > > >In any case this needs some synchronization. > > Well, as the comment says, this is not intended as a hard guarantee. > As written, it might race with a concurrent sched_setaffinity(), but > then again, it also is totally OK as written for sched_setaffinity() to > change it away after the prctl() is complete, so it's not necessary to > do any explicit synchronization. > > This harks back again to the whole "polite vs aggressive" issue with > how we envision task isolation. > > The "polite" model basically allows you to set up the conditions for > task isolation to be useful, and then if they are useful, great! What > you're suggesting here is a bit more of the "aggressive" model, where > we actually fail sched_setaffinity() either for any cpumask after > task isolation is set, or perhaps just for resetting it to housekeeping > cores. (Note that we could in principle use PF_NO_SETAFFINITY to > just hard fail all attempts to call sched_setaffinity once we enable > task isolation, so we don't have to add more mechanism on that path.) > > I'm a little reluctant to ever fail sched_setaffinity() based on the > task isolation status with the current "polite" model, since an > unprivileged application can set up for task isolation, and then > presumably no one can override it via sched_setaffinity() from another > task. (I suppose you could do some kind of permissions-based thing > where root can always override it, or some suitable capability, etc., > but I feel like that gets complicated quickly, for little benefit.) > > The alternative you mention is that if the task is re-affinitized, it > loses its task-isolation status, and that also seems like an unfortunate > API, since if you are setting it with prctl(), it's really cleanest just to > only be able to unset it with prctl() as well. > > I think given the current "polite" API, the only question is whether in > fact *no* initial test is the best thing, or if an initial test (as > introduced > in the v8 version) is defensible just as a help for catching an obvious > mistake in setting up your task isolation. I decided the advantage > of catching the mistake were more important than the "API purity" > of being 100% consistent in how we handled the interactions between > affinity and isolation, but I am certainly open to argument on that one. > > Meanwhile I think it still feels like the v8 code is the best compromise. So what is the way to deal with a migration for example? When the task wakes up on the non-isolated CPU, it gets warned or killed? > > >>+ /* If the tick is running, request rescheduling; we're not ready. */ > >>+ if (!tick_nohz_tick_stopped()) { > >Note that this function tells whether the tick is in dynticks mode, which means > >the tick currently only run on-demand. But it's not necessarily completely stopped. > > I think in fact this is the semantics we want (and that people requested), > e.g. if the user requests an alarm(), we may still be ticking even though > tick_nohz_tick_stopped() is true, but that test is still the right condition > to use to return to user space, since the user explicitly requested the > alarm. It seems to break the initial purpose. If your task really doesn't want to be disturbed, it simply can't arm a timer. tick_nohz_tick_stopped() is really no other indication than the CPU trying to do its best to delay the next tick. But that next tick could be re-armed every two msecs for example. Worse yet, if the tick has been stopped and finally issues a timer that rearms itself every 1 msec, tick_nohz_tick_stopped() will still be true. Thanks. > > >I think we should rename that function and the field it refers to. > > Sounds like a good idea. > > -- > Chris Metcalf, EZChip Semiconductor > http://www.ezchip.com > ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support 2016-01-28 16:38 ` Frederic Weisbecker @ 2016-02-11 19:58 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2016-02-11 19:58 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 01/28/2016 11:38 AM, Frederic Weisbecker wrote: > On Tue, Oct 27, 2015 at 12:40:29PM -0400, Chris Metcalf wrote: >> On 10/21/2015 12:12 PM, Frederic Weisbecker wrote: >>> On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote: >>>> +/* >>>> + * This routine controls whether we can enable task-isolation mode. >>>> + * The task must be affinitized to a single nohz_full core or we will >>>> + * return EINVAL. Although the application could later re-affinitize >>>> + * to a housekeeping core and lose task isolation semantics, this >>>> + * initial test should catch 99% of bugs with task placement prior to >>>> + * enabling task isolation. >>>> + */ >>>> +int task_isolation_set(unsigned int flags) >>>> +{ >>>> + if (cpumask_weight(tsk_cpus_allowed(current)) != 1 || >>> I think you'll have to make sure the task can not be concurrently reaffined >>> to more CPUs. This may involve setting task_isolation_flags under the runqueue >>> lock and thus move that tiny part to the scheduler code. And then we must forbid >>> changing the affinity while the task has the isolation flag, or deactivate the flag. >>> >>> In any case this needs some synchronization. >> Well, as the comment says, this is not intended as a hard guarantee. >> As written, it might race with a concurrent sched_setaffinity(), but >> then again, it also is totally OK as written for sched_setaffinity() to >> change it away after the prctl() is complete, so it's not necessary to >> do any explicit synchronization. >> >> This harks back again to the whole "polite vs aggressive" issue with >> how we envision task isolation. >> >> The "polite" model basically allows you to set up the conditions for >> task isolation to be useful, and then if they are useful, great! What >> you're suggesting here is a bit more of the "aggressive" model, where >> we actually fail sched_setaffinity() either for any cpumask after >> task isolation is set, or perhaps just for resetting it to housekeeping >> cores. (Note that we could in principle use PF_NO_SETAFFINITY to >> just hard fail all attempts to call sched_setaffinity once we enable >> task isolation, so we don't have to add more mechanism on that path.) >> >> I'm a little reluctant to ever fail sched_setaffinity() based on the >> task isolation status with the current "polite" model, since an >> unprivileged application can set up for task isolation, and then >> presumably no one can override it via sched_setaffinity() from another >> task. (I suppose you could do some kind of permissions-based thing >> where root can always override it, or some suitable capability, etc., >> but I feel like that gets complicated quickly, for little benefit.) >> >> The alternative you mention is that if the task is re-affinitized, it >> loses its task-isolation status, and that also seems like an unfortunate >> API, since if you are setting it with prctl(), it's really cleanest just to >> only be able to unset it with prctl() as well. >> >> I think given the current "polite" API, the only question is whether in >> fact *no* initial test is the best thing, or if an initial test (as >> introduced >> in the v8 version) is defensible just as a help for catching an obvious >> mistake in setting up your task isolation. I decided the advantage >> of catching the mistake were more important than the "API purity" >> of being 100% consistent in how we handled the interactions between >> affinity and isolation, but I am certainly open to argument on that one. >> >> Meanwhile I think it still feels like the v8 code is the best compromise. > So what is the way to deal with a migration for example? When the task wakes > up on the non-isolated CPU, it gets warned or killed? Good question! We can only enable task isolation on an isolcpus core, so it must be a manual migration, either externally, or by the program itself calling sched_setaffinity(). So at some level, it's just an application bug. In the current code, if you have enabled STRICT mode task isolation, the process will get killed since it has to go through the kernel to migrate. If not in STRICT mode, then it will hang until it is manually killed since full dynticks will never get turned on once it wakes up on a non-isolated CPU - unless it is then manually migrated back to a proper task-isolation cpu. And, perhaps the intent was to do some cpu offlining and rearrange the task isolation tasks, and therefore that makes sense? So, maybe that semantics is good enough!? I'm not completely sure, but I think I'm willing to claim that for something this much of a corner case, it's probably reasonable. >>>> + /* If the tick is running, request rescheduling; we're not ready. */ >>>> + if (!tick_nohz_tick_stopped()) { >>> Note that this function tells whether the tick is in dynticks mode, which means >>> the tick currently only run on-demand. But it's not necessarily completely stopped. >> I think in fact this is the semantics we want (and that people requested), >> e.g. if the user requests an alarm(), we may still be ticking even though >> tick_nohz_tick_stopped() is true, but that test is still the right condition >> to use to return to user space, since the user explicitly requested the >> alarm. > It seems to break the initial purpose. If your task really doesn't want to be > disturbed, it simply can't arm a timer. tick_nohz_tick_stopped() is really no > other indication than the CPU trying to do its best to delay the next tick. But > that next tick could be re-armed every two msecs for example. Worse yet, if the > tick has been stopped and finally issues a timer that rearms itself every 1 msec, > tick_nohz_tick_stopped() will still be true. This is definitely another grey area. Certainly if there's a kernel timer that rearms itself every 1 ms, we're in trouble. (And the existing mechanisms of STRICT mode and task_isolation_debug would help.) But as far as just regular userspace arming a timer via syscall, then if your hardware had a 64-bit down counter for timer interrupts, for example, you might well be able to do something like say "every night at midnight, I can stop driving packets and do system maintenance, so I'd like the kernel to interrupt me". In this case some kind of alarm() would not be incompatible with task isolation. I admit this is kind of an extreme case; and certainly in STRICT mode, as currently written, you'd get a signal if you tried to do this, so you'd have to run with STRICT mode off. However, the reason I specifically decided to do this is community feedback. In http://lkml.kernel.org/r/CALCETrVdZxkEeQd3=V6p_yLYL7T83Y3WfnhfVGi3GwTxF+vPQg@mail.gmail.com, on 9/28/2015, Andy Lutomirski wrote: > Why are we treating alarms as something that should defer entry to > userspace? I think it would be entirely reasonable to set an alarm > for ten minutes, ask for isolation, and then think hard for ten > minutes. > > [...] > > ISTM something's suboptimal with the inner workings of all this if > task_isolation_enter needs to sleep to wait for an event that isn't > scheduled for the immediate future (e.g. already queued up as an > interrupt). -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 04/14] task_isolation: add initial support @ 2016-02-11 19:58 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2016-02-11 19:58 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 01/28/2016 11:38 AM, Frederic Weisbecker wrote: > On Tue, Oct 27, 2015 at 12:40:29PM -0400, Chris Metcalf wrote: >> On 10/21/2015 12:12 PM, Frederic Weisbecker wrote: >>> On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote: >>>> +/* >>>> + * This routine controls whether we can enable task-isolation mode. >>>> + * The task must be affinitized to a single nohz_full core or we will >>>> + * return EINVAL. Although the application could later re-affinitize >>>> + * to a housekeeping core and lose task isolation semantics, this >>>> + * initial test should catch 99% of bugs with task placement prior to >>>> + * enabling task isolation. >>>> + */ >>>> +int task_isolation_set(unsigned int flags) >>>> +{ >>>> + if (cpumask_weight(tsk_cpus_allowed(current)) != 1 || >>> I think you'll have to make sure the task can not be concurrently reaffined >>> to more CPUs. This may involve setting task_isolation_flags under the runqueue >>> lock and thus move that tiny part to the scheduler code. And then we must forbid >>> changing the affinity while the task has the isolation flag, or deactivate the flag. >>> >>> In any case this needs some synchronization. >> Well, as the comment says, this is not intended as a hard guarantee. >> As written, it might race with a concurrent sched_setaffinity(), but >> then again, it also is totally OK as written for sched_setaffinity() to >> change it away after the prctl() is complete, so it's not necessary to >> do any explicit synchronization. >> >> This harks back again to the whole "polite vs aggressive" issue with >> how we envision task isolation. >> >> The "polite" model basically allows you to set up the conditions for >> task isolation to be useful, and then if they are useful, great! What >> you're suggesting here is a bit more of the "aggressive" model, where >> we actually fail sched_setaffinity() either for any cpumask after >> task isolation is set, or perhaps just for resetting it to housekeeping >> cores. (Note that we could in principle use PF_NO_SETAFFINITY to >> just hard fail all attempts to call sched_setaffinity once we enable >> task isolation, so we don't have to add more mechanism on that path.) >> >> I'm a little reluctant to ever fail sched_setaffinity() based on the >> task isolation status with the current "polite" model, since an >> unprivileged application can set up for task isolation, and then >> presumably no one can override it via sched_setaffinity() from another >> task. (I suppose you could do some kind of permissions-based thing >> where root can always override it, or some suitable capability, etc., >> but I feel like that gets complicated quickly, for little benefit.) >> >> The alternative you mention is that if the task is re-affinitized, it >> loses its task-isolation status, and that also seems like an unfortunate >> API, since if you are setting it with prctl(), it's really cleanest just to >> only be able to unset it with prctl() as well. >> >> I think given the current "polite" API, the only question is whether in >> fact *no* initial test is the best thing, or if an initial test (as >> introduced >> in the v8 version) is defensible just as a help for catching an obvious >> mistake in setting up your task isolation. I decided the advantage >> of catching the mistake were more important than the "API purity" >> of being 100% consistent in how we handled the interactions between >> affinity and isolation, but I am certainly open to argument on that one. >> >> Meanwhile I think it still feels like the v8 code is the best compromise. > So what is the way to deal with a migration for example? When the task wakes > up on the non-isolated CPU, it gets warned or killed? Good question! We can only enable task isolation on an isolcpus core, so it must be a manual migration, either externally, or by the program itself calling sched_setaffinity(). So at some level, it's just an application bug. In the current code, if you have enabled STRICT mode task isolation, the process will get killed since it has to go through the kernel to migrate. If not in STRICT mode, then it will hang until it is manually killed since full dynticks will never get turned on once it wakes up on a non-isolated CPU - unless it is then manually migrated back to a proper task-isolation cpu. And, perhaps the intent was to do some cpu offlining and rearrange the task isolation tasks, and therefore that makes sense? So, maybe that semantics is good enough!? I'm not completely sure, but I think I'm willing to claim that for something this much of a corner case, it's probably reasonable. >>>> + /* If the tick is running, request rescheduling; we're not ready. */ >>>> + if (!tick_nohz_tick_stopped()) { >>> Note that this function tells whether the tick is in dynticks mode, which means >>> the tick currently only run on-demand. But it's not necessarily completely stopped. >> I think in fact this is the semantics we want (and that people requested), >> e.g. if the user requests an alarm(), we may still be ticking even though >> tick_nohz_tick_stopped() is true, but that test is still the right condition >> to use to return to user space, since the user explicitly requested the >> alarm. > It seems to break the initial purpose. If your task really doesn't want to be > disturbed, it simply can't arm a timer. tick_nohz_tick_stopped() is really no > other indication than the CPU trying to do its best to delay the next tick. But > that next tick could be re-armed every two msecs for example. Worse yet, if the > tick has been stopped and finally issues a timer that rearms itself every 1 msec, > tick_nohz_tick_stopped() will still be true. This is definitely another grey area. Certainly if there's a kernel timer that rearms itself every 1 ms, we're in trouble. (And the existing mechanisms of STRICT mode and task_isolation_debug would help.) But as far as just regular userspace arming a timer via syscall, then if your hardware had a 64-bit down counter for timer interrupts, for example, you might well be able to do something like say "every night at midnight, I can stop driving packets and do system maintenance, so I'd like the kernel to interrupt me". In this case some kind of alarm() would not be incompatible with task isolation. I admit this is kind of an extreme case; and certainly in STRICT mode, as currently written, you'd get a signal if you tried to do this, so you'd have to run with STRICT mode off. However, the reason I specifically decided to do this is community feedback. In http://lkml.kernel.org/r/CALCETrVdZxkEeQd3=V6p_yLYL7T83Y3WfnhfVGi3GwTxF+vPQg@mail.gmail.com, on 9/28/2015, Andy Lutomirski wrote: > Why are we treating alarms as something that should defer entry to > userspace? I think it would be entirely reasonable to set an alarm > for ten minutes, ask for isolation, and then think hard for ten > minutes. > > [...] > > ISTM something's suboptimal with the inner workings of all this if > task_isolation_enter needs to sleep to wait for an event that isn't > scheduled for the immediate future (e.g. already queued up as an > interrupt). -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v8 05/14] task_isolation: support PR_TASK_ISOLATION_STRICT mode 2015-10-20 20:35 ` Chris Metcalf @ 2015-10-20 20:36 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With task_isolation mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal; this is defined as happening immediately before the SECCOMP test. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/isolation.h | 21 +++++++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/isolation.c | 42 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 64 insertions(+) diff --git a/include/linux/isolation.h b/include/linux/isolation.h index 4bef90024924..dc14057a359c 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -29,10 +29,31 @@ static inline void task_isolation_enter(void) _task_isolation_enter(); } +extern bool task_isolation_syscall(int nr); +extern bool task_isolation_exception(const char *fmt, ...); + +static inline bool task_isolation_strict(void) +{ + return (tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) == + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)); +} + +#define task_isolation_check_syscall(nr) \ + (task_isolation_strict() && \ + task_isolation_syscall(nr)) + +#define task_isolation_check_exception(fmt, ...) \ + (task_isolation_strict() && \ + task_isolation_exception(fmt, ## __VA_ARGS__)) + #else static inline bool task_isolation_enabled(void) { return false; } static inline bool task_isolation_ready(void) { return true; } static inline void task_isolation_enter(void) { } +static inline bool task_isolation_check_syscall(int nr) { return false; } +static inline bool task_isolation_check_exception(const char *fmt, ...) { return false; } #endif #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 67224df4b559..2b8038b0d1e1 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -201,5 +201,6 @@ struct prctl_mm_map { #define PR_SET_TASK_ISOLATION 48 #define PR_GET_TASK_ISOLATION 49 # define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/isolation.c b/kernel/isolation.c index 9a73235db0bb..30db40098a35 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -11,6 +11,7 @@ #include <linux/vmstat.h> #include <linux/isolation.h> #include <linux/syscalls.h> +#include <asm/unistd.h> #include "time/tick-sched.h" /* @@ -76,3 +77,44 @@ void _task_isolation_enter(void) /* Quieten the vmstat worker so it won't interrupt us. */ quiet_vmstat(); } + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +bool task_isolation_exception(const char *fmt, ...) +{ + va_list args; + char buf[100]; + + /* RCU should have been enabled prior to this point. */ + RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); + + va_start(args, fmt); + vsnprintf(buf, sizeof(buf), fmt, args); + va_end(args); + + pr_warn("%s/%d: task_isolation strict mode violated by %s\n", + current->comm, current->pid, buf); + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; + send_sig(SIGKILL, current, 1); + + return true; +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +bool task_isolation_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return false; + } + + return task_isolation_exception("syscall %d", syscall); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v8 05/14] task_isolation: support PR_TASK_ISOLATION_STRICT mode @ 2015-10-20 20:36 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf With task_isolation mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal; this is defined as happening immediately before the SECCOMP test. To allow the state to be entered and exited, we ignore the prctl() syscall so that we can clear the bit again later, and we ignore exit/exit_group to allow exiting the task without a pointless signal killing you as you try to do so. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/linux/isolation.h | 21 +++++++++++++++++++++ include/uapi/linux/prctl.h | 1 + kernel/isolation.c | 42 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 64 insertions(+) diff --git a/include/linux/isolation.h b/include/linux/isolation.h index 4bef90024924..dc14057a359c 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -29,10 +29,31 @@ static inline void task_isolation_enter(void) _task_isolation_enter(); } +extern bool task_isolation_syscall(int nr); +extern bool task_isolation_exception(const char *fmt, ...); + +static inline bool task_isolation_strict(void) +{ + return (tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) == + (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)); +} + +#define task_isolation_check_syscall(nr) \ + (task_isolation_strict() && \ + task_isolation_syscall(nr)) + +#define task_isolation_check_exception(fmt, ...) \ + (task_isolation_strict() && \ + task_isolation_exception(fmt, ## __VA_ARGS__)) + #else static inline bool task_isolation_enabled(void) { return false; } static inline bool task_isolation_ready(void) { return true; } static inline void task_isolation_enter(void) { } +static inline bool task_isolation_check_syscall(int nr) { return false; } +static inline bool task_isolation_check_exception(const char *fmt, ...) { return false; } #endif #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 67224df4b559..2b8038b0d1e1 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -201,5 +201,6 @@ struct prctl_mm_map { #define PR_SET_TASK_ISOLATION 48 #define PR_GET_TASK_ISOLATION 49 # define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_STRICT (1 << 1) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/isolation.c b/kernel/isolation.c index 9a73235db0bb..30db40098a35 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -11,6 +11,7 @@ #include <linux/vmstat.h> #include <linux/isolation.h> #include <linux/syscalls.h> +#include <asm/unistd.h> #include "time/tick-sched.h" /* @@ -76,3 +77,44 @@ void _task_isolation_enter(void) /* Quieten the vmstat worker so it won't interrupt us. */ quiet_vmstat(); } + +/* + * This routine is called from any userspace exception if the _STRICT + * flag is set. + */ +bool task_isolation_exception(const char *fmt, ...) +{ + va_list args; + char buf[100]; + + /* RCU should have been enabled prior to this point. */ + RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); + + va_start(args, fmt); + vsnprintf(buf, sizeof(buf), fmt, args); + va_end(args); + + pr_warn("%s/%d: task_isolation strict mode violated by %s\n", + current->comm, current->pid, buf); + current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; + send_sig(SIGKILL, current, 1); + + return true; +} + +/* + * This routine is called from syscall entry (with the syscall number + * passed in) if the _STRICT flag is set. + */ +bool task_isolation_syscall(int syscall) +{ + /* Ignore prctl() syscalls or any task exit. */ + switch (syscall) { + case __NR_prctl: + case __NR_exit: + case __NR_exit_group: + return false; + } + + return task_isolation_exception("syscall %d", syscall); +} -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v8 06/14] task_isolation: provide strict mode configurable signal 2015-10-20 20:35 ` Chris Metcalf @ 2015-10-20 20:36 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a task_isolation process in STRICT mode does a syscall or otherwise synchronously enters the kernel. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/isolation.c | 9 ++++++++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 2b8038b0d1e1..a5582ace987f 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -202,5 +202,7 @@ struct prctl_mm_map { #define PR_GET_TASK_ISOLATION 49 # define PR_TASK_ISOLATION_ENABLE (1 << 0) # define PR_TASK_ISOLATION_STRICT (1 << 1) +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/isolation.c b/kernel/isolation.c index 30db40098a35..0fa13b081bb4 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -84,8 +84,10 @@ void _task_isolation_enter(void) */ bool task_isolation_exception(const char *fmt, ...) { + siginfo_t info = {}; va_list args; char buf[100]; + int sig; /* RCU should have been enabled prior to this point. */ RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); @@ -97,7 +99,12 @@ bool task_isolation_exception(const char *fmt, ...) pr_warn("%s/%d: task_isolation strict mode violated by %s\n", current->comm, current->pid, buf); current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags); + if (sig == 0) + sig = SIGKILL; + info.si_signo = sig; + send_sig_info(sig, &info, current); return true; } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v8 06/14] task_isolation: provide strict mode configurable signal @ 2015-10-20 20:36 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Cc: Chris Metcalf Allow userspace to override the default SIGKILL delivered when a task_isolation process in STRICT mode does a syscall or otherwise synchronously enters the kernel. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- include/uapi/linux/prctl.h | 2 ++ kernel/isolation.c | 9 ++++++++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 2b8038b0d1e1..a5582ace987f 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -202,5 +202,7 @@ struct prctl_mm_map { #define PR_GET_TASK_ISOLATION 49 # define PR_TASK_ISOLATION_ENABLE (1 << 0) # define PR_TASK_ISOLATION_STRICT (1 << 1) +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/isolation.c b/kernel/isolation.c index 30db40098a35..0fa13b081bb4 100644 --- a/kernel/isolation.c +++ b/kernel/isolation.c @@ -84,8 +84,10 @@ void _task_isolation_enter(void) */ bool task_isolation_exception(const char *fmt, ...) { + siginfo_t info = {}; va_list args; char buf[100]; + int sig; /* RCU should have been enabled prior to this point. */ RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU"); @@ -97,7 +99,12 @@ bool task_isolation_exception(const char *fmt, ...) pr_warn("%s/%d: task_isolation strict mode violated by %s\n", current->comm, current->pid, buf); current->task_isolation_flags &= ~PR_TASK_ISOLATION_ENABLE; - send_sig(SIGKILL, current, 1); + + sig = PR_TASK_ISOLATION_GET_SIG(current->task_isolation_flags); + if (sig == 0) + sig = SIGKILL; + info.si_signo = sig; + send_sig_info(sig, &info, current); return true; } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal 2015-10-20 20:36 ` Chris Metcalf @ 2015-10-21 0:56 ` Steven Rostedt -1 siblings, 0 replies; 340+ messages in thread From: Steven Rostedt @ 2015-10-21 0:56 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Tue, 20 Oct 2015 16:36:04 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > Allow userspace to override the default SIGKILL delivered > when a task_isolation process in STRICT mode does a syscall > or otherwise synchronously enters the kernel. > Is this really a good idea? This means that there's no way to terminate a task in this mode, even if it goes astray. -- Steve ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal @ 2015-10-21 0:56 ` Steven Rostedt 0 siblings, 0 replies; 340+ messages in thread From: Steven Rostedt @ 2015-10-21 0:56 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Tue, 20 Oct 2015 16:36:04 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > Allow userspace to override the default SIGKILL delivered > when a task_isolation process in STRICT mode does a syscall > or otherwise synchronously enters the kernel. > Is this really a good idea? This means that there's no way to terminate a task in this mode, even if it goes astray. -- Steve ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal @ 2015-10-21 1:30 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-21 1:30 UTC (permalink / raw) To: Steven Rostedt Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 10/20/2015 8:56 PM, Steven Rostedt wrote: > On Tue, 20 Oct 2015 16:36:04 -0400 > Chris Metcalf <cmetcalf@ezchip.com> wrote: > >> Allow userspace to override the default SIGKILL delivered >> when a task_isolation process in STRICT mode does a syscall >> or otherwise synchronously enters the kernel. >> > Is this really a good idea? This means that there's no way to terminate > a task in this mode, even if it goes astray. It doesn't map SIGKILL to some other signal unconditionally. It just allows the "hey, you broke the STRICT contract and entered the kernel" signal to be something besides the default SIGKILL. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal @ 2015-10-21 1:30 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-21 1:30 UTC (permalink / raw) To: Steven Rostedt Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 10/20/2015 8:56 PM, Steven Rostedt wrote: > On Tue, 20 Oct 2015 16:36:04 -0400 > Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > >> Allow userspace to override the default SIGKILL delivered >> when a task_isolation process in STRICT mode does a syscall >> or otherwise synchronously enters the kernel. >> > Is this really a good idea? This means that there's no way to terminate > a task in this mode, even if it goes astray. It doesn't map SIGKILL to some other signal unconditionally. It just allows the "hey, you broke the STRICT contract and entered the kernel" signal to be something besides the default SIGKILL. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal 2015-10-21 1:30 ` Chris Metcalf @ 2015-10-21 1:41 ` Steven Rostedt -1 siblings, 0 replies; 340+ messages in thread From: Steven Rostedt @ 2015-10-21 1:41 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Tue, 20 Oct 2015 21:30:36 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 10/20/2015 8:56 PM, Steven Rostedt wrote: > > On Tue, 20 Oct 2015 16:36:04 -0400 > > Chris Metcalf <cmetcalf@ezchip.com> wrote: > > > >> Allow userspace to override the default SIGKILL delivered > >> when a task_isolation process in STRICT mode does a syscall > >> or otherwise synchronously enters the kernel. > >> > > Is this really a good idea? This means that there's no way to terminate > > a task in this mode, even if it goes astray. > > It doesn't map SIGKILL to some other signal unconditionally. It just allows > the "hey, you broke the STRICT contract and entered the kernel" signal > to be something besides the default SIGKILL. > Ah, I misread the change log. Now looking at the actual code, it makes sense. Sorry for the noise ;-) -- Steve ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal @ 2015-10-21 1:41 ` Steven Rostedt 0 siblings, 0 replies; 340+ messages in thread From: Steven Rostedt @ 2015-10-21 1:41 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Tue, 20 Oct 2015 21:30:36 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 10/20/2015 8:56 PM, Steven Rostedt wrote: > > On Tue, 20 Oct 2015 16:36:04 -0400 > > Chris Metcalf <cmetcalf@ezchip.com> wrote: > > > >> Allow userspace to override the default SIGKILL delivered > >> when a task_isolation process in STRICT mode does a syscall > >> or otherwise synchronously enters the kernel. > >> > > Is this really a good idea? This means that there's no way to terminate > > a task in this mode, even if it goes astray. > > It doesn't map SIGKILL to some other signal unconditionally. It just allows > the "hey, you broke the STRICT contract and entered the kernel" signal > to be something besides the default SIGKILL. > Ah, I misread the change log. Now looking at the actual code, it makes sense. Sorry for the noise ;-) -- Steve ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal 2015-10-21 1:30 ` Chris Metcalf (?) (?) @ 2015-10-21 1:42 ` Andy Lutomirski 2015-10-21 6:41 ` Gilad Ben Yossef -1 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-10-21 1:42 UTC (permalink / raw) To: Chris Metcalf Cc: Steven Rostedt, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 10/20/2015 8:56 PM, Steven Rostedt wrote: >> >> On Tue, 20 Oct 2015 16:36:04 -0400 >> Chris Metcalf <cmetcalf@ezchip.com> wrote: >> >>> Allow userspace to override the default SIGKILL delivered >>> when a task_isolation process in STRICT mode does a syscall >>> or otherwise synchronously enters the kernel. >>> >> Is this really a good idea? This means that there's no way to terminate >> a task in this mode, even if it goes astray. > > > It doesn't map SIGKILL to some other signal unconditionally. It just allows > the "hey, you broke the STRICT contract and entered the kernel" signal > to be something besides the default SIGKILL. > ...which has the odd side effect that sending a non-fatal signal from another process will cause the strict process to enter the kernel and receive an extra signal. I still dislike this thing. It seems like a debugging feature being implemented using signals instead of existing APIs. I *still* don't see why perf can't be used to accomplish your goal. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* RE: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal @ 2015-10-21 6:41 ` Gilad Ben Yossef 0 siblings, 0 replies; 340+ messages in thread From: Gilad Ben Yossef @ 2015-10-21 6:41 UTC (permalink / raw) To: Andy Lutomirski, Chris Metcalf Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2648 bytes --] > From: Andy Lutomirski [mailto:luto@amacapital.net] > Sent: Wednesday, October 21, 2015 4:43 AM > To: Chris Metcalf > Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode > configurable signal > > On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com> > wrote: > > On 10/20/2015 8:56 PM, Steven Rostedt wrote: > >> > >> On Tue, 20 Oct 2015 16:36:04 -0400 > >> Chris Metcalf <cmetcalf@ezchip.com> wrote: > >> > >>> Allow userspace to override the default SIGKILL delivered > >>> when a task_isolation process in STRICT mode does a syscall > >>> or otherwise synchronously enters the kernel. > >>> <snip> > > > > It doesn't map SIGKILL to some other signal unconditionally. It just allows > > the "hey, you broke the STRICT contract and entered the kernel" signal > > to be something besides the default SIGKILL. > > > <snip> > > I still dislike this thing. It seems like a debugging feature being > implemented using signals instead of existing APIs. I *still* don't > see why perf can't be used to accomplish your goal. > It is not (just) a debugging feature. There are workloads were not performing an action is much preferred to being late. Consider the following artificial but representative scenario: a task running in strict isolation is controlling a radiotherapy alpha emitter. The code runs in a tight event loop, reading an MMIO register with location data, making some calculation and in response writing an MMIO register that triggers the alpha emitter. As a safety measure, each trigger is for a specific very short time frame - the alpha emitter auto stops. The code has a strict assumption that no more than X cycles pass between reading the value and the response and the system is built in such a way that as long as the code has mastery of the CPU the assumption holds true. If something breaks this assumption (unplanned context switch to kernel), what you want to do is just stop place rather than fire the alpha emitter X nanoseconds too late. This feature lets you say: if the "contract" of isolation is broken, notify/kill me at once. For code where isolation is important, the correctness of a calculation is dependent on timing. It's like you would accept the kernel to kill a task if it read from an unmapped virtual address rather than returning garbage data. With an isolated task, the right data acted on later than you think is garbage just the same. I hope this sheds some light on the issue. Thanks, Gilad ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 340+ messages in thread
* RE: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal @ 2015-10-21 6:41 ` Gilad Ben Yossef 0 siblings, 0 replies; 340+ messages in thread From: Gilad Ben Yossef @ 2015-10-21 6:41 UTC (permalink / raw) To: Andy Lutomirski, Chris Metcalf Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA > From: Andy Lutomirski [mailto:luto@amacapital.net] > Sent: Wednesday, October 21, 2015 4:43 AM > To: Chris Metcalf > Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode > configurable signal > > On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com> > wrote: > > On 10/20/2015 8:56 PM, Steven Rostedt wrote: > >> > >> On Tue, 20 Oct 2015 16:36:04 -0400 > >> Chris Metcalf <cmetcalf@ezchip.com> wrote: > >> > >>> Allow userspace to override the default SIGKILL delivered > >>> when a task_isolation process in STRICT mode does a syscall > >>> or otherwise synchronously enters the kernel. > >>> <snip> > > > > It doesn't map SIGKILL to some other signal unconditionally. It just allows > > the "hey, you broke the STRICT contract and entered the kernel" signal > > to be something besides the default SIGKILL. > > > <snip> > > I still dislike this thing. It seems like a debugging feature being > implemented using signals instead of existing APIs. I *still* don't > see why perf can't be used to accomplish your goal. > It is not (just) a debugging feature. There are workloads were not performing an action is much preferred to being late. Consider the following artificial but representative scenario: a task running in strict isolation is controlling a radiotherapy alpha emitter. The code runs in a tight event loop, reading an MMIO register with location data, making some calculation and in response writing an MMIO register that triggers the alpha emitter. As a safety measure, each trigger is for a specific very short time frame - the alpha emitter auto stops. The code has a strict assumption that no more than X cycles pass between reading the value and the response and the system is built in such a way that as long as the code has mastery of the CPU the assumption holds true. If something breaks this assumption (unplanned context switch to kernel), what you want to do is just stop place rather than fire the alpha emitter X nanoseconds too late. This feature lets you say: if the "contract" of isolation is broken, notify/kill me at once. For code where isolation is important, the correctness of a calculation is dependent on timing. It's like you would accept the kernel to kill a task if it read from an unmapped virtual address rather than returning garbage data. With an isolated task, the right data acted on later than you think is garbage just the same. I hope this sheds some light on the issue. Thanks, Gilad ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal 2015-10-21 6:41 ` Gilad Ben Yossef (?) @ 2015-10-21 18:53 ` Andy Lutomirski 2015-10-22 20:44 ` Chris Metcalf 2015-10-24 9:16 ` Gilad Ben Yossef -1 siblings, 2 replies; 340+ messages in thread From: Andy Lutomirski @ 2015-10-21 18:53 UTC (permalink / raw) To: Gilad Ben Yossef Cc: Chris Metcalf, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On Tue, Oct 20, 2015 at 11:41 PM, Gilad Ben Yossef <giladb@ezchip.com> wrote: > > >> From: Andy Lutomirski [mailto:luto@amacapital.net] >> Sent: Wednesday, October 21, 2015 4:43 AM >> To: Chris Metcalf >> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode >> configurable signal >> >> On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com> >> wrote: >> > On 10/20/2015 8:56 PM, Steven Rostedt wrote: >> >> >> >> On Tue, 20 Oct 2015 16:36:04 -0400 >> >> Chris Metcalf <cmetcalf@ezchip.com> wrote: >> >> >> >>> Allow userspace to override the default SIGKILL delivered >> >>> when a task_isolation process in STRICT mode does a syscall >> >>> or otherwise synchronously enters the kernel. >> >>> > <snip> >> > >> > It doesn't map SIGKILL to some other signal unconditionally. It just allows >> > the "hey, you broke the STRICT contract and entered the kernel" signal >> > to be something besides the default SIGKILL. >> > >> > > <snip> >> >> I still dislike this thing. It seems like a debugging feature being >> implemented using signals instead of existing APIs. I *still* don't >> see why perf can't be used to accomplish your goal. >> > > It is not (just) a debugging feature. There are workloads were not performing an action is much preferred to being late. > > Consider the following artificial but representative scenario: a task running in strict isolation is controlling a radiotherapy alpha emitter. > The code runs in a tight event loop, reading an MMIO register with location data, making some calculation and in response writing an > MMIO register that triggers the alpha emitter. As a safety measure, each trigger is for a specific very short time frame - the alpha emitter > auto stops. > > The code has a strict assumption that no more than X cycles pass between reading the value and the response and the system is built in > such a way that as long as the code has mastery of the CPU the assumption holds true. If something breaks this assumption (unplanned > context switch to kernel), what you want to do is just stop place > rather than fire the alpha emitter X nanoseconds too late. > > This feature lets you say: if the "contract" of isolation is broken, notify/kill me at once. That's a fair point. It's risky, though, for quite a few reasons. 1. If someone builds an alpha emitter like this, they did it wrong. The kernel should write a trigger *and* a timestamp to the hardware and the hardware should trigger at the specified time if the time is in the future and throw an error if it's in the past. If you need to check that you made the deadline, check the actual desired condition (did you meat the deadline?) not a proxy (did the signal fire?). 2. This strict mode thing isn't exhaustive. It's missing, at least, coverage for NMI, MCE, and SMI. Sure, you can think that you've disabled all NMI sources, you can try to remember to set the appropriate boot flag that panics on MCE (and hope that you don't get screwed by broadcast MCE on Intel systems before it got fixed (Skylake? Is the fix even available in a released chip?), and, for SMI, good luck... 3. You haven't dealt with IPIs. The TLB flush code in particular seems like it will break all your assumptions. Maybe it would make sense to whack more of the moles before adding a big assertion that there aren't any moles any more. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal 2015-10-21 18:53 ` Andy Lutomirski @ 2015-10-22 20:44 ` Chris Metcalf 2015-10-22 21:00 ` Andy Lutomirski 2015-10-24 9:16 ` Gilad Ben Yossef 1 sibling, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-10-22 20:44 UTC (permalink / raw) To: Andy Lutomirski, Gilad Ben Yossef Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On 10/21/2015 02:53 PM, Andy Lutomirski wrote: > On Tue, Oct 20, 2015 at 11:41 PM, Gilad Ben Yossef <giladb@ezchip.com> wrote: >> >>> From: Andy Lutomirski [mailto:luto@amacapital.net] >>> Sent: Wednesday, October 21, 2015 4:43 AM >>> To: Chris Metcalf >>> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode >>> configurable signal >>> >>> On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com> >>> wrote: >>>> On 10/20/2015 8:56 PM, Steven Rostedt wrote: >>>>> On Tue, 20 Oct 2015 16:36:04 -0400 >>>>> Chris Metcalf <cmetcalf@ezchip.com> wrote: >>>>> >>>>>> Allow userspace to override the default SIGKILL delivered >>>>>> when a task_isolation process in STRICT mode does a syscall >>>>>> or otherwise synchronously enters the kernel. >>>>>> >> <snip> >>>> It doesn't map SIGKILL to some other signal unconditionally. It just allows >>>> the "hey, you broke the STRICT contract and entered the kernel" signal >>>> to be something besides the default SIGKILL. >>>> >> <snip> >>> I still dislike this thing. It seems like a debugging feature being >>> implemented using signals instead of existing APIs. I *still* don't >>> see why perf can't be used to accomplish your goal. >>> >> It is not (just) a debugging feature. There are workloads were not performing an action is much preferred to being late. >> >> Consider the following artificial but representative scenario: a task running in strict isolation is controlling a radiotherapy alpha emitter. >> The code runs in a tight event loop, reading an MMIO register with location data, making some calculation and in response writing an >> MMIO register that triggers the alpha emitter. As a safety measure, each trigger is for a specific very short time frame - the alpha emitter >> auto stops. >> >> The code has a strict assumption that no more than X cycles pass between reading the value and the response and the system is built in >> such a way that as long as the code has mastery of the CPU the assumption holds true. If something breaks this assumption (unplanned >> context switch to kernel), what you want to do is just stop place >> rather than fire the alpha emitter X nanoseconds too late. >> >> This feature lets you say: if the "contract" of isolation is broken, notify/kill me at once. > That's a fair point. It's risky, though, for quite a few reasons. > > 1. If someone builds an alpha emitter like this, they did it wrong. > The kernel should write a trigger *and* a timestamp to the hardware > and the hardware should trigger at the specified time if the time is > in the future and throw an error if it's in the past. If you need to > check that you made the deadline, check the actual desired condition > (did you meat the deadline?) not a proxy (did the signal fire?). Definitely a better hardware design, but as we all know, hardware designers too rarely consult the software people who have to right the actual code to properly drive the hardware :-) My canonical example is high-performance userspace network drivers, and though dropping is packet is less likely to kill a patient, it's still a pretty bad thing if you're trying to design a robust appliance. In this case you really want to fix application bugs that cause the code to enter the kernel when you think you're in the internal loop running purely in userspace. Things like unexpected page faults, and third-party code that almost never calls the kernel but in some dusty corner it occasionally does, can screw up your userspace code pretty badly, and mysteriously. The "strict" mode support is not a hypothetical insurance policy but a reaction to lots of Tilera customer support over the years to folks failing to stay in userspace when they thought they were doing the right thing. > 2. This strict mode thing isn't exhaustive. It's missing, at least, > coverage for NMI, MCE, and SMI. Sure, you can think that you've > disabled all NMI sources, you can try to remember to set the > appropriate boot flag that panics on MCE (and hope that you don't get > screwed by broadcast MCE on Intel systems before it got fixed > (Skylake? Is the fix even available in a released chip?), and, for > SMI, good luck... You are confusing this strict mode support with the debug support in patch 07/14. Strict mode is for synchronous application errors. You might be right that there are cases that haven't been covered, but certainly most of them are covered on the three platforms that are supported in this initial series. (You pointed me to one that I would have missed on x86, namely the bounds check exception from a bad bounds setup.) I'm pretty confident I have all of them for tile, since I know that hardware best, and I think we're in good shape for arm64, though I'm still coming up to speed on that architecture. NMIs and machine checks are asynchronous interrupts that don't have to do with what the application is doing, more or less. Those should not be delivered to task-isolation cores at all, so we just generate console spew when you set the task_isolation_debug boot option. I honestly don't know enough about system management interrupts to comment on that, though again, I would hope one can configure the system to just not deliver them to nohz_full cores, and I think it would be reasonable to generate some kernel spew if that happens. > 3. You haven't dealt with IPIs. The TLB flush code in particular > seems like it will break all your assumptions. Again, not a synchronous application error that we are trying to catch with this signalling mechanism. That said it could obviously be a more general application error (e.g. a process with threads on both nohz_full and housekeeping cores, where the housekeeping core unmaps some memory and thus requires a TLB flush IPI). But this is covered by the task_isolation_debug patch for kernel/smp.c. > Maybe it would make sense to whack more of the moles before adding a > big assertion that there aren't any moles any more. Maybe, but I've whacked the ones I know how to whack. If there are ones I've missed I'm happy to add them in a subsequent version of this series, or in follow-on patches. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal 2015-10-22 20:44 ` Chris Metcalf @ 2015-10-22 21:00 ` Andy Lutomirski 2015-10-27 19:37 ` Chris Metcalf 0 siblings, 1 reply; 340+ messages in thread From: Andy Lutomirski @ 2015-10-22 21:00 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On Thu, Oct 22, 2015 at 1:44 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: > On 10/21/2015 02:53 PM, Andy Lutomirski wrote: >> >> On Tue, Oct 20, 2015 at 11:41 PM, Gilad Ben Yossef <giladb@ezchip.com> >> wrote: >>> >>> >>>> From: Andy Lutomirski [mailto:luto@amacapital.net] >>>> Sent: Wednesday, October 21, 2015 4:43 AM >>>> To: Chris Metcalf >>>> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode >>>> configurable signal >>>> >>>> On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com> >>>> wrote: >>>>> >>>>> On 10/20/2015 8:56 PM, Steven Rostedt wrote: >>>>>> >>>>>> On Tue, 20 Oct 2015 16:36:04 -0400 >>>>>> Chris Metcalf <cmetcalf@ezchip.com> wrote: >>>>>> >>>>>>> Allow userspace to override the default SIGKILL delivered >>>>>>> when a task_isolation process in STRICT mode does a syscall >>>>>>> or otherwise synchronously enters the kernel. >>>>>>> >>> <snip> >>>>> >>>>> It doesn't map SIGKILL to some other signal unconditionally. It just >>>>> allows >>>>> the "hey, you broke the STRICT contract and entered the kernel" signal >>>>> to be something besides the default SIGKILL. >>>>> >>> <snip> >>>> >>>> I still dislike this thing. It seems like a debugging feature being >>>> implemented using signals instead of existing APIs. I *still* don't >>>> see why perf can't be used to accomplish your goal. >>>> >>> It is not (just) a debugging feature. There are workloads were not >>> performing an action is much preferred to being late. >>> >>> Consider the following artificial but representative scenario: a task >>> running in strict isolation is controlling a radiotherapy alpha emitter. >>> The code runs in a tight event loop, reading an MMIO register with >>> location data, making some calculation and in response writing an >>> MMIO register that triggers the alpha emitter. As a safety measure, each >>> trigger is for a specific very short time frame - the alpha emitter >>> auto stops. >>> >>> The code has a strict assumption that no more than X cycles pass between >>> reading the value and the response and the system is built in >>> such a way that as long as the code has mastery of the CPU the assumption >>> holds true. If something breaks this assumption (unplanned >>> context switch to kernel), what you want to do is just stop place >>> rather than fire the alpha emitter X nanoseconds too late. >>> >>> This feature lets you say: if the "contract" of isolation is broken, >>> notify/kill me at once. >> >> That's a fair point. It's risky, though, for quite a few reasons. >> >> 1. If someone builds an alpha emitter like this, they did it wrong. >> The kernel should write a trigger *and* a timestamp to the hardware >> and the hardware should trigger at the specified time if the time is >> in the future and throw an error if it's in the past. If you need to >> check that you made the deadline, check the actual desired condition >> (did you meat the deadline?) not a proxy (did the signal fire?). > > > Definitely a better hardware design, but as we all know, hardware > designers too rarely consult the software people who have to > right the actual code to properly drive the hardware :-) > > My canonical example is high-performance userspace network > drivers, and though dropping is packet is less likely to kill a > patient, it's still a pretty bad thing if you're trying to design > a robust appliance. In this case you really want to fix application > bugs that cause the code to enter the kernel when you think > you're in the internal loop running purely in userspace. Things > like unexpected page faults, and third-party code that almost > never calls the kernel but in some dusty corner it occasionally > does, can screw up your userspace code pretty badly, and > mysteriously. The "strict" mode support is not a hypothetical > insurance policy but a reaction to lots of Tilera customer support > over the years to folks failing to stay in userspace when they > thought they were doing the right thing. But this is *exactly* the case where perf or other out-of-band debugging could be a much better solution. Perf could notify a non-isolated thread that an interrupt happened, you'd still drop a packet or two, but you wouldn't also drop the next ten thousand packets while handling the signal. > >> 2. This strict mode thing isn't exhaustive. It's missing, at least, >> coverage for NMI, MCE, and SMI. Sure, you can think that you've >> disabled all NMI sources, you can try to remember to set the >> appropriate boot flag that panics on MCE (and hope that you don't get >> screwed by broadcast MCE on Intel systems before it got fixed >> (Skylake? Is the fix even available in a released chip?), and, for >> SMI, good luck... > > > You are confusing this strict mode support with the debug > support in patch 07/14. Nope. I'm confusing this strict mode with what Gilad described: using strict mode to cause outright shutdown instead of failure to meet a deadline. (FWIW, you could also use an ordinary hardware watchdog timer to promote your failure to meet a deadline to a shutdown. No new kernel support needed.) > > Strict mode is for synchronous application errors. You might > be right that there are cases that haven't been covered, but > certainly most of them are covered on the three platforms that > are supported in this initial series. (You pointed me to one > that I would have missed on x86, namely the bounds check > exception from a bad bounds setup.) I'm pretty confident I > have all of them for tile, since I know that hardware best, > and I think we're in good shape for arm64, though I'm still > coming up to speed on that architecture. Again, for this definition of strict mode, I still don't see why it's the right design. If you want to debug your application to detect application errors, use a debugging interface. > > NMIs and machine checks are asynchronous interrupts that > don't have to do with what the application is doing, more or less. > Those should not be delivered to task-isolation cores at all, > so we just generate console spew when you set the > task_isolation_debug boot option. I honestly don't know enough > about system management interrupts to comment on that, > though again, I would hope one can configure the system to > just not deliver them to nohz_full cores, and I think it would > be reasonable to generate some kernel spew if that happens. Hah hah yeah right. On most existing Intel CPUs, you *cannot* configure machine checks to do anything other than broadcast to all cores or cause immediate shutdown. And getting any sort of reasonable control over SMI more or less requires special firmware. > >> 3. You haven't dealt with IPIs. The TLB flush code in particular >> seems like it will break all your assumptions. > > > Again, not a synchronous application error that we are trying > to catch with this signalling mechanism. > > That said it could obviously be a more general application error > (e.g. a process with threads on both nohz_full and housekeeping > cores, where the housekeeping core unmaps some memory and > thus requires a TLB flush IPI). But this is covered by the > task_isolation_debug patch for kernel/smp.c. > >> Maybe it would make sense to whack more of the moles before adding a >> big assertion that there aren't any moles any more. > > > Maybe, but I've whacked the ones I know how to whack. > If there are ones I've missed I'm happy to add them in a > subsequent version of this series, or in follow-on patches. > I agree that you can, in principle, catch all the synchronous application errors using this mechanism. I'm saying that catching them seems quite useful, but catching them using a prctl that causes a signal and explicitly does *not* solve the deadline enforcement problem seems to have dubious value in the upstream kernel. You can't catch the asynchronous application errors with this mechanism (or at least your ability to catch them depends on which patch version IIRC), which include calling anything like munmap or membarrier in another thread. --Andy ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal @ 2015-10-27 19:37 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-27 19:37 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel On 10/22/2015 05:00 PM, Andy Lutomirski wrote: > On Thu, Oct 22, 2015 at 1:44 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote: >> On 10/21/2015 02:53 PM, Andy Lutomirski wrote: >>> On Tue, Oct 20, 2015 at 11:41 PM, Gilad Ben Yossef <giladb@ezchip.com> >>> wrote: >>>> >>>>> From: Andy Lutomirski [mailto:luto@amacapital.net] >>>>> Sent: Wednesday, October 21, 2015 4:43 AM >>>>> To: Chris Metcalf >>>>> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode >>>>> configurable signal >>>>> >>>>> On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf@ezchip.com> >>>>> wrote: >>>>>> On 10/20/2015 8:56 PM, Steven Rostedt wrote: >>>>>>> On Tue, 20 Oct 2015 16:36:04 -0400 >>>>>>> Chris Metcalf <cmetcalf@ezchip.com> wrote: >>>>>>> >>>>>>>> Allow userspace to override the default SIGKILL delivered >>>>>>>> when a task_isolation process in STRICT mode does a syscall >>>>>>>> or otherwise synchronously enters the kernel. >>>>>>>> >>>> <snip> >>>>>> It doesn't map SIGKILL to some other signal unconditionally. It just >>>>>> allows >>>>>> the "hey, you broke the STRICT contract and entered the kernel" signal >>>>>> to be something besides the default SIGKILL. >>>>>> >>>> <snip> >>>>> I still dislike this thing. It seems like a debugging feature being >>>>> implemented using signals instead of existing APIs. I *still* don't >>>>> see why perf can't be used to accomplish your goal. >>>>> >>>> It is not (just) a debugging feature. There are workloads were not >>>> performing an action is much preferred to being late. >>>> >>>> Consider the following artificial but representative scenario: a task >>>> running in strict isolation is controlling a radiotherapy alpha emitter. >>>> The code runs in a tight event loop, reading an MMIO register with >>>> location data, making some calculation and in response writing an >>>> MMIO register that triggers the alpha emitter. As a safety measure, each >>>> trigger is for a specific very short time frame - the alpha emitter >>>> auto stops. >>>> >>>> The code has a strict assumption that no more than X cycles pass between >>>> reading the value and the response and the system is built in >>>> such a way that as long as the code has mastery of the CPU the assumption >>>> holds true. If something breaks this assumption (unplanned >>>> context switch to kernel), what you want to do is just stop place >>>> rather than fire the alpha emitter X nanoseconds too late. >>>> >>>> This feature lets you say: if the "contract" of isolation is broken, >>>> notify/kill me at once. >>> That's a fair point. It's risky, though, for quite a few reasons. >>> >>> 1. If someone builds an alpha emitter like this, they did it wrong. >>> The kernel should write a trigger *and* a timestamp to the hardware >>> and the hardware should trigger at the specified time if the time is >>> in the future and throw an error if it's in the past. If you need to >>> check that you made the deadline, check the actual desired condition >>> (did you meat the deadline?) not a proxy (did the signal fire?). >> >> Definitely a better hardware design, but as we all know, hardware >> designers too rarely consult the software people who have to >> right the actual code to properly drive the hardware :-) >> >> My canonical example is high-performance userspace network >> drivers, and though dropping is packet is less likely to kill a >> patient, it's still a pretty bad thing if you're trying to design >> a robust appliance. In this case you really want to fix application >> bugs that cause the code to enter the kernel when you think >> you're in the internal loop running purely in userspace. Things >> like unexpected page faults, and third-party code that almost >> never calls the kernel but in some dusty corner it occasionally >> does, can screw up your userspace code pretty badly, and >> mysteriously. The "strict" mode support is not a hypothetical >> insurance policy but a reaction to lots of Tilera customer support >> over the years to folks failing to stay in userspace when they >> thought they were doing the right thing. > But this is *exactly* the case where perf or other out-of-band > debugging could be a much better solution. Perf could notify a > non-isolated thread that an interrupt happened, you'd still drop a > packet or two, but you wouldn't also drop the next ten thousand > packets while handling the signal. There's no reason the signal needs to be delivered to one of the nohz_full cores. If you're setting up to catch these signals rather than have them just SIGKILL you, then you want to run a separate thread on a housekeeping core that is doing a sigwait() or equivalent. I'm not sure why using perf to do this is particularly better; I'm most interested in ensuring that it is easy for applications to set this up if they want it, and perf isn't always super-easy to use. That said, maybe it's easier than I think to do that specific thing, and worth considering doing it that way instead. Is there an easily-explained way to do what you suggest where perf delivers a signal? I assume you have in mind creating a synthetic sampling perf event and using perf_event_open() to get a file descriptor for it, and waiting with poll or SIGIO? (Too bad perf_event_open isn't supported by glibc and we have to use syscall() to even call it.) Seems complex... >>> 2. This strict mode thing isn't exhaustive. It's missing, at least, >>> coverage for NMI, MCE, and SMI. Sure, you can think that you've >>> disabled all NMI sources, you can try to remember to set the >>> appropriate boot flag that panics on MCE (and hope that you don't get >>> screwed by broadcast MCE on Intel systems before it got fixed >>> (Skylake? Is the fix even available in a released chip?), and, for >>> SMI, good luck... >> >> You are confusing this strict mode support with the debug >> support in patch 07/14. > Nope. I'm confusing this strict mode with what Gilad described: using > strict mode to cause outright shutdown instead of failure to meet a > deadline. Yeah, fair point. We certainly could wire up a mode to deliver a signal or whatever for asynchronous interrupts (which I'm claiming are primarily kernel bugs) instead of just synchronous interrupts (which I'm claiming are application bugs). That could be an additional mode bit for prctl(), e.g. PR_TASK_ISOLATION_DEBUG to align with the task_isolation_debug boot variable that enables the kernel printk spew. > (FWIW, you could also use an ordinary hardware watchdog timer to > promote your failure to meet a deadline to a shutdown. No new kernel > support needed.) But more hardware support is needed; there may not be a handy hardware watchdog timer to use out of the box, and you don't want to require the customer to buy new hardware to support a feature like this if you don't have to. >> Strict mode is for synchronous application errors. You might >> be right that there are cases that haven't been covered, but >> certainly most of them are covered on the three platforms that >> are supported in this initial series. (You pointed me to one >> that I would have missed on x86, namely the bounds check >> exception from a bad bounds setup.) I'm pretty confident I >> have all of them for tile, since I know that hardware best, >> and I think we're in good shape for arm64, though I'm still >> coming up to speed on that architecture. > Again, for this definition of strict mode, I still don't see why it's > the right design. If you want to debug your application to detect > application errors, use a debugging interface. Maybe. But we basically want a single notification that the app (and/or maybe kernel) screwed up. Invoking all of perf for that seems like overkill and a signal seems totally adequate, whether for development fixing bugs, or production catching bad things. There are a reasonable number of precedents for doing things this way: SIGPIPE and SIGFPE, to name two. >> NMIs and machine checks are asynchronous interrupts that >> don't have to do with what the application is doing, more or less. >> Those should not be delivered to task-isolation cores at all, >> so we just generate console spew when you set the >> task_isolation_debug boot option. I honestly don't know enough >> about system management interrupts to comment on that, >> though again, I would hope one can configure the system to >> just not deliver them to nohz_full cores, and I think it would >> be reasonable to generate some kernel spew if that happens. > Hah hah yeah right. On most existing Intel CPUs, you *cannot* > configure machine checks to do anything other than broadcast to all > cores or cause immediate shutdown. And getting any sort of reasonable > control over SMI more or less requires special firmware. Yeah, as Gilad said, x86 may not be the best choice to run a task-isolated application unless you can really set up those kinds of things to stay off your core. >>> 3. You haven't dealt with IPIs. The TLB flush code in particular >>> seems like it will break all your assumptions. >> >> Again, not a synchronous application error that we are trying >> to catch with this signalling mechanism. >> >> That said it could obviously be a more general application error >> (e.g. a process with threads on both nohz_full and housekeeping >> cores, where the housekeeping core unmaps some memory and >> thus requires a TLB flush IPI). But this is covered by the >> task_isolation_debug patch for kernel/smp.c. >> >>> Maybe it would make sense to whack more of the moles before adding a >>> big assertion that there aren't any moles any more. >> >> Maybe, but I've whacked the ones I know how to whack. >> If there are ones I've missed I'm happy to add them in a >> subsequent version of this series, or in follow-on patches. >> > I agree that you can, in principle, catch all the synchronous > application errors using this mechanism. I'm saying that catching > them seems quite useful, but catching them using a prctl that causes a > signal and explicitly does *not* solve the deadline enforcement > problem seems to have dubious value in the upstream kernel. When you say "does not solve the deadline enforcement problem", I'm not sure what point you're making. The application presumably can meet its own deadlines when it's not interrupted; the intent here is to notice when the kernel gets in its way and notify it. Granted you could add separate mechanisms to create deadlines within the application, but that feels like a separate layer that may or may not be desired for any given application. > You can't catch the asynchronous application errors with this > mechanism (or at least your ability to catch them depends on which > patch version IIRC), which include calling anything like munmap or > membarrier in another thread. Yes, and munmap in another thread is certainly an application bug at some level, so that's another reason to allow using the same mechanism to notify the application of an asynchronous interrupt. I'll add that for the next version of the patch series. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal @ 2015-10-27 19:37 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-27 19:37 UTC (permalink / raw) To: Andy Lutomirski Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 10/22/2015 05:00 PM, Andy Lutomirski wrote: > On Thu, Oct 22, 2015 at 1:44 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >> On 10/21/2015 02:53 PM, Andy Lutomirski wrote: >>> On Tue, Oct 20, 2015 at 11:41 PM, Gilad Ben Yossef <giladb-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >>> wrote: >>>> >>>>> From: Andy Lutomirski [mailto:luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org] >>>>> Sent: Wednesday, October 21, 2015 4:43 AM >>>>> To: Chris Metcalf >>>>> Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode >>>>> configurable signal >>>>> >>>>> On Tue, Oct 20, 2015 at 6:30 PM, Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> >>>>> wrote: >>>>>> On 10/20/2015 8:56 PM, Steven Rostedt wrote: >>>>>>> On Tue, 20 Oct 2015 16:36:04 -0400 >>>>>>> Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: >>>>>>> >>>>>>>> Allow userspace to override the default SIGKILL delivered >>>>>>>> when a task_isolation process in STRICT mode does a syscall >>>>>>>> or otherwise synchronously enters the kernel. >>>>>>>> >>>> <snip> >>>>>> It doesn't map SIGKILL to some other signal unconditionally. It just >>>>>> allows >>>>>> the "hey, you broke the STRICT contract and entered the kernel" signal >>>>>> to be something besides the default SIGKILL. >>>>>> >>>> <snip> >>>>> I still dislike this thing. It seems like a debugging feature being >>>>> implemented using signals instead of existing APIs. I *still* don't >>>>> see why perf can't be used to accomplish your goal. >>>>> >>>> It is not (just) a debugging feature. There are workloads were not >>>> performing an action is much preferred to being late. >>>> >>>> Consider the following artificial but representative scenario: a task >>>> running in strict isolation is controlling a radiotherapy alpha emitter. >>>> The code runs in a tight event loop, reading an MMIO register with >>>> location data, making some calculation and in response writing an >>>> MMIO register that triggers the alpha emitter. As a safety measure, each >>>> trigger is for a specific very short time frame - the alpha emitter >>>> auto stops. >>>> >>>> The code has a strict assumption that no more than X cycles pass between >>>> reading the value and the response and the system is built in >>>> such a way that as long as the code has mastery of the CPU the assumption >>>> holds true. If something breaks this assumption (unplanned >>>> context switch to kernel), what you want to do is just stop place >>>> rather than fire the alpha emitter X nanoseconds too late. >>>> >>>> This feature lets you say: if the "contract" of isolation is broken, >>>> notify/kill me at once. >>> That's a fair point. It's risky, though, for quite a few reasons. >>> >>> 1. If someone builds an alpha emitter like this, they did it wrong. >>> The kernel should write a trigger *and* a timestamp to the hardware >>> and the hardware should trigger at the specified time if the time is >>> in the future and throw an error if it's in the past. If you need to >>> check that you made the deadline, check the actual desired condition >>> (did you meat the deadline?) not a proxy (did the signal fire?). >> >> Definitely a better hardware design, but as we all know, hardware >> designers too rarely consult the software people who have to >> right the actual code to properly drive the hardware :-) >> >> My canonical example is high-performance userspace network >> drivers, and though dropping is packet is less likely to kill a >> patient, it's still a pretty bad thing if you're trying to design >> a robust appliance. In this case you really want to fix application >> bugs that cause the code to enter the kernel when you think >> you're in the internal loop running purely in userspace. Things >> like unexpected page faults, and third-party code that almost >> never calls the kernel but in some dusty corner it occasionally >> does, can screw up your userspace code pretty badly, and >> mysteriously. The "strict" mode support is not a hypothetical >> insurance policy but a reaction to lots of Tilera customer support >> over the years to folks failing to stay in userspace when they >> thought they were doing the right thing. > But this is *exactly* the case where perf or other out-of-band > debugging could be a much better solution. Perf could notify a > non-isolated thread that an interrupt happened, you'd still drop a > packet or two, but you wouldn't also drop the next ten thousand > packets while handling the signal. There's no reason the signal needs to be delivered to one of the nohz_full cores. If you're setting up to catch these signals rather than have them just SIGKILL you, then you want to run a separate thread on a housekeeping core that is doing a sigwait() or equivalent. I'm not sure why using perf to do this is particularly better; I'm most interested in ensuring that it is easy for applications to set this up if they want it, and perf isn't always super-easy to use. That said, maybe it's easier than I think to do that specific thing, and worth considering doing it that way instead. Is there an easily-explained way to do what you suggest where perf delivers a signal? I assume you have in mind creating a synthetic sampling perf event and using perf_event_open() to get a file descriptor for it, and waiting with poll or SIGIO? (Too bad perf_event_open isn't supported by glibc and we have to use syscall() to even call it.) Seems complex... >>> 2. This strict mode thing isn't exhaustive. It's missing, at least, >>> coverage for NMI, MCE, and SMI. Sure, you can think that you've >>> disabled all NMI sources, you can try to remember to set the >>> appropriate boot flag that panics on MCE (and hope that you don't get >>> screwed by broadcast MCE on Intel systems before it got fixed >>> (Skylake? Is the fix even available in a released chip?), and, for >>> SMI, good luck... >> >> You are confusing this strict mode support with the debug >> support in patch 07/14. > Nope. I'm confusing this strict mode with what Gilad described: using > strict mode to cause outright shutdown instead of failure to meet a > deadline. Yeah, fair point. We certainly could wire up a mode to deliver a signal or whatever for asynchronous interrupts (which I'm claiming are primarily kernel bugs) instead of just synchronous interrupts (which I'm claiming are application bugs). That could be an additional mode bit for prctl(), e.g. PR_TASK_ISOLATION_DEBUG to align with the task_isolation_debug boot variable that enables the kernel printk spew. > (FWIW, you could also use an ordinary hardware watchdog timer to > promote your failure to meet a deadline to a shutdown. No new kernel > support needed.) But more hardware support is needed; there may not be a handy hardware watchdog timer to use out of the box, and you don't want to require the customer to buy new hardware to support a feature like this if you don't have to. >> Strict mode is for synchronous application errors. You might >> be right that there are cases that haven't been covered, but >> certainly most of them are covered on the three platforms that >> are supported in this initial series. (You pointed me to one >> that I would have missed on x86, namely the bounds check >> exception from a bad bounds setup.) I'm pretty confident I >> have all of them for tile, since I know that hardware best, >> and I think we're in good shape for arm64, though I'm still >> coming up to speed on that architecture. > Again, for this definition of strict mode, I still don't see why it's > the right design. If you want to debug your application to detect > application errors, use a debugging interface. Maybe. But we basically want a single notification that the app (and/or maybe kernel) screwed up. Invoking all of perf for that seems like overkill and a signal seems totally adequate, whether for development fixing bugs, or production catching bad things. There are a reasonable number of precedents for doing things this way: SIGPIPE and SIGFPE, to name two. >> NMIs and machine checks are asynchronous interrupts that >> don't have to do with what the application is doing, more or less. >> Those should not be delivered to task-isolation cores at all, >> so we just generate console spew when you set the >> task_isolation_debug boot option. I honestly don't know enough >> about system management interrupts to comment on that, >> though again, I would hope one can configure the system to >> just not deliver them to nohz_full cores, and I think it would >> be reasonable to generate some kernel spew if that happens. > Hah hah yeah right. On most existing Intel CPUs, you *cannot* > configure machine checks to do anything other than broadcast to all > cores or cause immediate shutdown. And getting any sort of reasonable > control over SMI more or less requires special firmware. Yeah, as Gilad said, x86 may not be the best choice to run a task-isolated application unless you can really set up those kinds of things to stay off your core. >>> 3. You haven't dealt with IPIs. The TLB flush code in particular >>> seems like it will break all your assumptions. >> >> Again, not a synchronous application error that we are trying >> to catch with this signalling mechanism. >> >> That said it could obviously be a more general application error >> (e.g. a process with threads on both nohz_full and housekeeping >> cores, where the housekeeping core unmaps some memory and >> thus requires a TLB flush IPI). But this is covered by the >> task_isolation_debug patch for kernel/smp.c. >> >>> Maybe it would make sense to whack more of the moles before adding a >>> big assertion that there aren't any moles any more. >> >> Maybe, but I've whacked the ones I know how to whack. >> If there are ones I've missed I'm happy to add them in a >> subsequent version of this series, or in follow-on patches. >> > I agree that you can, in principle, catch all the synchronous > application errors using this mechanism. I'm saying that catching > them seems quite useful, but catching them using a prctl that causes a > signal and explicitly does *not* solve the deadline enforcement > problem seems to have dubious value in the upstream kernel. When you say "does not solve the deadline enforcement problem", I'm not sure what point you're making. The application presumably can meet its own deadlines when it's not interrupted; the intent here is to notice when the kernel gets in its way and notify it. Granted you could add separate mechanisms to create deadlines within the application, but that feels like a separate layer that may or may not be desired for any given application. > You can't catch the asynchronous application errors with this > mechanism (or at least your ability to catch them depends on which > patch version IIRC), which include calling anything like munmap or > membarrier in another thread. Yes, and munmap in another thread is certainly an application bug at some level, so that's another reason to allow using the same mechanism to notify the application of an asynchronous interrupt. I'll add that for the next version of the patch series. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* RE: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal @ 2015-10-24 9:16 ` Gilad Ben Yossef 0 siblings, 0 replies; 340+ messages in thread From: Gilad Ben Yossef @ 2015-10-24 9:16 UTC (permalink / raw) To: Andy Lutomirski Cc: Chris Metcalf, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc, Linux API, linux-kernel Hi Andy, Thank for the feedback. > From: Andy Lutomirski [mailto:luto@amacapital.net] > Sent: Wednesday, October 21, 2015 9:53 PM > To: Gilad Ben Yossef > Cc: Chris Metcalf; Steven Rostedt; Ingo Molnar; Peter Zijlstra; Andrew > Morton; Rik van Riel; Tejun Heo; Frederic Weisbecker; Thomas Gleixner; Paul > E. McKenney; Christoph Lameter; Viresh Kumar; Catalin Marinas; Will Deacon; > linux-doc@vger.kernel.org; Linux API; linux-kernel@vger.kernel.org > Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode > configurable signal > > >> >> On Tue, 20 Oct 2015 16:36:04 -0400 > >> >> Chris Metcalf <cmetcalf@ezchip.com> wrote: > >> >> > >> >>> Allow userspace to override the default SIGKILL delivered > >> >>> when a task_isolation process in STRICT mode does a syscall > >> >>> or otherwise synchronously enters the kernel. > >> >>> > > <snip> > >> > > >> > It doesn't map SIGKILL to some other signal unconditionally. It just allows > >> > the "hey, you broke the STRICT contract and entered the kernel" signal > >> > to be something besides the default SIGKILL. > >> > > >> > > > > <snip> > >> > >> I still dislike this thing. It seems like a debugging feature being > >> implemented using signals instead of existing APIs. I *still* don't > >> see why perf can't be used to accomplish your goal. > >> > > > > It is not (just) a debugging feature. There are workloads were not > performing an action is much preferred to being late. > > > > Consider the following artificial but representative scenario: a task running > in strict isolation is controlling a radiotherapy alpha emitter. > > The code runs in a tight event loop, reading an MMIO register with location > data, making some calculation and in response writing an > > MMIO register that triggers the alpha emitter. As a safety measure, each > trigger is for a specific very short time frame - the alpha emitter > > auto stops. > > > > The code has a strict assumption that no more than X cycles pass between > reading the value and the response and the system is built in > > such a way that as long as the code has mastery of the CPU the assumption > holds true. If something breaks this assumption (unplanned > > context switch to kernel), what you want to do is just stop place > > rather than fire the alpha emitter X nanoseconds too late. > > > > This feature lets you say: if the "contract" of isolation is broken, notify/kill > me at once. > > That's a fair point. It's risky, though, for quite a few reasons. > > 1. If someone builds an alpha emitter like this, they did it wrong. > The kernel should write a trigger *and* a timestamp to the hardware > and the hardware should trigger at the specified time if the time is > in the future and throw an error if it's in the past. If you need to > check that you made the deadline, check the actual desired condition > (did you meat the deadline?) not a proxy (did the signal fire?). > As I wrote above it is an *artificial* scenario. Yes, hardware and systems can be designed better, but they are not always are and in these kind of systems, you really do want to have double or triple checks. Knowing such systems, even IF the hardware was designed as you specified (and I agree it should!) you would still add the software protection. > 2. This strict mode thing isn't exhaustive. It's missing, at least, > coverage for NMI, MCE, and SMI. Sure, you can think that you've > disabled all NMI sources, you can try to remember to set the > appropriate boot flag that panics on MCE (and hope that you don't get > screwed by broadcast MCE on Intel systems before it got fixed > (Skylake? Is the fix even available in a released chip?), and, for > SMI, good luck... You are right - it isn't exhaustive. It is one piece in a bigger puzzle. Many of the other bits are platform specific and some of them have been dealt with on the platform that care about these things. Yes, we don't have dark magic to detect SMIs. Is that a reason to penalize platforms where there is no such thing as SMI? > 3. You haven't dealt with IPIs. The TLB flush code in particular > seems like it will break all your assumptions. > But we have - in the general context. Consider this patch set from 2012 - https://lwn.net/Articles/479510/ Not finished for sure. But what we have is now useful enough that it is used in the real world for different workloads on different platforms, from packet processing, through HPC to high frequency trading. > Maybe it would make sense to whack more of the moles before adding a > big assertion that there aren't any moles any more. > hm... maybe you are reading too much into this specific feature - its a "notify me, the application, if I asked you to do something that violates my previous request to be isolated", rather than "notify me whenever isolation is broken". Does that make more sense? Thanks, Gilad ^ permalink raw reply [flat|nested] 340+ messages in thread
* RE: [PATCH v8 06/14] task_isolation: provide strict mode configurable signal @ 2015-10-24 9:16 ` Gilad Ben Yossef 0 siblings, 0 replies; 340+ messages in thread From: Gilad Ben Yossef @ 2015-10-24 9:16 UTC (permalink / raw) To: Andy Lutomirski Cc: Chris Metcalf, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA Hi Andy, Thank for the feedback. > From: Andy Lutomirski [mailto:luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org] > Sent: Wednesday, October 21, 2015 9:53 PM > To: Gilad Ben Yossef > Cc: Chris Metcalf; Steven Rostedt; Ingo Molnar; Peter Zijlstra; Andrew > Morton; Rik van Riel; Tejun Heo; Frederic Weisbecker; Thomas Gleixner; Paul > E. McKenney; Christoph Lameter; Viresh Kumar; Catalin Marinas; Will Deacon; > linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Linux API; linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Subject: Re: [PATCH v8 06/14] task_isolation: provide strict mode > configurable signal > > >> >> On Tue, 20 Oct 2015 16:36:04 -0400 > >> >> Chris Metcalf <cmetcalf-d5a29ZRxExrQT0dZR+AlfA@public.gmane.org> wrote: > >> >> > >> >>> Allow userspace to override the default SIGKILL delivered > >> >>> when a task_isolation process in STRICT mode does a syscall > >> >>> or otherwise synchronously enters the kernel. > >> >>> > > <snip> > >> > > >> > It doesn't map SIGKILL to some other signal unconditionally. It just allows > >> > the "hey, you broke the STRICT contract and entered the kernel" signal > >> > to be something besides the default SIGKILL. > >> > > >> > > > > <snip> > >> > >> I still dislike this thing. It seems like a debugging feature being > >> implemented using signals instead of existing APIs. I *still* don't > >> see why perf can't be used to accomplish your goal. > >> > > > > It is not (just) a debugging feature. There are workloads were not > performing an action is much preferred to being late. > > > > Consider the following artificial but representative scenario: a task running > in strict isolation is controlling a radiotherapy alpha emitter. > > The code runs in a tight event loop, reading an MMIO register with location > data, making some calculation and in response writing an > > MMIO register that triggers the alpha emitter. As a safety measure, each > trigger is for a specific very short time frame - the alpha emitter > > auto stops. > > > > The code has a strict assumption that no more than X cycles pass between > reading the value and the response and the system is built in > > such a way that as long as the code has mastery of the CPU the assumption > holds true. If something breaks this assumption (unplanned > > context switch to kernel), what you want to do is just stop place > > rather than fire the alpha emitter X nanoseconds too late. > > > > This feature lets you say: if the "contract" of isolation is broken, notify/kill > me at once. > > That's a fair point. It's risky, though, for quite a few reasons. > > 1. If someone builds an alpha emitter like this, they did it wrong. > The kernel should write a trigger *and* a timestamp to the hardware > and the hardware should trigger at the specified time if the time is > in the future and throw an error if it's in the past. If you need to > check that you made the deadline, check the actual desired condition > (did you meat the deadline?) not a proxy (did the signal fire?). > As I wrote above it is an *artificial* scenario. Yes, hardware and systems can be designed better, but they are not always are and in these kind of systems, you really do want to have double or triple checks. Knowing such systems, even IF the hardware was designed as you specified (and I agree it should!) you would still add the software protection. > 2. This strict mode thing isn't exhaustive. It's missing, at least, > coverage for NMI, MCE, and SMI. Sure, you can think that you've > disabled all NMI sources, you can try to remember to set the > appropriate boot flag that panics on MCE (and hope that you don't get > screwed by broadcast MCE on Intel systems before it got fixed > (Skylake? Is the fix even available in a released chip?), and, for > SMI, good luck... You are right - it isn't exhaustive. It is one piece in a bigger puzzle. Many of the other bits are platform specific and some of them have been dealt with on the platform that care about these things. Yes, we don't have dark magic to detect SMIs. Is that a reason to penalize platforms where there is no such thing as SMI? > 3. You haven't dealt with IPIs. The TLB flush code in particular > seems like it will break all your assumptions. > But we have - in the general context. Consider this patch set from 2012 - https://lwn.net/Articles/479510/ Not finished for sure. But what we have is now useful enough that it is used in the real world for different workloads on different platforms, from packet processing, through HPC to high frequency trading. > Maybe it would make sense to whack more of the moles before adding a > big assertion that there aren't any moles any more. > hm... maybe you are reading too much into this specific feature - its a "notify me, the application, if I asked you to do something that violates my previous request to be isolated", rather than "notify me whenever isolation is broken". Does that make more sense? Thanks, Gilad-- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v8 07/14] task_isolation: add debug boot flag 2015-10-20 20:35 ` Chris Metcalf ` (6 preceding siblings ...) (?) @ 2015-10-20 20:36 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-kernel Cc: Chris Metcalf The new "task_isolation_debug" flag simplifies debugging of TASK_ISOLATION kernels when processes are running in PR_TASK_ISOLATION_ENABLE mode. Such processes should get no interrupts from the kernel, and if they do, when this boot flag is specified a kernel stack dump on the console is generated. It's possible to use ftrace to simply detect whether a task_isolation core has unexpectedly entered the kernel. But what this boot flag does is allow the kernel to provide better diagnostics, e.g. by reporting in the IPI-generating code what remote core and context is preparing to deliver an interrupt to a task_isolation core. It may be worth considering other ways to generate useful debugging output rather than console spew, but for now that is simple and direct. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- Documentation/kernel-parameters.txt | 7 +++++++ include/linux/isolation.h | 2 ++ kernel/irq_work.c | 5 ++++- kernel/sched/core.c | 21 +++++++++++++++++++++ kernel/signal.c | 5 +++++ kernel/smp.c | 4 ++++ kernel/softirq.c | 7 +++++++ 7 files changed, 50 insertions(+), 1 deletion(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 22a4b687ea5b..48ff15f3166f 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -3623,6 +3623,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. neutralize any effect of /proc/sys/kernel/sysrq. Useful for debugging. + task_isolation_debug [KNL] + In kernels built with CONFIG_TASK_ISOLATION and booted + in nohz_full= mode, this setting will generate console + backtraces when the kernel is about to interrupt a + task that has requested PR_TASK_ISOLATION_ENABLE + and is running on a nohz_full core. + tcpmhash_entries= [KNL,NET] Set the number of tcp_metrics_hash slots. Default value is 8192 or 16384 depending on total diff --git a/include/linux/isolation.h b/include/linux/isolation.h index dc14057a359c..ad94d1168c31 100644 --- a/include/linux/isolation.h +++ b/include/linux/isolation.h @@ -31,6 +31,7 @@ static inline void task_isolation_enter(void) extern bool task_isolation_syscall(int nr); extern bool task_isolation_exception(const char *fmt, ...); +extern void task_isolation_debug(int cpu); static inline bool task_isolation_strict(void) { @@ -54,6 +55,7 @@ static inline bool task_isolation_ready(void) { return true; } static inline void task_isolation_enter(void) { } static inline bool task_isolation_check_syscall(int nr) { return false; } static inline bool task_isolation_check_exception(const char *fmt, ...) { return false; } +static inline void task_isolation_debug(int cpu) { } #endif #endif diff --git a/kernel/irq_work.c b/kernel/irq_work.c index cbf9fb899d92..745c2ea6a4e4 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -17,6 +17,7 @@ #include <linux/cpu.h> #include <linux/notifier.h> #include <linux/smp.h> +#include <linux/isolation.h> #include <asm/processor.h> @@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu) if (!irq_work_claim(work)) return false; - if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) + if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) { + task_isolation_debug(cpu); arch_send_call_function_single_ipi(cpu); + } return true; } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 10a8faa1b0d4..b79f8e0aeffb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -74,6 +74,7 @@ #include <linux/binfmts.h> #include <linux/context_tracking.h> #include <linux/compiler.h> +#include <linux/isolation.h> #include <asm/switch_to.h> #include <asm/tlb.h> @@ -746,6 +747,26 @@ bool sched_can_stop_tick(void) } #endif /* CONFIG_NO_HZ_FULL */ +#ifdef CONFIG_TASK_ISOLATION +/* Enable debugging of any interrupts of task_isolation cores. */ +static int task_isolation_debug_flag; +static int __init task_isolation_debug_func(char *str) +{ + task_isolation_debug_flag = true; + return 1; +} +__setup("task_isolation_debug", task_isolation_debug_func); + +void task_isolation_debug(int cpu) +{ + if (task_isolation_debug_flag && tick_nohz_full_cpu(cpu) && + (cpu_curr(cpu)->task_isolation_flags & PR_TASK_ISOLATION_ENABLE)) { + pr_err("Interrupt detected for task_isolation cpu %d\n", cpu); + dump_stack(); + } +} +#endif + void sched_avg_update(struct rq *rq) { s64 period = sched_avg_period(); diff --git a/kernel/signal.c b/kernel/signal.c index 0f6bbbe77b46..c6e09f0f7e24 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -684,6 +684,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) */ void signal_wake_up_state(struct task_struct *t, unsigned int state) { +#ifdef CONFIG_TASK_ISOLATION + /* If the task is being killed, don't complain about task_isolation. */ + if (state & TASK_WAKEKILL) + t->task_isolation_flags = 0; +#endif set_tsk_thread_flag(t, TIF_SIGPENDING); /* * TASK_WAKEKILL also means wake it up in the stopped/traced/killable diff --git a/kernel/smp.c b/kernel/smp.c index 07854477c164..b0bddff2693d 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -14,6 +14,7 @@ #include <linux/smp.h> #include <linux/cpu.h> #include <linux/sched.h> +#include <linux/isolation.h> #include "smpboot.h" @@ -178,6 +179,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, * locking and barrier primitives. Generic code isn't really * equipped to do the right thing... */ + task_isolation_debug(cpu); if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) arch_send_call_function_single_ipi(cpu); @@ -457,6 +459,8 @@ void smp_call_function_many(const struct cpumask *mask, } /* Send a message to all CPUs in the map */ + for_each_cpu(cpu, cfd->cpumask) + task_isolation_debug(cpu); arch_send_call_function_ipi_mask(cfd->cpumask); if (wait) { diff --git a/kernel/softirq.c b/kernel/softirq.c index 479e4436f787..ed762fec7265 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -24,8 +24,10 @@ #include <linux/ftrace.h> #include <linux/smp.h> #include <linux/smpboot.h> +#include <linux/context_tracking.h> #include <linux/tick.h> #include <linux/irq.h> +#include <linux/isolation.h> #define CREATE_TRACE_POINTS #include <trace/events/irq.h> @@ -335,6 +337,11 @@ void irq_enter(void) _local_bh_enable(); } + if (context_tracking_cpu_is_enabled() && + context_tracking_in_user() && + !in_interrupt()) + task_isolation_debug(smp_processor_id()); + __irq_enter(); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot 2015-10-20 20:35 ` Chris Metcalf ` (7 preceding siblings ...) (?) @ 2015-10-20 20:36 ` Chris Metcalf 2015-10-20 21:03 ` Frederic Weisbecker -1 siblings, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-kernel Cc: Chris Metcalf While the current fallback to 1-second tick is still required for a number of kernel accounting tasks (e.g. vruntime, load balancing data, and load accounting), it's useful to be able to disable it for testing purposes. Paul McKenney observed that if we provide a mode where the 1Hz fallback timer is removed, this will provide an environment where new code that relies on that tick will get punished, and we won't forgive such assumptions silently. This option also allows easy testing of nohz_full and task-isolation modes to determine what functionality needs to be implemented, and what possibly-spurious timer interrupts are scheduled when the basic 1Hz tick has been turned off. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- kernel/sched/core.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b79f8e0aeffb..634d5c2ab08a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2849,6 +2849,19 @@ void scheduler_tick(void) } #ifdef CONFIG_NO_HZ_FULL +/* + * Allow a boot-time option to debug running + * without the 1Hz minimum tick on nohz_full cores. + */ +static bool debug_1hz_tick; + +static __init int set_debug_1hz_tick(char *arg) +{ + debug_1hz_tick = true; + return 1; +} +__setup("debug_1hz_tick", set_debug_1hz_tick); + /** * scheduler_tick_max_deferment * @@ -2867,6 +2880,9 @@ u64 scheduler_tick_max_deferment(void) struct rq *rq = this_rq(); unsigned long next, now = READ_ONCE(jiffies); + if (debug_1hz_tick) + return KTIME_MAX; + next = rq->last_sched_tick + HZ; if (time_before_eq(next, now)) -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot 2015-10-20 20:36 ` [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot Chris Metcalf @ 2015-10-20 21:03 ` Frederic Weisbecker 2015-10-20 21:18 ` Chris Metcalf ` (2 more replies) 0 siblings, 3 replies; 340+ messages in thread From: Frederic Weisbecker @ 2015-10-20 21:03 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-kernel On Tue, Oct 20, 2015 at 04:36:06PM -0400, Chris Metcalf wrote: > While the current fallback to 1-second tick is still required for > a number of kernel accounting tasks (e.g. vruntime, load balancing > data, and load accounting), it's useful to be able to disable it > for testing purposes. Paul McKenney observed that if we provide > a mode where the 1Hz fallback timer is removed, this will provide > an environment where new code that relies on that tick will get > punished, and we won't forgive such assumptions silently. > > This option also allows easy testing of nohz_full and task-isolation > modes to determine what functionality needs to be implemented, > and what possibly-spurious timer interrupts are scheduled when > the basic 1Hz tick has been turned off. > > Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> There have been proposals to disable/tune the 1 Hz tick via debugfs which I Nacked because once you give such an opportunity to the users, they will use that hack and never fix the real underlying issue. For the same reasons, I'm sorry but I have to Nack this proposal as well. If this is for development or testing purpose, scheduler_max_tick_deferment() is easily commented out. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot 2015-10-20 21:03 ` Frederic Weisbecker @ 2015-10-20 21:18 ` Chris Metcalf 2015-10-21 0:59 ` Steven Rostedt 2015-10-21 6:56 ` Gilad Ben Yossef 2015-10-21 14:28 ` Christoph Lameter 2 siblings, 1 reply; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 21:18 UTC (permalink / raw) To: Frederic Weisbecker Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-kernel On 10/20/2015 05:03 PM, Frederic Weisbecker wrote: > On Tue, Oct 20, 2015 at 04:36:06PM -0400, Chris Metcalf wrote: >> While the current fallback to 1-second tick is still required for >> a number of kernel accounting tasks (e.g. vruntime, load balancing >> data, and load accounting), it's useful to be able to disable it >> for testing purposes. Paul McKenney observed that if we provide >> a mode where the 1Hz fallback timer is removed, this will provide >> an environment where new code that relies on that tick will get >> punished, and we won't forgive such assumptions silently. >> >> This option also allows easy testing of nohz_full and task-isolation >> modes to determine what functionality needs to be implemented, >> and what possibly-spurious timer interrupts are scheduled when >> the basic 1Hz tick has been turned off. >> >> Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> > There have been proposals to disable/tune the 1 Hz tick via debugfs which > I Nacked because once you give such an opportunity to the users, they > will use that hack and never fix the real underlying issue. > > For the same reasons, I'm sorry but I have to Nack this proposal as well. > > If this is for development or testing purpose, scheduler_max_tick_deferment() is > easily commented out. Fair enough and certainly your prerogative, so don't hesitate to say "no" to the following argument. :-) I would tend to differentiate a debugfs proposal from a boot flag proposal: a boot flag is a more hardcore thing to change, and it's not like application developers will come along and explain that you have to boot with different flags to run their app - whereas if they can just sneak in a modification to a debugfs setting that's much easier for the app to tweak. So perhaps a boot flag is an acceptable compromise between "nothing" and a debugfs tweak? It certainly does make it easier to hack on the task-isolation code, and likely other things where people are trying out fixes to subsystems where they are attempting to remove the reliance on the tick. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot 2015-10-20 21:18 ` Chris Metcalf @ 2015-10-21 0:59 ` Steven Rostedt 0 siblings, 0 replies; 340+ messages in thread From: Steven Rostedt @ 2015-10-21 0:59 UTC (permalink / raw) To: Chris Metcalf Cc: Frederic Weisbecker, Gilad Ben Yossef, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-kernel On Tue, 20 Oct 2015 17:18:13 -0400 Chris Metcalf <cmetcalf@ezchip.com> wrote: > So perhaps a boot flag is an acceptable compromise between > "nothing" and a debugfs tweak? It certainly does make it easier > to hack on the task-isolation code, and likely other things where > people are trying out fixes to subsystems where they are attempting > to remove the reliance on the tick. > Just change the name to: this_will_crash_your_kernel_and_kill_your_kittens_debug_1hz_tick -- Steve ^ permalink raw reply [flat|nested] 340+ messages in thread
* RE: [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot 2015-10-20 21:03 ` Frederic Weisbecker 2015-10-20 21:18 ` Chris Metcalf @ 2015-10-21 6:56 ` Gilad Ben Yossef 2015-10-21 14:28 ` Christoph Lameter 2 siblings, 0 replies; 340+ messages in thread From: Gilad Ben Yossef @ 2015-10-21 6:56 UTC (permalink / raw) To: Frederic Weisbecker, Chris Metcalf Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-kernel > From: Frederic Weisbecker [mailto:fweisbec@gmail.com] > > This option also allows easy testing of nohz_full and task-isolation > > modes to determine what functionality needs to be implemented, > > and what possibly-spurious timer interrupts are scheduled when > > the basic 1Hz tick has been turned off. > > > > Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> > > There have been proposals to disable/tune the 1 Hz tick via debugfs which > I Nacked because once you give such an opportunity to the users, they > will use that hack and never fix the real underlying issue. > > For the same reasons, I'm sorry but I have to Nack this proposal as well. > > If this is for development or testing purpose, > scheduler_max_tick_deferment() is > easily commented out. The problem with the latter is that it is much easier get back to one of the poor^H^H^H^H brave souls that are willing to risk their kittens testing this stuff for us saying: "can you please boot without this boot option and let me know if that behavior you were complaining about still happens?" rather than "can you please go to this and that line in the source file and un-comment it and re-compile and see if it still happens?" I hope this makes more sense. Thinking about it, it's probably a good idea to taint the kernel when this option is set as well. Thanks, Gilad ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot 2015-10-20 21:03 ` Frederic Weisbecker 2015-10-20 21:18 ` Chris Metcalf 2015-10-21 6:56 ` Gilad Ben Yossef @ 2015-10-21 14:28 ` Christoph Lameter 2015-10-21 15:35 ` Frederic Weisbecker 2 siblings, 1 reply; 340+ messages in thread From: Christoph Lameter @ 2015-10-21 14:28 UTC (permalink / raw) To: Frederic Weisbecker Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-kernel On Tue, 20 Oct 2015, Frederic Weisbecker wrote: > There have been proposals to disable/tune the 1 Hz tick via debugfs which > I Nacked because once you give such an opportunity to the users, they > will use that hack and never fix the real underlying issue. Well this is a pretty bad argument. By that reasoning no one should be allowed to use root. After all stupid users will become root and kill processes that are hung. And the underlying issue that causes those hangs in processes will never be fixed because they will keep on killing processes. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot 2015-10-21 14:28 ` Christoph Lameter @ 2015-10-21 15:35 ` Frederic Weisbecker 0 siblings, 0 replies; 340+ messages in thread From: Frederic Weisbecker @ 2015-10-21 15:35 UTC (permalink / raw) To: Christoph Lameter Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-kernel On Wed, Oct 21, 2015 at 09:28:04AM -0500, Christoph Lameter wrote: > On Tue, 20 Oct 2015, Frederic Weisbecker wrote: > > > There have been proposals to disable/tune the 1 Hz tick via debugfs which > > I Nacked because once you give such an opportunity to the users, they > > will use that hack and never fix the real underlying issue. > > Well this is a pretty bad argument. By that reasoning no one should be > allowed to use root. After all stupid users will become root and kill > processes that are hung. And the underlying issue that causes those hangs > in processes will never be fixed because they will keep on killing > processes. I disagree. There is an signifiant frontier between: _ hack it up and you're responsible of the consequences, yourself and _ provide a buggy hack to the user and support this officially upstream Especially as all I've seen in two years, wrt. solving the 1 Hz issue, is patches like this. Almost nobody really tried to dig into the real issues that are well known and identified after all now, and not that hard to fix: it's many standalone issues to make the scheduler resilient with full-total-hard-dynticks. The only effort toward that I've seen lately is: https://lkml.org/lkml/2015/10/14/192 and still I think the author came to that nohz issue by accident. Many users are too easily happy with hacks like this and that one is a too much a good opportunity for them. ^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH v8 09/14] arch/x86: enable task isolation functionality 2015-10-20 20:35 ` Chris Metcalf ` (8 preceding siblings ...) (?) @ 2015-10-20 20:36 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, H. Peter Anvin, x86, linux-kernel Cc: Chris Metcalf In prepare_exit_to_usermode(), call task_isolation_ready() when we are checking the thread-info flags, and after we've handled the other work, call task_isolation_enter() unconditionally. In syscall_trace_enter_phase1(), we add the necessary support for strict-mode detection of syscalls. We add strict reporting for the kernel exception types that do not result in signals, namely non-signalling page faults and non-signalling MPX fixups. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/x86/entry/common.c | 10 +++++++++- arch/x86/kernel/traps.c | 2 ++ arch/x86/mm/fault.c | 2 ++ 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 80dcc9261ca3..13426c0656b4 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -21,6 +21,7 @@ #include <linux/context_tracking.h> #include <linux/user-return-notifier.h> #include <linux/uprobes.h> +#include <linux/isolation.h> #include <asm/desc.h> #include <asm/traps.h> @@ -81,6 +82,10 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) */ if (work & _TIF_NOHZ) { enter_from_user_mode(); + if (task_isolation_check_syscall(regs->orig_ax)) { + regs->orig_ax = -1; + return 0; + } work &= ~_TIF_NOHZ; } #endif @@ -234,7 +239,8 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs) if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | _TIF_NEED_RESCHED | - _TIF_USER_RETURN_NOTIFY))) + _TIF_USER_RETURN_NOTIFY)) && + task_isolation_ready()) break; /* We have work to do. */ @@ -258,6 +264,8 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs) if (cached_flags & _TIF_USER_RETURN_NOTIFY) fire_user_return_notifiers(); + task_isolation_enter(); + /* Disable IRQs and retry */ local_irq_disable(); } diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 346eec73f7db..1ed4d8a52d23 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -36,6 +36,7 @@ #include <linux/mm.h> #include <linux/smp.h> #include <linux/io.h> +#include <linux/isolation.h> #ifdef CONFIG_EISA #include <linux/ioport.h> @@ -398,6 +399,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code) case 2: /* Bound directory has invalid entry. */ if (mpx_handle_bd_fault()) goto exit_trap; + task_isolation_check_exception("bounds check"); break; /* Success, it was handled */ case 1: /* Bound violation. */ info = mpx_generate_siginfo(regs); diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index eef44d9a3f77..7b23487a3bd7 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -14,6 +14,7 @@ #include <linux/prefetch.h> /* prefetchw */ #include <linux/context_tracking.h> /* exception_enter(), ... */ #include <linux/uaccess.h> /* faulthandler_disabled() */ +#include <linux/isolation.h> /* task_isolation_check_exception */ #include <asm/traps.h> /* dotraplinkage, ... */ #include <asm/pgalloc.h> /* pgd_*(), ... */ @@ -1148,6 +1149,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code, local_irq_enable(); error_code |= PF_USER; flags |= FAULT_FLAG_USER; + task_isolation_check_exception("page fault at %#lx", address); } else { if (regs->flags & X86_EFLAGS_IF) local_irq_enable(); -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v8 10/14] arch/arm64: adopt prepare_exit_to_usermode() model from x86 2015-10-20 20:35 ` Chris Metcalf @ 2015-10-20 20:36 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-arm-kernel, linux-kernel Cc: Chris Metcalf This change is a prerequisite change for TASK_ISOLATION but also stands on its own for readability and maintainability. The existing arm64 do_notify_resume() is called in a loop from assembly on the slow path; this change moves the loop into C code as well. For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add new, comprehensible entry and exit handlers written in C"). Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/arm64/kernel/entry.S | 6 +++--- arch/arm64/kernel/signal.c | 32 ++++++++++++++++++++++---------- 2 files changed, 25 insertions(+), 13 deletions(-) diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S index 4306c937b1ff..6fcbf8ea307b 100644 --- a/arch/arm64/kernel/entry.S +++ b/arch/arm64/kernel/entry.S @@ -628,9 +628,8 @@ work_pending: mov x0, sp // 'regs' tst x2, #PSR_MODE_MASK // user mode regs? b.ne no_work_pending // returning to kernel - enable_irq // enable interrupts for do_notify_resume() - bl do_notify_resume - b ret_to_user + bl prepare_exit_to_usermode + b no_user_work_pending work_resched: bl schedule @@ -642,6 +641,7 @@ ret_to_user: ldr x1, [tsk, #TI_FLAGS] and x2, x1, #_TIF_WORK_MASK cbnz x2, work_pending +no_user_work_pending: enable_step_tsk x1, x2 no_work_pending: kernel_exit 0 diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index e18c48cb6db1..fde59c1139a9 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -399,18 +399,30 @@ static void do_signal(struct pt_regs *regs) restore_saved_sigmask(); } -asmlinkage void do_notify_resume(struct pt_regs *regs, - unsigned int thread_flags) +asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs, + unsigned int thread_flags) { - if (thread_flags & _TIF_SIGPENDING) - do_signal(regs); + do { + local_irq_enable(); - if (thread_flags & _TIF_NOTIFY_RESUME) { - clear_thread_flag(TIF_NOTIFY_RESUME); - tracehook_notify_resume(regs); - } + if (thread_flags & _TIF_NEED_RESCHED) + schedule(); + + if (thread_flags & _TIF_SIGPENDING) + do_signal(regs); + + if (thread_flags & _TIF_NOTIFY_RESUME) { + clear_thread_flag(TIF_NOTIFY_RESUME); + tracehook_notify_resume(regs); + } + + if (thread_flags & _TIF_FOREIGN_FPSTATE) + fpsimd_restore_current_state(); + + local_irq_disable(); - if (thread_flags & _TIF_FOREIGN_FPSTATE) - fpsimd_restore_current_state(); + thread_flags = READ_ONCE(current_thread_info()->flags) & + _TIF_WORK_MASK; + } while (thread_flags); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v8 10/14] arch/arm64: adopt prepare_exit_to_usermode() model from x86 @ 2015-10-20 20:36 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: linux-arm-kernel This change is a prerequisite change for TASK_ISOLATION but also stands on its own for readability and maintainability. The existing arm64 do_notify_resume() is called in a loop from assembly on the slow path; this change moves the loop into C code as well. For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add new, comprehensible entry and exit handlers written in C"). Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/arm64/kernel/entry.S | 6 +++--- arch/arm64/kernel/signal.c | 32 ++++++++++++++++++++++---------- 2 files changed, 25 insertions(+), 13 deletions(-) diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S index 4306c937b1ff..6fcbf8ea307b 100644 --- a/arch/arm64/kernel/entry.S +++ b/arch/arm64/kernel/entry.S @@ -628,9 +628,8 @@ work_pending: mov x0, sp // 'regs' tst x2, #PSR_MODE_MASK // user mode regs? b.ne no_work_pending // returning to kernel - enable_irq // enable interrupts for do_notify_resume() - bl do_notify_resume - b ret_to_user + bl prepare_exit_to_usermode + b no_user_work_pending work_resched: bl schedule @@ -642,6 +641,7 @@ ret_to_user: ldr x1, [tsk, #TI_FLAGS] and x2, x1, #_TIF_WORK_MASK cbnz x2, work_pending +no_user_work_pending: enable_step_tsk x1, x2 no_work_pending: kernel_exit 0 diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index e18c48cb6db1..fde59c1139a9 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -399,18 +399,30 @@ static void do_signal(struct pt_regs *regs) restore_saved_sigmask(); } -asmlinkage void do_notify_resume(struct pt_regs *regs, - unsigned int thread_flags) +asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs, + unsigned int thread_flags) { - if (thread_flags & _TIF_SIGPENDING) - do_signal(regs); + do { + local_irq_enable(); - if (thread_flags & _TIF_NOTIFY_RESUME) { - clear_thread_flag(TIF_NOTIFY_RESUME); - tracehook_notify_resume(regs); - } + if (thread_flags & _TIF_NEED_RESCHED) + schedule(); + + if (thread_flags & _TIF_SIGPENDING) + do_signal(regs); + + if (thread_flags & _TIF_NOTIFY_RESUME) { + clear_thread_flag(TIF_NOTIFY_RESUME); + tracehook_notify_resume(regs); + } + + if (thread_flags & _TIF_FOREIGN_FPSTATE) + fpsimd_restore_current_state(); + + local_irq_disable(); - if (thread_flags & _TIF_FOREIGN_FPSTATE) - fpsimd_restore_current_state(); + thread_flags = READ_ONCE(current_thread_info()->flags) & + _TIF_WORK_MASK; + } while (thread_flags); } -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v8 11/14] arch/arm64: enable task isolation functionality 2015-10-20 20:35 ` Chris Metcalf @ 2015-10-20 20:36 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-arm-kernel, linux-kernel Cc: Chris Metcalf We need to call task_isolation_enter() from prepare_exit_to_usermode(), so that we can both ensure we do it last before returning to userspace, and we also are able to re-run signal handling, etc., if something occurs while task_isolation_enter() has interrupts enabled. To do this we add _TIF_NOHZ to the _TIF_WORK_MASK if we have CONFIG_TASK_ISOLATION enabled, which brings us into prepare_exit_to_usermode() on all return to userspace. But we don't put _TIF_NOHZ in the flags that we use to loop back and recheck, since we don't need to loop back only because the flag is set. Instead we unconditionally call task_isolation_enter() at the end of the loop if any other work is done. To make the assembly code continue to be as optimized as before, we renumber the _TIF flags so that both _TIF_WORK_MASK and _TIF_SYSCALL_WORK still have contiguous runs of bits in the immediate operand for the "and" instruction, as required by the ARM64 ISA. Since TIF_NOHZ is in both masks, it must be the middle bit in the contiguous run that starts with the _TIF_WORK_MASK bits and ends with the _TIF_SYSCALL_WORK bits. We tweak syscall_trace_enter() slightly to carry the "flags" value from current_thread_info()->flags for each of the tests, rather than doing a volatile read from memory for each one. This avoids a small overhead for each test, and in particular avoids that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled. Finally, add an explicit check for STRICT mode in do_mem_abort() to handle the case of page faults. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/arm64/include/asm/thread_info.h | 18 ++++++++++++------ arch/arm64/kernel/ptrace.c | 12 +++++++++--- arch/arm64/kernel/signal.c | 7 +++++-- arch/arm64/mm/fault.c | 4 ++++ 4 files changed, 30 insertions(+), 11 deletions(-) diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index dcd06d18a42a..4c36c4ee3528 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -101,11 +101,11 @@ static inline struct thread_info *current_thread_info(void) #define TIF_NEED_RESCHED 1 #define TIF_NOTIFY_RESUME 2 /* callback before returning to user */ #define TIF_FOREIGN_FPSTATE 3 /* CPU's FP state is not current's */ -#define TIF_NOHZ 7 -#define TIF_SYSCALL_TRACE 8 -#define TIF_SYSCALL_AUDIT 9 -#define TIF_SYSCALL_TRACEPOINT 10 -#define TIF_SECCOMP 11 +#define TIF_NOHZ 4 +#define TIF_SYSCALL_TRACE 5 +#define TIF_SYSCALL_AUDIT 6 +#define TIF_SYSCALL_TRACEPOINT 7 +#define TIF_SECCOMP 8 #define TIF_MEMDIE 18 /* is terminating due to OOM killer */ #define TIF_FREEZE 19 #define TIF_RESTORE_SIGMASK 20 @@ -124,9 +124,15 @@ static inline struct thread_info *current_thread_info(void) #define _TIF_SECCOMP (1 << TIF_SECCOMP) #define _TIF_32BIT (1 << TIF_32BIT) -#define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ +#define _TIF_WORK_LOOP_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE) +#ifdef CONFIG_TASK_ISOLATION +# define _TIF_WORK_MASK (_TIF_WORK_LOOP_MASK | _TIF_NOHZ) +#else +# define _TIF_WORK_MASK _TIF_WORK_LOOP_MASK +#endif + #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \ _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \ _TIF_NOHZ) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index 1971f491bb90..69ed3ba81650 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -37,6 +37,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/isolation.h> #include <asm/compat.h> #include <asm/debug-monitors.h> @@ -1240,14 +1241,19 @@ static void tracehook_report_syscall(struct pt_regs *regs, asmlinkage int syscall_trace_enter(struct pt_regs *regs) { - /* Do the secure computing check first; failures should be fast. */ + unsigned long work = ACCESS_ONCE(current_thread_info()->flags); + + if ((work & _TIF_NOHZ) && task_isolation_check_syscall(regs->syscallno)) + return -1; + + /* Do the secure computing check early; failures should be fast. */ if (secure_computing() == -1) return -1; - if (test_thread_flag(TIF_SYSCALL_TRACE)) + if (work & _TIF_SYSCALL_TRACE) tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER); - if (test_thread_flag(TIF_SYSCALL_TRACEPOINT)) + if (work & _TIF_SYSCALL_TRACEPOINT) trace_sys_enter(regs, regs->syscallno); audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1], diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index fde59c1139a9..641c828653c7 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -25,6 +25,7 @@ #include <linux/uaccess.h> #include <linux/tracehook.h> #include <linux/ratelimit.h> +#include <linux/isolation.h> #include <asm/debug-monitors.h> #include <asm/elf.h> @@ -419,10 +420,12 @@ asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs, if (thread_flags & _TIF_FOREIGN_FPSTATE) fpsimd_restore_current_state(); + task_isolation_enter(); + local_irq_disable(); thread_flags = READ_ONCE(current_thread_info()->flags) & - _TIF_WORK_MASK; + _TIF_WORK_LOOP_MASK; - } while (thread_flags); + } while (thread_flags || !task_isolation_ready()); } diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 9fadf6d7039b..a726f9f3ef3c 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -29,6 +29,7 @@ #include <linux/sched.h> #include <linux/highmem.h> #include <linux/perf_event.h> +#include <linux/isolation.h> #include <asm/cpufeature.h> #include <asm/exception.h> @@ -466,6 +467,9 @@ asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr, const struct fault_info *inf = fault_info + (esr & 63); struct siginfo info; + if (user_mode(regs)) + task_isolation_check_exception("%s at %#lx", inf->name, addr); + if (!inf->fn(addr, esr, regs)) return; -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v8 11/14] arch/arm64: enable task isolation functionality @ 2015-10-20 20:36 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: linux-arm-kernel We need to call task_isolation_enter() from prepare_exit_to_usermode(), so that we can both ensure we do it last before returning to userspace, and we also are able to re-run signal handling, etc., if something occurs while task_isolation_enter() has interrupts enabled. To do this we add _TIF_NOHZ to the _TIF_WORK_MASK if we have CONFIG_TASK_ISOLATION enabled, which brings us into prepare_exit_to_usermode() on all return to userspace. But we don't put _TIF_NOHZ in the flags that we use to loop back and recheck, since we don't need to loop back only because the flag is set. Instead we unconditionally call task_isolation_enter() at the end of the loop if any other work is done. To make the assembly code continue to be as optimized as before, we renumber the _TIF flags so that both _TIF_WORK_MASK and _TIF_SYSCALL_WORK still have contiguous runs of bits in the immediate operand for the "and" instruction, as required by the ARM64 ISA. Since TIF_NOHZ is in both masks, it must be the middle bit in the contiguous run that starts with the _TIF_WORK_MASK bits and ends with the _TIF_SYSCALL_WORK bits. We tweak syscall_trace_enter() slightly to carry the "flags" value from current_thread_info()->flags for each of the tests, rather than doing a volatile read from memory for each one. This avoids a small overhead for each test, and in particular avoids that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled. Finally, add an explicit check for STRICT mode in do_mem_abort() to handle the case of page faults. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/arm64/include/asm/thread_info.h | 18 ++++++++++++------ arch/arm64/kernel/ptrace.c | 12 +++++++++--- arch/arm64/kernel/signal.c | 7 +++++-- arch/arm64/mm/fault.c | 4 ++++ 4 files changed, 30 insertions(+), 11 deletions(-) diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index dcd06d18a42a..4c36c4ee3528 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -101,11 +101,11 @@ static inline struct thread_info *current_thread_info(void) #define TIF_NEED_RESCHED 1 #define TIF_NOTIFY_RESUME 2 /* callback before returning to user */ #define TIF_FOREIGN_FPSTATE 3 /* CPU's FP state is not current's */ -#define TIF_NOHZ 7 -#define TIF_SYSCALL_TRACE 8 -#define TIF_SYSCALL_AUDIT 9 -#define TIF_SYSCALL_TRACEPOINT 10 -#define TIF_SECCOMP 11 +#define TIF_NOHZ 4 +#define TIF_SYSCALL_TRACE 5 +#define TIF_SYSCALL_AUDIT 6 +#define TIF_SYSCALL_TRACEPOINT 7 +#define TIF_SECCOMP 8 #define TIF_MEMDIE 18 /* is terminating due to OOM killer */ #define TIF_FREEZE 19 #define TIF_RESTORE_SIGMASK 20 @@ -124,9 +124,15 @@ static inline struct thread_info *current_thread_info(void) #define _TIF_SECCOMP (1 << TIF_SECCOMP) #define _TIF_32BIT (1 << TIF_32BIT) -#define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ +#define _TIF_WORK_LOOP_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE) +#ifdef CONFIG_TASK_ISOLATION +# define _TIF_WORK_MASK (_TIF_WORK_LOOP_MASK | _TIF_NOHZ) +#else +# define _TIF_WORK_MASK _TIF_WORK_LOOP_MASK +#endif + #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \ _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \ _TIF_NOHZ) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index 1971f491bb90..69ed3ba81650 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -37,6 +37,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/isolation.h> #include <asm/compat.h> #include <asm/debug-monitors.h> @@ -1240,14 +1241,19 @@ static void tracehook_report_syscall(struct pt_regs *regs, asmlinkage int syscall_trace_enter(struct pt_regs *regs) { - /* Do the secure computing check first; failures should be fast. */ + unsigned long work = ACCESS_ONCE(current_thread_info()->flags); + + if ((work & _TIF_NOHZ) && task_isolation_check_syscall(regs->syscallno)) + return -1; + + /* Do the secure computing check early; failures should be fast. */ if (secure_computing() == -1) return -1; - if (test_thread_flag(TIF_SYSCALL_TRACE)) + if (work & _TIF_SYSCALL_TRACE) tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER); - if (test_thread_flag(TIF_SYSCALL_TRACEPOINT)) + if (work & _TIF_SYSCALL_TRACEPOINT) trace_sys_enter(regs, regs->syscallno); audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1], diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index fde59c1139a9..641c828653c7 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -25,6 +25,7 @@ #include <linux/uaccess.h> #include <linux/tracehook.h> #include <linux/ratelimit.h> +#include <linux/isolation.h> #include <asm/debug-monitors.h> #include <asm/elf.h> @@ -419,10 +420,12 @@ asmlinkage void prepare_exit_to_usermode(struct pt_regs *regs, if (thread_flags & _TIF_FOREIGN_FPSTATE) fpsimd_restore_current_state(); + task_isolation_enter(); + local_irq_disable(); thread_flags = READ_ONCE(current_thread_info()->flags) & - _TIF_WORK_MASK; + _TIF_WORK_LOOP_MASK; - } while (thread_flags); + } while (thread_flags || !task_isolation_ready()); } diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 9fadf6d7039b..a726f9f3ef3c 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -29,6 +29,7 @@ #include <linux/sched.h> #include <linux/highmem.h> #include <linux/perf_event.h> +#include <linux/isolation.h> #include <asm/cpufeature.h> #include <asm/exception.h> @@ -466,6 +467,9 @@ asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr, const struct fault_info *inf = fault_info + (esr & 63); struct siginfo info; + if (user_mode(regs)) + task_isolation_check_exception("%s at %#lx", inf->name, addr); + if (!inf->fn(addr, esr, regs)) return; -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v8 12/14] arch/tile: adopt prepare_exit_to_usermode() model from x86 2015-10-20 20:35 ` Chris Metcalf ` (11 preceding siblings ...) (?) @ 2015-10-20 20:36 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-kernel Cc: Chris Metcalf This change is a prerequisite change for TASK_ISOLATION but also stands on its own for readability and maintainability. The existing tile do_work_pending() was called in a loop from assembly on the slow path; this change moves the loop into C code as well. For the x86 version see commit c5c46f59e4e7 ("x86/entry: Add new, comprehensible entry and exit handlers written in C"). This change exposes a pre-existing bug on the older tilepro platform; the singlestep processing is done last, but on tilepro (unlike tilegx) we enable interrupts while doing that processing, so we could in theory miss a signal or other asynchronous event. A future change could fix this by breaking the singlestep work into a "prepare" step done in the main loop, and a "trigger" step done after exiting the loop. Since this change is intended as purely a restructuring change, we call out the bug explicitly now, but don't yet fix it. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/include/asm/processor.h | 2 +- arch/tile/include/asm/thread_info.h | 8 +++- arch/tile/kernel/intvec_32.S | 46 +++++++-------------- arch/tile/kernel/intvec_64.S | 49 +++++++---------------- arch/tile/kernel/process.c | 79 +++++++++++++++++++------------------ 5 files changed, 77 insertions(+), 107 deletions(-) diff --git a/arch/tile/include/asm/processor.h b/arch/tile/include/asm/processor.h index 139dfdee0134..0684e88aacd8 100644 --- a/arch/tile/include/asm/processor.h +++ b/arch/tile/include/asm/processor.h @@ -212,7 +212,7 @@ static inline void release_thread(struct task_struct *dead_task) /* Nothing for now */ } -extern int do_work_pending(struct pt_regs *regs, u32 flags); +extern void prepare_exit_to_usermode(struct pt_regs *regs, u32 flags); /* diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h index dc1fb28d9636..4b7cef9e94e0 100644 --- a/arch/tile/include/asm/thread_info.h +++ b/arch/tile/include/asm/thread_info.h @@ -140,10 +140,14 @@ extern void _cpu_idle(void); #define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG) #define _TIF_NOHZ (1<<TIF_NOHZ) +/* Work to do as we loop to exit to user space. */ +#define _TIF_WORK_MASK \ + (_TIF_SIGPENDING | _TIF_NEED_RESCHED | \ + _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME) + /* Work to do on any return to user space. */ #define _TIF_ALLWORK_MASK \ - (_TIF_SIGPENDING | _TIF_NEED_RESCHED | _TIF_SINGLESTEP | \ - _TIF_ASYNC_TLB | _TIF_NOTIFY_RESUME | _TIF_NOHZ) + (_TIF_WORK_MASK | _TIF_SINGLESTEP | _TIF_NOHZ) /* Work to do at syscall entry. */ #define _TIF_SYSCALL_ENTRY_WORK \ diff --git a/arch/tile/kernel/intvec_32.S b/arch/tile/kernel/intvec_32.S index fbbe2ea882ea..33d48812872a 100644 --- a/arch/tile/kernel/intvec_32.S +++ b/arch/tile/kernel/intvec_32.S @@ -846,18 +846,6 @@ STD_ENTRY(interrupt_return) FEEDBACK_REENTER(interrupt_return) /* - * Use r33 to hold whether we have already loaded the callee-saves - * into ptregs. We don't want to do it twice in this loop, since - * then we'd clobber whatever changes are made by ptrace, etc. - * Get base of stack in r32. - */ - { - GET_THREAD_INFO(r32) - movei r33, 0 - } - -.Lretry_work_pending: - /* * Disable interrupts so as to make sure we don't * miss an interrupt that sets any of the thread flags (like * need_resched or sigpending) between sampling and the iret. @@ -867,33 +855,27 @@ STD_ENTRY(interrupt_return) IRQ_DISABLE(r20, r21) TRACE_IRQS_OFF /* Note: clobbers registers r0-r29 */ - - /* Check to see if there is any work to do before returning to user. */ + /* + * See if there are any work items (including single-shot items) + * to do. If so, save the callee-save registers to pt_regs + * and then dispatch to C code. + */ + GET_THREAD_INFO(r21) { - addi r29, r32, THREAD_INFO_FLAGS_OFFSET - moveli r1, lo16(_TIF_ALLWORK_MASK) + addi r22, r21, THREAD_INFO_FLAGS_OFFSET + moveli r20, lo16(_TIF_ALLWORK_MASK) } { - lw r29, r29 - auli r1, r1, ha16(_TIF_ALLWORK_MASK) + lw r22, r22 + auli r20, r20, ha16(_TIF_ALLWORK_MASK) } - and r1, r29, r1 - bzt r1, .Lrestore_all - - /* - * Make sure we have all the registers saved for signal - * handling, notify-resume, or single-step. Call out to C - * code to figure out exactly what we need to do for each flag bit, - * then if necessary, reload the flags and recheck. - */ + and r1, r22, r20 { PTREGS_PTR(r0, PTREGS_OFFSET_BASE) - bnz r33, 1f + bzt r1, .Lrestore_all } push_extra_callee_saves r0 - movei r33, 1 -1: jal do_work_pending - bnz r0, .Lretry_work_pending + jal prepare_exit_to_usermode /* * In the NMI case we @@ -1327,7 +1309,7 @@ STD_ENTRY(ret_from_kernel_thread) FEEDBACK_REENTER(ret_from_kernel_thread) { movei r30, 0 /* not an NMI */ - j .Lresume_userspace /* jump into middle of interrupt_return */ + j interrupt_return } STD_ENDPROC(ret_from_kernel_thread) diff --git a/arch/tile/kernel/intvec_64.S b/arch/tile/kernel/intvec_64.S index 58964d209d4d..a41c994ce237 100644 --- a/arch/tile/kernel/intvec_64.S +++ b/arch/tile/kernel/intvec_64.S @@ -879,20 +879,6 @@ STD_ENTRY(interrupt_return) FEEDBACK_REENTER(interrupt_return) /* - * Use r33 to hold whether we have already loaded the callee-saves - * into ptregs. We don't want to do it twice in this loop, since - * then we'd clobber whatever changes are made by ptrace, etc. - */ - { - movei r33, 0 - move r32, sp - } - - /* Get base of stack in r32. */ - EXTRACT_THREAD_INFO(r32) - -.Lretry_work_pending: - /* * Disable interrupts so as to make sure we don't * miss an interrupt that sets any of the thread flags (like * need_resched or sigpending) between sampling and the iret. @@ -902,33 +888,28 @@ STD_ENTRY(interrupt_return) IRQ_DISABLE(r20, r21) TRACE_IRQS_OFF /* Note: clobbers registers r0-r29 */ - - /* Check to see if there is any work to do before returning to user. */ + /* + * See if there are any work items (including single-shot items) + * to do. If so, save the callee-save registers to pt_regs + * and then dispatch to C code. + */ + move r21, sp + EXTRACT_THREAD_INFO(r21) { - addi r29, r32, THREAD_INFO_FLAGS_OFFSET - moveli r1, hw1_last(_TIF_ALLWORK_MASK) + addi r22, r21, THREAD_INFO_FLAGS_OFFSET + moveli r20, hw1_last(_TIF_ALLWORK_MASK) } { - ld r29, r29 - shl16insli r1, r1, hw0(_TIF_ALLWORK_MASK) + ld r22, r22 + shl16insli r20, r20, hw0(_TIF_ALLWORK_MASK) } - and r1, r29, r1 - beqzt r1, .Lrestore_all - - /* - * Make sure we have all the registers saved for signal - * handling or notify-resume. Call out to C code to figure out - * exactly what we need to do for each flag bit, then if - * necessary, reload the flags and recheck. - */ + and r1, r22, r20 { PTREGS_PTR(r0, PTREGS_OFFSET_BASE) - bnez r33, 1f + beqzt r1, .Lrestore_all } push_extra_callee_saves r0 - movei r33, 1 -1: jal do_work_pending - bnez r0, .Lretry_work_pending + jal prepare_exit_to_usermode /* * In the NMI case we @@ -1411,7 +1392,7 @@ STD_ENTRY(ret_from_kernel_thread) FEEDBACK_REENTER(ret_from_kernel_thread) { movei r30, 0 /* not an NMI */ - j .Lresume_userspace /* jump into middle of interrupt_return */ + j interrupt_return } STD_ENDPROC(ret_from_kernel_thread) diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index 7d5769310bef..b5f30d376ce1 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -462,54 +462,57 @@ struct task_struct *__sched _switch_to(struct task_struct *prev, /* * This routine is called on return from interrupt if any of the - * TIF_WORK_MASK flags are set in thread_info->flags. It is - * entered with interrupts disabled so we don't miss an event - * that modified the thread_info flags. If any flag is set, we - * handle it and return, and the calling assembly code will - * re-disable interrupts, reload the thread flags, and call back - * if more flags need to be handled. - * - * We return whether we need to check the thread_info flags again - * or not. Note that we don't clear TIF_SINGLESTEP here, so it's - * important that it be tested last, and then claim that we don't - * need to recheck the flags. + * TIF_ALLWORK_MASK flags are set in thread_info->flags. It is + * entered with interrupts disabled so we don't miss an event that + * modified the thread_info flags. We loop until all the tested flags + * are clear. Note that the function is called on certain conditions + * that are not listed in the loop condition here (e.g. SINGLESTEP) + * which guarantees we will do those things once, and redo them if any + * of the other work items is re-done, but won't continue looping if + * all the other work is done. */ -int do_work_pending(struct pt_regs *regs, u32 thread_info_flags) +void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags) { - /* If we enter in kernel mode, do nothing and exit the caller loop. */ - if (!user_mode(regs)) - return 0; + if (WARN_ON(!user_mode(regs))) + return; - user_exit(); + do { + local_irq_enable(); - /* Enable interrupts; they are disabled again on return to caller. */ - local_irq_enable(); + if (thread_info_flags & _TIF_NEED_RESCHED) + schedule(); - if (thread_info_flags & _TIF_NEED_RESCHED) { - schedule(); - return 1; - } #if CHIP_HAS_TILE_DMA() - if (thread_info_flags & _TIF_ASYNC_TLB) { - do_async_page_fault(regs); - return 1; - } + if (thread_info_flags & _TIF_ASYNC_TLB) + do_async_page_fault(regs); #endif - if (thread_info_flags & _TIF_SIGPENDING) { - do_signal(regs); - return 1; - } - if (thread_info_flags & _TIF_NOTIFY_RESUME) { - clear_thread_flag(TIF_NOTIFY_RESUME); - tracehook_notify_resume(regs); - return 1; - } - if (thread_info_flags & _TIF_SINGLESTEP) + + if (thread_info_flags & _TIF_SIGPENDING) + do_signal(regs); + + if (thread_info_flags & _TIF_NOTIFY_RESUME) { + clear_thread_flag(TIF_NOTIFY_RESUME); + tracehook_notify_resume(regs); + } + + local_irq_disable(); + thread_info_flags = READ_ONCE(current_thread_info()->flags); + + } while (thread_info_flags & _TIF_WORK_MASK); + + if (thread_info_flags & _TIF_SINGLESTEP) { single_step_once(regs); +#ifndef __tilegx__ + /* + * FIXME: on tilepro, since we enable interrupts in + * this routine, it's possible that we miss a signal + * or other asynchronous event. + */ + local_irq_disable(); +#endif + } user_enter(); - - return 0; } unsigned long get_wchan(struct task_struct *p) -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v8 13/14] arch/tile: turn off timer tick for oneshot_stopped state 2015-10-20 20:35 ` Chris Metcalf ` (12 preceding siblings ...) (?) @ 2015-10-20 20:36 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-kernel Cc: Chris Metcalf When the schedule tick is disabled in tick_nohz_stop_sched_tick(), we call hrtimer_cancel(), which eventually calls down into __remove_hrtimer() and thus into hrtimer_force_reprogram(). That function's call to tick_program_event() detects that we are trying to set the expiration to KTIME_MAX and calls clockevents_switch_state() to set the state to ONESHOT_STOPPED, and returns. However, by default the internal __clockevents_switch_state() code doesn't have a "set_state_oneshot_stopped" function pointer for the tile clock_event_device, so that code returns -ENOSYS, and we end up not setting the state, and more importantly, we don't actually turn off the tile hardware timer. As a result, the timer tick we were waiting for before is still queued, and fires shortly afterwards, only to discover there was nothing for it to do, at which point it quiesces. The fix is to provide that function pointer for tile, and like the other function pointers, have it just turn off the timer interrupt. Any call to set a new timer interval will properly re-enable it. This fix avoids a small performance hiccup for regular applications, but for TASK_ISOLATION code, it fixes a potentially disastrous kernel timer interruption that could cause packets to be dropped. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/time.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c index 178989e6d3e3..fbedf380d9d4 100644 --- a/arch/tile/kernel/time.c +++ b/arch/tile/kernel/time.c @@ -159,6 +159,7 @@ static DEFINE_PER_CPU(struct clock_event_device, tile_timer) = { .set_next_event = tile_timer_set_next_event, .set_state_shutdown = tile_timer_shutdown, .set_state_oneshot = tile_timer_shutdown, + .set_state_oneshot_stopped = tile_timer_shutdown, .tick_resume = tile_timer_shutdown, }; -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* [PATCH v8 14/14] arch/tile: enable task isolation functionality 2015-10-20 20:35 ` Chris Metcalf ` (13 preceding siblings ...) (?) @ 2015-10-20 20:36 ` Chris Metcalf -1 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-20 20:36 UTC (permalink / raw) To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-kernel Cc: Chris Metcalf We add the necessary call to task_isolation_enter() in the prepare_exit_to_usermode() routine. We already unconditionally call into this routine if TIF_NOHZ is set, since that's where we do the user_enter() call. We add calls to task_isolation_check_exception() in places where exceptions may not generate signals to the application. In addition, we add an overriding task_isolation_wait() call that runs a nap instruction while waiting for an interrupt, to make the task_isolation_enter() loop run in a lower-power state. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> --- arch/tile/kernel/process.c | 6 +++++- arch/tile/kernel/ptrace.c | 6 +++++- arch/tile/kernel/single_step.c | 5 +++++ arch/tile/kernel/unaligned.c | 3 +++ arch/tile/mm/fault.c | 3 +++ arch/tile/mm/homecache.c | 5 ++++- 6 files changed, 25 insertions(+), 3 deletions(-) diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index b5f30d376ce1..832febfd65df 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -29,6 +29,7 @@ #include <linux/signal.h> #include <linux/delay.h> #include <linux/context_tracking.h> +#include <linux/isolation.h> #include <asm/stack.h> #include <asm/switch_to.h> #include <asm/homecache.h> @@ -495,10 +496,13 @@ void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags) tracehook_notify_resume(regs); } + task_isolation_enter(); + local_irq_disable(); thread_info_flags = READ_ONCE(current_thread_info()->flags); - } while (thread_info_flags & _TIF_WORK_MASK); + } while ((thread_info_flags & _TIF_WORK_MASK) || + !task_isolation_ready()); if (thread_info_flags & _TIF_SINGLESTEP) { single_step_once(regs); diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c index bdc126faf741..63acf7b4655f 100644 --- a/arch/tile/kernel/ptrace.c +++ b/arch/tile/kernel/ptrace.c @@ -23,6 +23,7 @@ #include <linux/elf.h> #include <linux/tracehook.h> #include <linux/context_tracking.h> +#include <linux/isolation.h> #include <asm/traps.h> #include <arch/chip.h> @@ -259,8 +260,11 @@ int do_syscall_trace_enter(struct pt_regs *regs) * If TIF_NOHZ is set, we are required to call user_exit() before * doing anything that could touch RCU. */ - if (work & _TIF_NOHZ) + if (work & _TIF_NOHZ) { user_exit(); + if (task_isolation_check_syscall(regs->regs[TREG_SYSCALL_NR])) + return -1; + } if (secure_computing() == -1) return -1; diff --git a/arch/tile/kernel/single_step.c b/arch/tile/kernel/single_step.c index 53f7b9def07b..4cba9f4a1915 100644 --- a/arch/tile/kernel/single_step.c +++ b/arch/tile/kernel/single_step.c @@ -23,6 +23,7 @@ #include <linux/types.h> #include <linux/err.h> #include <linux/prctl.h> +#include <linux/isolation.h> #include <linux/context_tracking.h> #include <asm/cacheflush.h> #include <asm/traps.h> @@ -321,6 +322,8 @@ void single_step_once(struct pt_regs *regs) int size = 0, sign_ext = 0; /* happy compiler */ int align_ctl; + task_isolation_check_exception("single step at %#lx", regs->pc); + align_ctl = unaligned_fixup; switch (task_thread_info(current)->align_ctl) { case PR_UNALIGN_NOPRINT: @@ -770,6 +773,8 @@ void single_step_once(struct pt_regs *regs) unsigned long *ss_pc = this_cpu_ptr(&ss_saved_pc); unsigned long control = __insn_mfspr(SPR_SINGLE_STEP_CONTROL_K); + task_isolation_check_exception("single step at %#lx", regs->pc); + *ss_pc = regs->pc; control |= SPR_SINGLE_STEP_CONTROL_1__CANCELED_MASK; control |= SPR_SINGLE_STEP_CONTROL_1__INHIBIT_MASK; diff --git a/arch/tile/kernel/unaligned.c b/arch/tile/kernel/unaligned.c index d075f92ccee0..dbb9c1144236 100644 --- a/arch/tile/kernel/unaligned.c +++ b/arch/tile/kernel/unaligned.c @@ -26,6 +26,7 @@ #include <linux/compat.h> #include <linux/prctl.h> #include <linux/context_tracking.h> +#include <linux/isolation.h> #include <asm/cacheflush.h> #include <asm/traps.h> #include <asm/uaccess.h> @@ -1547,6 +1548,8 @@ void do_unaligned(struct pt_regs *regs, int vecnum) goto done; } + task_isolation_check_exception("unaligned JIT at %#lx", regs->pc); + if (!info->unalign_jit_base) { void __user *user_page; diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index 13eac59bf16a..53514ca54143 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -36,6 +36,7 @@ #include <linux/uaccess.h> #include <linux/kdebug.h> #include <linux/context_tracking.h> +#include <linux/isolation.h> #include <asm/pgalloc.h> #include <asm/sections.h> @@ -846,6 +847,8 @@ void do_page_fault(struct pt_regs *regs, int fault_num, unsigned long address, unsigned long write) { enum ctx_state prev_state = exception_enter(); + task_isolation_check_exception("page fault interrupt %d at %#lx (%#lx)", + fault_num, regs->pc, address); __do_page_fault(regs, fault_num, address, write); exception_exit(prev_state); } diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c index 40ca30a9fee3..a79325113105 100644 --- a/arch/tile/mm/homecache.c +++ b/arch/tile/mm/homecache.c @@ -31,6 +31,7 @@ #include <linux/smp.h> #include <linux/module.h> #include <linux/hugetlb.h> +#include <linux/isolation.h> #include <asm/page.h> #include <asm/sections.h> @@ -83,8 +84,10 @@ static void hv_flush_update(const struct cpumask *cache_cpumask, * Don't bother to update atomically; losing a count * here is not that critical. */ - for_each_cpu(cpu, &mask) + for_each_cpu(cpu, &mask) { ++per_cpu(irq_stat, cpu).irq_hv_flush_count; + task_isolation_debug(cpu); + } } /* -- 2.1.2 ^ permalink raw reply related [flat|nested] 340+ messages in thread
* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full 2015-10-20 20:35 ` Chris Metcalf ` (14 preceding siblings ...) (?) @ 2015-10-21 12:39 ` Peter Zijlstra 2015-10-22 20:31 ` Chris Metcalf -1 siblings, 1 reply; 340+ messages in thread From: Peter Zijlstra @ 2015-10-21 12:39 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel Can you *please* start a new thread with each posting? This is absolutely unmanageable. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full 2015-10-21 12:39 ` [PATCH v8 00/14] support "task_isolation" mode for nohz_full Peter Zijlstra @ 2015-10-22 20:31 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-22 20:31 UTC (permalink / raw) To: Peter Zijlstra Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 10/21/2015 08:39 AM, Peter Zijlstra wrote: > Can you *please* start a new thread with each posting? > > This is absolutely unmanageable. I've been explicitly threading the multiple patch series on purpose due to this text in "git help send-email": --in-reply-to=<identifier> Make the first mail (or all the mails with --no-thread) appear as a reply to the given Message-Id, which avoids breaking threads to provide a new patch series. The second and subsequent emails will be sent as replies according to the --[no]-chain-reply-to setting. So for example when --thread and --no-chain-reply-to are specified, the second and subsequent patches will be replies to the first one like in the illustration below where [PATCH v2 0/3] is in reply to [PATCH 0/2]: [PATCH 0/2] Here is what I did... [PATCH 1/2] Clean up and tests [PATCH 2/2] Implementation [PATCH v2 0/3] Here is a reroll [PATCH v2 1/3] Clean up [PATCH v2 2/3] New tests [PATCH v2 3/3] Implementation It sounds like this is exactly the behavior you are objecting to. It's all one to me because I am not seeing these emails come up in some hugely nested fashion, but just viewing the responses that I haven't yet triaged away. So is your recommendation to avoid the git send-email --in-reply-to option? If so, would you recommend including an lkml.kernel.org link in the cover letter pointing to the previous version, or is there something else that would make your workflow better? If you think this is actually the wrong thing, is it worth trying to fix the git docs to deprecate this option? Or is it more a question of scale, and the 80-odd patches that I've posted so far just pushed an otherwise good system into a more dysfunctional mode? If so, perhaps some text in Documentation/SubmittingPatches would be helpful here. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full @ 2015-10-22 20:31 ` Chris Metcalf 0 siblings, 0 replies; 340+ messages in thread From: Chris Metcalf @ 2015-10-22 20:31 UTC (permalink / raw) To: Peter Zijlstra Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On 10/21/2015 08:39 AM, Peter Zijlstra wrote: > Can you *please* start a new thread with each posting? > > This is absolutely unmanageable. I've been explicitly threading the multiple patch series on purpose due to this text in "git help send-email": --in-reply-to=<identifier> Make the first mail (or all the mails with --no-thread) appear as a reply to the given Message-Id, which avoids breaking threads to provide a new patch series. The second and subsequent emails will be sent as replies according to the --[no]-chain-reply-to setting. So for example when --thread and --no-chain-reply-to are specified, the second and subsequent patches will be replies to the first one like in the illustration below where [PATCH v2 0/3] is in reply to [PATCH 0/2]: [PATCH 0/2] Here is what I did... [PATCH 1/2] Clean up and tests [PATCH 2/2] Implementation [PATCH v2 0/3] Here is a reroll [PATCH v2 1/3] Clean up [PATCH v2 2/3] New tests [PATCH v2 3/3] Implementation It sounds like this is exactly the behavior you are objecting to. It's all one to me because I am not seeing these emails come up in some hugely nested fashion, but just viewing the responses that I haven't yet triaged away. So is your recommendation to avoid the git send-email --in-reply-to option? If so, would you recommend including an lkml.kernel.org link in the cover letter pointing to the previous version, or is there something else that would make your workflow better? If you think this is actually the wrong thing, is it worth trying to fix the git docs to deprecate this option? Or is it more a question of scale, and the 80-odd patches that I've posted so far just pushed an otherwise good system into a more dysfunctional mode? If so, perhaps some text in Documentation/SubmittingPatches would be helpful here. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full 2015-10-22 20:31 ` Chris Metcalf (?) @ 2015-10-23 2:33 ` Frederic Weisbecker 2015-10-23 8:49 ` Peter Zijlstra -1 siblings, 1 reply; 340+ messages in thread From: Frederic Weisbecker @ 2015-10-23 2:33 UTC (permalink / raw) To: Chris Metcalf Cc: Peter Zijlstra, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Thu, Oct 22, 2015 at 04:31:44PM -0400, Chris Metcalf wrote: > On 10/21/2015 08:39 AM, Peter Zijlstra wrote: > >Can you *please* start a new thread with each posting? > > > >This is absolutely unmanageable. > > I've been explicitly threading the multiple patch series on purpose > due to this text in "git help send-email": > > --in-reply-to=<identifier> > Make the first mail (or all the mails with --no-thread) appear > as a reply to the given Message-Id, which avoids breaking > threads to provide a new patch series. The second and subsequent > emails will be sent as replies according to the > --[no]-chain-reply-to setting. > > So for example when --thread and --no-chain-reply-to are > specified, the second and subsequent patches will be replies to > the first one like in the illustration below where [PATCH v2 > 0/3] is in reply to [PATCH 0/2]: > > [PATCH 0/2] Here is what I did... > [PATCH 1/2] Clean up and tests > [PATCH 2/2] Implementation > [PATCH v2 0/3] Here is a reroll > [PATCH v2 1/3] Clean up > [PATCH v2 2/3] New tests > [PATCH v2 3/3] Implementation > > It sounds like this is exactly the behavior you are objecting > to. It's all one to me because I am not seeing these emails > come up in some hugely nested fashion, but just viewing the > responses that I haven't yet triaged away. I personally (and I think this is the general LKML behaviour) use in-reply-to when I post a single patch that is a fix for a bug, or a small enhancement, discussed on some thread. It works well as it fits the conversation inline. But for anything that requires significant changes, namely a patchset, and that includes a new version of such patchset, it's usually better to create a new thread. Otherwise the thread becomes an infinite mess and it eventually expands further the mail client columns. Thanks. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full 2015-10-23 2:33 ` Frederic Weisbecker @ 2015-10-23 8:49 ` Peter Zijlstra 2015-10-23 13:29 ` Frederic Weisbecker 0 siblings, 1 reply; 340+ messages in thread From: Peter Zijlstra @ 2015-10-23 8:49 UTC (permalink / raw) To: Frederic Weisbecker Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Fri, Oct 23, 2015 at 04:33:02AM +0200, Frederic Weisbecker wrote: > On Thu, Oct 22, 2015 at 04:31:44PM -0400, Chris Metcalf wrote: > > On 10/21/2015 08:39 AM, Peter Zijlstra wrote: > > >Can you *please* start a new thread with each posting? > > > > > >This is absolutely unmanageable. > > > > I've been explicitly threading the multiple patch series on purpose > > due to this text in "git help send-email": > > > > --in-reply-to=<identifier> > > Make the first mail (or all the mails with --no-thread) appear > > as a reply to the given Message-Id, which avoids breaking > > threads to provide a new patch series. The second and subsequent > > emails will be sent as replies according to the > > --[no]-chain-reply-to setting. > > > > So for example when --thread and --no-chain-reply-to are > > specified, the second and subsequent patches will be replies to > > the first one like in the illustration below where [PATCH v2 > > 0/3] is in reply to [PATCH 0/2]: > > > > [PATCH 0/2] Here is what I did... > > [PATCH 1/2] Clean up and tests > > [PATCH 2/2] Implementation > > [PATCH v2 0/3] Here is a reroll > > [PATCH v2 1/3] Clean up > > [PATCH v2 2/3] New tests > > [PATCH v2 3/3] Implementation > > > > It sounds like this is exactly the behavior you are objecting > > to. It's all one to me because I am not seeing these emails > > come up in some hugely nested fashion, but just viewing the > > responses that I haven't yet triaged away. Yeah, the git people are not per definition following lkml standards, even though git originated 'here'. They, for a long time, also defaulted to --chain-reply-to, which is absolutely insane. > I personally (and I think this is the general LKML behaviour) use in-reply-to > when I post a single patch that is a fix for a bug, or a small enhancement, > discussed on some thread. It works well as it fits the conversation inline. > > But for anything that requires significant changes, namely a patchset, > and that includes a new version of such patchset, it's usually better > to create a new thread. Otherwise the thread becomes an infinite mess and it > eventually expands further the mail client columns. Agreed, although for single patches I use my regular mailer (mutt) and can't be arsed with tools. Also I don't actually use git-send-email ever, so I might be biased. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full 2015-10-23 8:49 ` Peter Zijlstra @ 2015-10-23 13:29 ` Frederic Weisbecker 0 siblings, 0 replies; 340+ messages in thread From: Frederic Weisbecker @ 2015-10-23 13:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Fri, Oct 23, 2015 at 10:49:51AM +0200, Peter Zijlstra wrote: > On Fri, Oct 23, 2015 at 04:33:02AM +0200, Frederic Weisbecker wrote: > > I personally (and I think this is the general LKML behaviour) use in-reply-to > > when I post a single patch that is a fix for a bug, or a small enhancement, > > discussed on some thread. It works well as it fits the conversation inline. > > > > But for anything that requires significant changes, namely a patchset, > > and that includes a new version of such patchset, it's usually better > > to create a new thread. Otherwise the thread becomes an infinite mess and it > > eventually expands further the mail client columns. > > Agreed, although for single patches I use my regular mailer (mutt) and > can't be arsed with tools. Yeah me too, otherwise I can't write a text before the patch changelog. > Also I don't actually use git-send-email ever, so I might be biased. Ah it's just too convenient so I wrote my scripts on top of it :-) But surely many mail sender libraries can post patches just fine as well. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full @ 2015-10-23 9:04 ` Peter Zijlstra 0 siblings, 0 replies; 340+ messages in thread From: Peter Zijlstra @ 2015-10-23 9:04 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Thu, Oct 22, 2015 at 04:31:44PM -0400, Chris Metcalf wrote: > So is your recommendation to avoid the git send-email --in-reply-to > option? If so, would you recommend including an lkml.kernel.org > link in the cover letter pointing to the previous version, or > is there something else that would make your workflow better? Mostly people don't bother with pointing to previous versions, and if they have the same 0/x subject, they're typically trivial to find anyway. But if you really feel the need for explicit references to previous versions, then yes, lkml.kernel.org/r/ links are preferred over pretty much anything else I think. > If you think this is actually the wrong thing, is it worth trying > to fix the git docs to deprecate this option? As said in the other email; git has different standards than lkml. By now we're just one of many many users of git. > Or is it more a question > of scale, and the 80-odd patches that I've posted so far just pushed > an otherwise good system into a more dysfunctional mode? If so, > perhaps some text in Documentation/SubmittingPatches would be > helpful here. Documentation/email-clients.txt maybe. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full @ 2015-10-23 9:04 ` Peter Zijlstra 0 siblings, 0 replies; 340+ messages in thread From: Peter Zijlstra @ 2015-10-23 9:04 UTC (permalink / raw) To: Chris Metcalf Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Thu, Oct 22, 2015 at 04:31:44PM -0400, Chris Metcalf wrote: > So is your recommendation to avoid the git send-email --in-reply-to > option? If so, would you recommend including an lkml.kernel.org > link in the cover letter pointing to the previous version, or > is there something else that would make your workflow better? Mostly people don't bother with pointing to previous versions, and if they have the same 0/x subject, they're typically trivial to find anyway. But if you really feel the need for explicit references to previous versions, then yes, lkml.kernel.org/r/ links are preferred over pretty much anything else I think. > If you think this is actually the wrong thing, is it worth trying > to fix the git docs to deprecate this option? As said in the other email; git has different standards than lkml. By now we're just one of many many users of git. > Or is it more a question > of scale, and the 80-odd patches that I've posted so far just pushed > an otherwise good system into a more dysfunctional mode? If so, > perhaps some text in Documentation/SubmittingPatches would be > helpful here. Documentation/email-clients.txt maybe. ^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH v8 00/14] support "task_isolation" mode for nohz_full 2015-10-23 9:04 ` Peter Zijlstra (?) @ 2015-10-23 11:52 ` Theodore Ts'o -1 siblings, 0 replies; 340+ messages in thread From: Theodore Ts'o @ 2015-10-23 11:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Metcalf, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney, Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski, linux-doc, linux-api, linux-kernel On Fri, Oct 23, 2015 at 11:04:59AM +0200, Peter Zijlstra wrote: > > If you think this is actually the wrong thing, is it worth trying > > to fix the git docs to deprecate this option? > > As said in the other email; git has different standards than lkml. By > now we're just one of many many users of git. Even git developers will create a new thread for a large (more than 2-3 patches) patch set. However, for a single patch, people have chained the -v3 version of the draft --- not to the v2 version, though, but to the review of the patch. And I've seen that behavior on some LKML lists, and I'm certainly fine with it on linux-ext4. But if you have a huge patch series, and you keep chaining it unto the 8th, 10th, 22nd version, it certainly will get **very** annoying for some MUA's. The bottom line is that you should use common sense, and it can be hard to document every last bit of what should be "common sense" into a rule that is followed by robots or a perl script. (Which is one of the reasons why I'm not fond of the philosophy that every single last checkpatch warning or error should result in a "cleanup" patch, but that's another issue.) Cheers, - Ted ^ permalink raw reply [flat|nested] 340+ messages in thread
end of thread, other threads:[~2016-02-11 20:13 UTC | newest] Thread overview: 340+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-05-08 17:58 [PATCH 0/6] support "dataplane" mode for nohz_full Chris Metcalf 2015-05-08 17:58 ` Chris Metcalf 2015-05-08 17:58 ` [PATCH 1/6] nohz_full: add support for "dataplane" mode Chris Metcalf 2015-05-08 17:58 ` Chris Metcalf 2015-05-08 17:58 ` [PATCH 2/6] nohz: dataplane: allow tick to be fully disabled for dataplane Chris Metcalf 2015-05-12 9:26 ` Peter Zijlstra 2015-05-12 13:12 ` Paul E. McKenney 2015-05-14 20:55 ` Chris Metcalf 2015-05-08 17:58 ` [PATCH 3/6] dataplane nohz: run softirqs synchronously on user entry Chris Metcalf 2015-05-09 7:04 ` Mike Galbraith 2015-05-11 20:13 ` Chris Metcalf 2015-05-12 2:21 ` Mike Galbraith 2015-05-12 9:28 ` Peter Zijlstra 2015-05-12 9:32 ` Peter Zijlstra 2015-05-12 13:08 ` Paul E. McKenney 2015-05-08 17:58 ` [PATCH 4/6] nohz: support PR_DATAPLANE_QUIESCE Chris Metcalf 2015-05-08 17:58 ` Chris Metcalf 2015-05-12 9:33 ` Peter Zijlstra 2015-05-12 9:33 ` Peter Zijlstra 2015-05-12 9:50 ` Ingo Molnar 2015-05-12 9:50 ` Ingo Molnar 2015-05-12 10:38 ` Peter Zijlstra 2015-05-12 10:38 ` Peter Zijlstra 2015-05-12 12:52 ` Ingo Molnar 2015-05-13 4:35 ` Andy Lutomirski 2015-05-13 17:51 ` Paul E. McKenney 2015-05-14 20:55 ` Chris Metcalf 2015-05-14 20:55 ` Chris Metcalf 2015-05-14 20:54 ` Chris Metcalf 2015-05-14 20:54 ` Chris Metcalf 2015-05-08 17:58 ` [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode Chris Metcalf 2015-05-08 17:58 ` Chris Metcalf 2015-05-09 7:28 ` Andy Lutomirski 2015-05-09 10:37 ` Gilad Ben Yossef 2015-05-09 10:37 ` Gilad Ben Yossef 2015-05-11 19:13 ` Chris Metcalf 2015-05-11 19:13 ` Chris Metcalf 2015-05-11 22:28 ` Andy Lutomirski 2015-05-12 21:06 ` Chris Metcalf 2015-05-12 22:23 ` Andy Lutomirski 2015-05-15 21:25 ` Chris Metcalf 2015-05-12 9:38 ` Peter Zijlstra 2015-05-12 13:20 ` Paul E. McKenney 2015-05-12 13:20 ` Paul E. McKenney 2015-05-08 17:58 ` [PATCH 6/6] nohz: add dataplane_debug boot flag Chris Metcalf 2015-05-08 21:18 ` [PATCH 0/6] support "dataplane" mode for nohz_full Andrew Morton 2015-05-08 21:18 ` Andrew Morton 2015-05-08 21:22 ` Steven Rostedt 2015-05-08 21:22 ` Steven Rostedt 2015-05-08 23:11 ` Chris Metcalf 2015-05-08 23:11 ` Chris Metcalf 2015-05-08 23:19 ` Andrew Morton 2015-05-08 23:19 ` Andrew Morton 2015-05-09 7:05 ` Ingo Molnar 2015-05-09 7:19 ` Andy Lutomirski 2015-05-09 7:19 ` Andy Lutomirski 2015-05-11 19:54 ` Chris Metcalf 2015-05-11 19:54 ` Chris Metcalf 2015-05-11 22:15 ` Andy Lutomirski 2015-05-11 22:15 ` Andy Lutomirski [not found] ` <55510885.9070101@ezchip.com> 2015-05-12 13:18 ` Paul E. McKenney 2015-05-09 7:19 ` Mike Galbraith 2015-05-09 7:19 ` Mike Galbraith 2015-05-09 10:18 ` Gilad Ben Yossef 2015-05-09 10:18 ` Gilad Ben Yossef 2015-05-11 12:57 ` Steven Rostedt 2015-05-11 12:57 ` Steven Rostedt 2015-05-11 15:36 ` Frederic Weisbecker 2015-05-11 19:19 ` Mike Galbraith 2015-05-11 19:25 ` Chris Metcalf 2015-05-11 19:25 ` Chris Metcalf 2015-05-12 1:47 ` Mike Galbraith 2015-05-12 4:35 ` Mike Galbraith 2015-05-15 15:05 ` Chris Metcalf 2015-05-15 18:44 ` Mike Galbraith 2015-05-26 19:51 ` Chris Metcalf 2015-05-27 3:28 ` Mike Galbraith 2015-05-11 17:19 ` Paul E. McKenney 2015-05-11 17:27 ` Andrew Morton 2015-05-11 17:33 ` Frederic Weisbecker 2015-05-11 18:00 ` Steven Rostedt 2015-05-11 18:09 ` Chris Metcalf 2015-05-11 18:09 ` Chris Metcalf 2015-05-11 18:36 ` Steven Rostedt 2015-05-11 18:36 ` Steven Rostedt 2015-05-12 9:10 ` CONFIG_ISOLATION=y (was: [PATCH 0/6] support "dataplane" mode for nohz_full) Ingo Molnar 2015-05-12 11:48 ` Peter Zijlstra 2015-05-12 11:48 ` Peter Zijlstra 2015-05-12 12:34 ` Ingo Molnar 2015-05-12 12:39 ` Peter Zijlstra 2015-05-12 12:39 ` Peter Zijlstra 2015-05-12 12:43 ` Ingo Molnar 2015-05-12 12:43 ` Ingo Molnar 2015-05-12 15:36 ` Frederic Weisbecker 2015-05-12 21:05 ` CONFIG_ISOLATION=y Chris Metcalf 2015-05-12 21:05 ` CONFIG_ISOLATION=y Chris Metcalf 2015-05-12 10:46 ` [PATCH 0/6] support "dataplane" mode for nohz_full Peter Zijlstra 2015-05-12 10:46 ` Peter Zijlstra 2015-05-15 15:10 ` Chris Metcalf 2015-05-15 15:10 ` Chris Metcalf 2015-05-15 21:26 ` [PATCH v2 0/5] support "cpu_isolated" " Chris Metcalf 2015-05-15 21:26 ` Chris Metcalf 2015-05-15 21:27 ` [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode Chris Metcalf 2015-05-15 21:27 ` Chris Metcalf 2015-05-15 21:27 ` [PATCH v2 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf 2015-05-15 21:27 ` Chris Metcalf 2015-05-15 21:27 ` [PATCH v2 3/5] nohz: cpu_isolated strict mode configurable signal Chris Metcalf 2015-05-15 21:27 ` Chris Metcalf 2015-05-15 21:27 ` [PATCH v2 4/5] nohz: add cpu_isolated_debug boot flag Chris Metcalf 2015-05-15 21:27 ` [PATCH v2 5/5] nohz: cpu_isolated: allow tick to be fully disabled Chris Metcalf 2015-05-15 22:17 ` [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode Thomas Gleixner 2015-05-15 22:17 ` Thomas Gleixner 2015-05-28 20:38 ` Chris Metcalf 2015-05-28 20:38 ` Chris Metcalf 2015-06-03 15:29 ` [PATCH v3 0/5] support "cpu_isolated" mode for nohz_full Chris Metcalf 2015-06-03 15:29 ` Chris Metcalf 2015-06-03 15:29 ` [PATCH v3 1/5] nohz_full: add support for "cpu_isolated" mode Chris Metcalf 2015-06-03 15:29 ` Chris Metcalf 2015-06-03 15:29 ` [PATCH v3 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf 2015-06-03 15:29 ` Chris Metcalf 2015-06-03 15:29 ` [PATCH v3 3/5] nohz: cpu_isolated strict mode configurable signal Chris Metcalf 2015-06-03 15:29 ` Chris Metcalf 2015-06-03 15:29 ` [PATCH v3 4/5] nohz: add cpu_isolated_debug boot flag Chris Metcalf 2015-06-03 15:29 ` [PATCH v3 5/5] nohz: cpu_isolated: allow tick to be fully disabled Chris Metcalf 2015-07-13 19:57 ` [PATCH v4 0/5] support "cpu_isolated" mode for nohz_full Chris Metcalf 2015-07-13 19:57 ` Chris Metcalf 2015-07-13 19:57 ` [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode Chris Metcalf 2015-07-13 19:57 ` Chris Metcalf 2015-07-13 20:40 ` Andy Lutomirski 2015-07-13 21:01 ` Chris Metcalf 2015-07-13 21:45 ` Andy Lutomirski 2015-07-21 19:10 ` Chris Metcalf 2015-07-21 19:26 ` Andy Lutomirski 2015-07-21 20:36 ` Paul E. McKenney 2015-07-21 20:36 ` Paul E. McKenney 2015-07-22 13:57 ` Christoph Lameter 2015-07-22 19:28 ` Paul E. McKenney 2015-07-22 19:28 ` Paul E. McKenney 2015-07-22 20:02 ` Christoph Lameter 2015-07-24 20:21 ` Chris Metcalf 2015-07-24 20:22 ` Chris Metcalf 2015-07-24 14:03 ` Frederic Weisbecker 2015-07-24 20:19 ` Chris Metcalf 2015-07-24 13:27 ` Frederic Weisbecker 2015-07-24 20:21 ` Chris Metcalf 2015-07-24 20:21 ` Chris Metcalf 2015-07-13 19:57 ` [PATCH v4 2/5] nohz: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf 2015-07-13 19:57 ` Chris Metcalf 2015-07-13 21:47 ` Andy Lutomirski 2015-07-13 21:47 ` Andy Lutomirski 2015-07-21 19:34 ` Chris Metcalf 2015-07-21 19:34 ` Chris Metcalf 2015-07-21 19:42 ` Andy Lutomirski 2015-07-21 19:42 ` Andy Lutomirski 2015-07-24 20:29 ` Chris Metcalf 2015-07-13 19:57 ` [PATCH v4 3/5] nohz: cpu_isolated strict mode configurable signal Chris Metcalf 2015-07-13 19:57 ` Chris Metcalf 2015-07-13 19:58 ` [PATCH v4 4/5] nohz: add cpu_isolated_debug boot flag Chris Metcalf 2015-07-13 19:58 ` [PATCH v4 5/5] nohz: cpu_isolated: allow tick to be fully disabled Chris Metcalf 2015-07-28 19:49 ` [PATCH v5 0/6] support "cpu_isolated" mode for nohz_full Chris Metcalf 2015-07-28 19:49 ` Chris Metcalf 2015-07-28 19:49 ` [PATCH v5 1/6] vmstat: provide a function to quiet down the diff processing Chris Metcalf 2015-07-28 19:49 ` [PATCH v5 2/6] cpu_isolated: add initial support Chris Metcalf 2015-07-28 19:49 ` Chris Metcalf 2015-08-12 16:00 ` Frederic Weisbecker 2015-08-12 16:00 ` Frederic Weisbecker 2015-08-12 18:22 ` Chris Metcalf 2015-08-12 18:22 ` Chris Metcalf 2015-08-26 15:26 ` Frederic Weisbecker 2015-08-26 15:26 ` Frederic Weisbecker 2015-08-26 15:55 ` Chris Metcalf 2015-08-26 15:55 ` Chris Metcalf 2015-07-28 19:49 ` [PATCH v5 3/6] cpu_isolated: support PR_CPU_ISOLATED_STRICT mode Chris Metcalf 2015-07-28 19:49 ` Chris Metcalf 2015-07-28 19:49 ` [PATCH v5 4/6] cpu_isolated: provide strict mode configurable signal Chris Metcalf 2015-07-28 19:49 ` Chris Metcalf 2015-07-28 19:49 ` [PATCH v5 5/6] cpu_isolated: add debug boot flag Chris Metcalf 2015-07-28 19:49 ` [PATCH v5 6/6] nohz: cpu_isolated: allow tick to be fully disabled Chris Metcalf 2015-08-25 19:55 ` [PATCH v6 0/6] support "task_isolated" mode for nohz_full Chris Metcalf 2015-08-25 19:55 ` Chris Metcalf 2015-08-25 19:55 ` [PATCH v6 1/6] vmstat: provide a function to quiet down the diff processing Chris Metcalf 2015-08-25 19:55 ` [PATCH v6 2/6] task_isolation: add initial support Chris Metcalf 2015-08-25 19:55 ` Chris Metcalf 2015-08-25 19:55 ` [PATCH v6 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf 2015-08-25 19:55 ` Chris Metcalf 2015-08-26 10:36 ` Will Deacon 2015-08-26 15:10 ` Chris Metcalf 2015-09-02 10:13 ` Will Deacon 2015-09-02 10:13 ` Will Deacon 2015-08-28 15:31 ` [PATCH v6.1 " Chris Metcalf 2015-08-28 15:31 ` Chris Metcalf 2015-08-25 19:55 ` [PATCH v6 4/6] task_isolation: provide strict mode configurable signal Chris Metcalf 2015-08-25 19:55 ` Chris Metcalf 2015-08-28 19:22 ` Andy Lutomirski 2015-09-02 18:38 ` [PATCH v6.2 3/6] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf 2015-09-02 18:38 ` Chris Metcalf 2015-08-25 19:55 ` [PATCH v6 5/6] task_isolation: add debug boot flag Chris Metcalf 2015-08-25 19:55 ` [PATCH v6 6/6] nohz: task_isolation: allow tick to be fully disabled Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 00/11] support "task_isolated" mode for nohz_full Chris Metcalf 2015-09-28 15:17 ` Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 01/11] vmstat: provide a function to quiet down the diff processing Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 02/11] task_isolation: add initial support Chris Metcalf 2015-09-28 15:17 ` Chris Metcalf 2015-10-01 12:14 ` Frederic Weisbecker 2015-10-01 12:18 ` Thomas Gleixner 2015-10-01 12:23 ` Frederic Weisbecker 2015-10-01 12:31 ` Thomas Gleixner 2015-10-01 17:02 ` Chris Metcalf 2015-10-01 17:02 ` Chris Metcalf 2015-10-01 21:20 ` Thomas Gleixner 2015-10-01 21:20 ` Thomas Gleixner 2015-10-02 17:15 ` Chris Metcalf 2015-10-02 17:15 ` Chris Metcalf 2015-10-02 19:02 ` Thomas Gleixner 2015-10-02 19:02 ` Thomas Gleixner 2015-10-01 19:25 ` Chris Metcalf 2015-10-01 19:25 ` Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 03/11] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf 2015-09-28 15:17 ` Chris Metcalf 2015-09-28 20:51 ` Andy Lutomirski 2015-09-28 20:51 ` Andy Lutomirski 2015-09-28 21:54 ` Chris Metcalf 2015-09-28 22:38 ` Andy Lutomirski 2015-09-29 17:35 ` Chris Metcalf 2015-09-29 17:46 ` Andy Lutomirski 2015-09-29 17:46 ` Andy Lutomirski 2015-09-29 17:57 ` Chris Metcalf 2015-09-29 17:57 ` Chris Metcalf 2015-09-29 18:00 ` Andy Lutomirski 2015-10-01 19:25 ` Chris Metcalf 2015-10-01 19:25 ` Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 04/11] task_isolation: provide strict mode configurable signal Chris Metcalf 2015-09-28 15:17 ` Chris Metcalf 2015-09-28 20:54 ` Andy Lutomirski 2015-09-28 21:54 ` Chris Metcalf 2015-09-28 21:54 ` Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 05/11] task_isolation: add debug boot flag Chris Metcalf 2015-09-28 20:59 ` Andy Lutomirski 2015-09-28 21:55 ` Chris Metcalf 2015-09-28 22:40 ` Andy Lutomirski 2015-09-29 17:35 ` Chris Metcalf 2015-10-05 17:07 ` Luiz Capitulino 2015-10-08 0:33 ` Chris Metcalf 2015-10-08 20:28 ` Luiz Capitulino 2015-09-28 15:17 ` [PATCH v7 06/11] nohz: task_isolation: allow tick to be fully disabled Chris Metcalf 2015-09-28 20:40 ` Andy Lutomirski 2015-10-01 13:07 ` Frederic Weisbecker 2015-10-01 14:13 ` Thomas Gleixner 2015-09-28 15:17 ` [PATCH v7 07/11] arch/x86: enable task isolation functionality Chris Metcalf 2015-09-28 20:59 ` Andy Lutomirski 2015-09-28 21:57 ` Chris Metcalf 2015-09-28 22:43 ` Andy Lutomirski 2015-09-29 17:42 ` Chris Metcalf 2015-09-29 17:57 ` Andy Lutomirski 2015-09-30 20:25 ` Thomas Gleixner 2015-09-30 20:58 ` Chris Metcalf 2015-09-30 22:02 ` Thomas Gleixner 2015-09-30 22:11 ` Andy Lutomirski 2015-10-01 8:12 ` Thomas Gleixner 2015-10-01 9:08 ` Christoph Lameter 2015-10-01 10:10 ` Thomas Gleixner 2015-10-01 19:25 ` Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 08/11] arch/arm64: adopt prepare_exit_to_usermode() model from x86 Chris Metcalf 2015-09-28 15:17 ` Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 09/11] arch/arm64: enable task isolation functionality Chris Metcalf 2015-09-28 15:17 ` Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 10/11] arch/tile: adopt prepare_exit_to_usermode() model from x86 Chris Metcalf 2015-09-28 15:17 ` [PATCH v7 11/11] arch/tile: enable task isolation functionality Chris Metcalf 2015-10-20 20:35 ` [PATCH v8 00/14] support "task_isolation" mode for nohz_full Chris Metcalf 2015-10-20 20:35 ` Chris Metcalf 2015-10-20 20:35 ` [PATCH v8 01/14] vmstat: provide a function to quiet down the diff processing Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 02/14] vmstat: add vmstat_idle function Chris Metcalf 2015-10-20 20:45 ` Christoph Lameter 2015-10-20 20:36 ` [PATCH v8 03/14] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf 2015-10-20 20:36 ` Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 04/14] task_isolation: add initial support Chris Metcalf 2015-10-20 20:36 ` Chris Metcalf 2015-10-20 20:56 ` Andy Lutomirski 2015-10-20 21:20 ` Chris Metcalf 2015-10-20 21:20 ` Chris Metcalf 2015-10-20 21:26 ` Andy Lutomirski 2015-10-20 21:26 ` Andy Lutomirski 2015-10-21 0:29 ` Steven Rostedt 2015-10-21 0:29 ` Steven Rostedt 2015-10-26 20:19 ` Chris Metcalf 2015-10-26 21:13 ` Chris Metcalf 2015-10-26 20:32 ` Chris Metcalf 2015-10-21 16:12 ` Frederic Weisbecker 2015-10-21 16:12 ` Frederic Weisbecker 2015-10-27 16:40 ` Chris Metcalf 2015-10-27 16:40 ` Chris Metcalf 2016-01-28 16:38 ` Frederic Weisbecker 2016-01-28 16:38 ` Frederic Weisbecker 2016-02-11 19:58 ` Chris Metcalf 2016-02-11 19:58 ` Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 05/14] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf 2015-10-20 20:36 ` Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 06/14] task_isolation: provide strict mode configurable signal Chris Metcalf 2015-10-20 20:36 ` Chris Metcalf 2015-10-21 0:56 ` Steven Rostedt 2015-10-21 0:56 ` Steven Rostedt 2015-10-21 1:30 ` Chris Metcalf 2015-10-21 1:30 ` Chris Metcalf 2015-10-21 1:41 ` Steven Rostedt 2015-10-21 1:41 ` Steven Rostedt 2015-10-21 1:42 ` Andy Lutomirski 2015-10-21 6:41 ` Gilad Ben Yossef 2015-10-21 6:41 ` Gilad Ben Yossef 2015-10-21 18:53 ` Andy Lutomirski 2015-10-22 20:44 ` Chris Metcalf 2015-10-22 21:00 ` Andy Lutomirski 2015-10-27 19:37 ` Chris Metcalf 2015-10-27 19:37 ` Chris Metcalf 2015-10-24 9:16 ` Gilad Ben Yossef 2015-10-24 9:16 ` Gilad Ben Yossef 2015-10-20 20:36 ` [PATCH v8 07/14] task_isolation: add debug boot flag Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 08/14] nohz_full: allow disabling the 1Hz minimum tick at boot Chris Metcalf 2015-10-20 21:03 ` Frederic Weisbecker 2015-10-20 21:18 ` Chris Metcalf 2015-10-21 0:59 ` Steven Rostedt 2015-10-21 6:56 ` Gilad Ben Yossef 2015-10-21 14:28 ` Christoph Lameter 2015-10-21 15:35 ` Frederic Weisbecker 2015-10-20 20:36 ` [PATCH v8 09/14] arch/x86: enable task isolation functionality Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 10/14] arch/arm64: adopt prepare_exit_to_usermode() model from x86 Chris Metcalf 2015-10-20 20:36 ` Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 11/14] arch/arm64: enable task isolation functionality Chris Metcalf 2015-10-20 20:36 ` Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 12/14] arch/tile: adopt prepare_exit_to_usermode() model from x86 Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 13/14] arch/tile: turn off timer tick for oneshot_stopped state Chris Metcalf 2015-10-20 20:36 ` [PATCH v8 14/14] arch/tile: enable task isolation functionality Chris Metcalf 2015-10-21 12:39 ` [PATCH v8 00/14] support "task_isolation" mode for nohz_full Peter Zijlstra 2015-10-22 20:31 ` Chris Metcalf 2015-10-22 20:31 ` Chris Metcalf 2015-10-23 2:33 ` Frederic Weisbecker 2015-10-23 8:49 ` Peter Zijlstra 2015-10-23 13:29 ` Frederic Weisbecker 2015-10-23 9:04 ` Peter Zijlstra 2015-10-23 9:04 ` Peter Zijlstra 2015-10-23 11:52 ` Theodore Ts'o
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.