* [PATCH v5 0/9] "Task_isolation" mode @ 2020-11-23 17:42 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:42 UTC (permalink / raw) To: nitesh, frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev This is an update of task isolation work that was originally done by Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until November 2017. It is adapted to the current kernel and cleaned up to implement its functionality in a more complete and cleaner manner. Previous version is at https://lore.kernel.org/netdev/04be044c1bcd76b7438b7563edc35383417f12c8.camel@marvell.com/ The last version by Chris Metcalf (now obsolete but may be relevant for comparison and understanding the origin of the changes) is at https://lore.kernel.org/lkml/1509728692-10460-1-git-send-email-cmetcalf@mellanox.com Supported architectures This version includes only architecture-independent code and arm64 support. x86 and arm support, and everything related to virtualization will be re-added later when new kernel entry/exit implementation will be accommodated. Support for other architectures can be added in a somewhat modular manner, however it heavily depends on the details of a kernel entry/exit support on any particular architecture. Development of common entry/exit and conversion to it should simplify that task. For now, this is the version that is currently being developed on arm64. Major changes since v4 The goal was to make isolation-breaking detection as generic as possible, and remove everything related to determining, _why_ isolation was broken. Originally reporting isolation breaking was done with a large number of of hooks in specific code (hardware interrupts, syscalls, IPIs, page faults, etc.), and it was necessary to cover all possible such events to have a reliable notification of a task about its isolation being broken. To avoid such a fragile mechanism, this version relies on mere fact of kernel being entered in isolation mode. As a result, reporting happens later in kernel code, however it covers everything. This means that now there is no specific reporting, in kernel log or elsewhere, about the reasons for breaking isolation. Information about that may be valuable at runtime, so a separate mechanism for generic reporting "why did CPU enter kernel" (with isolation or under other conditions) may be a good thing. That can be done later, however at this point it's important that task isolation does not require it, and such mechanism will not be developed with the limited purpose of supporting isolation alone. General description This is the result of development and maintenance of task isolation functionality that originally started based on task isolation patch v15 and was later updated to include v16. It provided predictable environment for userspace tasks running on arm64 processors alongside with full-featured Linux environment. It is intended to provide reliable interruption-free environment from the point when a userspace task enters isolation and until the moment it leaves isolation or receives a signal intentionally sent to it, and was successfully used for this purpose. While CPU isolation with nohz provides an environment that is close to this requirement, the remaining IPIs and other disturbances keep it from being usable for tasks that require complete predictability of CPU timing. This set of patches only covers the implementation of task isolation, however additional functionality, such as selective TLB flushes, may be implemented to avoid other kinds of disturbances that affect latency and performance of isolated tasks. The userspace support and test program is now at https://github.com/abelits/libtmc . It was originally developed for earlier implementation, so it has some checks that may be redundant now but kept for compatibility. My thanks to Chris Metcalf for design and maintenance of the original task isolation patch, Francis Giraldeau <francis.giraldeau@gmail.com> and Yuri Norov <ynorov@marvell.com> for various contributions to this work, Frederic Weisbecker <frederic@kernel.org> for his work on CPU isolation and housekeeping that made possible to remove some less elegant solutions that I had to devise for earlier, <4.17 kernels, and Nitesh Narayan Lal <nitesh@redhat.com> for adapting earlier patches related to interrupt and work distribution in presence of CPU isolation. -- Alex ^ permalink raw reply [flat|nested] 74+ messages in thread
* [PATCH v5 0/9] "Task_isolation" mode @ 2020-11-23 17:42 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:42 UTC (permalink / raw) To: nitesh, frederic Cc: linux-arch, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel This is an update of task isolation work that was originally done by Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until November 2017. It is adapted to the current kernel and cleaned up to implement its functionality in a more complete and cleaner manner. Previous version is at https://lore.kernel.org/netdev/04be044c1bcd76b7438b7563edc35383417f12c8.camel@marvell.com/ The last version by Chris Metcalf (now obsolete but may be relevant for comparison and understanding the origin of the changes) is at https://lore.kernel.org/lkml/1509728692-10460-1-git-send-email-cmetcalf@mellanox.com Supported architectures This version includes only architecture-independent code and arm64 support. x86 and arm support, and everything related to virtualization will be re-added later when new kernel entry/exit implementation will be accommodated. Support for other architectures can be added in a somewhat modular manner, however it heavily depends on the details of a kernel entry/exit support on any particular architecture. Development of common entry/exit and conversion to it should simplify that task. For now, this is the version that is currently being developed on arm64. Major changes since v4 The goal was to make isolation-breaking detection as generic as possible, and remove everything related to determining, _why_ isolation was broken. Originally reporting isolation breaking was done with a large number of of hooks in specific code (hardware interrupts, syscalls, IPIs, page faults, etc.), and it was necessary to cover all possible such events to have a reliable notification of a task about its isolation being broken. To avoid such a fragile mechanism, this version relies on mere fact of kernel being entered in isolation mode. As a result, reporting happens later in kernel code, however it covers everything. This means that now there is no specific reporting, in kernel log or elsewhere, about the reasons for breaking isolation. Information about that may be valuable at runtime, so a separate mechanism for generic reporting "why did CPU enter kernel" (with isolation or under other conditions) may be a good thing. That can be done later, however at this point it's important that task isolation does not require it, and such mechanism will not be developed with the limited purpose of supporting isolation alone. General description This is the result of development and maintenance of task isolation functionality that originally started based on task isolation patch v15 and was later updated to include v16. It provided predictable environment for userspace tasks running on arm64 processors alongside with full-featured Linux environment. It is intended to provide reliable interruption-free environment from the point when a userspace task enters isolation and until the moment it leaves isolation or receives a signal intentionally sent to it, and was successfully used for this purpose. While CPU isolation with nohz provides an environment that is close to this requirement, the remaining IPIs and other disturbances keep it from being usable for tasks that require complete predictability of CPU timing. This set of patches only covers the implementation of task isolation, however additional functionality, such as selective TLB flushes, may be implemented to avoid other kinds of disturbances that affect latency and performance of isolated tasks. The userspace support and test program is now at https://github.com/abelits/libtmc . It was originally developed for earlier implementation, so it has some checks that may be redundant now but kept for compatibility. My thanks to Chris Metcalf for design and maintenance of the original task isolation patch, Francis Giraldeau <francis.giraldeau@gmail.com> and Yuri Norov <ynorov@marvell.com> for various contributions to this work, Frederic Weisbecker <frederic@kernel.org> for his work on CPU isolation and housekeeping that made possible to remove some less elegant solutions that I had to devise for earlier, <4.17 kernels, and Nitesh Narayan Lal <nitesh@redhat.com> for adapting earlier patches related to interrupt and work distribution in presence of CPU isolation. -- Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* [PATCH v5 1/9] task_isolation: vmstat: add quiet_vmstat_sync function 2020-11-23 17:42 ` Alex Belits @ 2020-11-23 17:56 ` Alex Belits -1 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:56 UTC (permalink / raw) To: nitesh, frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev From: Chris Metcalf <cmetcalf@mellanox.com> In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter") the quiet_vmstat() function became asynchronous, in the sense that the vmstat work was still scheduled to run on the core when the function returned. For task isolation, we need a synchronous version of the function that guarantees that the vmstat worker will not run on the core on return from the function. Add a quiet_vmstat_sync() function with that semantic. Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> Signed-off-by: Alex Belits <abelits@marvell.com> --- include/linux/vmstat.h | 2 ++ mm/vmstat.c | 9 +++++++++ 2 files changed, 11 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 322dcbfcc933..300ce6648923 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -284,6 +284,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_node_state(struct pglist_data *, enum node_stat_item); void quiet_vmstat(void); +void quiet_vmstat_sync(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -391,6 +392,7 @@ static inline void __dec_node_page_state(struct page *page, static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } static inline void quiet_vmstat(void) { } +static inline void quiet_vmstat_sync(void) { } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index 698bc0bc18d1..43999caf47a4 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1936,6 +1936,15 @@ void quiet_vmstat(void) refresh_cpu_vm_stats(false); } +/* + * Synchronously quiet vmstat so the work is guaranteed not to run on return. + */ +void quiet_vmstat_sync(void) +{ + cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work)); + refresh_cpu_vm_stats(false); +} + /* * Shepherd worker thread that checks the * differentials of processors that have their worker -- 2.20.1 ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH v5 1/9] task_isolation: vmstat: add quiet_vmstat_sync function @ 2020-11-23 17:56 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:56 UTC (permalink / raw) To: nitesh, frederic Cc: linux-arch, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel From: Chris Metcalf <cmetcalf@mellanox.com> In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter") the quiet_vmstat() function became asynchronous, in the sense that the vmstat work was still scheduled to run on the core when the function returned. For task isolation, we need a synchronous version of the function that guarantees that the vmstat worker will not run on the core on return from the function. Add a quiet_vmstat_sync() function with that semantic. Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> Signed-off-by: Alex Belits <abelits@marvell.com> --- include/linux/vmstat.h | 2 ++ mm/vmstat.c | 9 +++++++++ 2 files changed, 11 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 322dcbfcc933..300ce6648923 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -284,6 +284,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_node_state(struct pglist_data *, enum node_stat_item); void quiet_vmstat(void); +void quiet_vmstat_sync(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -391,6 +392,7 @@ static inline void __dec_node_page_state(struct page *page, static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } static inline void quiet_vmstat(void) { } +static inline void quiet_vmstat_sync(void) { } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index 698bc0bc18d1..43999caf47a4 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1936,6 +1936,15 @@ void quiet_vmstat(void) refresh_cpu_vm_stats(false); } +/* + * Synchronously quiet vmstat so the work is guaranteed not to run on return. + */ +void quiet_vmstat_sync(void) +{ + cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work)); + refresh_cpu_vm_stats(false); +} + /* * Shepherd worker thread that checks the * differentials of processors that have their worker -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 74+ messages in thread
* Re: [PATCH v5 1/9] task_isolation: vmstat: add quiet_vmstat_sync function 2020-11-23 17:56 ` Alex Belits @ 2020-11-23 21:48 ` Thomas Gleixner -1 siblings, 0 replies; 74+ messages in thread From: Thomas Gleixner @ 2020-11-23 21:48 UTC (permalink / raw) To: Alex Belits, nitesh, frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev Alex, On Mon, Nov 23 2020 at 17:56, Alex Belits wrote: why are you insisting on adding 'task_isolation: ' as prefix to every single patch? That's wrong as I explained before. The prefix denotes the affected subsystem and 'task_isolation' is _NOT_ a subsystem. It's the project name you are using but the affected code belongs to the memory management subsystem and if you run git log mm/vmstat.c you might get a hint what the proper prefix is, i.e. 'mm/vmstat: ' > In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter") > the quiet_vmstat() function became asynchronous, in the sense that > the vmstat work was still scheduled to run on the core when the > function returned. For task isolation, we need a synchronous This changelog is useless because how should someone not familiar with the term 'task isolation' figure out what that means? It's not the reviewers job to figure that out. Again: Go read and adhere to Documentation/process/* Aside of that your patches are CR/LF inflicted. Please fix your work flow and tools. Thanks, tglx ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 1/9] task_isolation: vmstat: add quiet_vmstat_sync function @ 2020-11-23 21:48 ` Thomas Gleixner 0 siblings, 0 replies; 74+ messages in thread From: Thomas Gleixner @ 2020-11-23 21:48 UTC (permalink / raw) To: Alex Belits, nitesh, frederic Cc: linux-arch, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, linux-api, will, mingo, linux-arm-kernel Alex, On Mon, Nov 23 2020 at 17:56, Alex Belits wrote: why are you insisting on adding 'task_isolation: ' as prefix to every single patch? That's wrong as I explained before. The prefix denotes the affected subsystem and 'task_isolation' is _NOT_ a subsystem. It's the project name you are using but the affected code belongs to the memory management subsystem and if you run git log mm/vmstat.c you might get a hint what the proper prefix is, i.e. 'mm/vmstat: ' > In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter") > the quiet_vmstat() function became asynchronous, in the sense that > the vmstat work was still scheduled to run on the core when the > function returned. For task isolation, we need a synchronous This changelog is useless because how should someone not familiar with the term 'task isolation' figure out what that means? It's not the reviewers job to figure that out. Again: Go read and adhere to Documentation/process/* Aside of that your patches are CR/LF inflicted. Please fix your work flow and tools. Thanks, tglx _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* [PATCH v5 2/9] task_isolation: vmstat: add vmstat_idle function 2020-11-23 17:42 ` Alex Belits @ 2020-11-23 17:56 ` Alex Belits -1 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:56 UTC (permalink / raw) To: nitesh, frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev From: Chris Metcalf <cmetcalf@mellanox.com> This function checks to see if a vmstat worker is not running, and the vmstat diffs don't require an update. The function is called from the task-isolation code to see if we need to actually do some work to quiet vmstat. Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> Signed-off-by: Alex Belits <abelits@marvell.com> --- include/linux/vmstat.h | 2 ++ mm/vmstat.c | 10 ++++++++++ 2 files changed, 12 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 300ce6648923..24392a957cfc 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -285,6 +285,7 @@ extern void __dec_node_state(struct pglist_data *, enum node_stat_item); void quiet_vmstat(void); void quiet_vmstat_sync(void); +bool vmstat_idle(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -393,6 +394,7 @@ static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } static inline void quiet_vmstat(void) { } static inline void quiet_vmstat_sync(void) { } +static inline bool vmstat_idle(void) { return true; } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index 43999caf47a4..5b0ad7ed65f7 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1945,6 +1945,16 @@ void quiet_vmstat_sync(void) refresh_cpu_vm_stats(false); } +/* + * Report on whether vmstat processing is quiesced on the core currently: + * no vmstat worker running and no vmstat updates to perform. + */ +bool vmstat_idle(void) +{ + return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) && + !need_update(smp_processor_id()); +} + /* * Shepherd worker thread that checks the * differentials of processors that have their worker -- 2.20.1 ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH v5 2/9] task_isolation: vmstat: add vmstat_idle function @ 2020-11-23 17:56 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:56 UTC (permalink / raw) To: nitesh, frederic Cc: linux-arch, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel From: Chris Metcalf <cmetcalf@mellanox.com> This function checks to see if a vmstat worker is not running, and the vmstat diffs don't require an update. The function is called from the task-isolation code to see if we need to actually do some work to quiet vmstat. Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> Signed-off-by: Alex Belits <abelits@marvell.com> --- include/linux/vmstat.h | 2 ++ mm/vmstat.c | 10 ++++++++++ 2 files changed, 12 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 300ce6648923..24392a957cfc 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -285,6 +285,7 @@ extern void __dec_node_state(struct pglist_data *, enum node_stat_item); void quiet_vmstat(void); void quiet_vmstat_sync(void); +bool vmstat_idle(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -393,6 +394,7 @@ static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } static inline void quiet_vmstat(void) { } static inline void quiet_vmstat_sync(void) { } +static inline bool vmstat_idle(void) { return true; } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index 43999caf47a4..5b0ad7ed65f7 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1945,6 +1945,16 @@ void quiet_vmstat_sync(void) refresh_cpu_vm_stats(false); } +/* + * Report on whether vmstat processing is quiesced on the core currently: + * no vmstat worker running and no vmstat updates to perform. + */ +bool vmstat_idle(void) +{ + return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) && + !need_update(smp_processor_id()); +} + /* * Shepherd worker thread that checks the * differentials of processors that have their worker -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 74+ messages in thread
* Re: [PATCH v5 2/9] task_isolation: vmstat: add vmstat_idle function 2020-11-23 17:56 ` Alex Belits @ 2020-11-23 21:49 ` Thomas Gleixner -1 siblings, 0 replies; 74+ messages in thread From: Thomas Gleixner @ 2020-11-23 21:49 UTC (permalink / raw) To: Alex Belits, nitesh, frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev On Mon, Nov 23 2020 at 17:56, Alex Belits wrote: > This function checks to see if a vmstat worker is not running, > and the vmstat diffs don't require an update. The function is > called from the task-isolation code to see if we need to > actually do some work to quiet vmstat. A changelog has to explain _WHY_ this change is necessary and not _WHAT_ the patch is doing. Thanks, tglx ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 2/9] task_isolation: vmstat: add vmstat_idle function @ 2020-11-23 21:49 ` Thomas Gleixner 0 siblings, 0 replies; 74+ messages in thread From: Thomas Gleixner @ 2020-11-23 21:49 UTC (permalink / raw) To: Alex Belits, nitesh, frederic Cc: linux-arch, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, linux-api, will, mingo, linux-arm-kernel On Mon, Nov 23 2020 at 17:56, Alex Belits wrote: > This function checks to see if a vmstat worker is not running, > and the vmstat diffs don't require an update. The function is > called from the task-isolation code to see if we need to > actually do some work to quiet vmstat. A changelog has to explain _WHY_ this change is necessary and not _WHAT_ the patch is doing. Thanks, tglx _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* [PATCH v5 3/9] task_isolation: userspace hard isolation from kernel 2020-11-23 17:42 ` Alex Belits @ 2020-11-23 17:56 ` Alex Belits -1 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:56 UTC (permalink / raw) To: nitesh, frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE, 0, 0, 0) to do so. The kernel must be built with the new TASK_ISOLATION Kconfig flag to enable this mode, and the kernel booted with an appropriate "isolcpus=nohz,domain,CPULIST" boot argument to enable nohz_full and isolcpus. The "task_isolation" state is then indicated by setting a new task struct field, task_isolation_flag, to the value passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in the thread_info flags. When the kernel is returning to userspace from the prctl() call and sees TIF_TASK_ISOLATION set, it calls the new task_isolation_start() routine to arrange for the task to avoid being interrupted in the future. With interrupts disabled, task_isolation_start() ensures that kernel subsystems that might cause a future interrupt are quiesced. If it doesn't succeed, it adjusts the syscall return value to indicate that fact, and userspace can retry as desired. In addition to stopping the scheduler tick, the code takes any actions that might avoid a future interrupt to the core, such as a worker thread being scheduled that could be quiesced now (e.g. the vmstat worker) or a future IPI to the core to clean up some state that could be cleaned up now (e.g. the mm lru per-cpu cache). The last stage of enabling task isolation happens in task_isolation_exit_to_user_mode() that runs last before returning to userspace and changes ll_isol_flags (see later) to prevent other CPUs from interfering with isolated task. Once the task has returned to userspace after issuing the prctl(), if it enters the kernel again via system call, page fault, or any other exception or irq, the kernel will send it a signal to indicate isolation loss. In addition to sending a signal, the code supports a kernel command-line "task_isolation_debug" flag which causes a stack backtrace to be generated whenever a task loses isolation. To allow the state to be entered and exited, the syscall checking test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can clear the bit again later, and ignores exit/exit_group to allow exiting the task without a pointless signal being delivered. The prctl() API allows for specifying a signal number to use instead of the default SIGKILL, to allow for catching the notification signal; for example, in a production environment, it might be helpful to log information to the application logging mechanism before exiting. Or, the signal handler might choose to reset the program counter back to the code segment intended to be run isolated via prctl() to continue execution. Isolation also disables CPU state synchronization mechanisms that are. normally done by IPI. In the future, more synchronization mechanisms, such as TLB flushes, may be disabled for isolated tasks. This requires careful handling of kernel entry from isolated task -- remote synchronization requests must be re-enabled and synchronization procedure triggered, before anything other than low-level kernel entry code is called. Same applies to exiting from kernel to userspace after isolation is enabled. For this purpose, per-CPU low-level flags ll_isol_flags are used to indicate isolation state, and task_isolation_kernel_enter() is used to safely clear them early in kernel entry. CPU mask corresponding to isolation bit in ll_isol_flags is visible to userspace as /sys/devices/system/cpu/isolation_running, and can be used for monitoring. Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> Signed-off-by: Alex Belits <abelits@marvell.com> --- .../admin-guide/kernel-parameters.txt | 6 + drivers/base/cpu.c | 23 + include/linux/hrtimer.h | 4 + include/linux/isolation.h | 326 ++++++++ include/linux/sched.h | 5 + include/linux/tick.h | 3 + include/uapi/linux/prctl.h | 6 + init/Kconfig | 27 + kernel/Makefile | 2 + kernel/isolation.c | 714 ++++++++++++++++++ kernel/signal.c | 2 + kernel/sys.c | 6 + kernel/time/hrtimer.c | 27 + kernel/time/tick-sched.c | 18 + 14 files changed, 1169 insertions(+) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 44fde25bb221..4b28f2022c98 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5250,6 +5250,12 @@ neutralize any effect of /proc/sys/kernel/sysrq. Useful for debugging. + task_isolation_debug [KNL] + In kernels built with CONFIG_TASK_ISOLATION, this + setting will generate console backtraces to + accompany the diagnostics generated about + interrupting tasks running with task isolation. + tcpmhash_entries= [KNL,NET] Set the number of tcp_metrics_hash slots. Default value is 8192 or 16384 depending on total diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 8f1d6569564c..afa6163203c7 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -284,6 +284,26 @@ static ssize_t print_cpus_isolated(struct device *dev, } static DEVICE_ATTR(isolated, 0444, print_cpus_isolated, NULL); +#ifdef CONFIG_TASK_ISOLATION +static ssize_t isolation_running_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + int n; + cpumask_var_t isolation_running; + + if (!zalloc_cpumask_var(&isolation_running, GFP_KERNEL)) + return -ENOMEM; + + task_isolation_cpumask(isolation_running); + n = sprintf(buf, "%*pbl\n", cpumask_pr_args(isolation_running)); + + free_cpumask_var(isolation_running); + + return n; +} +static DEVICE_ATTR_RO(isolation_running); +#endif + #ifdef CONFIG_NO_HZ_FULL static ssize_t print_cpus_nohz_full(struct device *dev, struct device_attribute *attr, char *buf) @@ -471,6 +491,9 @@ static struct attribute *cpu_root_attrs[] = { #ifdef CONFIG_NO_HZ_FULL &dev_attr_nohz_full.attr, #endif +#ifdef CONFIG_TASK_ISOLATION + &dev_attr_isolation_running.attr, +#endif #ifdef CONFIG_GENERIC_CPU_AUTOPROBE &dev_attr_modalias.attr, #endif diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index 107cedd7019a..9a6c9ce95080 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -529,6 +529,10 @@ extern void __init hrtimers_init(void); /* Show pending timers: */ extern void sysrq_timer_list_show(void); +#ifdef CONFIG_TASK_ISOLATION +extern void kick_hrtimer(void); +#endif + int hrtimers_prepare_cpu(unsigned int cpu); #ifdef CONFIG_HOTPLUG_CPU int hrtimers_dead_cpu(unsigned int cpu); diff --git a/include/linux/isolation.h b/include/linux/isolation.h new file mode 100644 index 000000000000..eb6c99bd2db4 --- /dev/null +++ b/include/linux/isolation.h @@ -0,0 +1,326 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Task isolation support + * + * Authors: + * Chris Metcalf <cmetcalf@mellanox.com> + * Alex Belits <abelits@marvell.com> + * Yuri Norov <ynorov@marvell.com> + */ +#ifndef _LINUX_ISOLATION_H +#define _LINUX_ISOLATION_H + +#include <stdarg.h> +#include <linux/errno.h> +#include <linux/cpumask.h> +#include <linux/percpu.h> +#include <linux/irqflags.h> +#include <linux/prctl.h> +#include <linux/types.h> + +struct task_struct; + +#ifdef CONFIG_TASK_ISOLATION + +/* + * Logging + */ +int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...); + +#define pr_task_isol_emerg(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_EMERG, false, fmt, ##__VA_ARGS__) +#define pr_task_isol_alert(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_ALERT, false, fmt, ##__VA_ARGS__) +#define pr_task_isol_crit(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_CRIT, false, fmt, ##__VA_ARGS__) +#define pr_task_isol_err(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_ERR, false, fmt, ##__VA_ARGS__) +#define pr_task_isol_warn(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_WARNING, false, fmt, ##__VA_ARGS__) +#define pr_task_isol_notice(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_NOTICE, false, fmt, ##__VA_ARGS__) +#define pr_task_isol_info(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_INFO, false, fmt, ##__VA_ARGS__) +#define pr_task_isol_debug(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_DEBUG, false, fmt, ##__VA_ARGS__) + +#define pr_task_isol_emerg_supp(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_EMERG, true, fmt, ##__VA_ARGS__) +#define pr_task_isol_alert_supp(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_ALERT, true, fmt, ##__VA_ARGS__) +#define pr_task_isol_crit_supp(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_CRIT, true, fmt, ##__VA_ARGS__) +#define pr_task_isol_err_supp(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_ERR, true, fmt, ##__VA_ARGS__) +#define pr_task_isol_warn_supp(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_WARNING, true, fmt, ##__VA_ARGS__) +#define pr_task_isol_notice_supp(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_NOTICE, true, fmt, ##__VA_ARGS__) +#define pr_task_isol_info_supp(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_INFO, true, fmt, ##__VA_ARGS__) +#define pr_task_isol_debug_supp(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_DEBUG, true, fmt, ##__VA_ARGS__) + +#define BIT_LL_TASK_ISOLATION (0) +#define BIT_LL_TASK_ISOLATION_BROKEN (1) +#define BIT_LL_TASK_ISOLATION_REQUEST (2) +#define FLAG_LL_TASK_ISOLATION (1 << BIT_LL_TASK_ISOLATION) +#define FLAG_LL_TASK_ISOLATION_BROKEN (1 << BIT_LL_TASK_ISOLATION_BROKEN) +#define FLAG_LL_TASK_ISOLATION_REQUEST (1 << BIT_LL_TASK_ISOLATION_REQUEST) + +DECLARE_PER_CPU(unsigned long, ll_isol_flags); +extern cpumask_var_t task_isolation_map; + +/** + * task_isolation_request() - prctl hook to request task isolation + * @flags: Flags from <linux/prctl.h> PR_TASK_ISOLATION_xxx. + * + * This is called from the generic prctl() code for PR_TASK_ISOLATION. + * + * Return: Returns 0 when task isolation enabled, otherwise a negative + * errno. + */ +extern int task_isolation_request(unsigned int flags); + +/** + * task_isolation_kernel_enter() - clear low-level task isolation flag + * + * This should be called immediately after entering kernel. It must + * be inline, and suitable for running after leaving isolated + * userspace in a "stale" state when synchronization is required + * before the CPU can safely enter the rest of the kernel. + */ +static __always_inline void task_isolation_kernel_enter(void) +{ + unsigned long flags; + + /* + * This function runs on a CPU that ran isolated task. + * + * We don't want this CPU running code from the rest of kernel + * until other CPUs know that it is no longer isolated. When + * CPU is running isolated task until this point anything that + * causes an interrupt on this CPU must end up calling this + * before touching the rest of kernel. That is, this function + * or fast_task_isolation_cpu_cleanup() or stop_isolation() + * calling it. If any interrupt, including scheduling timer, + * arrives, it will still end up here early after entering + * kernel. From this point interrupts are disabled until all + * CPUs will see that this CPU is no longer running isolated + * task. + * + * See also fast_task_isolation_cpu_cleanup(). + */ + if ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION) == 0) + return; + + raw_local_irq_save(flags); + + /* Change low-level flags to indicate broken isolation */ + this_cpu_write(ll_isol_flags, FLAG_LL_TASK_ISOLATION_BROKEN); + + /* + * If something happened that requires a barrier that would + * otherwise be called from remote CPUs by CPU kick procedure, + * this barrier runs instead of it. After this barrier, CPU + * kick procedure would see the updated ll_isol_flags, so it + * will run its own IPI to trigger a barrier. + */ + smp_mb(); + /* + * Synchronize instructions -- this CPU was not kicked while + * in isolated mode, so it might require synchronization. + * There might be an IPI if kick procedure happened and + * ll_isol_flags was already updated while it assembled a CPU + * mask. However if this did not happen, synchronize everything + * here. + */ + instr_sync(); + raw_local_irq_restore(flags); +} + +/** + * task_isolation_exit_to_user_mode() - set low-level task isolation flag + * if task isolation is requested + * + * This should be called immediately before exiting kernel. It must + * be inline, and the state of CPI may become "stale" between setting + * the flag and returning to the userspace. + */ +static __always_inline void task_isolation_exit_to_user_mode(void) +{ + unsigned long flags; + + /* Check if this task is entering isolation */ + if ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION_REQUEST) + == 0) + return; + raw_local_irq_save(flags); + + /* Set low-level flags */ + this_cpu_write(ll_isol_flags, FLAG_LL_TASK_ISOLATION); + /* + * After this barrier the rest of the system stops using IPIs + * to synchronize this CPU state. Since only exit to userspace + * follows, this is safe. Synchronization will happen again in + * task_isolation_enter() when this CPU will enter kernel. + */ + smp_mb(); + /* + * From this point this is recognized as isolated by + * other CPUs + */ + raw_local_irq_restore(flags); +} + +extern void task_isolation_cpu_cleanup(void); +/** + * task_isolation_start() - attempt to actually start task isolation + * + * This function should be invoked as the last thing prior to returning to + * user space if TIF_TASK_ISOLATION is set in the thread_info flags. It + * will attempt to quiesce the core and enter task-isolation mode. If it + * fails, it will reset the system call return value to an error code that + * indicates the failure mode. + */ +extern void task_isolation_start(void); + +/** + * is_isolation_cpu() - check if CPU is intended for running isolated tasks. + * @cpu: CPU to check. + */ +static inline bool is_isolation_cpu(int cpu) +{ + return task_isolation_map != NULL && + cpumask_test_cpu(cpu, task_isolation_map); +} + +/** + * task_isolation_on_cpu() - check if the cpu is running isolated task + * @cpu: CPU to check. + */ +static inline int task_isolation_on_cpu(int cpu) +{ + return test_bit(BIT_LL_TASK_ISOLATION, &per_cpu(ll_isol_flags, cpu)); +} + +/** + * task_isolation_cpumask() - set CPUs currently running isolated tasks + * @mask: Mask to modify. + */ +extern void task_isolation_cpumask(struct cpumask *mask); + +/** + * task_isolation_clear_cpumask() - clear CPUs currently running isolated tasks + * @mask: Mask to modify. + */ +extern void task_isolation_clear_cpumask(struct cpumask *mask); + +/** + * task_isolation_syscall() - report a syscall from an isolated task + * @nr: The syscall number. + * + * This routine should be invoked at syscall entry if TIF_TASK_ISOLATION is + * set in the thread_info flags. It checks for valid syscalls, + * specifically prctl() with PR_TASK_ISOLATION, exit(), and exit_group(). + * For any other syscall it will raise a signal and return failure. + * + * Return: 0 for acceptable syscalls, -1 for all others. + */ +extern int task_isolation_syscall(int nr); + +/** + * task_isolation_before_pending_work_check() - check for isolation breaking + * + * This routine is called from the code responsible for exiting to user mode, + * before the point when thread flags are checked for pending work. + * That function must be called if the current task is isolated, because + * TIF_TASK_ISOLATION must trigger a call to it. + */ +void task_isolation_before_pending_work_check(void); + +/** + * _task_isolation_interrupt() - report an interrupt of an isolated task + * @fmt: A format string describing the interrupt + * @...: Format arguments, if any. + * + * This routine should be invoked at any exception or IRQ if + * TIF_TASK_ISOLATION is set in the thread_info flags. It is not necessary + * to invoke it if the exception will generate a signal anyway (e.g. a bad + * page fault), and in that case it is preferable not to invoke it but just + * rely on the standard Linux signal. The macro task_isolation_syscall() + * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code. + */ +extern void _task_isolation_interrupt(const char *fmt, ...); +#define task_isolation_interrupt(fmt, ...) \ + do { \ + if (current_thread_info()->flags & _TIF_TASK_ISOLATION) \ + _task_isolation_interrupt(fmt, ## __VA_ARGS__); \ + } while (0) + +/** + * task_isolation_remote() - report a remote interrupt of an isolated task + * @cpu: The remote cpu that is about to be interrupted. + * @fmt: A format string describing the interrupt + * @...: Format arguments, if any. + * + * This routine should be invoked any time a remote IPI or other type of + * interrupt is being delivered to another cpu. The function will check to + * see if the target core is running a task-isolation task, and generate a + * diagnostic on the console if so; in addition, we tag the task so it + * doesn't generate another diagnostic when the interrupt actually arrives. + * Generating a diagnostic remotely yields a clearer indication of what + * happened then just reporting only when the remote core is interrupted. + * + */ +extern void task_isolation_remote(int cpu, const char *fmt, ...); + +/** + * task_isolation_remote_cpumask() - report interruption of multiple cpus + * @mask: The set of remotes cpus that are about to be interrupted. + * @fmt: A format string describing the interrupt + * @...: Format arguments, if any. + * + * This is the cpumask variant of _task_isolation_remote(). We + * generate a single-line diagnostic message even if multiple remote + * task-isolation cpus are being interrupted. + */ +extern void task_isolation_remote_cpumask(const struct cpumask *mask, + const char *fmt, ...); + +/** + * _task_isolation_signal() - disable task isolation when signal is pending + * @task: The task for which to disable isolation. + * + * This function generates a diagnostic and disables task isolation; + * it should be called if TIF_TASK_ISOLATION is set when notifying a + * task of a pending signal. The task_isolation_interrupt() function + * normally generates a diagnostic for events that just interrupt a + * task without generating a signal; here we need to hook the paths + * that correspond to interrupts that do generate a signal. The macro + * task_isolation_signal() wraps the TIF_TASK_ISOLATION flag test to + * simplify the caller code. + */ +extern void _task_isolation_signal(struct task_struct *task); +#define task_isolation_signal(task) \ + do { \ + if (task_thread_info(task)->flags & _TIF_TASK_ISOLATION) \ + _task_isolation_signal(task); \ + } while (0) + +#else /* !CONFIG_TASK_ISOLATION */ +static inline int task_isolation_request(unsigned int flags) { return -EINVAL; } +static inline void task_isolation_kernel_enter(void) {} +static inline void task_isolation_exit_to_user_mode(void) {} +static inline void task_isolation_start(void) { } +static inline bool is_isolation_cpu(int cpu) { return 0; } +static inline int task_isolation_on_cpu(int cpu) { return 0; } +static inline void task_isolation_cpumask(struct cpumask *mask) { } +static inline void task_isolation_clear_cpumask(struct cpumask *mask) { } +static inline void task_isolation_cpu_cleanup(void) { } +static inline int task_isolation_syscall(int nr) { return 0; } +static inline void task_isolation_before_pending_work_check(void) { } +static inline void task_isolation_signal(struct task_struct *task) { } +#endif + +#endif /* _LINUX_ISOLATION_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 063cd120b459..5d8b17aa544b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1316,6 +1316,11 @@ struct task_struct { unsigned long prev_lowest_stack; #endif +#ifdef CONFIG_TASK_ISOLATION + unsigned short task_isolation_flags; /* prctl */ + unsigned short task_isolation_state; +#endif + #ifdef CONFIG_X86_MCE void __user *mce_vaddr; __u64 mce_kflags; diff --git a/include/linux/tick.h b/include/linux/tick.h index 7340613c7eff..27c7c033d5a8 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -268,6 +268,9 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal, extern void tick_nohz_full_kick_cpu(int cpu); extern void __tick_nohz_task_switch(void); extern void __init tick_nohz_full_setup(cpumask_var_t cpumask); +#ifdef CONFIG_TASK_ISOLATION +extern int try_stop_full_tick(void); +#endif #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 7f0827705c9a..bf72afb0364f 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -247,4 +247,10 @@ struct prctl_mm_map { #define PR_SET_IO_FLUSHER 57 #define PR_GET_IO_FLUSHER 58 +/* Enable task_isolation mode for TASK_ISOLATION kernels. */ +#define PR_TASK_ISOLATION 48 +# define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) + #endif /* _LINUX_PRCTL_H */ diff --git a/init/Kconfig b/init/Kconfig index c9446911cf41..f708580fb56d 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -648,6 +648,33 @@ config CPU_ISOLATION source "kernel/rcu/Kconfig" +config HAVE_ARCH_TASK_ISOLATION + bool + +config TASK_ISOLATION + bool "Provide hard CPU isolation from the kernel on demand" + depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION + help + Allow userspace processes that place themselves on cores with + nohz_full and isolcpus enabled, and run prctl(PR_TASK_ISOLATION), + to "isolate" themselves from the kernel. Prior to returning to + userspace, isolated tasks will arrange that no future kernel + activity will interrupt the task while the task is running in + userspace. Attempting to re-enter the kernel while in this mode + will cause the task to be terminated with a signal; you must + explicitly use prctl() to disable task isolation before resuming + normal use of the kernel. + + This "hard" isolation from the kernel is required for userspace + tasks that are running hard real-time tasks in userspace, such as + a high-speed network driver in userspace. Without this option, but + with NO_HZ_FULL enabled, the kernel will make a best-faith, "soft" + effort to shield a single userspace process from interrupts, but + makes no guarantees. + + You should say "N" unless you are intending to run a + high-performance userspace driver or similar task. + config BUILD_BIN2C bool default n diff --git a/kernel/Makefile b/kernel/Makefile index af601b9bda0e..8a8a361dc1a5 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -133,6 +133,8 @@ KCOV_INSTRUMENT_stackleak.o := n obj-$(CONFIG_SCF_TORTURE_TEST) += scftorture.o +obj-$(CONFIG_TASK_ISOLATION) += isolation.o + $(obj)/configs.o: $(obj)/config_data.gz targets += config_data.gz diff --git a/kernel/isolation.c b/kernel/isolation.c new file mode 100644 index 000000000000..127b2a1cb0cb --- /dev/null +++ b/kernel/isolation.c @@ -0,0 +1,714 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Implementation of task isolation. + * + * Authors: + * Chris Metcalf <cmetcalf@mellanox.com> + * Alex Belits <abelits@marvell.com> + * Yuri Norov <ynorov@marvell.com> + */ + +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/vmstat.h> +#include <linux/sched.h> +#include <linux/isolation.h> +#include <linux/syscalls.h> +#include <linux/smp.h> +#include <linux/tick.h> +#include <asm/unistd.h> +#include <asm/syscall.h> +#include <linux/hrtimer.h> + +/* + * These values are stored in task_isolation_state. + * Note that STATE_NORMAL + TIF_TASK_ISOLATION means we are still + * returning from sys_prctl() to userspace. + */ +enum { + STATE_NORMAL = 0, /* Not isolated */ + STATE_ISOLATED = 1 /* In userspace, isolated */ +}; + +/* + * Low-level isolation flags. + * Those flags are used by low-level isolation set/clear/check routines. + * Those flags should be set last before return to userspace and cleared + * first upon kernel entry, and synchronized to allow isolation breaking + * detection before touching potentially unsynchronized parts of kernel. + * Isolated task does not receive synchronization events of any kind, so + * at the time of the first entry into kernel it might not be ready to + * run most of the kernel code. However to perform synchronization + * properly, kernel entry code should also enable synchronization events + * at the same time. This presents a problem because more kernel code + * should run to determine the cause of isolation breaking, signals may + * have to be generated, etc. So some flag clearing and synchronization + * should happen in "low-level" entry code but processing of isolation + * breaking should happen in "high-level" code. Low-level isolation flags + * should be set in that low-level code, possibly long before the cause + * of isolation breaking is known. Symmetrically, entering isolation + * should disable synchronization events before returning to userspace + * but after all potentially volatile code is finished. + */ +DEFINE_PER_CPU(unsigned long, ll_isol_flags); + +/* + * Description of the last two tasks that ran isolated on a given CPU. + * This is intended only for messages about isolation breaking. We + * don't want any references to actual task while accessing this from + * CPU that caused isolation breaking -- we know nothing about timing + * and don't want to use locking or RCU. + */ +struct isol_task_desc { + atomic_t curr_index; + atomic_t curr_index_wr; + bool warned[2]; + pid_t pid[2]; + pid_t tgid[2]; + char comm[2][TASK_COMM_LEN]; +}; +static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs); + +/* + * Counter for isolation exiting procedures (from request to the start of + * cleanup) being attempted at once on a CPU. Normally incrementing of + * this counter is performed from the CPU that caused isolation breaking, + * however decrementing is done from the cleanup procedure, delegated to + * the CPU that is exiting isolation, not from the CPU that caused isolation + * breaking. + * + * If incrementing this counter while starting isolation exit procedure + * results in a value greater than 0, isolation exiting is already in + * progress, and cleanup did not start yet. This means, counter should be + * decremented back, and isolation exit that is already in progress, should + * be allowed to complete. Otherwise, a new isolation exit procedure should + * be started. + */ +DEFINE_PER_CPU(atomic_t, isol_exit_counter); + +/* + * Descriptor for isolation-breaking SMP calls + */ +DEFINE_PER_CPU(call_single_data_t, isol_break_csd); + +cpumask_var_t task_isolation_map; +cpumask_var_t task_isolation_cleanup_map; +static DEFINE_SPINLOCK(task_isolation_cleanup_lock); + +/* We can run on cpus that are isolated from the scheduler and are nohz_full. */ +static int __init task_isolation_init(void) +{ + alloc_bootmem_cpumask_var(&task_isolation_cleanup_map); + if (alloc_cpumask_var(&task_isolation_map, GFP_KERNEL)) + /* + * At this point task isolation should match + * nohz_full. This may change in the future. + */ + cpumask_copy(task_isolation_map, tick_nohz_full_mask); + return 0; +} +core_initcall(task_isolation_init) + +/* Enable stack backtraces of any interrupts of task_isolation cores. */ +static bool task_isolation_debug; +static int __init task_isolation_debug_func(char *str) +{ + task_isolation_debug = true; + return 1; +} +__setup("task_isolation_debug", task_isolation_debug_func); + +/* + * Record name, pid and group pid of the task entering isolation on + * the current CPU. + */ +static void record_curr_isolated_task(void) +{ + int ind; + int cpu = smp_processor_id(); + struct isol_task_desc *desc = &per_cpu(isol_task_descs, cpu); + struct task_struct *task = current; + + /* Finish everything before recording current task */ + smp_mb(); + ind = atomic_inc_return(&desc->curr_index_wr) & 1; + desc->comm[ind][sizeof(task->comm) - 1] = '\0'; + memcpy(desc->comm[ind], task->comm, sizeof(task->comm) - 1); + desc->pid[ind] = task->pid; + desc->tgid[ind] = task->tgid; + desc->warned[ind] = false; + /* Write everything, to be seen by other CPUs */ + smp_mb(); + atomic_inc(&desc->curr_index); + /* Everyone will see the new record from this point */ + smp_mb(); +} + +/* + * Print message prefixed with the description of the current (or + * last) isolated task on a given CPU. Intended for isolation breaking + * messages that include target task for the user's convenience. + * + * Messages produced with this function may have obsolete task + * information if isolated tasks managed to exit, start and enter + * isolation multiple times, or multiple tasks tried to enter + * isolation on the same CPU at once. For those unusual cases it would + * contain a valid description of the cause for isolation breaking and + * target CPU number, just not the correct description of which task + * ended up losing isolation. + */ +int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...) +{ + struct isol_task_desc *desc; + struct task_struct *task; + va_list args; + char buf_prefix[TASK_COMM_LEN + 20 + 3 * 20]; + char buf[200]; + int curr_cpu, ind_counter, ind_counter_old, ind; + + curr_cpu = get_cpu(); + /* Barrier to synchronize with recording isolated task information */ + smp_rmb(); + desc = &per_cpu(isol_task_descs, cpu); + ind_counter = atomic_read(&desc->curr_index); + + if (curr_cpu == cpu) { + /* + * Message is for the current CPU so current + * task_struct should be used instead of cached + * information. + * + * Like in other diagnostic messages, if issued from + * interrupt context, current will be the interrupted + * task. Unlike other diagnostic messages, this is + * always relevant because the message is about + * interrupting a task. + */ + ind = ind_counter & 1; + if (supp && desc->warned[ind]) { + /* + * If supp is true, skip the message if the + * same task was mentioned in the message + * originated on remote CPU, and it did not + * re-enter isolated state since then (warned + * is true). Only local messages following + * remote messages, likely about the same + * isolation breaking event, are skipped to + * avoid duplication. If remote cause is + * immediately followed by a local one before + * isolation is broken, local cause is skipped + * from messages. + */ + put_cpu(); + return 0; + } + task = current; + snprintf(buf_prefix, sizeof(buf_prefix), + "isolation %s/%d/%d (cpu %d)", + task->comm, task->tgid, task->pid, cpu); + put_cpu(); + } else { + /* + * Message is for remote CPU, use cached information. + */ + put_cpu(); + /* + * Make sure, index remained unchanged while data was + * copied. If it changed, data that was copied may be + * inconsistent because two updates in a sequence could + * overwrite the data while it was being read. + */ + do { + /* Make sure we are reading up to date values */ + smp_mb(); + ind = ind_counter & 1; + snprintf(buf_prefix, sizeof(buf_prefix), + "isolation %s/%d/%d (cpu %d)", + desc->comm[ind], desc->tgid[ind], + desc->pid[ind], cpu); + desc->warned[ind] = true; + ind_counter_old = ind_counter; + /* Record the warned flag, then re-read descriptor */ + smp_mb(); + ind_counter = atomic_read(&desc->curr_index); + /* + * If the counter changed, something was updated, so + * repeat everything to get the current data + */ + } while (ind_counter != ind_counter_old); + } + + va_start(args, fmt); + vsnprintf(buf, sizeof(buf), fmt, args); + va_end(args); + + switch (level) { + case LOGLEVEL_EMERG: + pr_emerg("%s: %s", buf_prefix, buf); + break; + case LOGLEVEL_ALERT: + pr_alert("%s: %s", buf_prefix, buf); + break; + case LOGLEVEL_CRIT: + pr_crit("%s: %s", buf_prefix, buf); + break; + case LOGLEVEL_ERR: + pr_err("%s: %s", buf_prefix, buf); + break; + case LOGLEVEL_WARNING: + pr_warn("%s: %s", buf_prefix, buf); + break; + case LOGLEVEL_NOTICE: + pr_notice("%s: %s", buf_prefix, buf); + break; + case LOGLEVEL_INFO: + pr_info("%s: %s", buf_prefix, buf); + break; + case LOGLEVEL_DEBUG: + pr_debug("%s: %s", buf_prefix, buf); + break; + default: + /* No message without a valid level */ + return 0; + } + return 1; +} + +/* + * Dump stack if need be. This can be helpful even from the final exit + * to usermode code since stack traces sometimes carry information about + * what put you into the kernel, e.g. an interrupt number encoded in + * the initial entry stack frame that is still visible at exit time. + */ +static void debug_dump_stack(void) +{ + if (task_isolation_debug) + dump_stack(); +} + +/* + * Set the flags word but don't try to actually start task isolation yet. + * We will start it when entering user space in task_isolation_start(). + */ +int task_isolation_request(unsigned int flags) +{ + struct task_struct *task = current; + + /* + * The task isolation flags should always be cleared just by + * virtue of having entered the kernel. + */ + WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_TASK_ISOLATION)); + WARN_ON_ONCE(task->task_isolation_flags != 0); + WARN_ON_ONCE(task->task_isolation_state != STATE_NORMAL); + + task->task_isolation_flags = flags; + if (!(task->task_isolation_flags & PR_TASK_ISOLATION_ENABLE)) + return 0; + + /* We are trying to enable task isolation. */ + set_tsk_thread_flag(task, TIF_TASK_ISOLATION); + + /* + * Shut down the vmstat worker so we're not interrupted later. + * We have to try to do this here (with interrupts enabled) since + * we are canceling delayed work and will call flush_work() + * (which enables interrupts) and possibly schedule(). + */ + quiet_vmstat_sync(); + + /* We return 0 here but we may change that in task_isolation_start(). */ + return 0; +} + +/* + * Perform actions that should be done immediately on exit from isolation. + */ +static void fast_task_isolation_cpu_cleanup(void *info) +{ + unsigned long flags; + + /* + * This function runs on a CPU that ran isolated task. + * It should be called either directly when isolation breaking is + * being processed, or using IPI from another CPU when it intends + * to break isolation on the given CPU. + * + * We don't want this CPU running code from the rest of kernel + * until other CPUs know that it is no longer isolated. Any + * entry into kernel will call task_isolation_kernel_enter() + * before calling this, so this will be already done by + * setting per-cpu flags and synchronizing in that function. + * + * For development purposes it makes sense to check if it was + * done, because there is a possibility that some entry points + * were left unguarded. That would be clearly a bug because it + * will mean that regular kernel code is running with no + * synchronization. + */ + local_irq_save(flags); + atomic_dec(&per_cpu(isol_exit_counter, smp_processor_id())); + /* Barrier to sync with requesting a task isolation breaking */ + smp_mb__after_atomic(); + /* + * At this point breaking isolation will be treated as a + * separate isolation-breaking event, however interrupts won't + * arrive until local_irq_restore() + */ + + /* + * Check for the above mentioned entry without a call to + * task_isolation_kernel_enter() + */ + if ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION)) { + /* + * If it did happen, call the function here, to + * prevent further problems from running in + * un-synchronized state + */ + task_isolation_kernel_enter(); + /* Report the problem */ + pr_task_isol_emerg(smp_processor_id(), + "Isolation breaking was not detected on kernel entry\n"); + } + /* + * This task is no longer isolated (and if by any chance this + * is the wrong task, it's already not isolated) + */ + current->task_isolation_flags = 0; + clear_tsk_thread_flag(current, TIF_TASK_ISOLATION); + + /* Run the rest of cleanup later */ + set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); + + local_irq_restore(flags); +} + +/* Disable task isolation for the specified task. */ +static void stop_isolation(struct task_struct *p) +{ + int cpu, this_cpu; + unsigned long flags; + + this_cpu = get_cpu(); + cpu = task_cpu(p); + if (atomic_inc_return(&per_cpu(isol_exit_counter, cpu)) > 1) { + /* Already exiting isolation */ + atomic_dec(&per_cpu(isol_exit_counter, cpu)); + put_cpu(); + return; + } + + if (p == current) { + p->task_isolation_state = STATE_NORMAL; + fast_task_isolation_cpu_cleanup(NULL); + task_isolation_cpu_cleanup(); + put_cpu(); + } else { + /* + * Schedule "slow" cleanup. This relies on + * TIF_NOTIFY_RESUME being set + */ + spin_lock_irqsave(&task_isolation_cleanup_lock, flags); + cpumask_set_cpu(cpu, task_isolation_cleanup_map); + spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags); + /* + * Setting flags is delegated to the CPU where + * isolated task is running + * isol_exit_counter will be decremented from there as well. + */ + per_cpu(isol_break_csd, cpu).func = + fast_task_isolation_cpu_cleanup; + per_cpu(isol_break_csd, cpu).info = NULL; + per_cpu(isol_break_csd, cpu).flags = 0; + smp_call_function_single_async(cpu, + &per_cpu(isol_break_csd, cpu)); + put_cpu(); + } +} + +/* + * This code runs with interrupts disabled just before the return to + * userspace, after a prctl() has requested enabling task isolation. + * We take whatever steps are needed to avoid being interrupted later: + * drain the lru pages, stop the scheduler tick, etc. More + * functionality may be added here later to avoid other types of + * interrupts from other kernel subsystems. This, however, may still not + * have the intended result, so the rest of the system takes into account + * the possibility of receiving an interrupt and isolation breaking later. + * + * If we can't enable task isolation, we update the syscall return + * value with an appropriate error. + * + * This, however, does not enable isolation yet, as far as low-level + * flags are concerned. So if interrupts will be enabled, it's still + * possible for the task to be interrupted. The call to + * task_isolation_exit_to_user_mode() should finally enable task + * isolation after this function set FLAG_LL_TASK_ISOLATION_REQUEST. + */ +void task_isolation_start(void) +{ + int error; + unsigned long flags; + + /* + * We should only be called in STATE_NORMAL (isolation + * disabled), on our way out of the kernel from the prctl() + * that turned it on. If we are exiting from the kernel in + * another state, it means we made it back into the kernel + * without disabling task isolation, and we should investigate + * how (and in any case disable task isolation at this + * point). We are clearly not on the path back from the + * prctl() so we don't touch the syscall return value. + */ + if (WARN_ON_ONCE(current->task_isolation_state != STATE_NORMAL)) { + stop_isolation(current); + /* Report the problem */ + pr_task_isol_emerg(smp_processor_id(), + "Isolation start requested while not in the normal state\n"); + return; + } + + /* + * Must be affinitized to a single core with task isolation possible. + * In principle this could be remotely modified between the prctl() + * and the return to userspace, so we have to check it here. + */ + if (current->nr_cpus_allowed != 1 || + !is_isolation_cpu(smp_processor_id())) { + error = -EINVAL; + goto error; + } + + /* If the vmstat delayed work is not canceled, we have to try again. */ + if (!vmstat_idle()) { + error = -EAGAIN; + goto error; + } + + /* Try to stop the dynamic tick. */ + error = try_stop_full_tick(); + if (error) + goto error; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + local_irq_save(flags); + + /* Record isolated task IDs and name */ + record_curr_isolated_task(); + + current->task_isolation_state = STATE_ISOLATED; + this_cpu_write(ll_isol_flags, FLAG_LL_TASK_ISOLATION_REQUEST); + /* Barrier to synchronize with reading of flags */ + smp_mb(); + local_irq_restore(flags); + return; + +error: + stop_isolation(current); + syscall_set_return_value(current, current_pt_regs(), error, 0); +} + +/* Stop task isolation on the remote task and send it a signal. */ +static void send_isolation_signal(struct task_struct *task) +{ + int flags = task->task_isolation_flags; + kernel_siginfo_t info = { + .si_signo = PR_TASK_ISOLATION_GET_SIG(flags) ?: SIGKILL, + }; + + if ((flags & PR_TASK_ISOLATION_ENABLE) == 0) + return; + + stop_isolation(task); + send_sig_info(info.si_signo, &info, task); +} + +/* Only a few syscalls are valid once we are in task isolation mode. */ +static bool is_acceptable_syscall(int syscall) +{ + /* No need to incur an isolation signal if we are just exiting. */ + if (syscall == __NR_exit || syscall == __NR_exit_group) + return true; + + /* Check to see if it's the prctl for isolation. */ + if (syscall == __NR_prctl) { + unsigned long arg[SYSCALL_MAX_ARGS]; + + syscall_get_arguments(current, current_pt_regs(), arg); + if (arg[0] == PR_TASK_ISOLATION) + return true; + } + + return false; +} + +/* + * This routine is called from syscall entry, prevents most syscalls + * from executing, and if needed raises a signal to notify the process. + * + * Note that we have to stop isolation before we even print a message + * here, since otherwise we might end up reporting an interrupt due to + * kicking the printk handling code, rather than reporting the true + * cause of interrupt here. + * + * The message is not suppressed by previous remotely triggered + * messages. + */ +int task_isolation_syscall(int syscall) +{ + struct task_struct *task = current; + + /* + * Check if by any chance syscall is being processed from + * isolated state without a call to + * task_isolation_kernel_enter() happening on entry + */ + if ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION)) { + /* + * If it did happen, call the function here, to + * prevent further problems from running in + * un-synchronized state + */ + task_isolation_kernel_enter(); + /* Report the problem */ + pr_task_isol_emerg(smp_processor_id(), + "Isolation breaking was not detected on syscall\n"); + } + /* + * Clear low-level isolation flags to avoid triggering + * a signal on return to userspace + */ + this_cpu_write(ll_isol_flags, 0); + + if (is_acceptable_syscall(syscall)) { + stop_isolation(task); + return 0; + } + + send_isolation_signal(task); + + pr_task_isol_warn(smp_processor_id(), + "task_isolation lost due to syscall %d\n", + syscall); + debug_dump_stack(); + + syscall_set_return_value(task, current_pt_regs(), -ERESTARTNOINTR, -1); + return -1; +} + +/* + * This routine is called from the code responsible for exiting to user mode, + * before the point when thread flags are checked for pending work. + * That function must be called if the current task is isolated, because + * TIF_TASK_ISOLATION must trigger a call to it. + */ +void task_isolation_before_pending_work_check(void) +{ + int cpu; + unsigned long flags; + + /* Handle isolation breaking */ + if ((current->task_isolation_state != STATE_NORMAL) + && ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION_BROKEN) + != 0)) { + /* + * Clear low-level isolation flags to avoid triggering + * a signal again + */ + this_cpu_write(ll_isol_flags, 0); + /* Send signal to notify about isolation breaking */ + send_isolation_signal(current); + /* Produce generic message about lost isolation */ + pr_task_isol_warn(smp_processor_id(), "task_isolation lost\n"); + debug_dump_stack(); + } + + /* + * If this CPU is in the map of CPUs with cleanup pending, + * remove it from the map and call cleanup + */ + spin_lock_irqsave(&task_isolation_cleanup_lock, flags); + + cpu = smp_processor_id(); + + if (cpumask_test_cpu(cpu, task_isolation_cleanup_map)) { + cpumask_clear_cpu(cpu, task_isolation_cleanup_map); + spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags); + task_isolation_cpu_cleanup(); + } else + spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags); +} + +/* + * Called before we wake up a task that has a signal to process. + * Needs to be done to handle interrupts that trigger signals, which + * we don't catch with task_isolation_interrupt() hooks. + * + * This message is also suppressed if there was already a remotely + * caused message about the same isolation breaking event. + */ +void _task_isolation_signal(struct task_struct *task) +{ + struct isol_task_desc *desc; + int ind, cpu; + bool do_warn = (task->task_isolation_state == STATE_ISOLATED); + + cpu = task_cpu(task); + desc = &per_cpu(isol_task_descs, cpu); + ind = atomic_read(&desc->curr_index) & 1; + if (desc->warned[ind]) + do_warn = false; + + stop_isolation(task); + + if (do_warn) { + pr_warn("isolation: %s/%d/%d (cpu %d): task_isolation lost due to signal\n", + task->comm, task->tgid, task->pid, cpu); + debug_dump_stack(); + } +} + +/* + * Set CPUs currently running isolated tasks in CPU mask. + */ +void task_isolation_cpumask(struct cpumask *mask) +{ + int cpu; + + if (task_isolation_map == NULL) + return; + + /* Barrier to synchronize with writing task isolation flags */ + smp_rmb(); + for_each_cpu(cpu, task_isolation_map) + if (task_isolation_on_cpu(cpu)) + cpumask_set_cpu(cpu, mask); +} + +/* + * Clear CPUs currently running isolated tasks in CPU mask. + */ +void task_isolation_clear_cpumask(struct cpumask *mask) +{ + int cpu; + + if (task_isolation_map == NULL) + return; + + /* Barrier to synchronize with writing task isolation flags */ + smp_rmb(); + for_each_cpu(cpu, task_isolation_map) + if (task_isolation_on_cpu(cpu)) + cpumask_clear_cpu(cpu, mask); +} + +/* + * Cleanup procedure. The call to this procedure may be delayed. + */ +void task_isolation_cpu_cleanup(void) +{ + kick_hrtimer(); +} diff --git a/kernel/signal.c b/kernel/signal.c index ef8f2a28d37c..252886e6aa76 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -46,6 +46,7 @@ #include <linux/livepatch.h> #include <linux/cgroup.h> #include <linux/audit.h> +#include <linux/isolation.h> #define CREATE_TRACE_POINTS #include <trace/events/signal.h> @@ -759,6 +760,7 @@ static int dequeue_synchronous_signal(kernel_siginfo_t *info) */ void signal_wake_up_state(struct task_struct *t, unsigned int state) { + task_isolation_signal(t); set_tsk_thread_flag(t, TIF_SIGPENDING); /* * TASK_WAKEKILL also means wake it up in the stopped/traced/killable diff --git a/kernel/sys.c b/kernel/sys.c index a730c03ee607..0da007ce8666 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -42,6 +42,7 @@ #include <linux/syscore_ops.h> #include <linux/version.h> #include <linux/ctype.h> +#include <linux/isolation.h> #include <linux/compat.h> #include <linux/syscalls.h> @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER; break; + case PR_TASK_ISOLATION: + if (arg3 || arg4 || arg5) + return -EINVAL; + error = task_isolation_request(arg2); + break; default: error = -EINVAL; break; diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 387b4bef7dd1..27fedddc2729 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -30,6 +30,7 @@ #include <linux/syscalls.h> #include <linux/interrupt.h> #include <linux/tick.h> +#include <linux/isolation.h> #include <linux/err.h> #include <linux/debugobjects.h> #include <linux/sched/signal.h> @@ -720,6 +721,19 @@ static void retrigger_next_event(void *arg) raw_spin_unlock(&base->lock); } +#ifdef CONFIG_TASK_ISOLATION +void kick_hrtimer(void) +{ + unsigned long flags; + + preempt_disable(); + local_irq_save(flags); + retrigger_next_event(NULL); + local_irq_restore(flags); + preempt_enable(); +} +#endif + /* * Switch to high resolution mode */ @@ -867,8 +881,21 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram) void clock_was_set(void) { #ifdef CONFIG_HIGH_RES_TIMERS +#ifdef CONFIG_TASK_ISOLATION + struct cpumask mask; + + cpumask_clear(&mask); + task_isolation_cpumask(&mask); + cpumask_complement(&mask, &mask); + /* + * Retrigger the CPU local events everywhere except CPUs + * running isolated tasks. + */ + on_each_cpu_mask(&mask, retrigger_next_event, NULL, 1); +#else /* Retrigger the CPU local events everywhere */ on_each_cpu(retrigger_next_event, NULL, 1); +#endif #endif timerfd_clock_was_set(); } diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 81632cd5e3b7..a213952541db 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -896,6 +896,24 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts) #endif } +#ifdef CONFIG_TASK_ISOLATION +int try_stop_full_tick(void) +{ + int cpu = smp_processor_id(); + struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched); + + /* For an unstable clock, we should return a permanent error code. */ + if (atomic_read(&tick_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE) + return -EINVAL; + + if (!can_stop_full_tick(cpu, ts)) + return -EAGAIN; + + tick_nohz_stop_sched_tick(ts, cpu); + return 0; +} +#endif + static bool can_stop_idle_tick(int cpu, struct tick_sched *ts) { /* -- 2.20.1 ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH v5 3/9] task_isolation: userspace hard isolation from kernel @ 2020-11-23 17:56 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:56 UTC (permalink / raw) To: nitesh, frederic Cc: linux-arch, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE, 0, 0, 0) to do so. The kernel must be built with the new TASK_ISOLATION Kconfig flag to enable this mode, and the kernel booted with an appropriate "isolcpus=nohz,domain,CPULIST" boot argument to enable nohz_full and isolcpus. The "task_isolation" state is then indicated by setting a new task struct field, task_isolation_flag, to the value passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in the thread_info flags. When the kernel is returning to userspace from the prctl() call and sees TIF_TASK_ISOLATION set, it calls the new task_isolation_start() routine to arrange for the task to avoid being interrupted in the future. With interrupts disabled, task_isolation_start() ensures that kernel subsystems that might cause a future interrupt are quiesced. If it doesn't succeed, it adjusts the syscall return value to indicate that fact, and userspace can retry as desired. In addition to stopping the scheduler tick, the code takes any actions that might avoid a future interrupt to the core, such as a worker thread being scheduled that could be quiesced now (e.g. the vmstat worker) or a future IPI to the core to clean up some state that could be cleaned up now (e.g. the mm lru per-cpu cache). The last stage of enabling task isolation happens in task_isolation_exit_to_user_mode() that runs last before returning to userspace and changes ll_isol_flags (see later) to prevent other CPUs from interfering with isolated task. Once the task has returned to userspace after issuing the prctl(), if it enters the kernel again via system call, page fault, or any other exception or irq, the kernel will send it a signal to indicate isolation loss. In addition to sending a signal, the code supports a kernel command-line "task_isolation_debug" flag which causes a stack backtrace to be generated whenever a task loses isolation. To allow the state to be entered and exited, the syscall checking test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can clear the bit again later, and ignores exit/exit_group to allow exiting the task without a pointless signal being delivered. The prctl() API allows for specifying a signal number to use instead of the default SIGKILL, to allow for catching the notification signal; for example, in a production environment, it might be helpful to log information to the application logging mechanism before exiting. Or, the signal handler might choose to reset the program counter back to the code segment intended to be run isolated via prctl() to continue execution. Isolation also disables CPU state synchronization mechanisms that are. normally done by IPI. In the future, more synchronization mechanisms, such as TLB flushes, may be disabled for isolated tasks. This requires careful handling of kernel entry from isolated task -- remote synchronization requests must be re-enabled and synchronization procedure triggered, before anything other than low-level kernel entry code is called. Same applies to exiting from kernel to userspace after isolation is enabled. For this purpose, per-CPU low-level flags ll_isol_flags are used to indicate isolation state, and task_isolation_kernel_enter() is used to safely clear them early in kernel entry. CPU mask corresponding to isolation bit in ll_isol_flags is visible to userspace as /sys/devices/system/cpu/isolation_running, and can be used for monitoring. Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> Signed-off-by: Alex Belits <abelits@marvell.com> --- .../admin-guide/kernel-parameters.txt | 6 + drivers/base/cpu.c | 23 + include/linux/hrtimer.h | 4 + include/linux/isolation.h | 326 ++++++++ include/linux/sched.h | 5 + include/linux/tick.h | 3 + include/uapi/linux/prctl.h | 6 + init/Kconfig | 27 + kernel/Makefile | 2 + kernel/isolation.c | 714 ++++++++++++++++++ kernel/signal.c | 2 + kernel/sys.c | 6 + kernel/time/hrtimer.c | 27 + kernel/time/tick-sched.c | 18 + 14 files changed, 1169 insertions(+) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 44fde25bb221..4b28f2022c98 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5250,6 +5250,12 @@ neutralize any effect of /proc/sys/kernel/sysrq. Useful for debugging. + task_isolation_debug [KNL] + In kernels built with CONFIG_TASK_ISOLATION, this + setting will generate console backtraces to + accompany the diagnostics generated about + interrupting tasks running with task isolation. + tcpmhash_entries= [KNL,NET] Set the number of tcp_metrics_hash slots. Default value is 8192 or 16384 depending on total diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 8f1d6569564c..afa6163203c7 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -284,6 +284,26 @@ static ssize_t print_cpus_isolated(struct device *dev, } static DEVICE_ATTR(isolated, 0444, print_cpus_isolated, NULL); +#ifdef CONFIG_TASK_ISOLATION +static ssize_t isolation_running_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + int n; + cpumask_var_t isolation_running; + + if (!zalloc_cpumask_var(&isolation_running, GFP_KERNEL)) + return -ENOMEM; + + task_isolation_cpumask(isolation_running); + n = sprintf(buf, "%*pbl\n", cpumask_pr_args(isolation_running)); + + free_cpumask_var(isolation_running); + + return n; +} +static DEVICE_ATTR_RO(isolation_running); +#endif + #ifdef CONFIG_NO_HZ_FULL static ssize_t print_cpus_nohz_full(struct device *dev, struct device_attribute *attr, char *buf) @@ -471,6 +491,9 @@ static struct attribute *cpu_root_attrs[] = { #ifdef CONFIG_NO_HZ_FULL &dev_attr_nohz_full.attr, #endif +#ifdef CONFIG_TASK_ISOLATION + &dev_attr_isolation_running.attr, +#endif #ifdef CONFIG_GENERIC_CPU_AUTOPROBE &dev_attr_modalias.attr, #endif diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index 107cedd7019a..9a6c9ce95080 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -529,6 +529,10 @@ extern void __init hrtimers_init(void); /* Show pending timers: */ extern void sysrq_timer_list_show(void); +#ifdef CONFIG_TASK_ISOLATION +extern void kick_hrtimer(void); +#endif + int hrtimers_prepare_cpu(unsigned int cpu); #ifdef CONFIG_HOTPLUG_CPU int hrtimers_dead_cpu(unsigned int cpu); diff --git a/include/linux/isolation.h b/include/linux/isolation.h new file mode 100644 index 000000000000..eb6c99bd2db4 --- /dev/null +++ b/include/linux/isolation.h @@ -0,0 +1,326 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Task isolation support + * + * Authors: + * Chris Metcalf <cmetcalf@mellanox.com> + * Alex Belits <abelits@marvell.com> + * Yuri Norov <ynorov@marvell.com> + */ +#ifndef _LINUX_ISOLATION_H +#define _LINUX_ISOLATION_H + +#include <stdarg.h> +#include <linux/errno.h> +#include <linux/cpumask.h> +#include <linux/percpu.h> +#include <linux/irqflags.h> +#include <linux/prctl.h> +#include <linux/types.h> + +struct task_struct; + +#ifdef CONFIG_TASK_ISOLATION + +/* + * Logging + */ +int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...); + +#define pr_task_isol_emerg(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_EMERG, false, fmt, ##__VA_ARGS__) +#define pr_task_isol_alert(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_ALERT, false, fmt, ##__VA_ARGS__) +#define pr_task_isol_crit(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_CRIT, false, fmt, ##__VA_ARGS__) +#define pr_task_isol_err(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_ERR, false, fmt, ##__VA_ARGS__) +#define pr_task_isol_warn(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_WARNING, false, fmt, ##__VA_ARGS__) +#define pr_task_isol_notice(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_NOTICE, false, fmt, ##__VA_ARGS__) +#define pr_task_isol_info(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_INFO, false, fmt, ##__VA_ARGS__) +#define pr_task_isol_debug(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_DEBUG, false, fmt, ##__VA_ARGS__) + +#define pr_task_isol_emerg_supp(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_EMERG, true, fmt, ##__VA_ARGS__) +#define pr_task_isol_alert_supp(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_ALERT, true, fmt, ##__VA_ARGS__) +#define pr_task_isol_crit_supp(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_CRIT, true, fmt, ##__VA_ARGS__) +#define pr_task_isol_err_supp(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_ERR, true, fmt, ##__VA_ARGS__) +#define pr_task_isol_warn_supp(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_WARNING, true, fmt, ##__VA_ARGS__) +#define pr_task_isol_notice_supp(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_NOTICE, true, fmt, ##__VA_ARGS__) +#define pr_task_isol_info_supp(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_INFO, true, fmt, ##__VA_ARGS__) +#define pr_task_isol_debug_supp(cpu, fmt, ...) \ + task_isolation_message(cpu, LOGLEVEL_DEBUG, true, fmt, ##__VA_ARGS__) + +#define BIT_LL_TASK_ISOLATION (0) +#define BIT_LL_TASK_ISOLATION_BROKEN (1) +#define BIT_LL_TASK_ISOLATION_REQUEST (2) +#define FLAG_LL_TASK_ISOLATION (1 << BIT_LL_TASK_ISOLATION) +#define FLAG_LL_TASK_ISOLATION_BROKEN (1 << BIT_LL_TASK_ISOLATION_BROKEN) +#define FLAG_LL_TASK_ISOLATION_REQUEST (1 << BIT_LL_TASK_ISOLATION_REQUEST) + +DECLARE_PER_CPU(unsigned long, ll_isol_flags); +extern cpumask_var_t task_isolation_map; + +/** + * task_isolation_request() - prctl hook to request task isolation + * @flags: Flags from <linux/prctl.h> PR_TASK_ISOLATION_xxx. + * + * This is called from the generic prctl() code for PR_TASK_ISOLATION. + * + * Return: Returns 0 when task isolation enabled, otherwise a negative + * errno. + */ +extern int task_isolation_request(unsigned int flags); + +/** + * task_isolation_kernel_enter() - clear low-level task isolation flag + * + * This should be called immediately after entering kernel. It must + * be inline, and suitable for running after leaving isolated + * userspace in a "stale" state when synchronization is required + * before the CPU can safely enter the rest of the kernel. + */ +static __always_inline void task_isolation_kernel_enter(void) +{ + unsigned long flags; + + /* + * This function runs on a CPU that ran isolated task. + * + * We don't want this CPU running code from the rest of kernel + * until other CPUs know that it is no longer isolated. When + * CPU is running isolated task until this point anything that + * causes an interrupt on this CPU must end up calling this + * before touching the rest of kernel. That is, this function + * or fast_task_isolation_cpu_cleanup() or stop_isolation() + * calling it. If any interrupt, including scheduling timer, + * arrives, it will still end up here early after entering + * kernel. From this point interrupts are disabled until all + * CPUs will see that this CPU is no longer running isolated + * task. + * + * See also fast_task_isolation_cpu_cleanup(). + */ + if ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION) == 0) + return; + + raw_local_irq_save(flags); + + /* Change low-level flags to indicate broken isolation */ + this_cpu_write(ll_isol_flags, FLAG_LL_TASK_ISOLATION_BROKEN); + + /* + * If something happened that requires a barrier that would + * otherwise be called from remote CPUs by CPU kick procedure, + * this barrier runs instead of it. After this barrier, CPU + * kick procedure would see the updated ll_isol_flags, so it + * will run its own IPI to trigger a barrier. + */ + smp_mb(); + /* + * Synchronize instructions -- this CPU was not kicked while + * in isolated mode, so it might require synchronization. + * There might be an IPI if kick procedure happened and + * ll_isol_flags was already updated while it assembled a CPU + * mask. However if this did not happen, synchronize everything + * here. + */ + instr_sync(); + raw_local_irq_restore(flags); +} + +/** + * task_isolation_exit_to_user_mode() - set low-level task isolation flag + * if task isolation is requested + * + * This should be called immediately before exiting kernel. It must + * be inline, and the state of CPI may become "stale" between setting + * the flag and returning to the userspace. + */ +static __always_inline void task_isolation_exit_to_user_mode(void) +{ + unsigned long flags; + + /* Check if this task is entering isolation */ + if ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION_REQUEST) + == 0) + return; + raw_local_irq_save(flags); + + /* Set low-level flags */ + this_cpu_write(ll_isol_flags, FLAG_LL_TASK_ISOLATION); + /* + * After this barrier the rest of the system stops using IPIs + * to synchronize this CPU state. Since only exit to userspace + * follows, this is safe. Synchronization will happen again in + * task_isolation_enter() when this CPU will enter kernel. + */ + smp_mb(); + /* + * From this point this is recognized as isolated by + * other CPUs + */ + raw_local_irq_restore(flags); +} + +extern void task_isolation_cpu_cleanup(void); +/** + * task_isolation_start() - attempt to actually start task isolation + * + * This function should be invoked as the last thing prior to returning to + * user space if TIF_TASK_ISOLATION is set in the thread_info flags. It + * will attempt to quiesce the core and enter task-isolation mode. If it + * fails, it will reset the system call return value to an error code that + * indicates the failure mode. + */ +extern void task_isolation_start(void); + +/** + * is_isolation_cpu() - check if CPU is intended for running isolated tasks. + * @cpu: CPU to check. + */ +static inline bool is_isolation_cpu(int cpu) +{ + return task_isolation_map != NULL && + cpumask_test_cpu(cpu, task_isolation_map); +} + +/** + * task_isolation_on_cpu() - check if the cpu is running isolated task + * @cpu: CPU to check. + */ +static inline int task_isolation_on_cpu(int cpu) +{ + return test_bit(BIT_LL_TASK_ISOLATION, &per_cpu(ll_isol_flags, cpu)); +} + +/** + * task_isolation_cpumask() - set CPUs currently running isolated tasks + * @mask: Mask to modify. + */ +extern void task_isolation_cpumask(struct cpumask *mask); + +/** + * task_isolation_clear_cpumask() - clear CPUs currently running isolated tasks + * @mask: Mask to modify. + */ +extern void task_isolation_clear_cpumask(struct cpumask *mask); + +/** + * task_isolation_syscall() - report a syscall from an isolated task + * @nr: The syscall number. + * + * This routine should be invoked at syscall entry if TIF_TASK_ISOLATION is + * set in the thread_info flags. It checks for valid syscalls, + * specifically prctl() with PR_TASK_ISOLATION, exit(), and exit_group(). + * For any other syscall it will raise a signal and return failure. + * + * Return: 0 for acceptable syscalls, -1 for all others. + */ +extern int task_isolation_syscall(int nr); + +/** + * task_isolation_before_pending_work_check() - check for isolation breaking + * + * This routine is called from the code responsible for exiting to user mode, + * before the point when thread flags are checked for pending work. + * That function must be called if the current task is isolated, because + * TIF_TASK_ISOLATION must trigger a call to it. + */ +void task_isolation_before_pending_work_check(void); + +/** + * _task_isolation_interrupt() - report an interrupt of an isolated task + * @fmt: A format string describing the interrupt + * @...: Format arguments, if any. + * + * This routine should be invoked at any exception or IRQ if + * TIF_TASK_ISOLATION is set in the thread_info flags. It is not necessary + * to invoke it if the exception will generate a signal anyway (e.g. a bad + * page fault), and in that case it is preferable not to invoke it but just + * rely on the standard Linux signal. The macro task_isolation_syscall() + * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code. + */ +extern void _task_isolation_interrupt(const char *fmt, ...); +#define task_isolation_interrupt(fmt, ...) \ + do { \ + if (current_thread_info()->flags & _TIF_TASK_ISOLATION) \ + _task_isolation_interrupt(fmt, ## __VA_ARGS__); \ + } while (0) + +/** + * task_isolation_remote() - report a remote interrupt of an isolated task + * @cpu: The remote cpu that is about to be interrupted. + * @fmt: A format string describing the interrupt + * @...: Format arguments, if any. + * + * This routine should be invoked any time a remote IPI or other type of + * interrupt is being delivered to another cpu. The function will check to + * see if the target core is running a task-isolation task, and generate a + * diagnostic on the console if so; in addition, we tag the task so it + * doesn't generate another diagnostic when the interrupt actually arrives. + * Generating a diagnostic remotely yields a clearer indication of what + * happened then just reporting only when the remote core is interrupted. + * + */ +extern void task_isolation_remote(int cpu, const char *fmt, ...); + +/** + * task_isolation_remote_cpumask() - report interruption of multiple cpus + * @mask: The set of remotes cpus that are about to be interrupted. + * @fmt: A format string describing the interrupt + * @...: Format arguments, if any. + * + * This is the cpumask variant of _task_isolation_remote(). We + * generate a single-line diagnostic message even if multiple remote + * task-isolation cpus are being interrupted. + */ +extern void task_isolation_remote_cpumask(const struct cpumask *mask, + const char *fmt, ...); + +/** + * _task_isolation_signal() - disable task isolation when signal is pending + * @task: The task for which to disable isolation. + * + * This function generates a diagnostic and disables task isolation; + * it should be called if TIF_TASK_ISOLATION is set when notifying a + * task of a pending signal. The task_isolation_interrupt() function + * normally generates a diagnostic for events that just interrupt a + * task without generating a signal; here we need to hook the paths + * that correspond to interrupts that do generate a signal. The macro + * task_isolation_signal() wraps the TIF_TASK_ISOLATION flag test to + * simplify the caller code. + */ +extern void _task_isolation_signal(struct task_struct *task); +#define task_isolation_signal(task) \ + do { \ + if (task_thread_info(task)->flags & _TIF_TASK_ISOLATION) \ + _task_isolation_signal(task); \ + } while (0) + +#else /* !CONFIG_TASK_ISOLATION */ +static inline int task_isolation_request(unsigned int flags) { return -EINVAL; } +static inline void task_isolation_kernel_enter(void) {} +static inline void task_isolation_exit_to_user_mode(void) {} +static inline void task_isolation_start(void) { } +static inline bool is_isolation_cpu(int cpu) { return 0; } +static inline int task_isolation_on_cpu(int cpu) { return 0; } +static inline void task_isolation_cpumask(struct cpumask *mask) { } +static inline void task_isolation_clear_cpumask(struct cpumask *mask) { } +static inline void task_isolation_cpu_cleanup(void) { } +static inline int task_isolation_syscall(int nr) { return 0; } +static inline void task_isolation_before_pending_work_check(void) { } +static inline void task_isolation_signal(struct task_struct *task) { } +#endif + +#endif /* _LINUX_ISOLATION_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 063cd120b459..5d8b17aa544b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1316,6 +1316,11 @@ struct task_struct { unsigned long prev_lowest_stack; #endif +#ifdef CONFIG_TASK_ISOLATION + unsigned short task_isolation_flags; /* prctl */ + unsigned short task_isolation_state; +#endif + #ifdef CONFIG_X86_MCE void __user *mce_vaddr; __u64 mce_kflags; diff --git a/include/linux/tick.h b/include/linux/tick.h index 7340613c7eff..27c7c033d5a8 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -268,6 +268,9 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal, extern void tick_nohz_full_kick_cpu(int cpu); extern void __tick_nohz_task_switch(void); extern void __init tick_nohz_full_setup(cpumask_var_t cpumask); +#ifdef CONFIG_TASK_ISOLATION +extern int try_stop_full_tick(void); +#endif #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 7f0827705c9a..bf72afb0364f 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -247,4 +247,10 @@ struct prctl_mm_map { #define PR_SET_IO_FLUSHER 57 #define PR_GET_IO_FLUSHER 58 +/* Enable task_isolation mode for TASK_ISOLATION kernels. */ +#define PR_TASK_ISOLATION 48 +# define PR_TASK_ISOLATION_ENABLE (1 << 0) +# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8) +# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f) + #endif /* _LINUX_PRCTL_H */ diff --git a/init/Kconfig b/init/Kconfig index c9446911cf41..f708580fb56d 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -648,6 +648,33 @@ config CPU_ISOLATION source "kernel/rcu/Kconfig" +config HAVE_ARCH_TASK_ISOLATION + bool + +config TASK_ISOLATION + bool "Provide hard CPU isolation from the kernel on demand" + depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION + help + Allow userspace processes that place themselves on cores with + nohz_full and isolcpus enabled, and run prctl(PR_TASK_ISOLATION), + to "isolate" themselves from the kernel. Prior to returning to + userspace, isolated tasks will arrange that no future kernel + activity will interrupt the task while the task is running in + userspace. Attempting to re-enter the kernel while in this mode + will cause the task to be terminated with a signal; you must + explicitly use prctl() to disable task isolation before resuming + normal use of the kernel. + + This "hard" isolation from the kernel is required for userspace + tasks that are running hard real-time tasks in userspace, such as + a high-speed network driver in userspace. Without this option, but + with NO_HZ_FULL enabled, the kernel will make a best-faith, "soft" + effort to shield a single userspace process from interrupts, but + makes no guarantees. + + You should say "N" unless you are intending to run a + high-performance userspace driver or similar task. + config BUILD_BIN2C bool default n diff --git a/kernel/Makefile b/kernel/Makefile index af601b9bda0e..8a8a361dc1a5 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -133,6 +133,8 @@ KCOV_INSTRUMENT_stackleak.o := n obj-$(CONFIG_SCF_TORTURE_TEST) += scftorture.o +obj-$(CONFIG_TASK_ISOLATION) += isolation.o + $(obj)/configs.o: $(obj)/config_data.gz targets += config_data.gz diff --git a/kernel/isolation.c b/kernel/isolation.c new file mode 100644 index 000000000000..127b2a1cb0cb --- /dev/null +++ b/kernel/isolation.c @@ -0,0 +1,714 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Implementation of task isolation. + * + * Authors: + * Chris Metcalf <cmetcalf@mellanox.com> + * Alex Belits <abelits@marvell.com> + * Yuri Norov <ynorov@marvell.com> + */ + +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/vmstat.h> +#include <linux/sched.h> +#include <linux/isolation.h> +#include <linux/syscalls.h> +#include <linux/smp.h> +#include <linux/tick.h> +#include <asm/unistd.h> +#include <asm/syscall.h> +#include <linux/hrtimer.h> + +/* + * These values are stored in task_isolation_state. + * Note that STATE_NORMAL + TIF_TASK_ISOLATION means we are still + * returning from sys_prctl() to userspace. + */ +enum { + STATE_NORMAL = 0, /* Not isolated */ + STATE_ISOLATED = 1 /* In userspace, isolated */ +}; + +/* + * Low-level isolation flags. + * Those flags are used by low-level isolation set/clear/check routines. + * Those flags should be set last before return to userspace and cleared + * first upon kernel entry, and synchronized to allow isolation breaking + * detection before touching potentially unsynchronized parts of kernel. + * Isolated task does not receive synchronization events of any kind, so + * at the time of the first entry into kernel it might not be ready to + * run most of the kernel code. However to perform synchronization + * properly, kernel entry code should also enable synchronization events + * at the same time. This presents a problem because more kernel code + * should run to determine the cause of isolation breaking, signals may + * have to be generated, etc. So some flag clearing and synchronization + * should happen in "low-level" entry code but processing of isolation + * breaking should happen in "high-level" code. Low-level isolation flags + * should be set in that low-level code, possibly long before the cause + * of isolation breaking is known. Symmetrically, entering isolation + * should disable synchronization events before returning to userspace + * but after all potentially volatile code is finished. + */ +DEFINE_PER_CPU(unsigned long, ll_isol_flags); + +/* + * Description of the last two tasks that ran isolated on a given CPU. + * This is intended only for messages about isolation breaking. We + * don't want any references to actual task while accessing this from + * CPU that caused isolation breaking -- we know nothing about timing + * and don't want to use locking or RCU. + */ +struct isol_task_desc { + atomic_t curr_index; + atomic_t curr_index_wr; + bool warned[2]; + pid_t pid[2]; + pid_t tgid[2]; + char comm[2][TASK_COMM_LEN]; +}; +static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs); + +/* + * Counter for isolation exiting procedures (from request to the start of + * cleanup) being attempted at once on a CPU. Normally incrementing of + * this counter is performed from the CPU that caused isolation breaking, + * however decrementing is done from the cleanup procedure, delegated to + * the CPU that is exiting isolation, not from the CPU that caused isolation + * breaking. + * + * If incrementing this counter while starting isolation exit procedure + * results in a value greater than 0, isolation exiting is already in + * progress, and cleanup did not start yet. This means, counter should be + * decremented back, and isolation exit that is already in progress, should + * be allowed to complete. Otherwise, a new isolation exit procedure should + * be started. + */ +DEFINE_PER_CPU(atomic_t, isol_exit_counter); + +/* + * Descriptor for isolation-breaking SMP calls + */ +DEFINE_PER_CPU(call_single_data_t, isol_break_csd); + +cpumask_var_t task_isolation_map; +cpumask_var_t task_isolation_cleanup_map; +static DEFINE_SPINLOCK(task_isolation_cleanup_lock); + +/* We can run on cpus that are isolated from the scheduler and are nohz_full. */ +static int __init task_isolation_init(void) +{ + alloc_bootmem_cpumask_var(&task_isolation_cleanup_map); + if (alloc_cpumask_var(&task_isolation_map, GFP_KERNEL)) + /* + * At this point task isolation should match + * nohz_full. This may change in the future. + */ + cpumask_copy(task_isolation_map, tick_nohz_full_mask); + return 0; +} +core_initcall(task_isolation_init) + +/* Enable stack backtraces of any interrupts of task_isolation cores. */ +static bool task_isolation_debug; +static int __init task_isolation_debug_func(char *str) +{ + task_isolation_debug = true; + return 1; +} +__setup("task_isolation_debug", task_isolation_debug_func); + +/* + * Record name, pid and group pid of the task entering isolation on + * the current CPU. + */ +static void record_curr_isolated_task(void) +{ + int ind; + int cpu = smp_processor_id(); + struct isol_task_desc *desc = &per_cpu(isol_task_descs, cpu); + struct task_struct *task = current; + + /* Finish everything before recording current task */ + smp_mb(); + ind = atomic_inc_return(&desc->curr_index_wr) & 1; + desc->comm[ind][sizeof(task->comm) - 1] = '\0'; + memcpy(desc->comm[ind], task->comm, sizeof(task->comm) - 1); + desc->pid[ind] = task->pid; + desc->tgid[ind] = task->tgid; + desc->warned[ind] = false; + /* Write everything, to be seen by other CPUs */ + smp_mb(); + atomic_inc(&desc->curr_index); + /* Everyone will see the new record from this point */ + smp_mb(); +} + +/* + * Print message prefixed with the description of the current (or + * last) isolated task on a given CPU. Intended for isolation breaking + * messages that include target task for the user's convenience. + * + * Messages produced with this function may have obsolete task + * information if isolated tasks managed to exit, start and enter + * isolation multiple times, or multiple tasks tried to enter + * isolation on the same CPU at once. For those unusual cases it would + * contain a valid description of the cause for isolation breaking and + * target CPU number, just not the correct description of which task + * ended up losing isolation. + */ +int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...) +{ + struct isol_task_desc *desc; + struct task_struct *task; + va_list args; + char buf_prefix[TASK_COMM_LEN + 20 + 3 * 20]; + char buf[200]; + int curr_cpu, ind_counter, ind_counter_old, ind; + + curr_cpu = get_cpu(); + /* Barrier to synchronize with recording isolated task information */ + smp_rmb(); + desc = &per_cpu(isol_task_descs, cpu); + ind_counter = atomic_read(&desc->curr_index); + + if (curr_cpu == cpu) { + /* + * Message is for the current CPU so current + * task_struct should be used instead of cached + * information. + * + * Like in other diagnostic messages, if issued from + * interrupt context, current will be the interrupted + * task. Unlike other diagnostic messages, this is + * always relevant because the message is about + * interrupting a task. + */ + ind = ind_counter & 1; + if (supp && desc->warned[ind]) { + /* + * If supp is true, skip the message if the + * same task was mentioned in the message + * originated on remote CPU, and it did not + * re-enter isolated state since then (warned + * is true). Only local messages following + * remote messages, likely about the same + * isolation breaking event, are skipped to + * avoid duplication. If remote cause is + * immediately followed by a local one before + * isolation is broken, local cause is skipped + * from messages. + */ + put_cpu(); + return 0; + } + task = current; + snprintf(buf_prefix, sizeof(buf_prefix), + "isolation %s/%d/%d (cpu %d)", + task->comm, task->tgid, task->pid, cpu); + put_cpu(); + } else { + /* + * Message is for remote CPU, use cached information. + */ + put_cpu(); + /* + * Make sure, index remained unchanged while data was + * copied. If it changed, data that was copied may be + * inconsistent because two updates in a sequence could + * overwrite the data while it was being read. + */ + do { + /* Make sure we are reading up to date values */ + smp_mb(); + ind = ind_counter & 1; + snprintf(buf_prefix, sizeof(buf_prefix), + "isolation %s/%d/%d (cpu %d)", + desc->comm[ind], desc->tgid[ind], + desc->pid[ind], cpu); + desc->warned[ind] = true; + ind_counter_old = ind_counter; + /* Record the warned flag, then re-read descriptor */ + smp_mb(); + ind_counter = atomic_read(&desc->curr_index); + /* + * If the counter changed, something was updated, so + * repeat everything to get the current data + */ + } while (ind_counter != ind_counter_old); + } + + va_start(args, fmt); + vsnprintf(buf, sizeof(buf), fmt, args); + va_end(args); + + switch (level) { + case LOGLEVEL_EMERG: + pr_emerg("%s: %s", buf_prefix, buf); + break; + case LOGLEVEL_ALERT: + pr_alert("%s: %s", buf_prefix, buf); + break; + case LOGLEVEL_CRIT: + pr_crit("%s: %s", buf_prefix, buf); + break; + case LOGLEVEL_ERR: + pr_err("%s: %s", buf_prefix, buf); + break; + case LOGLEVEL_WARNING: + pr_warn("%s: %s", buf_prefix, buf); + break; + case LOGLEVEL_NOTICE: + pr_notice("%s: %s", buf_prefix, buf); + break; + case LOGLEVEL_INFO: + pr_info("%s: %s", buf_prefix, buf); + break; + case LOGLEVEL_DEBUG: + pr_debug("%s: %s", buf_prefix, buf); + break; + default: + /* No message without a valid level */ + return 0; + } + return 1; +} + +/* + * Dump stack if need be. This can be helpful even from the final exit + * to usermode code since stack traces sometimes carry information about + * what put you into the kernel, e.g. an interrupt number encoded in + * the initial entry stack frame that is still visible at exit time. + */ +static void debug_dump_stack(void) +{ + if (task_isolation_debug) + dump_stack(); +} + +/* + * Set the flags word but don't try to actually start task isolation yet. + * We will start it when entering user space in task_isolation_start(). + */ +int task_isolation_request(unsigned int flags) +{ + struct task_struct *task = current; + + /* + * The task isolation flags should always be cleared just by + * virtue of having entered the kernel. + */ + WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_TASK_ISOLATION)); + WARN_ON_ONCE(task->task_isolation_flags != 0); + WARN_ON_ONCE(task->task_isolation_state != STATE_NORMAL); + + task->task_isolation_flags = flags; + if (!(task->task_isolation_flags & PR_TASK_ISOLATION_ENABLE)) + return 0; + + /* We are trying to enable task isolation. */ + set_tsk_thread_flag(task, TIF_TASK_ISOLATION); + + /* + * Shut down the vmstat worker so we're not interrupted later. + * We have to try to do this here (with interrupts enabled) since + * we are canceling delayed work and will call flush_work() + * (which enables interrupts) and possibly schedule(). + */ + quiet_vmstat_sync(); + + /* We return 0 here but we may change that in task_isolation_start(). */ + return 0; +} + +/* + * Perform actions that should be done immediately on exit from isolation. + */ +static void fast_task_isolation_cpu_cleanup(void *info) +{ + unsigned long flags; + + /* + * This function runs on a CPU that ran isolated task. + * It should be called either directly when isolation breaking is + * being processed, or using IPI from another CPU when it intends + * to break isolation on the given CPU. + * + * We don't want this CPU running code from the rest of kernel + * until other CPUs know that it is no longer isolated. Any + * entry into kernel will call task_isolation_kernel_enter() + * before calling this, so this will be already done by + * setting per-cpu flags and synchronizing in that function. + * + * For development purposes it makes sense to check if it was + * done, because there is a possibility that some entry points + * were left unguarded. That would be clearly a bug because it + * will mean that regular kernel code is running with no + * synchronization. + */ + local_irq_save(flags); + atomic_dec(&per_cpu(isol_exit_counter, smp_processor_id())); + /* Barrier to sync with requesting a task isolation breaking */ + smp_mb__after_atomic(); + /* + * At this point breaking isolation will be treated as a + * separate isolation-breaking event, however interrupts won't + * arrive until local_irq_restore() + */ + + /* + * Check for the above mentioned entry without a call to + * task_isolation_kernel_enter() + */ + if ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION)) { + /* + * If it did happen, call the function here, to + * prevent further problems from running in + * un-synchronized state + */ + task_isolation_kernel_enter(); + /* Report the problem */ + pr_task_isol_emerg(smp_processor_id(), + "Isolation breaking was not detected on kernel entry\n"); + } + /* + * This task is no longer isolated (and if by any chance this + * is the wrong task, it's already not isolated) + */ + current->task_isolation_flags = 0; + clear_tsk_thread_flag(current, TIF_TASK_ISOLATION); + + /* Run the rest of cleanup later */ + set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); + + local_irq_restore(flags); +} + +/* Disable task isolation for the specified task. */ +static void stop_isolation(struct task_struct *p) +{ + int cpu, this_cpu; + unsigned long flags; + + this_cpu = get_cpu(); + cpu = task_cpu(p); + if (atomic_inc_return(&per_cpu(isol_exit_counter, cpu)) > 1) { + /* Already exiting isolation */ + atomic_dec(&per_cpu(isol_exit_counter, cpu)); + put_cpu(); + return; + } + + if (p == current) { + p->task_isolation_state = STATE_NORMAL; + fast_task_isolation_cpu_cleanup(NULL); + task_isolation_cpu_cleanup(); + put_cpu(); + } else { + /* + * Schedule "slow" cleanup. This relies on + * TIF_NOTIFY_RESUME being set + */ + spin_lock_irqsave(&task_isolation_cleanup_lock, flags); + cpumask_set_cpu(cpu, task_isolation_cleanup_map); + spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags); + /* + * Setting flags is delegated to the CPU where + * isolated task is running + * isol_exit_counter will be decremented from there as well. + */ + per_cpu(isol_break_csd, cpu).func = + fast_task_isolation_cpu_cleanup; + per_cpu(isol_break_csd, cpu).info = NULL; + per_cpu(isol_break_csd, cpu).flags = 0; + smp_call_function_single_async(cpu, + &per_cpu(isol_break_csd, cpu)); + put_cpu(); + } +} + +/* + * This code runs with interrupts disabled just before the return to + * userspace, after a prctl() has requested enabling task isolation. + * We take whatever steps are needed to avoid being interrupted later: + * drain the lru pages, stop the scheduler tick, etc. More + * functionality may be added here later to avoid other types of + * interrupts from other kernel subsystems. This, however, may still not + * have the intended result, so the rest of the system takes into account + * the possibility of receiving an interrupt and isolation breaking later. + * + * If we can't enable task isolation, we update the syscall return + * value with an appropriate error. + * + * This, however, does not enable isolation yet, as far as low-level + * flags are concerned. So if interrupts will be enabled, it's still + * possible for the task to be interrupted. The call to + * task_isolation_exit_to_user_mode() should finally enable task + * isolation after this function set FLAG_LL_TASK_ISOLATION_REQUEST. + */ +void task_isolation_start(void) +{ + int error; + unsigned long flags; + + /* + * We should only be called in STATE_NORMAL (isolation + * disabled), on our way out of the kernel from the prctl() + * that turned it on. If we are exiting from the kernel in + * another state, it means we made it back into the kernel + * without disabling task isolation, and we should investigate + * how (and in any case disable task isolation at this + * point). We are clearly not on the path back from the + * prctl() so we don't touch the syscall return value. + */ + if (WARN_ON_ONCE(current->task_isolation_state != STATE_NORMAL)) { + stop_isolation(current); + /* Report the problem */ + pr_task_isol_emerg(smp_processor_id(), + "Isolation start requested while not in the normal state\n"); + return; + } + + /* + * Must be affinitized to a single core with task isolation possible. + * In principle this could be remotely modified between the prctl() + * and the return to userspace, so we have to check it here. + */ + if (current->nr_cpus_allowed != 1 || + !is_isolation_cpu(smp_processor_id())) { + error = -EINVAL; + goto error; + } + + /* If the vmstat delayed work is not canceled, we have to try again. */ + if (!vmstat_idle()) { + error = -EAGAIN; + goto error; + } + + /* Try to stop the dynamic tick. */ + error = try_stop_full_tick(); + if (error) + goto error; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + local_irq_save(flags); + + /* Record isolated task IDs and name */ + record_curr_isolated_task(); + + current->task_isolation_state = STATE_ISOLATED; + this_cpu_write(ll_isol_flags, FLAG_LL_TASK_ISOLATION_REQUEST); + /* Barrier to synchronize with reading of flags */ + smp_mb(); + local_irq_restore(flags); + return; + +error: + stop_isolation(current); + syscall_set_return_value(current, current_pt_regs(), error, 0); +} + +/* Stop task isolation on the remote task and send it a signal. */ +static void send_isolation_signal(struct task_struct *task) +{ + int flags = task->task_isolation_flags; + kernel_siginfo_t info = { + .si_signo = PR_TASK_ISOLATION_GET_SIG(flags) ?: SIGKILL, + }; + + if ((flags & PR_TASK_ISOLATION_ENABLE) == 0) + return; + + stop_isolation(task); + send_sig_info(info.si_signo, &info, task); +} + +/* Only a few syscalls are valid once we are in task isolation mode. */ +static bool is_acceptable_syscall(int syscall) +{ + /* No need to incur an isolation signal if we are just exiting. */ + if (syscall == __NR_exit || syscall == __NR_exit_group) + return true; + + /* Check to see if it's the prctl for isolation. */ + if (syscall == __NR_prctl) { + unsigned long arg[SYSCALL_MAX_ARGS]; + + syscall_get_arguments(current, current_pt_regs(), arg); + if (arg[0] == PR_TASK_ISOLATION) + return true; + } + + return false; +} + +/* + * This routine is called from syscall entry, prevents most syscalls + * from executing, and if needed raises a signal to notify the process. + * + * Note that we have to stop isolation before we even print a message + * here, since otherwise we might end up reporting an interrupt due to + * kicking the printk handling code, rather than reporting the true + * cause of interrupt here. + * + * The message is not suppressed by previous remotely triggered + * messages. + */ +int task_isolation_syscall(int syscall) +{ + struct task_struct *task = current; + + /* + * Check if by any chance syscall is being processed from + * isolated state without a call to + * task_isolation_kernel_enter() happening on entry + */ + if ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION)) { + /* + * If it did happen, call the function here, to + * prevent further problems from running in + * un-synchronized state + */ + task_isolation_kernel_enter(); + /* Report the problem */ + pr_task_isol_emerg(smp_processor_id(), + "Isolation breaking was not detected on syscall\n"); + } + /* + * Clear low-level isolation flags to avoid triggering + * a signal on return to userspace + */ + this_cpu_write(ll_isol_flags, 0); + + if (is_acceptable_syscall(syscall)) { + stop_isolation(task); + return 0; + } + + send_isolation_signal(task); + + pr_task_isol_warn(smp_processor_id(), + "task_isolation lost due to syscall %d\n", + syscall); + debug_dump_stack(); + + syscall_set_return_value(task, current_pt_regs(), -ERESTARTNOINTR, -1); + return -1; +} + +/* + * This routine is called from the code responsible for exiting to user mode, + * before the point when thread flags are checked for pending work. + * That function must be called if the current task is isolated, because + * TIF_TASK_ISOLATION must trigger a call to it. + */ +void task_isolation_before_pending_work_check(void) +{ + int cpu; + unsigned long flags; + + /* Handle isolation breaking */ + if ((current->task_isolation_state != STATE_NORMAL) + && ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION_BROKEN) + != 0)) { + /* + * Clear low-level isolation flags to avoid triggering + * a signal again + */ + this_cpu_write(ll_isol_flags, 0); + /* Send signal to notify about isolation breaking */ + send_isolation_signal(current); + /* Produce generic message about lost isolation */ + pr_task_isol_warn(smp_processor_id(), "task_isolation lost\n"); + debug_dump_stack(); + } + + /* + * If this CPU is in the map of CPUs with cleanup pending, + * remove it from the map and call cleanup + */ + spin_lock_irqsave(&task_isolation_cleanup_lock, flags); + + cpu = smp_processor_id(); + + if (cpumask_test_cpu(cpu, task_isolation_cleanup_map)) { + cpumask_clear_cpu(cpu, task_isolation_cleanup_map); + spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags); + task_isolation_cpu_cleanup(); + } else + spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags); +} + +/* + * Called before we wake up a task that has a signal to process. + * Needs to be done to handle interrupts that trigger signals, which + * we don't catch with task_isolation_interrupt() hooks. + * + * This message is also suppressed if there was already a remotely + * caused message about the same isolation breaking event. + */ +void _task_isolation_signal(struct task_struct *task) +{ + struct isol_task_desc *desc; + int ind, cpu; + bool do_warn = (task->task_isolation_state == STATE_ISOLATED); + + cpu = task_cpu(task); + desc = &per_cpu(isol_task_descs, cpu); + ind = atomic_read(&desc->curr_index) & 1; + if (desc->warned[ind]) + do_warn = false; + + stop_isolation(task); + + if (do_warn) { + pr_warn("isolation: %s/%d/%d (cpu %d): task_isolation lost due to signal\n", + task->comm, task->tgid, task->pid, cpu); + debug_dump_stack(); + } +} + +/* + * Set CPUs currently running isolated tasks in CPU mask. + */ +void task_isolation_cpumask(struct cpumask *mask) +{ + int cpu; + + if (task_isolation_map == NULL) + return; + + /* Barrier to synchronize with writing task isolation flags */ + smp_rmb(); + for_each_cpu(cpu, task_isolation_map) + if (task_isolation_on_cpu(cpu)) + cpumask_set_cpu(cpu, mask); +} + +/* + * Clear CPUs currently running isolated tasks in CPU mask. + */ +void task_isolation_clear_cpumask(struct cpumask *mask) +{ + int cpu; + + if (task_isolation_map == NULL) + return; + + /* Barrier to synchronize with writing task isolation flags */ + smp_rmb(); + for_each_cpu(cpu, task_isolation_map) + if (task_isolation_on_cpu(cpu)) + cpumask_clear_cpu(cpu, mask); +} + +/* + * Cleanup procedure. The call to this procedure may be delayed. + */ +void task_isolation_cpu_cleanup(void) +{ + kick_hrtimer(); +} diff --git a/kernel/signal.c b/kernel/signal.c index ef8f2a28d37c..252886e6aa76 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -46,6 +46,7 @@ #include <linux/livepatch.h> #include <linux/cgroup.h> #include <linux/audit.h> +#include <linux/isolation.h> #define CREATE_TRACE_POINTS #include <trace/events/signal.h> @@ -759,6 +760,7 @@ static int dequeue_synchronous_signal(kernel_siginfo_t *info) */ void signal_wake_up_state(struct task_struct *t, unsigned int state) { + task_isolation_signal(t); set_tsk_thread_flag(t, TIF_SIGPENDING); /* * TASK_WAKEKILL also means wake it up in the stopped/traced/killable diff --git a/kernel/sys.c b/kernel/sys.c index a730c03ee607..0da007ce8666 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -42,6 +42,7 @@ #include <linux/syscore_ops.h> #include <linux/version.h> #include <linux/ctype.h> +#include <linux/isolation.h> #include <linux/compat.h> #include <linux/syscalls.h> @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER; break; + case PR_TASK_ISOLATION: + if (arg3 || arg4 || arg5) + return -EINVAL; + error = task_isolation_request(arg2); + break; default: error = -EINVAL; break; diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 387b4bef7dd1..27fedddc2729 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -30,6 +30,7 @@ #include <linux/syscalls.h> #include <linux/interrupt.h> #include <linux/tick.h> +#include <linux/isolation.h> #include <linux/err.h> #include <linux/debugobjects.h> #include <linux/sched/signal.h> @@ -720,6 +721,19 @@ static void retrigger_next_event(void *arg) raw_spin_unlock(&base->lock); } +#ifdef CONFIG_TASK_ISOLATION +void kick_hrtimer(void) +{ + unsigned long flags; + + preempt_disable(); + local_irq_save(flags); + retrigger_next_event(NULL); + local_irq_restore(flags); + preempt_enable(); +} +#endif + /* * Switch to high resolution mode */ @@ -867,8 +881,21 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram) void clock_was_set(void) { #ifdef CONFIG_HIGH_RES_TIMERS +#ifdef CONFIG_TASK_ISOLATION + struct cpumask mask; + + cpumask_clear(&mask); + task_isolation_cpumask(&mask); + cpumask_complement(&mask, &mask); + /* + * Retrigger the CPU local events everywhere except CPUs + * running isolated tasks. + */ + on_each_cpu_mask(&mask, retrigger_next_event, NULL, 1); +#else /* Retrigger the CPU local events everywhere */ on_each_cpu(retrigger_next_event, NULL, 1); +#endif #endif timerfd_clock_was_set(); } diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 81632cd5e3b7..a213952541db 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -896,6 +896,24 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts) #endif } +#ifdef CONFIG_TASK_ISOLATION +int try_stop_full_tick(void) +{ + int cpu = smp_processor_id(); + struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched); + + /* For an unstable clock, we should return a permanent error code. */ + if (atomic_read(&tick_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE) + return -EINVAL; + + if (!can_stop_full_tick(cpu, ts)) + return -EAGAIN; + + tick_nohz_stop_sched_tick(ts, cpu); + return 0; +} +#endif + static bool can_stop_idle_tick(int cpu, struct tick_sched *ts) { /* -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 74+ messages in thread
* Re: [PATCH v5 3/9] task_isolation: userspace hard isolation from kernel 2020-11-23 17:56 ` Alex Belits @ 2020-11-23 22:01 ` Thomas Gleixner -1 siblings, 0 replies; 74+ messages in thread From: Thomas Gleixner @ 2020-11-23 22:01 UTC (permalink / raw) To: Alex Belits, nitesh, frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev Alex, On Mon, Nov 23 2020 at 17:56, Alex Belits wrote: > .../admin-guide/kernel-parameters.txt | 6 + > drivers/base/cpu.c | 23 + > include/linux/hrtimer.h | 4 + > include/linux/isolation.h | 326 ++++++++ > include/linux/sched.h | 5 + > include/linux/tick.h | 3 + > include/uapi/linux/prctl.h | 6 + > init/Kconfig | 27 + > kernel/Makefile | 2 + > kernel/isolation.c | 714 ++++++++++++++++++ > kernel/signal.c | 2 + > kernel/sys.c | 6 + > kernel/time/hrtimer.c | 27 + > kernel/time/tick-sched.c | 18 + I asked you before to split this up into bits and pieces and argue and justify each change. Throwing this wholesale over the fence is going nowhere. It's not revieable at all. Aside of that ignoring review comments is a sure path to make yourself ignored: > +/* > + * Logging > + */ > +int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...); > + > +#define pr_task_isol_emerg(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_EMERG, false, fmt, ##__VA_ARGS__) The comments various people made about that are not going away and none of this is going near anything I'm responsible for unless you provide these independent of the rest and with a reasonable justification why you can't use any other existing mechanism or extend it for your use case. Thanks, tglx ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 3/9] task_isolation: userspace hard isolation from kernel @ 2020-11-23 22:01 ` Thomas Gleixner 0 siblings, 0 replies; 74+ messages in thread From: Thomas Gleixner @ 2020-11-23 22:01 UTC (permalink / raw) To: Alex Belits, nitesh, frederic Cc: linux-arch, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, linux-api, will, mingo, linux-arm-kernel Alex, On Mon, Nov 23 2020 at 17:56, Alex Belits wrote: > .../admin-guide/kernel-parameters.txt | 6 + > drivers/base/cpu.c | 23 + > include/linux/hrtimer.h | 4 + > include/linux/isolation.h | 326 ++++++++ > include/linux/sched.h | 5 + > include/linux/tick.h | 3 + > include/uapi/linux/prctl.h | 6 + > init/Kconfig | 27 + > kernel/Makefile | 2 + > kernel/isolation.c | 714 ++++++++++++++++++ > kernel/signal.c | 2 + > kernel/sys.c | 6 + > kernel/time/hrtimer.c | 27 + > kernel/time/tick-sched.c | 18 + I asked you before to split this up into bits and pieces and argue and justify each change. Throwing this wholesale over the fence is going nowhere. It's not revieable at all. Aside of that ignoring review comments is a sure path to make yourself ignored: > +/* > + * Logging > + */ > +int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...); > + > +#define pr_task_isol_emerg(cpu, fmt, ...) \ > + task_isolation_message(cpu, LOGLEVEL_EMERG, false, fmt, ##__VA_ARGS__) The comments various people made about that are not going away and none of this is going near anything I'm responsible for unless you provide these independent of the rest and with a reasonable justification why you can't use any other existing mechanism or extend it for your use case. Thanks, tglx _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* [PATCH v5 4/9] task_isolation: Add task isolation hooks to arch-independent code 2020-11-23 17:42 ` Alex Belits @ 2020-11-23 17:57 ` Alex Belits -1 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:57 UTC (permalink / raw) To: nitesh, frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev Kernel entry and exit functions for task isolation are added to context tracking and common entry points. Common handling of pending work on exit to userspace now processes isolation breaking, cleanup and start. Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> [abelits@marvell.com: adapted for kernel 5.10] Signed-off-by: Alex Belits <abelits@marvell.com> --- include/linux/hardirq.h | 2 ++ include/linux/sched.h | 2 ++ kernel/context_tracking.c | 5 +++++ kernel/entry/common.c | 10 +++++++++- kernel/irq/irqdesc.c | 5 +++++ 5 files changed, 23 insertions(+), 1 deletion(-) diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index 754f67ac4326..b9e604ae6a0d 100644 --- a/include/linux/hardirq.h +++ b/include/linux/hardirq.h @@ -7,6 +7,7 @@ #include <linux/lockdep.h> #include <linux/ftrace_irq.h> #include <linux/vtime.h> +#include <linux/isolation.h> #include <asm/hardirq.h> extern void synchronize_irq(unsigned int irq); @@ -115,6 +116,7 @@ extern void rcu_nmi_exit(void); do { \ lockdep_off(); \ arch_nmi_enter(); \ + task_isolation_kernel_enter(); \ printk_nmi_enter(); \ BUG_ON(in_nmi() == NMI_MASK); \ __preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET); \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 5d8b17aa544b..51c2d774250b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -34,6 +34,7 @@ #include <linux/rseq.h> #include <linux/seqlock.h> #include <linux/kcsan.h> +#include <linux/isolation.h> /* task_struct member predeclarations (sorted alphabetically): */ struct audit_context; @@ -1762,6 +1763,7 @@ extern char *__get_task_comm(char *to, size_t len, struct task_struct *tsk); #ifdef CONFIG_SMP static __always_inline void scheduler_ipi(void) { + task_isolation_kernel_enter(); /* * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting * TIF_NEED_RESCHED remotely (for the first time) will also send diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 36a98c48aedc..379a48fd0e65 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -21,6 +21,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/isolation.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -100,6 +101,8 @@ void noinstr __context_tracking_enter(enum ctx_state state) __this_cpu_write(context_tracking.state, state); } context_tracking_recursion_exit(); + + task_isolation_exit_to_user_mode(); } EXPORT_SYMBOL_GPL(__context_tracking_enter); @@ -148,6 +151,8 @@ void noinstr __context_tracking_exit(enum ctx_state state) if (!context_tracking_recursion_enter()) return; + task_isolation_kernel_enter(); + if (__this_cpu_read(context_tracking.state) == state) { if (__this_cpu_read(context_tracking.active)) { /* diff --git a/kernel/entry/common.c b/kernel/entry/common.c index e9e2df3f3f9e..10a520894105 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -4,6 +4,7 @@ #include <linux/entry-common.h> #include <linux/livepatch.h> #include <linux/audit.h> +#include <linux/isolation.h> #define CREATE_TRACE_POINTS #include <trace/events/syscalls.h> @@ -183,13 +184,20 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, static void exit_to_user_mode_prepare(struct pt_regs *regs) { - unsigned long ti_work = READ_ONCE(current_thread_info()->flags); + unsigned long ti_work; lockdep_assert_irqs_disabled(); + task_isolation_before_pending_work_check(); + + ti_work = READ_ONCE(current_thread_info()->flags); + if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) ti_work = exit_to_user_mode_loop(regs, ti_work); + if (unlikely(ti_work & _TIF_TASK_ISOLATION)) + task_isolation_start(); + arch_exit_to_user_mode_prepare(regs, ti_work); /* Ensure that the address limit is intact and no locks are held */ diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index 1a7723604399..b8f0a7574f55 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -16,6 +16,7 @@ #include <linux/bitmap.h> #include <linux/irqdomain.h> #include <linux/sysfs.h> +#include <linux/isolation.h> #include "internals.h" @@ -669,6 +670,8 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq, unsigned int irq = hwirq; int ret = 0; + task_isolation_kernel_enter(); + irq_enter(); #ifdef CONFIG_IRQ_DOMAIN @@ -710,6 +713,8 @@ int handle_domain_nmi(struct irq_domain *domain, unsigned int hwirq, unsigned int irq; int ret = 0; + task_isolation_kernel_enter(); + /* * NMI context needs to be setup earlier in order to deal with tracing. */ -- 2.20.1 ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH v5 4/9] task_isolation: Add task isolation hooks to arch-independent code @ 2020-11-23 17:57 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:57 UTC (permalink / raw) To: nitesh, frederic Cc: linux-arch, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel Kernel entry and exit functions for task isolation are added to context tracking and common entry points. Common handling of pending work on exit to userspace now processes isolation breaking, cleanup and start. Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> [abelits@marvell.com: adapted for kernel 5.10] Signed-off-by: Alex Belits <abelits@marvell.com> --- include/linux/hardirq.h | 2 ++ include/linux/sched.h | 2 ++ kernel/context_tracking.c | 5 +++++ kernel/entry/common.c | 10 +++++++++- kernel/irq/irqdesc.c | 5 +++++ 5 files changed, 23 insertions(+), 1 deletion(-) diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index 754f67ac4326..b9e604ae6a0d 100644 --- a/include/linux/hardirq.h +++ b/include/linux/hardirq.h @@ -7,6 +7,7 @@ #include <linux/lockdep.h> #include <linux/ftrace_irq.h> #include <linux/vtime.h> +#include <linux/isolation.h> #include <asm/hardirq.h> extern void synchronize_irq(unsigned int irq); @@ -115,6 +116,7 @@ extern void rcu_nmi_exit(void); do { \ lockdep_off(); \ arch_nmi_enter(); \ + task_isolation_kernel_enter(); \ printk_nmi_enter(); \ BUG_ON(in_nmi() == NMI_MASK); \ __preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET); \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 5d8b17aa544b..51c2d774250b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -34,6 +34,7 @@ #include <linux/rseq.h> #include <linux/seqlock.h> #include <linux/kcsan.h> +#include <linux/isolation.h> /* task_struct member predeclarations (sorted alphabetically): */ struct audit_context; @@ -1762,6 +1763,7 @@ extern char *__get_task_comm(char *to, size_t len, struct task_struct *tsk); #ifdef CONFIG_SMP static __always_inline void scheduler_ipi(void) { + task_isolation_kernel_enter(); /* * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting * TIF_NEED_RESCHED remotely (for the first time) will also send diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 36a98c48aedc..379a48fd0e65 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -21,6 +21,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/isolation.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -100,6 +101,8 @@ void noinstr __context_tracking_enter(enum ctx_state state) __this_cpu_write(context_tracking.state, state); } context_tracking_recursion_exit(); + + task_isolation_exit_to_user_mode(); } EXPORT_SYMBOL_GPL(__context_tracking_enter); @@ -148,6 +151,8 @@ void noinstr __context_tracking_exit(enum ctx_state state) if (!context_tracking_recursion_enter()) return; + task_isolation_kernel_enter(); + if (__this_cpu_read(context_tracking.state) == state) { if (__this_cpu_read(context_tracking.active)) { /* diff --git a/kernel/entry/common.c b/kernel/entry/common.c index e9e2df3f3f9e..10a520894105 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -4,6 +4,7 @@ #include <linux/entry-common.h> #include <linux/livepatch.h> #include <linux/audit.h> +#include <linux/isolation.h> #define CREATE_TRACE_POINTS #include <trace/events/syscalls.h> @@ -183,13 +184,20 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, static void exit_to_user_mode_prepare(struct pt_regs *regs) { - unsigned long ti_work = READ_ONCE(current_thread_info()->flags); + unsigned long ti_work; lockdep_assert_irqs_disabled(); + task_isolation_before_pending_work_check(); + + ti_work = READ_ONCE(current_thread_info()->flags); + if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) ti_work = exit_to_user_mode_loop(regs, ti_work); + if (unlikely(ti_work & _TIF_TASK_ISOLATION)) + task_isolation_start(); + arch_exit_to_user_mode_prepare(regs, ti_work); /* Ensure that the address limit is intact and no locks are held */ diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index 1a7723604399..b8f0a7574f55 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -16,6 +16,7 @@ #include <linux/bitmap.h> #include <linux/irqdomain.h> #include <linux/sysfs.h> +#include <linux/isolation.h> #include "internals.h" @@ -669,6 +670,8 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq, unsigned int irq = hwirq; int ret = 0; + task_isolation_kernel_enter(); + irq_enter(); #ifdef CONFIG_IRQ_DOMAIN @@ -710,6 +713,8 @@ int handle_domain_nmi(struct irq_domain *domain, unsigned int hwirq, unsigned int irq; int ret = 0; + task_isolation_kernel_enter(); + /* * NMI context needs to be setup earlier in order to deal with tracing. */ -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 74+ messages in thread
* Re: [PATCH v5 4/9] task_isolation: Add task isolation hooks to arch-independent code 2020-11-23 17:57 ` Alex Belits @ 2020-11-23 22:31 ` Thomas Gleixner -1 siblings, 0 replies; 74+ messages in thread From: Thomas Gleixner @ 2020-11-23 22:31 UTC (permalink / raw) To: Alex Belits, nitesh, frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev Alex, On Mon, Nov 23 2020 at 17:57, Alex Belits wrote: > Kernel entry and exit functions for task isolation are added to context > tracking and common entry points. Common handling of pending work on exit > to userspace now processes isolation breaking, cleanup and start. Again: You fail to explain the rationale and just explain what the patch is doing. I can see what the patch is doing from the patch itself. > --- > include/linux/hardirq.h | 2 ++ > include/linux/sched.h | 2 ++ > kernel/context_tracking.c | 5 +++++ > kernel/entry/common.c | 10 +++++++++- > kernel/irq/irqdesc.c | 5 +++++ At least 3 different subsystems, which means this again failed to be split into seperate patches. > extern void synchronize_irq(unsigned int irq); > @@ -115,6 +116,7 @@ extern void rcu_nmi_exit(void); > do { \ > lockdep_off(); \ > arch_nmi_enter(); \ > + task_isolation_kernel_enter(); \ Where is the explanation why this is safe and correct vs. this fragile code path? > @@ -1762,6 +1763,7 @@ extern char *__get_task_comm(char *to, size_t len, struct task_struct *tsk); > #ifdef CONFIG_SMP > static __always_inline void scheduler_ipi(void) > { > + task_isolation_kernel_enter(); Why is the scheduler_ipi() special? Just because everything else cannot happen at all? Oh well... > #define CREATE_TRACE_POINTS > #include <trace/events/context_tracking.h> > @@ -100,6 +101,8 @@ void noinstr __context_tracking_enter(enum ctx_state state) > __this_cpu_write(context_tracking.state, state); > } > context_tracking_recursion_exit(); > + > + task_isolation_exit_to_user_mode(); Why is this here at all and why is it outside of the recursion protection > } > EXPORT_SYMBOL_GPL(__context_tracking_enter); > > @@ -148,6 +151,8 @@ void noinstr __context_tracking_exit(enum ctx_state state) > if (!context_tracking_recursion_enter()) > return; > > + task_isolation_kernel_enter(); while this is inside? And why has the scheduler_ipi() on x86 call this twice? Just because? > if (__this_cpu_read(context_tracking.state) == state) { > if (__this_cpu_read(context_tracking.active)) { > /* > diff --git a/kernel/entry/common.c b/kernel/entry/common.c > static void exit_to_user_mode_prepare(struct pt_regs *regs) > { > - unsigned long ti_work = READ_ONCE(current_thread_info()->flags); > + unsigned long ti_work; > > lockdep_assert_irqs_disabled(); > > + task_isolation_before_pending_work_check(); > + > + ti_work = READ_ONCE(current_thread_info()->flags); > + > if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) > ti_work = exit_to_user_mode_loop(regs, ti_work); > > + if (unlikely(ti_work & _TIF_TASK_ISOLATION)) > + task_isolation_start(); Where is the explaination of this change? Aside of that how does anything of this compile on x86 at all? Answer: It does not ... Stop this frenzy right now. It's going nowhere and all you achieve is to make people more grumpy than they are already. Thanks, tglx ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 4/9] task_isolation: Add task isolation hooks to arch-independent code @ 2020-11-23 22:31 ` Thomas Gleixner 0 siblings, 0 replies; 74+ messages in thread From: Thomas Gleixner @ 2020-11-23 22:31 UTC (permalink / raw) To: Alex Belits, nitesh, frederic Cc: linux-arch, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, linux-api, will, mingo, linux-arm-kernel Alex, On Mon, Nov 23 2020 at 17:57, Alex Belits wrote: > Kernel entry and exit functions for task isolation are added to context > tracking and common entry points. Common handling of pending work on exit > to userspace now processes isolation breaking, cleanup and start. Again: You fail to explain the rationale and just explain what the patch is doing. I can see what the patch is doing from the patch itself. > --- > include/linux/hardirq.h | 2 ++ > include/linux/sched.h | 2 ++ > kernel/context_tracking.c | 5 +++++ > kernel/entry/common.c | 10 +++++++++- > kernel/irq/irqdesc.c | 5 +++++ At least 3 different subsystems, which means this again failed to be split into seperate patches. > extern void synchronize_irq(unsigned int irq); > @@ -115,6 +116,7 @@ extern void rcu_nmi_exit(void); > do { \ > lockdep_off(); \ > arch_nmi_enter(); \ > + task_isolation_kernel_enter(); \ Where is the explanation why this is safe and correct vs. this fragile code path? > @@ -1762,6 +1763,7 @@ extern char *__get_task_comm(char *to, size_t len, struct task_struct *tsk); > #ifdef CONFIG_SMP > static __always_inline void scheduler_ipi(void) > { > + task_isolation_kernel_enter(); Why is the scheduler_ipi() special? Just because everything else cannot happen at all? Oh well... > #define CREATE_TRACE_POINTS > #include <trace/events/context_tracking.h> > @@ -100,6 +101,8 @@ void noinstr __context_tracking_enter(enum ctx_state state) > __this_cpu_write(context_tracking.state, state); > } > context_tracking_recursion_exit(); > + > + task_isolation_exit_to_user_mode(); Why is this here at all and why is it outside of the recursion protection > } > EXPORT_SYMBOL_GPL(__context_tracking_enter); > > @@ -148,6 +151,8 @@ void noinstr __context_tracking_exit(enum ctx_state state) > if (!context_tracking_recursion_enter()) > return; > > + task_isolation_kernel_enter(); while this is inside? And why has the scheduler_ipi() on x86 call this twice? Just because? > if (__this_cpu_read(context_tracking.state) == state) { > if (__this_cpu_read(context_tracking.active)) { > /* > diff --git a/kernel/entry/common.c b/kernel/entry/common.c > static void exit_to_user_mode_prepare(struct pt_regs *regs) > { > - unsigned long ti_work = READ_ONCE(current_thread_info()->flags); > + unsigned long ti_work; > > lockdep_assert_irqs_disabled(); > > + task_isolation_before_pending_work_check(); > + > + ti_work = READ_ONCE(current_thread_info()->flags); > + > if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) > ti_work = exit_to_user_mode_loop(regs, ti_work); > > + if (unlikely(ti_work & _TIF_TASK_ISOLATION)) > + task_isolation_start(); Where is the explaination of this change? Aside of that how does anything of this compile on x86 at all? Answer: It does not ... Stop this frenzy right now. It's going nowhere and all you achieve is to make people more grumpy than they are already. Thanks, tglx _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* [PATCH v5 5/9] task_isolation: Add driver-specific hooks 2020-11-23 17:42 ` Alex Belits @ 2020-11-23 17:57 ` Alex Belits -1 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:57 UTC (permalink / raw) To: nitesh, frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev Some drivers don't call functions that call task_isolation_kernel_enter() in interrupt handlers. Call it directly. Signed-off-by: Alex Belits <abelits@marvell.com> --- drivers/irqchip/irq-armada-370-xp.c | 6 ++++++ drivers/irqchip/irq-gic-v3.c | 3 +++ drivers/irqchip/irq-gic.c | 3 +++ drivers/s390/cio/cio.c | 3 +++ 4 files changed, 15 insertions(+) diff --git a/drivers/irqchip/irq-armada-370-xp.c b/drivers/irqchip/irq-armada-370-xp.c index d7eb2e93db8f..4ac7babe1abe 100644 --- a/drivers/irqchip/irq-armada-370-xp.c +++ b/drivers/irqchip/irq-armada-370-xp.c @@ -29,6 +29,7 @@ #include <linux/slab.h> #include <linux/syscore_ops.h> #include <linux/msi.h> +#include <linux/isolation.h> #include <asm/mach/arch.h> #include <asm/exception.h> #include <asm/smp_plat.h> @@ -572,6 +573,7 @@ static const struct irq_domain_ops armada_370_xp_mpic_irq_ops = { static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained) { u32 msimask, msinr; + int isol_entered = 0; msimask = readl_relaxed(per_cpu_int_base + ARMADA_370_XP_IN_DRBEL_CAUSE_OFFS) @@ -588,6 +590,10 @@ static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained) continue; if (is_chained) { + if (!isol_entered) { + task_isolation_kernel_enter(); + isol_entered = 1; + } irq = irq_find_mapping(armada_370_xp_msi_inner_domain, msinr - PCI_MSI_DOORBELL_START); generic_handle_irq(irq); diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 16fecc0febe8..ded26dd4da0f 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -18,6 +18,7 @@ #include <linux/percpu.h> #include <linux/refcount.h> #include <linux/slab.h> +#include <linux/isolation.h> #include <linux/irqchip.h> #include <linux/irqchip/arm-gic-common.h> @@ -646,6 +647,8 @@ static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs *regs { u32 irqnr; + task_isolation_kernel_enter(); + irqnr = gic_read_iar(); if (gic_supports_nmi() && diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c index 6053245a4754..bb482b4ae218 100644 --- a/drivers/irqchip/irq-gic.c +++ b/drivers/irqchip/irq-gic.c @@ -35,6 +35,7 @@ #include <linux/interrupt.h> #include <linux/percpu.h> #include <linux/slab.h> +#include <linux/isolation.h> #include <linux/irqchip.h> #include <linux/irqchip/chained_irq.h> #include <linux/irqchip/arm-gic.h> @@ -337,6 +338,8 @@ static void __exception_irq_entry gic_handle_irq(struct pt_regs *regs) struct gic_chip_data *gic = &gic_data[0]; void __iomem *cpu_base = gic_data_cpu_base(gic); + task_isolation_kernel_enter(); + do { irqstat = readl_relaxed(cpu_base + GIC_CPU_INTACK); irqnr = irqstat & GICC_IAR_INT_ID_MASK; diff --git a/drivers/s390/cio/cio.c b/drivers/s390/cio/cio.c index 6d716db2a46a..beab88881b6d 100644 --- a/drivers/s390/cio/cio.c +++ b/drivers/s390/cio/cio.c @@ -20,6 +20,7 @@ #include <linux/kernel_stat.h> #include <linux/interrupt.h> #include <linux/irq.h> +#include <linux/isolation.h> #include <asm/cio.h> #include <asm/delay.h> #include <asm/irq.h> @@ -584,6 +585,8 @@ void cio_tsch(struct subchannel *sch) struct irb *irb; int irq_context; + task_isolation_kernel_enter(); + irb = this_cpu_ptr(&cio_irb); /* Store interrupt response block to lowcore. */ if (tsch(sch->schid, irb) != 0) -- 2.20.1 ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH v5 5/9] task_isolation: Add driver-specific hooks @ 2020-11-23 17:57 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:57 UTC (permalink / raw) To: nitesh, frederic Cc: linux-arch, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel Some drivers don't call functions that call task_isolation_kernel_enter() in interrupt handlers. Call it directly. Signed-off-by: Alex Belits <abelits@marvell.com> --- drivers/irqchip/irq-armada-370-xp.c | 6 ++++++ drivers/irqchip/irq-gic-v3.c | 3 +++ drivers/irqchip/irq-gic.c | 3 +++ drivers/s390/cio/cio.c | 3 +++ 4 files changed, 15 insertions(+) diff --git a/drivers/irqchip/irq-armada-370-xp.c b/drivers/irqchip/irq-armada-370-xp.c index d7eb2e93db8f..4ac7babe1abe 100644 --- a/drivers/irqchip/irq-armada-370-xp.c +++ b/drivers/irqchip/irq-armada-370-xp.c @@ -29,6 +29,7 @@ #include <linux/slab.h> #include <linux/syscore_ops.h> #include <linux/msi.h> +#include <linux/isolation.h> #include <asm/mach/arch.h> #include <asm/exception.h> #include <asm/smp_plat.h> @@ -572,6 +573,7 @@ static const struct irq_domain_ops armada_370_xp_mpic_irq_ops = { static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained) { u32 msimask, msinr; + int isol_entered = 0; msimask = readl_relaxed(per_cpu_int_base + ARMADA_370_XP_IN_DRBEL_CAUSE_OFFS) @@ -588,6 +590,10 @@ static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained) continue; if (is_chained) { + if (!isol_entered) { + task_isolation_kernel_enter(); + isol_entered = 1; + } irq = irq_find_mapping(armada_370_xp_msi_inner_domain, msinr - PCI_MSI_DOORBELL_START); generic_handle_irq(irq); diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 16fecc0febe8..ded26dd4da0f 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -18,6 +18,7 @@ #include <linux/percpu.h> #include <linux/refcount.h> #include <linux/slab.h> +#include <linux/isolation.h> #include <linux/irqchip.h> #include <linux/irqchip/arm-gic-common.h> @@ -646,6 +647,8 @@ static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs *regs { u32 irqnr; + task_isolation_kernel_enter(); + irqnr = gic_read_iar(); if (gic_supports_nmi() && diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c index 6053245a4754..bb482b4ae218 100644 --- a/drivers/irqchip/irq-gic.c +++ b/drivers/irqchip/irq-gic.c @@ -35,6 +35,7 @@ #include <linux/interrupt.h> #include <linux/percpu.h> #include <linux/slab.h> +#include <linux/isolation.h> #include <linux/irqchip.h> #include <linux/irqchip/chained_irq.h> #include <linux/irqchip/arm-gic.h> @@ -337,6 +338,8 @@ static void __exception_irq_entry gic_handle_irq(struct pt_regs *regs) struct gic_chip_data *gic = &gic_data[0]; void __iomem *cpu_base = gic_data_cpu_base(gic); + task_isolation_kernel_enter(); + do { irqstat = readl_relaxed(cpu_base + GIC_CPU_INTACK); irqnr = irqstat & GICC_IAR_INT_ID_MASK; diff --git a/drivers/s390/cio/cio.c b/drivers/s390/cio/cio.c index 6d716db2a46a..beab88881b6d 100644 --- a/drivers/s390/cio/cio.c +++ b/drivers/s390/cio/cio.c @@ -20,6 +20,7 @@ #include <linux/kernel_stat.h> #include <linux/interrupt.h> #include <linux/irq.h> +#include <linux/isolation.h> #include <asm/cio.h> #include <asm/delay.h> #include <asm/irq.h> @@ -584,6 +585,8 @@ void cio_tsch(struct subchannel *sch) struct irb *irb; int irq_context; + task_isolation_kernel_enter(); + irb = this_cpu_ptr(&cio_irb); /* Store interrupt response block to lowcore. */ if (tsch(sch->schid, irb) != 0) -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 74+ messages in thread
* Re: [PATCH v5 5/9] task_isolation: Add driver-specific hooks 2020-11-23 17:57 ` Alex Belits @ 2020-12-02 14:18 ` Mark Rutland -1 siblings, 0 replies; 74+ messages in thread From: Mark Rutland @ 2020-12-02 14:18 UTC (permalink / raw) To: Alex Belits Cc: nitesh, frederic, Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev On Mon, Nov 23, 2020 at 05:57:42PM +0000, Alex Belits wrote: > Some drivers don't call functions that call > task_isolation_kernel_enter() in interrupt handlers. Call it > directly. I don't think putting this in drivers is the right approach. IIUC we only need to track user<->kernel transitions, and we can do that within the architectural entry code before we ever reach irqchip code. I suspect the current approacch is an artifact of that being difficult in the old structure of the arch code; recent rework should address that, and we can restruecture things further in future. Thanks, Mark. > Signed-off-by: Alex Belits <abelits@marvell.com> > --- > drivers/irqchip/irq-armada-370-xp.c | 6 ++++++ > drivers/irqchip/irq-gic-v3.c | 3 +++ > drivers/irqchip/irq-gic.c | 3 +++ > drivers/s390/cio/cio.c | 3 +++ > 4 files changed, 15 insertions(+) > > diff --git a/drivers/irqchip/irq-armada-370-xp.c b/drivers/irqchip/irq-armada-370-xp.c > index d7eb2e93db8f..4ac7babe1abe 100644 > --- a/drivers/irqchip/irq-armada-370-xp.c > +++ b/drivers/irqchip/irq-armada-370-xp.c > @@ -29,6 +29,7 @@ > #include <linux/slab.h> > #include <linux/syscore_ops.h> > #include <linux/msi.h> > +#include <linux/isolation.h> > #include <asm/mach/arch.h> > #include <asm/exception.h> > #include <asm/smp_plat.h> > @@ -572,6 +573,7 @@ static const struct irq_domain_ops armada_370_xp_mpic_irq_ops = { > static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained) > { > u32 msimask, msinr; > + int isol_entered = 0; > > msimask = readl_relaxed(per_cpu_int_base + > ARMADA_370_XP_IN_DRBEL_CAUSE_OFFS) > @@ -588,6 +590,10 @@ static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained) > continue; > > if (is_chained) { > + if (!isol_entered) { > + task_isolation_kernel_enter(); > + isol_entered = 1; > + } > irq = irq_find_mapping(armada_370_xp_msi_inner_domain, > msinr - PCI_MSI_DOORBELL_START); > generic_handle_irq(irq); > diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c > index 16fecc0febe8..ded26dd4da0f 100644 > --- a/drivers/irqchip/irq-gic-v3.c > +++ b/drivers/irqchip/irq-gic-v3.c > @@ -18,6 +18,7 @@ > #include <linux/percpu.h> > #include <linux/refcount.h> > #include <linux/slab.h> > +#include <linux/isolation.h> > > #include <linux/irqchip.h> > #include <linux/irqchip/arm-gic-common.h> > @@ -646,6 +647,8 @@ static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs *regs > { > u32 irqnr; > > + task_isolation_kernel_enter(); > + > irqnr = gic_read_iar(); > > if (gic_supports_nmi() && > diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c > index 6053245a4754..bb482b4ae218 100644 > --- a/drivers/irqchip/irq-gic.c > +++ b/drivers/irqchip/irq-gic.c > @@ -35,6 +35,7 @@ > #include <linux/interrupt.h> > #include <linux/percpu.h> > #include <linux/slab.h> > +#include <linux/isolation.h> > #include <linux/irqchip.h> > #include <linux/irqchip/chained_irq.h> > #include <linux/irqchip/arm-gic.h> > @@ -337,6 +338,8 @@ static void __exception_irq_entry gic_handle_irq(struct pt_regs *regs) > struct gic_chip_data *gic = &gic_data[0]; > void __iomem *cpu_base = gic_data_cpu_base(gic); > > + task_isolation_kernel_enter(); > + > do { > irqstat = readl_relaxed(cpu_base + GIC_CPU_INTACK); > irqnr = irqstat & GICC_IAR_INT_ID_MASK; > diff --git a/drivers/s390/cio/cio.c b/drivers/s390/cio/cio.c > index 6d716db2a46a..beab88881b6d 100644 > --- a/drivers/s390/cio/cio.c > +++ b/drivers/s390/cio/cio.c > @@ -20,6 +20,7 @@ > #include <linux/kernel_stat.h> > #include <linux/interrupt.h> > #include <linux/irq.h> > +#include <linux/isolation.h> > #include <asm/cio.h> > #include <asm/delay.h> > #include <asm/irq.h> > @@ -584,6 +585,8 @@ void cio_tsch(struct subchannel *sch) > struct irb *irb; > int irq_context; > > + task_isolation_kernel_enter(); > + > irb = this_cpu_ptr(&cio_irb); > /* Store interrupt response block to lowcore. */ > if (tsch(sch->schid, irb) != 0) > -- > 2.20.1 > ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 5/9] task_isolation: Add driver-specific hooks @ 2020-12-02 14:18 ` Mark Rutland 0 siblings, 0 replies; 74+ messages in thread From: Mark Rutland @ 2020-12-02 14:18 UTC (permalink / raw) To: Alex Belits Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, linux-api, frederic, mtosatti, linux-kernel, rostedt, mingo, leon, netdev, peterx, trix, Prasun Kapoor, tglx, will, davem, linux-arm-kernel On Mon, Nov 23, 2020 at 05:57:42PM +0000, Alex Belits wrote: > Some drivers don't call functions that call > task_isolation_kernel_enter() in interrupt handlers. Call it > directly. I don't think putting this in drivers is the right approach. IIUC we only need to track user<->kernel transitions, and we can do that within the architectural entry code before we ever reach irqchip code. I suspect the current approacch is an artifact of that being difficult in the old structure of the arch code; recent rework should address that, and we can restruecture things further in future. Thanks, Mark. > Signed-off-by: Alex Belits <abelits@marvell.com> > --- > drivers/irqchip/irq-armada-370-xp.c | 6 ++++++ > drivers/irqchip/irq-gic-v3.c | 3 +++ > drivers/irqchip/irq-gic.c | 3 +++ > drivers/s390/cio/cio.c | 3 +++ > 4 files changed, 15 insertions(+) > > diff --git a/drivers/irqchip/irq-armada-370-xp.c b/drivers/irqchip/irq-armada-370-xp.c > index d7eb2e93db8f..4ac7babe1abe 100644 > --- a/drivers/irqchip/irq-armada-370-xp.c > +++ b/drivers/irqchip/irq-armada-370-xp.c > @@ -29,6 +29,7 @@ > #include <linux/slab.h> > #include <linux/syscore_ops.h> > #include <linux/msi.h> > +#include <linux/isolation.h> > #include <asm/mach/arch.h> > #include <asm/exception.h> > #include <asm/smp_plat.h> > @@ -572,6 +573,7 @@ static const struct irq_domain_ops armada_370_xp_mpic_irq_ops = { > static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained) > { > u32 msimask, msinr; > + int isol_entered = 0; > > msimask = readl_relaxed(per_cpu_int_base + > ARMADA_370_XP_IN_DRBEL_CAUSE_OFFS) > @@ -588,6 +590,10 @@ static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained) > continue; > > if (is_chained) { > + if (!isol_entered) { > + task_isolation_kernel_enter(); > + isol_entered = 1; > + } > irq = irq_find_mapping(armada_370_xp_msi_inner_domain, > msinr - PCI_MSI_DOORBELL_START); > generic_handle_irq(irq); > diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c > index 16fecc0febe8..ded26dd4da0f 100644 > --- a/drivers/irqchip/irq-gic-v3.c > +++ b/drivers/irqchip/irq-gic-v3.c > @@ -18,6 +18,7 @@ > #include <linux/percpu.h> > #include <linux/refcount.h> > #include <linux/slab.h> > +#include <linux/isolation.h> > > #include <linux/irqchip.h> > #include <linux/irqchip/arm-gic-common.h> > @@ -646,6 +647,8 @@ static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs *regs > { > u32 irqnr; > > + task_isolation_kernel_enter(); > + > irqnr = gic_read_iar(); > > if (gic_supports_nmi() && > diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c > index 6053245a4754..bb482b4ae218 100644 > --- a/drivers/irqchip/irq-gic.c > +++ b/drivers/irqchip/irq-gic.c > @@ -35,6 +35,7 @@ > #include <linux/interrupt.h> > #include <linux/percpu.h> > #include <linux/slab.h> > +#include <linux/isolation.h> > #include <linux/irqchip.h> > #include <linux/irqchip/chained_irq.h> > #include <linux/irqchip/arm-gic.h> > @@ -337,6 +338,8 @@ static void __exception_irq_entry gic_handle_irq(struct pt_regs *regs) > struct gic_chip_data *gic = &gic_data[0]; > void __iomem *cpu_base = gic_data_cpu_base(gic); > > + task_isolation_kernel_enter(); > + > do { > irqstat = readl_relaxed(cpu_base + GIC_CPU_INTACK); > irqnr = irqstat & GICC_IAR_INT_ID_MASK; > diff --git a/drivers/s390/cio/cio.c b/drivers/s390/cio/cio.c > index 6d716db2a46a..beab88881b6d 100644 > --- a/drivers/s390/cio/cio.c > +++ b/drivers/s390/cio/cio.c > @@ -20,6 +20,7 @@ > #include <linux/kernel_stat.h> > #include <linux/interrupt.h> > #include <linux/irq.h> > +#include <linux/isolation.h> > #include <asm/cio.h> > #include <asm/delay.h> > #include <asm/irq.h> > @@ -584,6 +585,8 @@ void cio_tsch(struct subchannel *sch) > struct irb *irb; > int irq_context; > > + task_isolation_kernel_enter(); > + > irb = this_cpu_ptr(&cio_irb); > /* Store interrupt response block to lowcore. */ > if (tsch(sch->schid, irb) != 0) > -- > 2.20.1 > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 5/9] task_isolation: Add driver-specific hooks 2020-12-02 14:18 ` Mark Rutland @ 2020-12-04 0:43 ` Alex Belits -1 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-12-04 0:43 UTC (permalink / raw) To: mark.rutland Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel, rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will, catalin.marinas, peterz, frederic, leon, linux-arm-kernel, pauld, netdev On Wed, 2020-12-02 at 14:18 +0000, Mark Rutland wrote: > External Email > > ------------------------------------------------------------------- > --- > On Mon, Nov 23, 2020 at 05:57:42PM +0000, Alex Belits wrote: > > Some drivers don't call functions that call > > task_isolation_kernel_enter() in interrupt handlers. Call it > > directly. > > I don't think putting this in drivers is the right approach. IIUC we > only need to track user<->kernel transitions, and we can do that > within > the architectural entry code before we ever reach irqchip code. I > suspect the current approacch is an artifact of that being difficult > in > the old structure of the arch code; recent rework should address > that, > and we can restruecture things further in future. I agree completely. This patch only covers irqchip drivers with unusual entry procedures. -- Alex ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 5/9] task_isolation: Add driver-specific hooks @ 2020-12-04 0:43 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-12-04 0:43 UTC (permalink / raw) To: mark.rutland Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, Prasun Kapoor, frederic, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel On Wed, 2020-12-02 at 14:18 +0000, Mark Rutland wrote: > External Email > > ------------------------------------------------------------------- > --- > On Mon, Nov 23, 2020 at 05:57:42PM +0000, Alex Belits wrote: > > Some drivers don't call functions that call > > task_isolation_kernel_enter() in interrupt handlers. Call it > > directly. > > I don't think putting this in drivers is the right approach. IIUC we > only need to track user<->kernel transitions, and we can do that > within > the architectural entry code before we ever reach irqchip code. I > suspect the current approacch is an artifact of that being difficult > in > the old structure of the arch code; recent rework should address > that, > and we can restruecture things further in future. I agree completely. This patch only covers irqchip drivers with unusual entry procedures. -- Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality 2020-11-23 17:42 ` Alex Belits @ 2020-11-23 17:58 ` Alex Belits -1 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:58 UTC (permalink / raw) To: nitesh, frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev In do_notify_resume(), call task_isolation_before_pending_work_check() first, to report isolation breaking, then after handling all pending work, call task_isolation_start() for TIF_TASK_ISOLATION tasks. Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and _TIF_SYSCALL_WORK, define local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since we don't clear _TIF_TASK_ISOLATION in the loop. Early kernel entry code calls task_isolation_kernel_enter(). In particular: Vectors: el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter() el1_irq -> asm_nmi_enter(), handle_arch_irq() el1_error -> do_serror() el0_sync -> el0_sync_handler() el0_irq -> handle_arch_irq() el0_error -> do_serror() el0_sync_compat -> el0_sync_compat_handler() el0_irq_compat -> handle_arch_irq() el0_error_compat -> do_serror() SDEI entry: __sdei_asm_handler -> __sdei_handler() -> nmi_enter() Functions called from there: asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter() asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return() Handlers: do_serror() -> nmi_enter() -> task_isolation_kernel_enter() or task_isolation_kernel_enter() el1_sync_handler() -> task_isolation_kernel_enter() el0_sync_handler() -> task_isolation_kernel_enter() el0_sync_compat_handler() -> task_isolation_kernel_enter() handle_arch_irq() is irqchip-specific, most call handle_domain_irq() There is a separate patch for irqchips that do not follow this rule. handle_domain_irq() -> task_isolation_kernel_enter() do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant) nmi_enter() -> task_isolation_kernel_enter() Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> [abelits@marvell.com: simplified to match kernel 5.10] Signed-off-by: Alex Belits <abelits@marvell.com> --- arch/arm64/Kconfig | 1 + arch/arm64/include/asm/barrier.h | 1 + arch/arm64/include/asm/thread_info.h | 7 +++++-- arch/arm64/kernel/entry-common.c | 7 +++++++ arch/arm64/kernel/ptrace.c | 10 ++++++++++ arch/arm64/kernel/signal.c | 13 ++++++++++++- arch/arm64/kernel/smp.c | 3 +++ 7 files changed, 39 insertions(+), 3 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 1515f6f153a0..fc958d8d8945 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -141,6 +141,7 @@ config ARM64 select HAVE_ARCH_PREL32_RELOCATIONS select HAVE_ARCH_SECCOMP_FILTER select HAVE_ARCH_STACKLEAK + select HAVE_ARCH_TASK_ISOLATION select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h index c3009b0e5239..ad5a6dd380cf 100644 --- a/arch/arm64/include/asm/barrier.h +++ b/arch/arm64/include/asm/barrier.h @@ -49,6 +49,7 @@ #define dma_rmb() dmb(oshld) #define dma_wmb() dmb(oshst) +#define instr_sync() isb() /* * Generate a mask for array_index__nospec() that is ~0UL when 0 <= idx < sz * and 0 otherwise. diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index 1fbab854a51b..3321c69c46fe 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -68,6 +68,7 @@ void arch_release_task_struct(struct task_struct *tsk); #define TIF_UPROBE 4 /* uprobe breakpoint or singlestep */ #define TIF_FSCHECK 5 /* Check FS is USER_DS on return */ #define TIF_MTE_ASYNC_FAULT 6 /* MTE Asynchronous Tag Check Fault */ +#define TIF_TASK_ISOLATION 7 /* task isolation enabled for task */ #define TIF_SYSCALL_TRACE 8 /* syscall trace active */ #define TIF_SYSCALL_AUDIT 9 /* syscall auditing */ #define TIF_SYSCALL_TRACEPOINT 10 /* syscall tracepoint for ftrace */ @@ -87,6 +88,7 @@ void arch_release_task_struct(struct task_struct *tsk); #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) #define _TIF_FOREIGN_FPSTATE (1 << TIF_FOREIGN_FPSTATE) +#define _TIF_TASK_ISOLATION (1 << TIF_TASK_ISOLATION) #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) #define _TIF_SYSCALL_TRACEPOINT (1 << TIF_SYSCALL_TRACEPOINT) @@ -101,11 +103,12 @@ void arch_release_task_struct(struct task_struct *tsk); #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \ - _TIF_UPROBE | _TIF_FSCHECK | _TIF_MTE_ASYNC_FAULT) + _TIF_UPROBE | _TIF_FSCHECK | \ + _TIF_MTE_ASYNC_FAULT | _TIF_TASK_ISOLATION) #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \ _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \ - _TIF_SYSCALL_EMU) + _TIF_SYSCALL_EMU | _TIF_TASK_ISOLATION) #ifdef CONFIG_SHADOW_CALL_STACK #define INIT_SCS \ diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c index 43d4c329775f..8152760de683 100644 --- a/arch/arm64/kernel/entry-common.c +++ b/arch/arm64/kernel/entry-common.c @@ -8,6 +8,7 @@ #include <linux/context_tracking.h> #include <linux/ptrace.h> #include <linux/thread_info.h> +#include <linux/isolation.h> #include <asm/cpufeature.h> #include <asm/daifflags.h> @@ -77,6 +78,8 @@ asmlinkage void notrace el1_sync_handler(struct pt_regs *regs) { unsigned long esr = read_sysreg(esr_el1); + task_isolation_kernel_enter(); + switch (ESR_ELx_EC(esr)) { case ESR_ELx_EC_DABT_CUR: case ESR_ELx_EC_IABT_CUR: @@ -249,6 +252,8 @@ asmlinkage void notrace el0_sync_handler(struct pt_regs *regs) { unsigned long esr = read_sysreg(esr_el1); + task_isolation_kernel_enter(); + switch (ESR_ELx_EC(esr)) { case ESR_ELx_EC_SVC64: el0_svc(regs); @@ -321,6 +326,8 @@ asmlinkage void notrace el0_sync_compat_handler(struct pt_regs *regs) { unsigned long esr = read_sysreg(esr_el1); + task_isolation_kernel_enter(); + switch (ESR_ELx_EC(esr)) { case ESR_ELx_EC_SVC32: el0_svc_compat(regs); diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index f49b349e16a3..2941f2b16796 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -29,6 +29,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/isolation.h> #include <asm/compat.h> #include <asm/cpufeature.h> @@ -1803,6 +1804,15 @@ int syscall_trace_enter(struct pt_regs *regs) return NO_SYSCALL; } + /* + * In task isolation mode, we may prevent the syscall from + * running, and if so we also deliver a signal to the process. + */ + if (test_thread_flag(TIF_TASK_ISOLATION)) { + if (task_isolation_syscall(regs->syscallno) == -1) + return NO_SYSCALL; + } + /* Do the secure computing after ptrace; failures should be fast. */ if (secure_computing() == -1) return NO_SYSCALL; diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index a8184cad8890..e3a82b75e39d 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -20,6 +20,7 @@ #include <linux/tracehook.h> #include <linux/ratelimit.h> #include <linux/syscalls.h> +#include <linux/isolation.h> #include <asm/daifflags.h> #include <asm/debug-monitors.h> @@ -911,6 +912,11 @@ static void do_signal(struct pt_regs *regs) restore_saved_sigmask(); } +#define NOTIFY_RESUME_LOOP_FLAGS \ + (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_RESUME | \ + _TIF_FOREIGN_FPSTATE | _TIF_UPROBE | _TIF_FSCHECK | \ + _TIF_MTE_ASYNC_FAULT) + asmlinkage void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags) { @@ -921,6 +927,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, */ trace_hardirqs_off(); + task_isolation_before_pending_work_check(); + do { /* Check valid user FS if needed */ addr_limit_user_check(); @@ -956,7 +964,10 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, local_daif_mask(); thread_flags = READ_ONCE(current_thread_info()->flags); - } while (thread_flags & _TIF_WORK_MASK); + } while (thread_flags & NOTIFY_RESUME_LOOP_FLAGS); + + if (thread_flags & _TIF_TASK_ISOLATION) + task_isolation_start(); } unsigned long __ro_after_init signal_minsigstksz; diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 18e9727d3f64..4401eac4710c 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -33,6 +33,7 @@ #include <linux/kernel_stat.h> #include <linux/kexec.h> #include <linux/kvm_host.h> +#include <linux/isolation.h> #include <asm/alternative.h> #include <asm/atomic.h> @@ -890,6 +891,8 @@ static void do_handle_IPI(int ipinr) { unsigned int cpu = smp_processor_id(); + task_isolation_kernel_enter(); + if ((unsigned)ipinr < NR_IPI) trace_ipi_entry_rcuidle(ipi_types[ipinr]); -- 2.20.1 ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality @ 2020-11-23 17:58 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:58 UTC (permalink / raw) To: nitesh, frederic Cc: linux-arch, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel In do_notify_resume(), call task_isolation_before_pending_work_check() first, to report isolation breaking, then after handling all pending work, call task_isolation_start() for TIF_TASK_ISOLATION tasks. Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and _TIF_SYSCALL_WORK, define local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since we don't clear _TIF_TASK_ISOLATION in the loop. Early kernel entry code calls task_isolation_kernel_enter(). In particular: Vectors: el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter() el1_irq -> asm_nmi_enter(), handle_arch_irq() el1_error -> do_serror() el0_sync -> el0_sync_handler() el0_irq -> handle_arch_irq() el0_error -> do_serror() el0_sync_compat -> el0_sync_compat_handler() el0_irq_compat -> handle_arch_irq() el0_error_compat -> do_serror() SDEI entry: __sdei_asm_handler -> __sdei_handler() -> nmi_enter() Functions called from there: asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter() asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return() Handlers: do_serror() -> nmi_enter() -> task_isolation_kernel_enter() or task_isolation_kernel_enter() el1_sync_handler() -> task_isolation_kernel_enter() el0_sync_handler() -> task_isolation_kernel_enter() el0_sync_compat_handler() -> task_isolation_kernel_enter() handle_arch_irq() is irqchip-specific, most call handle_domain_irq() There is a separate patch for irqchips that do not follow this rule. handle_domain_irq() -> task_isolation_kernel_enter() do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant) nmi_enter() -> task_isolation_kernel_enter() Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> [abelits@marvell.com: simplified to match kernel 5.10] Signed-off-by: Alex Belits <abelits@marvell.com> --- arch/arm64/Kconfig | 1 + arch/arm64/include/asm/barrier.h | 1 + arch/arm64/include/asm/thread_info.h | 7 +++++-- arch/arm64/kernel/entry-common.c | 7 +++++++ arch/arm64/kernel/ptrace.c | 10 ++++++++++ arch/arm64/kernel/signal.c | 13 ++++++++++++- arch/arm64/kernel/smp.c | 3 +++ 7 files changed, 39 insertions(+), 3 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 1515f6f153a0..fc958d8d8945 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -141,6 +141,7 @@ config ARM64 select HAVE_ARCH_PREL32_RELOCATIONS select HAVE_ARCH_SECCOMP_FILTER select HAVE_ARCH_STACKLEAK + select HAVE_ARCH_TASK_ISOLATION select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h index c3009b0e5239..ad5a6dd380cf 100644 --- a/arch/arm64/include/asm/barrier.h +++ b/arch/arm64/include/asm/barrier.h @@ -49,6 +49,7 @@ #define dma_rmb() dmb(oshld) #define dma_wmb() dmb(oshst) +#define instr_sync() isb() /* * Generate a mask for array_index__nospec() that is ~0UL when 0 <= idx < sz * and 0 otherwise. diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index 1fbab854a51b..3321c69c46fe 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -68,6 +68,7 @@ void arch_release_task_struct(struct task_struct *tsk); #define TIF_UPROBE 4 /* uprobe breakpoint or singlestep */ #define TIF_FSCHECK 5 /* Check FS is USER_DS on return */ #define TIF_MTE_ASYNC_FAULT 6 /* MTE Asynchronous Tag Check Fault */ +#define TIF_TASK_ISOLATION 7 /* task isolation enabled for task */ #define TIF_SYSCALL_TRACE 8 /* syscall trace active */ #define TIF_SYSCALL_AUDIT 9 /* syscall auditing */ #define TIF_SYSCALL_TRACEPOINT 10 /* syscall tracepoint for ftrace */ @@ -87,6 +88,7 @@ void arch_release_task_struct(struct task_struct *tsk); #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) #define _TIF_FOREIGN_FPSTATE (1 << TIF_FOREIGN_FPSTATE) +#define _TIF_TASK_ISOLATION (1 << TIF_TASK_ISOLATION) #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) #define _TIF_SYSCALL_TRACEPOINT (1 << TIF_SYSCALL_TRACEPOINT) @@ -101,11 +103,12 @@ void arch_release_task_struct(struct task_struct *tsk); #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \ - _TIF_UPROBE | _TIF_FSCHECK | _TIF_MTE_ASYNC_FAULT) + _TIF_UPROBE | _TIF_FSCHECK | \ + _TIF_MTE_ASYNC_FAULT | _TIF_TASK_ISOLATION) #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \ _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \ - _TIF_SYSCALL_EMU) + _TIF_SYSCALL_EMU | _TIF_TASK_ISOLATION) #ifdef CONFIG_SHADOW_CALL_STACK #define INIT_SCS \ diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c index 43d4c329775f..8152760de683 100644 --- a/arch/arm64/kernel/entry-common.c +++ b/arch/arm64/kernel/entry-common.c @@ -8,6 +8,7 @@ #include <linux/context_tracking.h> #include <linux/ptrace.h> #include <linux/thread_info.h> +#include <linux/isolation.h> #include <asm/cpufeature.h> #include <asm/daifflags.h> @@ -77,6 +78,8 @@ asmlinkage void notrace el1_sync_handler(struct pt_regs *regs) { unsigned long esr = read_sysreg(esr_el1); + task_isolation_kernel_enter(); + switch (ESR_ELx_EC(esr)) { case ESR_ELx_EC_DABT_CUR: case ESR_ELx_EC_IABT_CUR: @@ -249,6 +252,8 @@ asmlinkage void notrace el0_sync_handler(struct pt_regs *regs) { unsigned long esr = read_sysreg(esr_el1); + task_isolation_kernel_enter(); + switch (ESR_ELx_EC(esr)) { case ESR_ELx_EC_SVC64: el0_svc(regs); @@ -321,6 +326,8 @@ asmlinkage void notrace el0_sync_compat_handler(struct pt_regs *regs) { unsigned long esr = read_sysreg(esr_el1); + task_isolation_kernel_enter(); + switch (ESR_ELx_EC(esr)) { case ESR_ELx_EC_SVC32: el0_svc_compat(regs); diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index f49b349e16a3..2941f2b16796 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -29,6 +29,7 @@ #include <linux/regset.h> #include <linux/tracehook.h> #include <linux/elf.h> +#include <linux/isolation.h> #include <asm/compat.h> #include <asm/cpufeature.h> @@ -1803,6 +1804,15 @@ int syscall_trace_enter(struct pt_regs *regs) return NO_SYSCALL; } + /* + * In task isolation mode, we may prevent the syscall from + * running, and if so we also deliver a signal to the process. + */ + if (test_thread_flag(TIF_TASK_ISOLATION)) { + if (task_isolation_syscall(regs->syscallno) == -1) + return NO_SYSCALL; + } + /* Do the secure computing after ptrace; failures should be fast. */ if (secure_computing() == -1) return NO_SYSCALL; diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index a8184cad8890..e3a82b75e39d 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -20,6 +20,7 @@ #include <linux/tracehook.h> #include <linux/ratelimit.h> #include <linux/syscalls.h> +#include <linux/isolation.h> #include <asm/daifflags.h> #include <asm/debug-monitors.h> @@ -911,6 +912,11 @@ static void do_signal(struct pt_regs *regs) restore_saved_sigmask(); } +#define NOTIFY_RESUME_LOOP_FLAGS \ + (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_RESUME | \ + _TIF_FOREIGN_FPSTATE | _TIF_UPROBE | _TIF_FSCHECK | \ + _TIF_MTE_ASYNC_FAULT) + asmlinkage void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags) { @@ -921,6 +927,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, */ trace_hardirqs_off(); + task_isolation_before_pending_work_check(); + do { /* Check valid user FS if needed */ addr_limit_user_check(); @@ -956,7 +964,10 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, local_daif_mask(); thread_flags = READ_ONCE(current_thread_info()->flags); - } while (thread_flags & _TIF_WORK_MASK); + } while (thread_flags & NOTIFY_RESUME_LOOP_FLAGS); + + if (thread_flags & _TIF_TASK_ISOLATION) + task_isolation_start(); } unsigned long __ro_after_init signal_minsigstksz; diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 18e9727d3f64..4401eac4710c 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -33,6 +33,7 @@ #include <linux/kernel_stat.h> #include <linux/kexec.h> #include <linux/kvm_host.h> +#include <linux/isolation.h> #include <asm/alternative.h> #include <asm/atomic.h> @@ -890,6 +891,8 @@ static void do_handle_IPI(int ipinr) { unsigned int cpu = smp_processor_id(); + task_isolation_kernel_enter(); + if ((unsigned)ipinr < NR_IPI) trace_ipi_entry_rcuidle(ipi_types[ipinr]); -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 74+ messages in thread
* Re: [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality 2020-11-23 17:58 ` Alex Belits @ 2020-12-02 13:59 ` Mark Rutland -1 siblings, 0 replies; 74+ messages in thread From: Mark Rutland @ 2020-12-02 13:59 UTC (permalink / raw) To: Alex Belits Cc: nitesh, frederic, Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev Hi Alex, On Mon, Nov 23, 2020 at 05:58:06PM +0000, Alex Belits wrote: > In do_notify_resume(), call task_isolation_before_pending_work_check() > first, to report isolation breaking, then after handling all pending > work, call task_isolation_start() for TIF_TASK_ISOLATION tasks. > > Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and _TIF_SYSCALL_WORK, > define local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since we > don't clear _TIF_TASK_ISOLATION in the loop. > > Early kernel entry code calls task_isolation_kernel_enter(). In > particular: > > Vectors: > el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter() > el1_irq -> asm_nmi_enter(), handle_arch_irq() > el1_error -> do_serror() > el0_sync -> el0_sync_handler() > el0_irq -> handle_arch_irq() > el0_error -> do_serror() > el0_sync_compat -> el0_sync_compat_handler() > el0_irq_compat -> handle_arch_irq() > el0_error_compat -> do_serror() > > SDEI entry: > __sdei_asm_handler -> __sdei_handler() -> nmi_enter() As a heads-up, the arm64 entry code is changing, as we found that our lockdep, RCU, and context-tracking management wasn't quite right. I have a series of patches: https://lore.kernel.org/r/20201130115950.22492-1-mark.rutland@arm.com ... which are queued in the arm64 for-next/fixes branch. I intend to have some further rework ready for the next cycle. I'd appreciate if you could Cc me on any patches altering the arm64 entry code, as I have a vested interest. That was quite obviously broken if PROVE_LOCKING and NO_HZ_FULL were chosen and context tracking was in use (e.g. with CONTEXT_TRACKING_FORCE), so I'm assuming that this series has not been tested in that configuration. What sort of testing has this seen? It would be very helpful for the next posting if you could provide any instructions on how to test this series (e.g. with pointers to any test suite that you have), since it's very easy to introduce subtle breakage in this area without realising it. > > Functions called from there: > asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter() > asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return() > > Handlers: > do_serror() -> nmi_enter() -> task_isolation_kernel_enter() > or task_isolation_kernel_enter() > el1_sync_handler() -> task_isolation_kernel_enter() > el0_sync_handler() -> task_isolation_kernel_enter() > el0_sync_compat_handler() -> task_isolation_kernel_enter() > > handle_arch_irq() is irqchip-specific, most call handle_domain_irq() > There is a separate patch for irqchips that do not follow this rule. > > handle_domain_irq() -> task_isolation_kernel_enter() > do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant) > nmi_enter() -> task_isolation_kernel_enter() The IRQ cases look very odd to me. With the rework I've just done for arm64, we'll do the regular context tracking accounting before we ever get into handle_domain_irq() or similar, so I suspect that's not necessary at all? > > Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> > [abelits@marvell.com: simplified to match kernel 5.10] > Signed-off-by: Alex Belits <abelits@marvell.com> > --- > arch/arm64/Kconfig | 1 + > arch/arm64/include/asm/barrier.h | 1 + > arch/arm64/include/asm/thread_info.h | 7 +++++-- > arch/arm64/kernel/entry-common.c | 7 +++++++ > arch/arm64/kernel/ptrace.c | 10 ++++++++++ > arch/arm64/kernel/signal.c | 13 ++++++++++++- > arch/arm64/kernel/smp.c | 3 +++ > 7 files changed, 39 insertions(+), 3 deletions(-) > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 1515f6f153a0..fc958d8d8945 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -141,6 +141,7 @@ config ARM64 > select HAVE_ARCH_PREL32_RELOCATIONS > select HAVE_ARCH_SECCOMP_FILTER > select HAVE_ARCH_STACKLEAK > + select HAVE_ARCH_TASK_ISOLATION > select HAVE_ARCH_THREAD_STRUCT_WHITELIST > select HAVE_ARCH_TRACEHOOK > select HAVE_ARCH_TRANSPARENT_HUGEPAGE > diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h > index c3009b0e5239..ad5a6dd380cf 100644 > --- a/arch/arm64/include/asm/barrier.h > +++ b/arch/arm64/include/asm/barrier.h > @@ -49,6 +49,7 @@ > #define dma_rmb() dmb(oshld) > #define dma_wmb() dmb(oshst) > > +#define instr_sync() isb() I think I've asked on prior versions of the patchset, but what is this for? Where is it going to be used, and what is the expected semantics? I'm wary of exposing this outside of arch code because there aren't strong cross-architectural semantics, and at the least this requires some documentation. If it's unused, please delete it. [...] > diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c > index 43d4c329775f..8152760de683 100644 > --- a/arch/arm64/kernel/entry-common.c > +++ b/arch/arm64/kernel/entry-common.c > @@ -8,6 +8,7 @@ > #include <linux/context_tracking.h> > #include <linux/ptrace.h> > #include <linux/thread_info.h> > +#include <linux/isolation.h> > > #include <asm/cpufeature.h> > #include <asm/daifflags.h> > @@ -77,6 +78,8 @@ asmlinkage void notrace el1_sync_handler(struct pt_regs *regs) > { > unsigned long esr = read_sysreg(esr_el1); > > + task_isolation_kernel_enter(); For regular context tracking we only acount the user<->kernel transitions. This is a kernel->kernel transition, so surely this is not necessary? If nothing else, it doesn't feel well-balanced. I havwe not looked at the rest of this patch (or series) in detail. Thanks, Mark. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality @ 2020-12-02 13:59 ` Mark Rutland 0 siblings, 0 replies; 74+ messages in thread From: Mark Rutland @ 2020-12-02 13:59 UTC (permalink / raw) To: Alex Belits Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, linux-api, frederic, mtosatti, linux-kernel, rostedt, mingo, leon, netdev, peterx, trix, Prasun Kapoor, tglx, will, davem, linux-arm-kernel Hi Alex, On Mon, Nov 23, 2020 at 05:58:06PM +0000, Alex Belits wrote: > In do_notify_resume(), call task_isolation_before_pending_work_check() > first, to report isolation breaking, then after handling all pending > work, call task_isolation_start() for TIF_TASK_ISOLATION tasks. > > Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and _TIF_SYSCALL_WORK, > define local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since we > don't clear _TIF_TASK_ISOLATION in the loop. > > Early kernel entry code calls task_isolation_kernel_enter(). In > particular: > > Vectors: > el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter() > el1_irq -> asm_nmi_enter(), handle_arch_irq() > el1_error -> do_serror() > el0_sync -> el0_sync_handler() > el0_irq -> handle_arch_irq() > el0_error -> do_serror() > el0_sync_compat -> el0_sync_compat_handler() > el0_irq_compat -> handle_arch_irq() > el0_error_compat -> do_serror() > > SDEI entry: > __sdei_asm_handler -> __sdei_handler() -> nmi_enter() As a heads-up, the arm64 entry code is changing, as we found that our lockdep, RCU, and context-tracking management wasn't quite right. I have a series of patches: https://lore.kernel.org/r/20201130115950.22492-1-mark.rutland@arm.com ... which are queued in the arm64 for-next/fixes branch. I intend to have some further rework ready for the next cycle. I'd appreciate if you could Cc me on any patches altering the arm64 entry code, as I have a vested interest. That was quite obviously broken if PROVE_LOCKING and NO_HZ_FULL were chosen and context tracking was in use (e.g. with CONTEXT_TRACKING_FORCE), so I'm assuming that this series has not been tested in that configuration. What sort of testing has this seen? It would be very helpful for the next posting if you could provide any instructions on how to test this series (e.g. with pointers to any test suite that you have), since it's very easy to introduce subtle breakage in this area without realising it. > > Functions called from there: > asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter() > asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return() > > Handlers: > do_serror() -> nmi_enter() -> task_isolation_kernel_enter() > or task_isolation_kernel_enter() > el1_sync_handler() -> task_isolation_kernel_enter() > el0_sync_handler() -> task_isolation_kernel_enter() > el0_sync_compat_handler() -> task_isolation_kernel_enter() > > handle_arch_irq() is irqchip-specific, most call handle_domain_irq() > There is a separate patch for irqchips that do not follow this rule. > > handle_domain_irq() -> task_isolation_kernel_enter() > do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant) > nmi_enter() -> task_isolation_kernel_enter() The IRQ cases look very odd to me. With the rework I've just done for arm64, we'll do the regular context tracking accounting before we ever get into handle_domain_irq() or similar, so I suspect that's not necessary at all? > > Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> > [abelits@marvell.com: simplified to match kernel 5.10] > Signed-off-by: Alex Belits <abelits@marvell.com> > --- > arch/arm64/Kconfig | 1 + > arch/arm64/include/asm/barrier.h | 1 + > arch/arm64/include/asm/thread_info.h | 7 +++++-- > arch/arm64/kernel/entry-common.c | 7 +++++++ > arch/arm64/kernel/ptrace.c | 10 ++++++++++ > arch/arm64/kernel/signal.c | 13 ++++++++++++- > arch/arm64/kernel/smp.c | 3 +++ > 7 files changed, 39 insertions(+), 3 deletions(-) > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 1515f6f153a0..fc958d8d8945 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -141,6 +141,7 @@ config ARM64 > select HAVE_ARCH_PREL32_RELOCATIONS > select HAVE_ARCH_SECCOMP_FILTER > select HAVE_ARCH_STACKLEAK > + select HAVE_ARCH_TASK_ISOLATION > select HAVE_ARCH_THREAD_STRUCT_WHITELIST > select HAVE_ARCH_TRACEHOOK > select HAVE_ARCH_TRANSPARENT_HUGEPAGE > diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h > index c3009b0e5239..ad5a6dd380cf 100644 > --- a/arch/arm64/include/asm/barrier.h > +++ b/arch/arm64/include/asm/barrier.h > @@ -49,6 +49,7 @@ > #define dma_rmb() dmb(oshld) > #define dma_wmb() dmb(oshst) > > +#define instr_sync() isb() I think I've asked on prior versions of the patchset, but what is this for? Where is it going to be used, and what is the expected semantics? I'm wary of exposing this outside of arch code because there aren't strong cross-architectural semantics, and at the least this requires some documentation. If it's unused, please delete it. [...] > diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c > index 43d4c329775f..8152760de683 100644 > --- a/arch/arm64/kernel/entry-common.c > +++ b/arch/arm64/kernel/entry-common.c > @@ -8,6 +8,7 @@ > #include <linux/context_tracking.h> > #include <linux/ptrace.h> > #include <linux/thread_info.h> > +#include <linux/isolation.h> > > #include <asm/cpufeature.h> > #include <asm/daifflags.h> > @@ -77,6 +78,8 @@ asmlinkage void notrace el1_sync_handler(struct pt_regs *regs) > { > unsigned long esr = read_sysreg(esr_el1); > > + task_isolation_kernel_enter(); For regular context tracking we only acount the user<->kernel transitions. This is a kernel->kernel transition, so surely this is not necessary? If nothing else, it doesn't feel well-balanced. I havwe not looked at the rest of this patch (or series) in detail. Thanks, Mark. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality 2020-12-02 13:59 ` Mark Rutland @ 2020-12-04 0:37 ` Alex Belits -1 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-12-04 0:37 UTC (permalink / raw) To: mark.rutland Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel, rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will, catalin.marinas, peterz, frederic, leon, linux-arm-kernel, pauld, netdev On Wed, 2020-12-02 at 13:59 +0000, Mark Rutland wrote: > External Email > > ------------------------------------------------------------------- > --- > Hi Alex, > > On Mon, Nov 23, 2020 at 05:58:06PM +0000, Alex Belits wrote: > > In do_notify_resume(), call > > task_isolation_before_pending_work_check() > > first, to report isolation breaking, then after handling all > > pending > > work, call task_isolation_start() for TIF_TASK_ISOLATION tasks. > > > > Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and _TIF_SYSCALL_WORK, > > define local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since > > we > > don't clear _TIF_TASK_ISOLATION in the loop. > > > > Early kernel entry code calls task_isolation_kernel_enter(). In > > particular: > > > > Vectors: > > el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter() > > el1_irq -> asm_nmi_enter(), handle_arch_irq() > > el1_error -> do_serror() > > el0_sync -> el0_sync_handler() > > el0_irq -> handle_arch_irq() > > el0_error -> do_serror() > > el0_sync_compat -> el0_sync_compat_handler() > > el0_irq_compat -> handle_arch_irq() > > el0_error_compat -> do_serror() > > > > SDEI entry: > > __sdei_asm_handler -> __sdei_handler() -> nmi_enter() > > As a heads-up, the arm64 entry code is changing, as we found that our > lockdep, RCU, and context-tracking management wasn't quite right. I > have > a series of patches: > > https://lore.kernel.org/r/20201130115950.22492-1-mark.rutland@arm.com > > ... which are queued in the arm64 for-next/fixes branch. I intend to > have some further rework ready for the next cycle. Thanks! > I'd appreciate if you > could Cc me on any patches altering the arm64 entry code, as I have a > vested interest. I will do that. > > That was quite obviously broken if PROVE_LOCKING and NO_HZ_FULL were > chosen and context tracking was in use (e.g. with > CONTEXT_TRACKING_FORCE), I am not yet sure about TRACE_IRQFLAGS, however NO_HZ_FULL and CONTEXT_TRACKING have to be enabled for it to do anything. I will check it with PROVE_LOCKING and your patches. Entry code only adds an inline function that, if task isolation is enabled, uses raw_local_irq_save() / raw_local_irq_restore(), low-level operations and accesses per-CPU variabled by offset, so at very least it should not add any problems. Even raw_local_irq_save() / raw_local_irq_restore() probably should be removed, however I wanted to have something that can be safely called if by whatever reason interrupts were enabled before kernel was fully entered. > so I'm assuming that this series has not been > tested in that configuration. What sort of testing has this seen? > On various available arm64 hardware, with enabled CONFIG_TASK_ISOLATION CONFIG_NO_HZ_FULL CONFIG_HIGH_RES_TIMERS and disabled: CONFIG_HZ_PERIODIC CONFIG_NO_HZ_IDLE CONFIG_NO_HZ > It would be very helpful for the next posting if you could provide > any > instructions on how to test this series (e.g. with pointers to any > test > suite that you have), since it's very easy to introduce subtle > breakage > in this area without realising it. I will. Currently libtmc ( https://github.com/abelits/libtmc ) contains all userspace code used for testing, however I should document the testing procedures. > > > Functions called from there: > > asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter() > > asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return() > > > > Handlers: > > do_serror() -> nmi_enter() -> task_isolation_kernel_enter() > > or task_isolation_kernel_enter() > > el1_sync_handler() -> task_isolation_kernel_enter() > > el0_sync_handler() -> task_isolation_kernel_enter() > > el0_sync_compat_handler() -> task_isolation_kernel_enter() > > > > handle_arch_irq() is irqchip-specific, most call > > handle_domain_irq() > > There is a separate patch for irqchips that do not follow this > > rule. > > > > handle_domain_irq() -> task_isolation_kernel_enter() > > do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant) > > nmi_enter() -> task_isolation_kernel_enter() > > The IRQ cases look very odd to me. With the rework I've just done for > arm64, we'll do the regular context tracking accounting before we > ever > get into handle_domain_irq() or similar, so I suspect that's not > necessary at all? The goal is to call task_isolation_kernel_enter() before anything that depends on a CPU state, including pipeline, that could remain un- synchronized when the rest of the kernel was sending synchronization IPIs. Similarly task_isolation_kernel_return() should be called when it is safe to turn off synchronization. If rework allows it to be done earlier, there is no need to touch more specific functions. > --- a/arch/arm64/include/asm/barrier.h > > +++ b/arch/arm64/include/asm/barrier.h > > @@ -49,6 +49,7 @@ > > #define dma_rmb() dmb(oshld) > > #define dma_wmb() dmb(oshst) > > > > +#define instr_sync() isb() > > I think I've asked on prior versions of the patchset, but what is > this > for? Where is it going to be used, and what is the expected > semantics? > I'm wary of exposing this outside of arch code because there aren't > strong cross-architectural semantics, and at the least this requires > some documentation. This is intended as an instruction pipeline flush for the situation when arch-independent code has to synchronize with potential changes that it missed. This is necessary after some other CPUs could modify code (and send IPIs to notify the rest but not isolated CPU) while this one was still running isolated task or, more likely, exiting from it, so it might be unlucky enough to pick the old instructions before that point. It's only used on kernel entry. > > If it's unused, please delete it. > > [...] > > > diff --git a/arch/arm64/kernel/entry-common.c > > b/arch/arm64/kernel/entry-common.c > > index 43d4c329775f..8152760de683 100644 > > --- a/arch/arm64/kernel/entry-common.c > > +++ b/arch/arm64/kernel/entry-common.c > > @@ -8,6 +8,7 @@ > > #include <linux/context_tracking.h> > > #include <linux/ptrace.h> > > #include <linux/thread_info.h> > > +#include <linux/isolation.h> > > > > #include <asm/cpufeature.h> > > #include <asm/daifflags.h> > > @@ -77,6 +78,8 @@ asmlinkage void notrace el1_sync_handler(struct > > pt_regs *regs) > > { > > unsigned long esr = read_sysreg(esr_el1); > > > > + task_isolation_kernel_enter(); > > For regular context tracking we only acount the user<->kernel > transitions. > > This is a kernel->kernel transition, so surely this is not necessary? Right. If we entered kernel from an isolated task, we have already changed the flags. > > If nothing else, it doesn't feel well-balanced. > > I havwe not looked at the rest of this patch (or series) in detail. > > Thanks, > Mark. My goal was to make sure that all transitions between kernel and userspace are covered, so when in doubt I have added corresponding calls those inline functions, and made them safe to be called from those places. With improved entry-exit code it should be easier to be sure where this can be done in a cleaner way. -- Alex ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality @ 2020-12-04 0:37 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-12-04 0:37 UTC (permalink / raw) To: mark.rutland Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, Prasun Kapoor, frederic, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel On Wed, 2020-12-02 at 13:59 +0000, Mark Rutland wrote: > External Email > > ------------------------------------------------------------------- > --- > Hi Alex, > > On Mon, Nov 23, 2020 at 05:58:06PM +0000, Alex Belits wrote: > > In do_notify_resume(), call > > task_isolation_before_pending_work_check() > > first, to report isolation breaking, then after handling all > > pending > > work, call task_isolation_start() for TIF_TASK_ISOLATION tasks. > > > > Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and _TIF_SYSCALL_WORK, > > define local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since > > we > > don't clear _TIF_TASK_ISOLATION in the loop. > > > > Early kernel entry code calls task_isolation_kernel_enter(). In > > particular: > > > > Vectors: > > el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter() > > el1_irq -> asm_nmi_enter(), handle_arch_irq() > > el1_error -> do_serror() > > el0_sync -> el0_sync_handler() > > el0_irq -> handle_arch_irq() > > el0_error -> do_serror() > > el0_sync_compat -> el0_sync_compat_handler() > > el0_irq_compat -> handle_arch_irq() > > el0_error_compat -> do_serror() > > > > SDEI entry: > > __sdei_asm_handler -> __sdei_handler() -> nmi_enter() > > As a heads-up, the arm64 entry code is changing, as we found that our > lockdep, RCU, and context-tracking management wasn't quite right. I > have > a series of patches: > > https://lore.kernel.org/r/20201130115950.22492-1-mark.rutland@arm.com > > ... which are queued in the arm64 for-next/fixes branch. I intend to > have some further rework ready for the next cycle. Thanks! > I'd appreciate if you > could Cc me on any patches altering the arm64 entry code, as I have a > vested interest. I will do that. > > That was quite obviously broken if PROVE_LOCKING and NO_HZ_FULL were > chosen and context tracking was in use (e.g. with > CONTEXT_TRACKING_FORCE), I am not yet sure about TRACE_IRQFLAGS, however NO_HZ_FULL and CONTEXT_TRACKING have to be enabled for it to do anything. I will check it with PROVE_LOCKING and your patches. Entry code only adds an inline function that, if task isolation is enabled, uses raw_local_irq_save() / raw_local_irq_restore(), low-level operations and accesses per-CPU variabled by offset, so at very least it should not add any problems. Even raw_local_irq_save() / raw_local_irq_restore() probably should be removed, however I wanted to have something that can be safely called if by whatever reason interrupts were enabled before kernel was fully entered. > so I'm assuming that this series has not been > tested in that configuration. What sort of testing has this seen? > On various available arm64 hardware, with enabled CONFIG_TASK_ISOLATION CONFIG_NO_HZ_FULL CONFIG_HIGH_RES_TIMERS and disabled: CONFIG_HZ_PERIODIC CONFIG_NO_HZ_IDLE CONFIG_NO_HZ > It would be very helpful for the next posting if you could provide > any > instructions on how to test this series (e.g. with pointers to any > test > suite that you have), since it's very easy to introduce subtle > breakage > in this area without realising it. I will. Currently libtmc ( https://github.com/abelits/libtmc ) contains all userspace code used for testing, however I should document the testing procedures. > > > Functions called from there: > > asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter() > > asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return() > > > > Handlers: > > do_serror() -> nmi_enter() -> task_isolation_kernel_enter() > > or task_isolation_kernel_enter() > > el1_sync_handler() -> task_isolation_kernel_enter() > > el0_sync_handler() -> task_isolation_kernel_enter() > > el0_sync_compat_handler() -> task_isolation_kernel_enter() > > > > handle_arch_irq() is irqchip-specific, most call > > handle_domain_irq() > > There is a separate patch for irqchips that do not follow this > > rule. > > > > handle_domain_irq() -> task_isolation_kernel_enter() > > do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant) > > nmi_enter() -> task_isolation_kernel_enter() > > The IRQ cases look very odd to me. With the rework I've just done for > arm64, we'll do the regular context tracking accounting before we > ever > get into handle_domain_irq() or similar, so I suspect that's not > necessary at all? The goal is to call task_isolation_kernel_enter() before anything that depends on a CPU state, including pipeline, that could remain un- synchronized when the rest of the kernel was sending synchronization IPIs. Similarly task_isolation_kernel_return() should be called when it is safe to turn off synchronization. If rework allows it to be done earlier, there is no need to touch more specific functions. > --- a/arch/arm64/include/asm/barrier.h > > +++ b/arch/arm64/include/asm/barrier.h > > @@ -49,6 +49,7 @@ > > #define dma_rmb() dmb(oshld) > > #define dma_wmb() dmb(oshst) > > > > +#define instr_sync() isb() > > I think I've asked on prior versions of the patchset, but what is > this > for? Where is it going to be used, and what is the expected > semantics? > I'm wary of exposing this outside of arch code because there aren't > strong cross-architectural semantics, and at the least this requires > some documentation. This is intended as an instruction pipeline flush for the situation when arch-independent code has to synchronize with potential changes that it missed. This is necessary after some other CPUs could modify code (and send IPIs to notify the rest but not isolated CPU) while this one was still running isolated task or, more likely, exiting from it, so it might be unlucky enough to pick the old instructions before that point. It's only used on kernel entry. > > If it's unused, please delete it. > > [...] > > > diff --git a/arch/arm64/kernel/entry-common.c > > b/arch/arm64/kernel/entry-common.c > > index 43d4c329775f..8152760de683 100644 > > --- a/arch/arm64/kernel/entry-common.c > > +++ b/arch/arm64/kernel/entry-common.c > > @@ -8,6 +8,7 @@ > > #include <linux/context_tracking.h> > > #include <linux/ptrace.h> > > #include <linux/thread_info.h> > > +#include <linux/isolation.h> > > > > #include <asm/cpufeature.h> > > #include <asm/daifflags.h> > > @@ -77,6 +78,8 @@ asmlinkage void notrace el1_sync_handler(struct > > pt_regs *regs) > > { > > unsigned long esr = read_sysreg(esr_el1); > > > > + task_isolation_kernel_enter(); > > For regular context tracking we only acount the user<->kernel > transitions. > > This is a kernel->kernel transition, so surely this is not necessary? Right. If we entered kernel from an isolated task, we have already changed the flags. > > If nothing else, it doesn't feel well-balanced. > > I havwe not looked at the rest of this patch (or series) in detail. > > Thanks, > Mark. My goal was to make sure that all transitions between kernel and userspace are covered, so when in doubt I have added corresponding calls those inline functions, and made them safe to be called from those places. With improved entry-exit code it should be easier to be sure where this can be done in a cleaner way. -- Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality 2020-12-04 0:37 ` Alex Belits @ 2020-12-07 11:57 ` Mark Rutland -1 siblings, 0 replies; 74+ messages in thread From: Mark Rutland @ 2020-12-07 11:57 UTC (permalink / raw) To: Alex Belits Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel, rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will, catalin.marinas, peterz, frederic, leon, linux-arm-kernel, pauld, netdev On Fri, Dec 04, 2020 at 12:37:32AM +0000, Alex Belits wrote: > On Wed, 2020-12-02 at 13:59 +0000, Mark Rutland wrote: > > On Mon, Nov 23, 2020 at 05:58:06PM +0000, Alex Belits wrote: > > As a heads-up, the arm64 entry code is changing, as we found that > > our lockdep, RCU, and context-tracking management wasn't quite > > right. I have a series of patches: > > > > https://lore.kernel.org/r/20201130115950.22492-1-mark.rutland@arm.com > > > > ... which are queued in the arm64 for-next/fixes branch. I intend to > > have some further rework ready for the next cycle. > > That was quite obviously broken if PROVE_LOCKING and NO_HZ_FULL were > > chosen and context tracking was in use (e.g. with > > CONTEXT_TRACKING_FORCE), > > I am not yet sure about TRACE_IRQFLAGS, however NO_HZ_FULL and > CONTEXT_TRACKING have to be enabled for it to do anything. > > I will check it with PROVE_LOCKING and your patches. Thanks. In future, please do test this functionality with PROVE_LOCKING, because otherwise bugs with RCU and IRQ state maangement will easily be missed (as has been the case until very recently). Testing with all those debug optiosn enabled (and stating that you have done so) will give reviuewers much greater confidence that this works, and if that does start spewing errors it save everyone the time identifying that. > Entry code only adds an inline function that, if task isolation is > enabled, uses raw_local_irq_save() / raw_local_irq_restore(), low-level > operations and accesses per-CPU variabled by offset, so at very least > it should not add any problems. Even raw_local_irq_save() / > raw_local_irq_restore() probably should be removed, however I wanted to > have something that can be safely called if by whatever reason > interrupts were enabled before kernel was fully entered. Sure. In the new flows we have new enter_from_*() and exit_to_*() functions where these calls should be able to live (and so we should be able to ensure a more consistent environment). The near-term plan for arm64 is to migrate more of the exception triage assembly to C, then to rework the arm64 entry code and generic entry code to be more similar, then to migrate as much as possible to the generic entry code. So please bear in mind that anything that adds to the differences between the two is goingf to be problematic. > > so I'm assuming that this series has not been > > tested in that configuration. What sort of testing has this seen? > > On various available arm64 hardware, with enabled > > CONFIG_TASK_ISOLATION > CONFIG_NO_HZ_FULL > CONFIG_HIGH_RES_TIMERS > > and disabled: > > CONFIG_HZ_PERIODIC > CONFIG_NO_HZ_IDLE > CONFIG_NO_HZ Ok. I'd recommend looking at the various debug options under the "kernel hacking" section in kconfig, and enabling some of those. At the very least PROVE_LOCKING, ideally also using the lockup dectors and anything else for debugging RCU, etc. [...] > > > Functions called from there: > > > asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter() > > > asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return() > > > > > > Handlers: > > > do_serror() -> nmi_enter() -> task_isolation_kernel_enter() > > > or task_isolation_kernel_enter() > > > el1_sync_handler() -> task_isolation_kernel_enter() > > > el0_sync_handler() -> task_isolation_kernel_enter() > > > el0_sync_compat_handler() -> task_isolation_kernel_enter() > > > > > > handle_arch_irq() is irqchip-specific, most call > > > handle_domain_irq() > > > There is a separate patch for irqchips that do not follow this > > > rule. > > > > > > handle_domain_irq() -> task_isolation_kernel_enter() > > > do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant) > > > nmi_enter() -> task_isolation_kernel_enter() > > > > The IRQ cases look very odd to me. With the rework I've just done > > for arm64, we'll do the regular context tracking accounting before > > we ever get into handle_domain_irq() or similar, so I suspect that's > > not necessary at all? > > The goal is to call task_isolation_kernel_enter() before anything that > depends on a CPU state, including pipeline, that could remain un- > synchronized when the rest of the kernel was sending synchronization > IPIs. Similarly task_isolation_kernel_return() should be called when it > is safe to turn off synchronization. If rework allows it to be done > earlier, there is no need to touch more specific functions. Sure; I think that's sorted as a result of the changes I made recently. > > > --- a/arch/arm64/include/asm/barrier.h > > > +++ b/arch/arm64/include/asm/barrier.h > > > @@ -49,6 +49,7 @@ > > > #define dma_rmb() dmb(oshld) > > > #define dma_wmb() dmb(oshst) > > > > > > +#define instr_sync() isb() > > > > I think I've asked on prior versions of the patchset, but what is > > this for? Where is it going to be used, and what is the expected > > semantics? I'm wary of exposing this outside of arch code because > > there aren't strong cross-architectural semantics, and at the least > > this requires some documentation. > > This is intended as an instruction pipeline flush for the situation > when arch-independent code has to synchronize with potential changes > that it missed. This is necessary after some other CPUs could modify > code (and send IPIs to notify the rest but not isolated CPU) while this > one was still running isolated task or, more likely, exiting from it, > so it might be unlucky enough to pick the old instructions before that > point. > > It's only used on kernel entry. Sure. My point is that instr_sync() is a very generic sounding name that doesn't get any of that across, and it's entirely undocumented. I think something like arch_simulate_kick_cpu() would be better to get the intended semantic across, and we should add thorough documentation somewhere as to what this is meant to do. Thanks, Mark. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality @ 2020-12-07 11:57 ` Mark Rutland 0 siblings, 0 replies; 74+ messages in thread From: Mark Rutland @ 2020-12-07 11:57 UTC (permalink / raw) To: Alex Belits Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, Prasun Kapoor, frederic, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel On Fri, Dec 04, 2020 at 12:37:32AM +0000, Alex Belits wrote: > On Wed, 2020-12-02 at 13:59 +0000, Mark Rutland wrote: > > On Mon, Nov 23, 2020 at 05:58:06PM +0000, Alex Belits wrote: > > As a heads-up, the arm64 entry code is changing, as we found that > > our lockdep, RCU, and context-tracking management wasn't quite > > right. I have a series of patches: > > > > https://lore.kernel.org/r/20201130115950.22492-1-mark.rutland@arm.com > > > > ... which are queued in the arm64 for-next/fixes branch. I intend to > > have some further rework ready for the next cycle. > > That was quite obviously broken if PROVE_LOCKING and NO_HZ_FULL were > > chosen and context tracking was in use (e.g. with > > CONTEXT_TRACKING_FORCE), > > I am not yet sure about TRACE_IRQFLAGS, however NO_HZ_FULL and > CONTEXT_TRACKING have to be enabled for it to do anything. > > I will check it with PROVE_LOCKING and your patches. Thanks. In future, please do test this functionality with PROVE_LOCKING, because otherwise bugs with RCU and IRQ state maangement will easily be missed (as has been the case until very recently). Testing with all those debug optiosn enabled (and stating that you have done so) will give reviuewers much greater confidence that this works, and if that does start spewing errors it save everyone the time identifying that. > Entry code only adds an inline function that, if task isolation is > enabled, uses raw_local_irq_save() / raw_local_irq_restore(), low-level > operations and accesses per-CPU variabled by offset, so at very least > it should not add any problems. Even raw_local_irq_save() / > raw_local_irq_restore() probably should be removed, however I wanted to > have something that can be safely called if by whatever reason > interrupts were enabled before kernel was fully entered. Sure. In the new flows we have new enter_from_*() and exit_to_*() functions where these calls should be able to live (and so we should be able to ensure a more consistent environment). The near-term plan for arm64 is to migrate more of the exception triage assembly to C, then to rework the arm64 entry code and generic entry code to be more similar, then to migrate as much as possible to the generic entry code. So please bear in mind that anything that adds to the differences between the two is goingf to be problematic. > > so I'm assuming that this series has not been > > tested in that configuration. What sort of testing has this seen? > > On various available arm64 hardware, with enabled > > CONFIG_TASK_ISOLATION > CONFIG_NO_HZ_FULL > CONFIG_HIGH_RES_TIMERS > > and disabled: > > CONFIG_HZ_PERIODIC > CONFIG_NO_HZ_IDLE > CONFIG_NO_HZ Ok. I'd recommend looking at the various debug options under the "kernel hacking" section in kconfig, and enabling some of those. At the very least PROVE_LOCKING, ideally also using the lockup dectors and anything else for debugging RCU, etc. [...] > > > Functions called from there: > > > asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter() > > > asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return() > > > > > > Handlers: > > > do_serror() -> nmi_enter() -> task_isolation_kernel_enter() > > > or task_isolation_kernel_enter() > > > el1_sync_handler() -> task_isolation_kernel_enter() > > > el0_sync_handler() -> task_isolation_kernel_enter() > > > el0_sync_compat_handler() -> task_isolation_kernel_enter() > > > > > > handle_arch_irq() is irqchip-specific, most call > > > handle_domain_irq() > > > There is a separate patch for irqchips that do not follow this > > > rule. > > > > > > handle_domain_irq() -> task_isolation_kernel_enter() > > > do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant) > > > nmi_enter() -> task_isolation_kernel_enter() > > > > The IRQ cases look very odd to me. With the rework I've just done > > for arm64, we'll do the regular context tracking accounting before > > we ever get into handle_domain_irq() or similar, so I suspect that's > > not necessary at all? > > The goal is to call task_isolation_kernel_enter() before anything that > depends on a CPU state, including pipeline, that could remain un- > synchronized when the rest of the kernel was sending synchronization > IPIs. Similarly task_isolation_kernel_return() should be called when it > is safe to turn off synchronization. If rework allows it to be done > earlier, there is no need to touch more specific functions. Sure; I think that's sorted as a result of the changes I made recently. > > > --- a/arch/arm64/include/asm/barrier.h > > > +++ b/arch/arm64/include/asm/barrier.h > > > @@ -49,6 +49,7 @@ > > > #define dma_rmb() dmb(oshld) > > > #define dma_wmb() dmb(oshst) > > > > > > +#define instr_sync() isb() > > > > I think I've asked on prior versions of the patchset, but what is > > this for? Where is it going to be used, and what is the expected > > semantics? I'm wary of exposing this outside of arch code because > > there aren't strong cross-architectural semantics, and at the least > > this requires some documentation. > > This is intended as an instruction pipeline flush for the situation > when arch-independent code has to synchronize with potential changes > that it missed. This is necessary after some other CPUs could modify > code (and send IPIs to notify the rest but not isolated CPU) while this > one was still running isolated task or, more likely, exiting from it, > so it might be unlucky enough to pick the old instructions before that > point. > > It's only used on kernel entry. Sure. My point is that instr_sync() is a very generic sounding name that doesn't get any of that across, and it's entirely undocumented. I think something like arch_simulate_kick_cpu() would be better to get the intended semantic across, and we should add thorough documentation somewhere as to what this is meant to do. Thanks, Mark. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() 2020-11-23 17:42 ` Alex Belits @ 2020-11-23 17:58 ` Alex Belits -1 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:58 UTC (permalink / raw) To: nitesh, frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev From: Yuri Norov <ynorov@marvell.com> For nohz_full CPUs the desirable behavior is to receive interrupts generated by tick_nohz_full_kick_cpu(). But for hard isolation it's obviously not desirable because it breaks isolation. This patch adds check for it. Signed-off-by: Yuri Norov <ynorov@marvell.com> [abelits@marvell.com: updated, only exclude CPUs running isolated tasks] Signed-off-by: Alex Belits <abelits@marvell.com> --- kernel/time/tick-sched.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index a213952541db..6c8679e200f0 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -20,6 +20,7 @@ #include <linux/sched/clock.h> #include <linux/sched/stat.h> #include <linux/sched/nohz.h> +#include <linux/isolation.h> #include <linux/module.h> #include <linux/irq_work.h> #include <linux/posix-timers.h> @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) */ void tick_nohz_full_kick_cpu(int cpu) { - if (!tick_nohz_full_cpu(cpu)) + smp_rmb(); + if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu)) return; irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu); -- 2.20.1 ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() @ 2020-11-23 17:58 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:58 UTC (permalink / raw) To: nitesh, frederic Cc: linux-arch, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel From: Yuri Norov <ynorov@marvell.com> For nohz_full CPUs the desirable behavior is to receive interrupts generated by tick_nohz_full_kick_cpu(). But for hard isolation it's obviously not desirable because it breaks isolation. This patch adds check for it. Signed-off-by: Yuri Norov <ynorov@marvell.com> [abelits@marvell.com: updated, only exclude CPUs running isolated tasks] Signed-off-by: Alex Belits <abelits@marvell.com> --- kernel/time/tick-sched.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index a213952541db..6c8679e200f0 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -20,6 +20,7 @@ #include <linux/sched/clock.h> #include <linux/sched/stat.h> #include <linux/sched/nohz.h> +#include <linux/isolation.h> #include <linux/module.h> #include <linux/irq_work.h> #include <linux/posix-timers.h> @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) */ void tick_nohz_full_kick_cpu(int cpu) { - if (!tick_nohz_full_cpu(cpu)) + smp_rmb(); + if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu)) return; irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu); -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 74+ messages in thread
* Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() 2020-11-23 17:58 ` Alex Belits @ 2020-11-23 22:13 ` Frederic Weisbecker -1 siblings, 0 replies; 74+ messages in thread From: Frederic Weisbecker @ 2020-11-23 22:13 UTC (permalink / raw) To: Alex Belits Cc: nitesh, Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev Hi Alex, On Mon, Nov 23, 2020 at 05:58:22PM +0000, Alex Belits wrote: > From: Yuri Norov <ynorov@marvell.com> > > For nohz_full CPUs the desirable behavior is to receive interrupts > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's > obviously not desirable because it breaks isolation. > > This patch adds check for it. > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > [abelits@marvell.com: updated, only exclude CPUs running isolated tasks] > Signed-off-by: Alex Belits <abelits@marvell.com> > --- > kernel/time/tick-sched.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index a213952541db..6c8679e200f0 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -20,6 +20,7 @@ > #include <linux/sched/clock.h> > #include <linux/sched/stat.h> > #include <linux/sched/nohz.h> > +#include <linux/isolation.h> > #include <linux/module.h> > #include <linux/irq_work.h> > #include <linux/posix-timers.h> > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) > */ > void tick_nohz_full_kick_cpu(int cpu) > { > - if (!tick_nohz_full_cpu(cpu)) > + smp_rmb(); > + if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu)) > return; Like I said in subsequent reviews, we are not going to ignore IPIs. We must fix the sources of these IPIs instead. Thanks. > > irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu); > -- > 2.20.1 > ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() @ 2020-11-23 22:13 ` Frederic Weisbecker 0 siblings, 0 replies; 74+ messages in thread From: Frederic Weisbecker @ 2020-11-23 22:13 UTC (permalink / raw) To: Alex Belits Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, linux-api, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, Prasun Kapoor, tglx, will, mingo, linux-arm-kernel Hi Alex, On Mon, Nov 23, 2020 at 05:58:22PM +0000, Alex Belits wrote: > From: Yuri Norov <ynorov@marvell.com> > > For nohz_full CPUs the desirable behavior is to receive interrupts > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's > obviously not desirable because it breaks isolation. > > This patch adds check for it. > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > [abelits@marvell.com: updated, only exclude CPUs running isolated tasks] > Signed-off-by: Alex Belits <abelits@marvell.com> > --- > kernel/time/tick-sched.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index a213952541db..6c8679e200f0 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -20,6 +20,7 @@ > #include <linux/sched/clock.h> > #include <linux/sched/stat.h> > #include <linux/sched/nohz.h> > +#include <linux/isolation.h> > #include <linux/module.h> > #include <linux/irq_work.h> > #include <linux/posix-timers.h> > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) > */ > void tick_nohz_full_kick_cpu(int cpu) > { > - if (!tick_nohz_full_cpu(cpu)) > + smp_rmb(); > + if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu)) > return; Like I said in subsequent reviews, we are not going to ignore IPIs. We must fix the sources of these IPIs instead. Thanks. > > irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu); > -- > 2.20.1 > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() 2020-11-23 22:13 ` Frederic Weisbecker @ 2020-11-23 22:35 ` Alex Belits -1 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 22:35 UTC (permalink / raw) To: frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel, rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, catalin.marinas, pauld, netdev On Mon, 2020-11-23 at 23:13 +0100, Frederic Weisbecker wrote: > External Email > > ------------------------------------------------------------------- > --- > Hi Alex, > > On Mon, Nov 23, 2020 at 05:58:22PM +0000, Alex Belits wrote: > > From: Yuri Norov <ynorov@marvell.com> > > > > For nohz_full CPUs the desirable behavior is to receive interrupts > > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's > > obviously not desirable because it breaks isolation. > > > > This patch adds check for it. > > > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > > [abelits@marvell.com: updated, only exclude CPUs running isolated > > tasks] > > Signed-off-by: Alex Belits <abelits@marvell.com> > > --- > > kernel/time/tick-sched.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > > index a213952541db..6c8679e200f0 100644 > > --- a/kernel/time/tick-sched.c > > +++ b/kernel/time/tick-sched.c > > @@ -20,6 +20,7 @@ > > #include <linux/sched/clock.h> > > #include <linux/sched/stat.h> > > #include <linux/sched/nohz.h> > > +#include <linux/isolation.h> > > #include <linux/module.h> > > #include <linux/irq_work.h> > > #include <linux/posix-timers.h> > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) > > */ > > void tick_nohz_full_kick_cpu(int cpu) > > { > > - if (!tick_nohz_full_cpu(cpu)) > > + smp_rmb(); > > + if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu)) > > return; > > Like I said in subsequent reviews, we are not going to ignore IPIs. > We must fix the sources of these IPIs instead. This is what I am working on right now. This is made with an assumption that CPU running isolated task has no reason to be kicked because nothing else is supposed to be there. Usually this is true and when not true is still safe when everything else is behaving right. For this version I have kept the original implementation with minimal changes to make it possible to use task isolation at all. I agree that it's a much better idea is to determine if the CPU should be kicked. If it really should, that will be a legitimate cause to break isolation there, because CPU running isolated task has no legitimate reason to have timers running. Right now I am trying to determine the origin of timers that _still_ show up as running in the current kernel version, so I think, this is a rather large chunk of work that I have to do separately. -- Alex ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() @ 2020-11-23 22:35 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 22:35 UTC (permalink / raw) To: frederic Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel On Mon, 2020-11-23 at 23:13 +0100, Frederic Weisbecker wrote: > External Email > > ------------------------------------------------------------------- > --- > Hi Alex, > > On Mon, Nov 23, 2020 at 05:58:22PM +0000, Alex Belits wrote: > > From: Yuri Norov <ynorov@marvell.com> > > > > For nohz_full CPUs the desirable behavior is to receive interrupts > > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's > > obviously not desirable because it breaks isolation. > > > > This patch adds check for it. > > > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > > [abelits@marvell.com: updated, only exclude CPUs running isolated > > tasks] > > Signed-off-by: Alex Belits <abelits@marvell.com> > > --- > > kernel/time/tick-sched.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > > index a213952541db..6c8679e200f0 100644 > > --- a/kernel/time/tick-sched.c > > +++ b/kernel/time/tick-sched.c > > @@ -20,6 +20,7 @@ > > #include <linux/sched/clock.h> > > #include <linux/sched/stat.h> > > #include <linux/sched/nohz.h> > > +#include <linux/isolation.h> > > #include <linux/module.h> > > #include <linux/irq_work.h> > > #include <linux/posix-timers.h> > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) > > */ > > void tick_nohz_full_kick_cpu(int cpu) > > { > > - if (!tick_nohz_full_cpu(cpu)) > > + smp_rmb(); > > + if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu)) > > return; > > Like I said in subsequent reviews, we are not going to ignore IPIs. > We must fix the sources of these IPIs instead. This is what I am working on right now. This is made with an assumption that CPU running isolated task has no reason to be kicked because nothing else is supposed to be there. Usually this is true and when not true is still safe when everything else is behaving right. For this version I have kept the original implementation with minimal changes to make it possible to use task isolation at all. I agree that it's a much better idea is to determine if the CPU should be kicked. If it really should, that will be a legitimate cause to break isolation there, because CPU running isolated task has no legitimate reason to have timers running. Right now I am trying to determine the origin of timers that _still_ show up as running in the current kernel version, so I think, this is a rather large chunk of work that I have to do separately. -- Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() 2020-11-23 17:58 ` Alex Belits @ 2020-11-23 22:36 ` Thomas Gleixner -1 siblings, 0 replies; 74+ messages in thread From: Thomas Gleixner @ 2020-11-23 22:36 UTC (permalink / raw) To: Alex Belits, nitesh, frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev On Mon, Nov 23 2020 at 17:58, Alex Belits wrote: > From: Yuri Norov <ynorov@marvell.com> > > For nohz_full CPUs the desirable behavior is to receive interrupts > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's > obviously not desirable because it breaks isolation. > > This patch adds check for it. git grep 'This patch' Documentation/process/ > */ > void tick_nohz_full_kick_cpu(int cpu) > { > - if (!tick_nohz_full_cpu(cpu)) > + smp_rmb(); Undocumented smp_rmb() ... > + if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu)) > return; I still have to see a convincing argument why task isolation is special and not just a straight forward extension of NOHZ full cpu isolation. It's not special as much as you want it to be special. Thanks, tglx ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() @ 2020-11-23 22:36 ` Thomas Gleixner 0 siblings, 0 replies; 74+ messages in thread From: Thomas Gleixner @ 2020-11-23 22:36 UTC (permalink / raw) To: Alex Belits, nitesh, frederic Cc: linux-arch, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, linux-api, will, mingo, linux-arm-kernel On Mon, Nov 23 2020 at 17:58, Alex Belits wrote: > From: Yuri Norov <ynorov@marvell.com> > > For nohz_full CPUs the desirable behavior is to receive interrupts > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's > obviously not desirable because it breaks isolation. > > This patch adds check for it. git grep 'This patch' Documentation/process/ > */ > void tick_nohz_full_kick_cpu(int cpu) > { > - if (!tick_nohz_full_cpu(cpu)) > + smp_rmb(); Undocumented smp_rmb() ... > + if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu)) > return; I still have to see a convincing argument why task isolation is special and not just a straight forward extension of NOHZ full cpu isolation. It's not special as much as you want it to be special. Thanks, tglx _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() 2020-11-23 17:58 ` Alex Belits @ 2020-12-02 14:20 ` Mark Rutland -1 siblings, 0 replies; 74+ messages in thread From: Mark Rutland @ 2020-12-02 14:20 UTC (permalink / raw) To: Alex Belits Cc: nitesh, frederic, Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev On Mon, Nov 23, 2020 at 05:58:22PM +0000, Alex Belits wrote: > From: Yuri Norov <ynorov@marvell.com> > > For nohz_full CPUs the desirable behavior is to receive interrupts > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's > obviously not desirable because it breaks isolation. > > This patch adds check for it. > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > [abelits@marvell.com: updated, only exclude CPUs running isolated tasks] > Signed-off-by: Alex Belits <abelits@marvell.com> > --- > kernel/time/tick-sched.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index a213952541db..6c8679e200f0 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -20,6 +20,7 @@ > #include <linux/sched/clock.h> > #include <linux/sched/stat.h> > #include <linux/sched/nohz.h> > +#include <linux/isolation.h> > #include <linux/module.h> > #include <linux/irq_work.h> > #include <linux/posix-timers.h> > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) > */ > void tick_nohz_full_kick_cpu(int cpu) > { > - if (!tick_nohz_full_cpu(cpu)) > + smp_rmb(); What does this barrier pair with? The commit message doesn't mention it, and it's not clear in-context. Thanks, Mark. > + if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu)) > return; > > irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu); > -- > 2.20.1 > ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() @ 2020-12-02 14:20 ` Mark Rutland 0 siblings, 0 replies; 74+ messages in thread From: Mark Rutland @ 2020-12-02 14:20 UTC (permalink / raw) To: Alex Belits Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, linux-api, frederic, mtosatti, linux-kernel, rostedt, mingo, leon, netdev, peterx, trix, Prasun Kapoor, tglx, will, davem, linux-arm-kernel On Mon, Nov 23, 2020 at 05:58:22PM +0000, Alex Belits wrote: > From: Yuri Norov <ynorov@marvell.com> > > For nohz_full CPUs the desirable behavior is to receive interrupts > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's > obviously not desirable because it breaks isolation. > > This patch adds check for it. > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > [abelits@marvell.com: updated, only exclude CPUs running isolated tasks] > Signed-off-by: Alex Belits <abelits@marvell.com> > --- > kernel/time/tick-sched.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index a213952541db..6c8679e200f0 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -20,6 +20,7 @@ > #include <linux/sched/clock.h> > #include <linux/sched/stat.h> > #include <linux/sched/nohz.h> > +#include <linux/isolation.h> > #include <linux/module.h> > #include <linux/irq_work.h> > #include <linux/posix-timers.h> > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) > */ > void tick_nohz_full_kick_cpu(int cpu) > { > - if (!tick_nohz_full_cpu(cpu)) > + smp_rmb(); What does this barrier pair with? The commit message doesn't mention it, and it's not clear in-context. Thanks, Mark. > + if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu)) > return; > > irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu); > -- > 2.20.1 > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() 2020-12-02 14:20 ` Mark Rutland @ 2020-12-04 0:54 ` Alex Belits -1 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-12-04 0:54 UTC (permalink / raw) To: mark.rutland Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel, rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will, catalin.marinas, peterz, frederic, leon, linux-arm-kernel, pauld, netdev On Wed, 2020-12-02 at 14:20 +0000, Mark Rutland wrote: > External Email > > ------------------------------------------------------------------- > --- > On Mon, Nov 23, 2020 at 05:58:22PM +0000, Alex Belits wrote: > > From: Yuri Norov <ynorov@marvell.com> > > > > For nohz_full CPUs the desirable behavior is to receive interrupts > > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's > > obviously not desirable because it breaks isolation. > > > > This patch adds check for it. > > > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > > [abelits@marvell.com: updated, only exclude CPUs running isolated > > tasks] > > Signed-off-by: Alex Belits <abelits@marvell.com> > > --- > > kernel/time/tick-sched.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > > index a213952541db..6c8679e200f0 100644 > > --- a/kernel/time/tick-sched.c > > +++ b/kernel/time/tick-sched.c > > @@ -20,6 +20,7 @@ > > #include <linux/sched/clock.h> > > #include <linux/sched/stat.h> > > #include <linux/sched/nohz.h> > > +#include <linux/isolation.h> > > #include <linux/module.h> > > #include <linux/irq_work.h> > > #include <linux/posix-timers.h> > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) > > */ > > void tick_nohz_full_kick_cpu(int cpu) > > { > > - if (!tick_nohz_full_cpu(cpu)) > > + smp_rmb(); > > What does this barrier pair with? The commit message doesn't mention > it, > and it's not clear in-context. With barriers in task_isolation_kernel_enter() and task_isolation_exit_to_user_mode(). -- Alex ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() @ 2020-12-04 0:54 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-12-04 0:54 UTC (permalink / raw) To: mark.rutland Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, Prasun Kapoor, frederic, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel On Wed, 2020-12-02 at 14:20 +0000, Mark Rutland wrote: > External Email > > ------------------------------------------------------------------- > --- > On Mon, Nov 23, 2020 at 05:58:22PM +0000, Alex Belits wrote: > > From: Yuri Norov <ynorov@marvell.com> > > > > For nohz_full CPUs the desirable behavior is to receive interrupts > > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's > > obviously not desirable because it breaks isolation. > > > > This patch adds check for it. > > > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > > [abelits@marvell.com: updated, only exclude CPUs running isolated > > tasks] > > Signed-off-by: Alex Belits <abelits@marvell.com> > > --- > > kernel/time/tick-sched.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > > index a213952541db..6c8679e200f0 100644 > > --- a/kernel/time/tick-sched.c > > +++ b/kernel/time/tick-sched.c > > @@ -20,6 +20,7 @@ > > #include <linux/sched/clock.h> > > #include <linux/sched/stat.h> > > #include <linux/sched/nohz.h> > > +#include <linux/isolation.h> > > #include <linux/module.h> > > #include <linux/irq_work.h> > > #include <linux/posix-timers.h> > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) > > */ > > void tick_nohz_full_kick_cpu(int cpu) > > { > > - if (!tick_nohz_full_cpu(cpu)) > > + smp_rmb(); > > What does this barrier pair with? The commit message doesn't mention > it, > and it's not clear in-context. With barriers in task_isolation_kernel_enter() and task_isolation_exit_to_user_mode(). -- Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() 2020-12-04 0:54 ` Alex Belits @ 2020-12-07 11:58 ` Mark Rutland -1 siblings, 0 replies; 74+ messages in thread From: Mark Rutland @ 2020-12-07 11:58 UTC (permalink / raw) To: Alex Belits Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel, rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will, catalin.marinas, peterz, frederic, leon, linux-arm-kernel, pauld, netdev On Fri, Dec 04, 2020 at 12:54:29AM +0000, Alex Belits wrote: > > On Wed, 2020-12-02 at 14:20 +0000, Mark Rutland wrote: > > External Email > > > > ------------------------------------------------------------------- > > --- > > On Mon, Nov 23, 2020 at 05:58:22PM +0000, Alex Belits wrote: > > > From: Yuri Norov <ynorov@marvell.com> > > > > > > For nohz_full CPUs the desirable behavior is to receive interrupts > > > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's > > > obviously not desirable because it breaks isolation. > > > > > > This patch adds check for it. > > > > > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > > > [abelits@marvell.com: updated, only exclude CPUs running isolated > > > tasks] > > > Signed-off-by: Alex Belits <abelits@marvell.com> > > > --- > > > kernel/time/tick-sched.c | 4 +++- > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > > > index a213952541db..6c8679e200f0 100644 > > > --- a/kernel/time/tick-sched.c > > > +++ b/kernel/time/tick-sched.c > > > @@ -20,6 +20,7 @@ > > > #include <linux/sched/clock.h> > > > #include <linux/sched/stat.h> > > > #include <linux/sched/nohz.h> > > > +#include <linux/isolation.h> > > > #include <linux/module.h> > > > #include <linux/irq_work.h> > > > #include <linux/posix-timers.h> > > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) > > > */ > > > void tick_nohz_full_kick_cpu(int cpu) > > > { > > > - if (!tick_nohz_full_cpu(cpu)) > > > + smp_rmb(); > > > > What does this barrier pair with? The commit message doesn't mention > > it, > > and it's not clear in-context. > > With barriers in task_isolation_kernel_enter() > and task_isolation_exit_to_user_mode(). Please add a comment in the code as to what it pairs with. Thanks, Mark. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() @ 2020-12-07 11:58 ` Mark Rutland 0 siblings, 0 replies; 74+ messages in thread From: Mark Rutland @ 2020-12-07 11:58 UTC (permalink / raw) To: Alex Belits Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, Prasun Kapoor, frederic, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel On Fri, Dec 04, 2020 at 12:54:29AM +0000, Alex Belits wrote: > > On Wed, 2020-12-02 at 14:20 +0000, Mark Rutland wrote: > > External Email > > > > ------------------------------------------------------------------- > > --- > > On Mon, Nov 23, 2020 at 05:58:22PM +0000, Alex Belits wrote: > > > From: Yuri Norov <ynorov@marvell.com> > > > > > > For nohz_full CPUs the desirable behavior is to receive interrupts > > > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's > > > obviously not desirable because it breaks isolation. > > > > > > This patch adds check for it. > > > > > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > > > [abelits@marvell.com: updated, only exclude CPUs running isolated > > > tasks] > > > Signed-off-by: Alex Belits <abelits@marvell.com> > > > --- > > > kernel/time/tick-sched.c | 4 +++- > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > > > index a213952541db..6c8679e200f0 100644 > > > --- a/kernel/time/tick-sched.c > > > +++ b/kernel/time/tick-sched.c > > > @@ -20,6 +20,7 @@ > > > #include <linux/sched/clock.h> > > > #include <linux/sched/stat.h> > > > #include <linux/sched/nohz.h> > > > +#include <linux/isolation.h> > > > #include <linux/module.h> > > > #include <linux/irq_work.h> > > > #include <linux/posix-timers.h> > > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) > > > */ > > > void tick_nohz_full_kick_cpu(int cpu) > > > { > > > - if (!tick_nohz_full_cpu(cpu)) > > > + smp_rmb(); > > > > What does this barrier pair with? The commit message doesn't mention > > it, > > and it's not clear in-context. > > With barriers in task_isolation_kernel_enter() > and task_isolation_exit_to_user_mode(). Please add a comment in the code as to what it pairs with. Thanks, Mark. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* [PATCH v5 8/9] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize 2020-11-23 17:42 ` Alex Belits @ 2020-11-23 17:58 ` Alex Belits -1 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:58 UTC (permalink / raw) To: nitesh, frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev From: Yuri Norov <ynorov@marvell.com> CPUs running isolated tasks are in userspace, so they don't have to perform ring buffer updates immediately. If ring_buffer_resize() schedules the update on those CPUs, isolation is broken. To prevent that, updates for CPUs running isolated tasks are performed locally, like for offline CPUs. A race condition between this update and isolation breaking is avoided at the cost of disabling per_cpu buffer writing for the time of update when it coincides with isolation breaking. Signed-off-by: Yuri Norov <ynorov@marvell.com> [abelits@marvell.com: updated to prevent race with isolation breaking] Signed-off-by: Alex Belits <abelits@marvell.com> --- kernel/trace/ring_buffer.c | 63 ++++++++++++++++++++++++++++++++++---- 1 file changed, 57 insertions(+), 6 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index dc83b3fa9fe7..9e4fb3ed2af0 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -21,6 +21,7 @@ #include <linux/delay.h> #include <linux/slab.h> #include <linux/init.h> +#include <linux/isolation.h> #include <linux/hash.h> #include <linux/list.h> #include <linux/cpu.h> @@ -1939,6 +1940,38 @@ static void update_pages_handler(struct work_struct *work) complete(&cpu_buffer->update_done); } +static bool update_if_isolated(struct ring_buffer_per_cpu *cpu_buffer, + int cpu) +{ + bool rv = false; + + smp_rmb(); + if (task_isolation_on_cpu(cpu)) { + /* + * CPU is running isolated task. Since it may lose + * isolation and re-enter kernel simultaneously with + * this update, disable recording until it's done. + */ + atomic_inc(&cpu_buffer->record_disabled); + /* Make sure, update is done, and isolation state is current */ + smp_mb(); + if (task_isolation_on_cpu(cpu)) { + /* + * If CPU is still running isolated task, we + * can be sure that breaking isolation will + * happen while recording is disabled, and CPU + * will not touch this buffer until the update + * is done. + */ + rb_update_pages(cpu_buffer); + cpu_buffer->nr_pages_to_update = 0; + rv = true; + } + atomic_dec(&cpu_buffer->record_disabled); + } + return rv; +} + /** * ring_buffer_resize - resize the ring buffer * @buffer: the buffer to resize. @@ -2028,13 +2061,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, if (!cpu_buffer->nr_pages_to_update) continue; - /* Can't run something on an offline CPU. */ + /* + * Can't run something on an offline CPU. + * + * CPUs running isolated tasks don't have to + * update ring buffers until they exit + * isolation because they are in + * userspace. Use the procedure that prevents + * race condition with isolation breaking. + */ if (!cpu_online(cpu)) { rb_update_pages(cpu_buffer); cpu_buffer->nr_pages_to_update = 0; } else { - schedule_work_on(cpu, - &cpu_buffer->update_pages_work); + if (!update_if_isolated(cpu_buffer, cpu)) + schedule_work_on(cpu, + &cpu_buffer->update_pages_work); } } @@ -2083,13 +2125,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, get_online_cpus(); - /* Can't run something on an offline CPU. */ + /* + * Can't run something on an offline CPU. + * + * CPUs running isolated tasks don't have to update + * ring buffers until they exit isolation because they + * are in userspace. Use the procedure that prevents + * race condition with isolation breaking. + */ if (!cpu_online(cpu_id)) rb_update_pages(cpu_buffer); else { - schedule_work_on(cpu_id, + if (!update_if_isolated(cpu_buffer, cpu_id)) + schedule_work_on(cpu_id, &cpu_buffer->update_pages_work); - wait_for_completion(&cpu_buffer->update_done); + wait_for_completion(&cpu_buffer->update_done); + } } cpu_buffer->nr_pages_to_update = 0; -- 2.20.1 ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH v5 8/9] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize @ 2020-11-23 17:58 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:58 UTC (permalink / raw) To: nitesh, frederic Cc: linux-arch, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel From: Yuri Norov <ynorov@marvell.com> CPUs running isolated tasks are in userspace, so they don't have to perform ring buffer updates immediately. If ring_buffer_resize() schedules the update on those CPUs, isolation is broken. To prevent that, updates for CPUs running isolated tasks are performed locally, like for offline CPUs. A race condition between this update and isolation breaking is avoided at the cost of disabling per_cpu buffer writing for the time of update when it coincides with isolation breaking. Signed-off-by: Yuri Norov <ynorov@marvell.com> [abelits@marvell.com: updated to prevent race with isolation breaking] Signed-off-by: Alex Belits <abelits@marvell.com> --- kernel/trace/ring_buffer.c | 63 ++++++++++++++++++++++++++++++++++---- 1 file changed, 57 insertions(+), 6 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index dc83b3fa9fe7..9e4fb3ed2af0 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -21,6 +21,7 @@ #include <linux/delay.h> #include <linux/slab.h> #include <linux/init.h> +#include <linux/isolation.h> #include <linux/hash.h> #include <linux/list.h> #include <linux/cpu.h> @@ -1939,6 +1940,38 @@ static void update_pages_handler(struct work_struct *work) complete(&cpu_buffer->update_done); } +static bool update_if_isolated(struct ring_buffer_per_cpu *cpu_buffer, + int cpu) +{ + bool rv = false; + + smp_rmb(); + if (task_isolation_on_cpu(cpu)) { + /* + * CPU is running isolated task. Since it may lose + * isolation and re-enter kernel simultaneously with + * this update, disable recording until it's done. + */ + atomic_inc(&cpu_buffer->record_disabled); + /* Make sure, update is done, and isolation state is current */ + smp_mb(); + if (task_isolation_on_cpu(cpu)) { + /* + * If CPU is still running isolated task, we + * can be sure that breaking isolation will + * happen while recording is disabled, and CPU + * will not touch this buffer until the update + * is done. + */ + rb_update_pages(cpu_buffer); + cpu_buffer->nr_pages_to_update = 0; + rv = true; + } + atomic_dec(&cpu_buffer->record_disabled); + } + return rv; +} + /** * ring_buffer_resize - resize the ring buffer * @buffer: the buffer to resize. @@ -2028,13 +2061,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, if (!cpu_buffer->nr_pages_to_update) continue; - /* Can't run something on an offline CPU. */ + /* + * Can't run something on an offline CPU. + * + * CPUs running isolated tasks don't have to + * update ring buffers until they exit + * isolation because they are in + * userspace. Use the procedure that prevents + * race condition with isolation breaking. + */ if (!cpu_online(cpu)) { rb_update_pages(cpu_buffer); cpu_buffer->nr_pages_to_update = 0; } else { - schedule_work_on(cpu, - &cpu_buffer->update_pages_work); + if (!update_if_isolated(cpu_buffer, cpu)) + schedule_work_on(cpu, + &cpu_buffer->update_pages_work); } } @@ -2083,13 +2125,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, get_online_cpus(); - /* Can't run something on an offline CPU. */ + /* + * Can't run something on an offline CPU. + * + * CPUs running isolated tasks don't have to update + * ring buffers until they exit isolation because they + * are in userspace. Use the procedure that prevents + * race condition with isolation breaking. + */ if (!cpu_online(cpu_id)) rb_update_pages(cpu_buffer); else { - schedule_work_on(cpu_id, + if (!update_if_isolated(cpu_buffer, cpu_id)) + schedule_work_on(cpu_id, &cpu_buffer->update_pages_work); - wait_for_completion(&cpu_buffer->update_done); + wait_for_completion(&cpu_buffer->update_done); + } } cpu_buffer->nr_pages_to_update = 0; -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus 2020-11-23 17:42 ` Alex Belits @ 2020-11-23 17:58 ` Alex Belits -1 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:58 UTC (permalink / raw) To: nitesh, frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev From: Yuri Norov <ynorov@marvell.com> Make sure that kick_all_cpus_sync() does not call CPUs that are running isolated tasks. Signed-off-by: Yuri Norov <ynorov@marvell.com> [abelits@marvell.com: use safe task_isolation_cpumask() implementation] Signed-off-by: Alex Belits <abelits@marvell.com> --- kernel/smp.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/smp.c b/kernel/smp.c index 4d17501433be..b2faecf58ed0 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -932,9 +932,21 @@ static void do_nothing(void *unused) */ void kick_all_cpus_sync(void) { + struct cpumask mask; + /* Make sure the change is visible before we kick the cpus */ smp_mb(); - smp_call_function(do_nothing, NULL, 1); + + preempt_disable(); +#ifdef CONFIG_TASK_ISOLATION + cpumask_clear(&mask); + task_isolation_cpumask(&mask); + cpumask_complement(&mask, &mask); +#else + cpumask_setall(&mask); +#endif + smp_call_function_many(&mask, do_nothing, NULL, 1); + preempt_enable(); } EXPORT_SYMBOL_GPL(kick_all_cpus_sync); -- 2.20.1 ^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus @ 2020-11-23 17:58 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 17:58 UTC (permalink / raw) To: nitesh, frederic Cc: linux-arch, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel From: Yuri Norov <ynorov@marvell.com> Make sure that kick_all_cpus_sync() does not call CPUs that are running isolated tasks. Signed-off-by: Yuri Norov <ynorov@marvell.com> [abelits@marvell.com: use safe task_isolation_cpumask() implementation] Signed-off-by: Alex Belits <abelits@marvell.com> --- kernel/smp.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/smp.c b/kernel/smp.c index 4d17501433be..b2faecf58ed0 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -932,9 +932,21 @@ static void do_nothing(void *unused) */ void kick_all_cpus_sync(void) { + struct cpumask mask; + /* Make sure the change is visible before we kick the cpus */ smp_mb(); - smp_call_function(do_nothing, NULL, 1); + + preempt_disable(); +#ifdef CONFIG_TASK_ISOLATION + cpumask_clear(&mask); + task_isolation_cpumask(&mask); + cpumask_complement(&mask, &mask); +#else + cpumask_setall(&mask); +#endif + smp_call_function_many(&mask, do_nothing, NULL, 1); + preempt_enable(); } EXPORT_SYMBOL_GPL(kick_all_cpus_sync); -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 74+ messages in thread
* Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus 2020-11-23 17:58 ` Alex Belits @ 2020-11-23 22:29 ` Frederic Weisbecker -1 siblings, 0 replies; 74+ messages in thread From: Frederic Weisbecker @ 2020-11-23 22:29 UTC (permalink / raw) To: Alex Belits Cc: nitesh, Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev On Mon, Nov 23, 2020 at 05:58:42PM +0000, Alex Belits wrote: > From: Yuri Norov <ynorov@marvell.com> > > Make sure that kick_all_cpus_sync() does not call CPUs that are running > isolated tasks. > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > [abelits@marvell.com: use safe task_isolation_cpumask() implementation] > Signed-off-by: Alex Belits <abelits@marvell.com> > --- > kernel/smp.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/kernel/smp.c b/kernel/smp.c > index 4d17501433be..b2faecf58ed0 100644 > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -932,9 +932,21 @@ static void do_nothing(void *unused) > */ > void kick_all_cpus_sync(void) > { > + struct cpumask mask; > + > /* Make sure the change is visible before we kick the cpus */ > smp_mb(); > - smp_call_function(do_nothing, NULL, 1); > + > + preempt_disable(); > +#ifdef CONFIG_TASK_ISOLATION > + cpumask_clear(&mask); > + task_isolation_cpumask(&mask); > + cpumask_complement(&mask, &mask); > +#else > + cpumask_setall(&mask); > +#endif > + smp_call_function_many(&mask, do_nothing, NULL, 1); > + preempt_enable(); Same comment about IPIs here. > } > EXPORT_SYMBOL_GPL(kick_all_cpus_sync); > > -- > 2.20.1 > ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus @ 2020-11-23 22:29 ` Frederic Weisbecker 0 siblings, 0 replies; 74+ messages in thread From: Frederic Weisbecker @ 2020-11-23 22:29 UTC (permalink / raw) To: Alex Belits Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, linux-api, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, Prasun Kapoor, tglx, will, mingo, linux-arm-kernel On Mon, Nov 23, 2020 at 05:58:42PM +0000, Alex Belits wrote: > From: Yuri Norov <ynorov@marvell.com> > > Make sure that kick_all_cpus_sync() does not call CPUs that are running > isolated tasks. > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > [abelits@marvell.com: use safe task_isolation_cpumask() implementation] > Signed-off-by: Alex Belits <abelits@marvell.com> > --- > kernel/smp.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/kernel/smp.c b/kernel/smp.c > index 4d17501433be..b2faecf58ed0 100644 > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -932,9 +932,21 @@ static void do_nothing(void *unused) > */ > void kick_all_cpus_sync(void) > { > + struct cpumask mask; > + > /* Make sure the change is visible before we kick the cpus */ > smp_mb(); > - smp_call_function(do_nothing, NULL, 1); > + > + preempt_disable(); > +#ifdef CONFIG_TASK_ISOLATION > + cpumask_clear(&mask); > + task_isolation_cpumask(&mask); > + cpumask_complement(&mask, &mask); > +#else > + cpumask_setall(&mask); > +#endif > + smp_call_function_many(&mask, do_nothing, NULL, 1); > + preempt_enable(); Same comment about IPIs here. > } > EXPORT_SYMBOL_GPL(kick_all_cpus_sync); > > -- > 2.20.1 > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus 2020-11-23 22:29 ` Frederic Weisbecker @ 2020-11-23 22:39 ` Alex Belits -1 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 22:39 UTC (permalink / raw) To: frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel, rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, catalin.marinas, pauld, netdev On Mon, 2020-11-23 at 23:29 +0100, Frederic Weisbecker wrote: > External Email > > ------------------------------------------------------------------- > --- > On Mon, Nov 23, 2020 at 05:58:42PM +0000, Alex Belits wrote: > > From: Yuri Norov <ynorov@marvell.com> > > > > Make sure that kick_all_cpus_sync() does not call CPUs that are > > running > > isolated tasks. > > > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > > [abelits@marvell.com: use safe task_isolation_cpumask() > > implementation] > > Signed-off-by: Alex Belits <abelits@marvell.com> > > --- > > kernel/smp.c | 14 +++++++++++++- > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/smp.c b/kernel/smp.c > > index 4d17501433be..b2faecf58ed0 100644 > > --- a/kernel/smp.c > > +++ b/kernel/smp.c > > @@ -932,9 +932,21 @@ static void do_nothing(void *unused) > > */ > > void kick_all_cpus_sync(void) > > { > > + struct cpumask mask; > > + > > /* Make sure the change is visible before we kick the cpus */ > > smp_mb(); > > - smp_call_function(do_nothing, NULL, 1); > > + > > + preempt_disable(); > > +#ifdef CONFIG_TASK_ISOLATION > > + cpumask_clear(&mask); > > + task_isolation_cpumask(&mask); > > + cpumask_complement(&mask, &mask); > > +#else > > + cpumask_setall(&mask); > > +#endif > > + smp_call_function_many(&mask, do_nothing, NULL, 1); > > + preempt_enable(); > > Same comment about IPIs here. This is different from timers. The original design was based on the idea that every CPU should be able to enter kernel at any time and run kernel code with no additional preparation. Then the only solution is to always do full broadcast and require all CPUs to process it. What I am trying to introduce is the idea of CPU that is not likely to run kernel code any soon, and can afford to go through an additional synchronization procedure on the next entry into kernel. The synchronization is not skipped, it simply happens later, early in kernel entry code. -- Alex ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus @ 2020-11-23 22:39 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-23 22:39 UTC (permalink / raw) To: frederic Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel On Mon, 2020-11-23 at 23:29 +0100, Frederic Weisbecker wrote: > External Email > > ------------------------------------------------------------------- > --- > On Mon, Nov 23, 2020 at 05:58:42PM +0000, Alex Belits wrote: > > From: Yuri Norov <ynorov@marvell.com> > > > > Make sure that kick_all_cpus_sync() does not call CPUs that are > > running > > isolated tasks. > > > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > > [abelits@marvell.com: use safe task_isolation_cpumask() > > implementation] > > Signed-off-by: Alex Belits <abelits@marvell.com> > > --- > > kernel/smp.c | 14 +++++++++++++- > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/smp.c b/kernel/smp.c > > index 4d17501433be..b2faecf58ed0 100644 > > --- a/kernel/smp.c > > +++ b/kernel/smp.c > > @@ -932,9 +932,21 @@ static void do_nothing(void *unused) > > */ > > void kick_all_cpus_sync(void) > > { > > + struct cpumask mask; > > + > > /* Make sure the change is visible before we kick the cpus */ > > smp_mb(); > > - smp_call_function(do_nothing, NULL, 1); > > + > > + preempt_disable(); > > +#ifdef CONFIG_TASK_ISOLATION > > + cpumask_clear(&mask); > > + task_isolation_cpumask(&mask); > > + cpumask_complement(&mask, &mask); > > +#else > > + cpumask_setall(&mask); > > +#endif > > + smp_call_function_many(&mask, do_nothing, NULL, 1); > > + preempt_enable(); > > Same comment about IPIs here. This is different from timers. The original design was based on the idea that every CPU should be able to enter kernel at any time and run kernel code with no additional preparation. Then the only solution is to always do full broadcast and require all CPUs to process it. What I am trying to introduce is the idea of CPU that is not likely to run kernel code any soon, and can afford to go through an additional synchronization procedure on the next entry into kernel. The synchronization is not skipped, it simply happens later, early in kernel entry code. -- Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus 2020-11-23 22:39 ` Alex Belits @ 2020-11-23 23:21 ` Frederic Weisbecker -1 siblings, 0 replies; 74+ messages in thread From: Frederic Weisbecker @ 2020-11-23 23:21 UTC (permalink / raw) To: Alex Belits Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel, rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, catalin.marinas, pauld, netdev On Mon, Nov 23, 2020 at 10:39:34PM +0000, Alex Belits wrote: > > On Mon, 2020-11-23 at 23:29 +0100, Frederic Weisbecker wrote: > > External Email > > > > ------------------------------------------------------------------- > > --- > > On Mon, Nov 23, 2020 at 05:58:42PM +0000, Alex Belits wrote: > > > From: Yuri Norov <ynorov@marvell.com> > > > > > > Make sure that kick_all_cpus_sync() does not call CPUs that are > > > running > > > isolated tasks. > > > > > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > > > [abelits@marvell.com: use safe task_isolation_cpumask() > > > implementation] > > > Signed-off-by: Alex Belits <abelits@marvell.com> > > > --- > > > kernel/smp.c | 14 +++++++++++++- > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > diff --git a/kernel/smp.c b/kernel/smp.c > > > index 4d17501433be..b2faecf58ed0 100644 > > > --- a/kernel/smp.c > > > +++ b/kernel/smp.c > > > @@ -932,9 +932,21 @@ static void do_nothing(void *unused) > > > */ > > > void kick_all_cpus_sync(void) > > > { > > > + struct cpumask mask; > > > + > > > /* Make sure the change is visible before we kick the cpus */ > > > smp_mb(); > > > - smp_call_function(do_nothing, NULL, 1); > > > + > > > + preempt_disable(); > > > +#ifdef CONFIG_TASK_ISOLATION > > > + cpumask_clear(&mask); > > > + task_isolation_cpumask(&mask); > > > + cpumask_complement(&mask, &mask); > > > +#else > > > + cpumask_setall(&mask); > > > +#endif > > > + smp_call_function_many(&mask, do_nothing, NULL, 1); > > > + preempt_enable(); > > > > Same comment about IPIs here. > > This is different from timers. The original design was based on the > idea that every CPU should be able to enter kernel at any time and run > kernel code with no additional preparation. Then the only solution is > to always do full broadcast and require all CPUs to process it. > > What I am trying to introduce is the idea of CPU that is not likely to > run kernel code any soon, and can afford to go through an additional > synchronization procedure on the next entry into kernel. The > synchronization is not skipped, it simply happens later, early in > kernel entry code. Ah I see, this is ordered that way: ll_isol_flags = ISOLATED CPU 0 CPU 1 ------------------ ----------------- // kernel entry data_to_sync = 1 ll_isol_flags = ISOLATED_BROKEN smp_mb() smp_mb() if ll_isol_flags(CPU 1) == ISOLATED READ data_to_sync smp_call(CPU 1) You should document that, ie: explain why what you're doing is safe. Also Beware though that the data to sync in question doesn't need to be visible in the entry code before task_isolation_kernel_enter(). You need to audit all the callers of kick_all_cpus_sync(). ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus @ 2020-11-23 23:21 ` Frederic Weisbecker 0 siblings, 0 replies; 74+ messages in thread From: Frederic Weisbecker @ 2020-11-23 23:21 UTC (permalink / raw) To: Alex Belits Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel On Mon, Nov 23, 2020 at 10:39:34PM +0000, Alex Belits wrote: > > On Mon, 2020-11-23 at 23:29 +0100, Frederic Weisbecker wrote: > > External Email > > > > ------------------------------------------------------------------- > > --- > > On Mon, Nov 23, 2020 at 05:58:42PM +0000, Alex Belits wrote: > > > From: Yuri Norov <ynorov@marvell.com> > > > > > > Make sure that kick_all_cpus_sync() does not call CPUs that are > > > running > > > isolated tasks. > > > > > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > > > [abelits@marvell.com: use safe task_isolation_cpumask() > > > implementation] > > > Signed-off-by: Alex Belits <abelits@marvell.com> > > > --- > > > kernel/smp.c | 14 +++++++++++++- > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > diff --git a/kernel/smp.c b/kernel/smp.c > > > index 4d17501433be..b2faecf58ed0 100644 > > > --- a/kernel/smp.c > > > +++ b/kernel/smp.c > > > @@ -932,9 +932,21 @@ static void do_nothing(void *unused) > > > */ > > > void kick_all_cpus_sync(void) > > > { > > > + struct cpumask mask; > > > + > > > /* Make sure the change is visible before we kick the cpus */ > > > smp_mb(); > > > - smp_call_function(do_nothing, NULL, 1); > > > + > > > + preempt_disable(); > > > +#ifdef CONFIG_TASK_ISOLATION > > > + cpumask_clear(&mask); > > > + task_isolation_cpumask(&mask); > > > + cpumask_complement(&mask, &mask); > > > +#else > > > + cpumask_setall(&mask); > > > +#endif > > > + smp_call_function_many(&mask, do_nothing, NULL, 1); > > > + preempt_enable(); > > > > Same comment about IPIs here. > > This is different from timers. The original design was based on the > idea that every CPU should be able to enter kernel at any time and run > kernel code with no additional preparation. Then the only solution is > to always do full broadcast and require all CPUs to process it. > > What I am trying to introduce is the idea of CPU that is not likely to > run kernel code any soon, and can afford to go through an additional > synchronization procedure on the next entry into kernel. The > synchronization is not skipped, it simply happens later, early in > kernel entry code. Ah I see, this is ordered that way: ll_isol_flags = ISOLATED CPU 0 CPU 1 ------------------ ----------------- // kernel entry data_to_sync = 1 ll_isol_flags = ISOLATED_BROKEN smp_mb() smp_mb() if ll_isol_flags(CPU 1) == ISOLATED READ data_to_sync smp_call(CPU 1) You should document that, ie: explain why what you're doing is safe. Also Beware though that the data to sync in question doesn't need to be visible in the entry code before task_isolation_kernel_enter(). You need to audit all the callers of kick_all_cpus_sync(). _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus 2020-11-23 23:21 ` Frederic Weisbecker @ 2020-11-25 3:20 ` Alex Belits -1 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-25 3:20 UTC (permalink / raw) To: frederic Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel, rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, catalin.marinas, pauld, netdev On Tue, 2020-11-24 at 00:21 +0100, Frederic Weisbecker wrote: > On Mon, Nov 23, 2020 at 10:39:34PM +0000, Alex Belits wrote: > > > > This is different from timers. The original design was based on the > > idea that every CPU should be able to enter kernel at any time and > > run > > kernel code with no additional preparation. Then the only solution > > is > > to always do full broadcast and require all CPUs to process it. > > > > What I am trying to introduce is the idea of CPU that is not likely > > to > > run kernel code any soon, and can afford to go through an > > additional > > synchronization procedure on the next entry into kernel. The > > synchronization is not skipped, it simply happens later, early in > > kernel entry code. > > Ah I see, this is ordered that way: > > ll_isol_flags = ISOLATED > > CPU 0 CPU 1 > ------------------ ----------------- > // kernel entry > data_to_sync = 1 ll_isol_flags = ISOLATED_BROKEN > smp_mb() smp_mb() > if ll_isol_flags(CPU 1) == ISOLATED READ data_to_sync > smp_call(CPU 1) > The check for ll_isol_flags(CPU 1) is reversed, and it's a bit more complex. In terms of scenarios, on entry from isolation the following can happen: 1. Kernel entry happens simultaneously with operation that requires synchronization, kernel entry processing happens before the check for isolation on the sender side: ll_isol_flags(CPU 1) = ISOLATED CPU 0 CPU 1 ------------------ ----------------- // kernel entry if (ll_isol_flags == ISOLATED) { ll_isol_flags = ISOLATED_BROKEN data_to_sync = 1 smp_mb() // data_to_sync undetermined smp_mb() } // ll_isol_flags(CPU 1) updated if ll_isol_flags(CPU 1) != ISOLATED // interrupts enabled smp_call(CPU 1) // kernel entry again if (ll_isol_flags == ISOLATED) // nothing happens // explicit or implied barriers // data_to_sync updated // kernel exit // CPU 0 assumes, CPU 1 will see READ data_to_sync // data_to_sync = 1 when in kernel 2. Kernel entry happens simultaneously with operation that requires synchronization, kernel entry processing happens after the check for isolation on the sender side: ll_isol_flags(CPU 1) = ISOLATED CPU 0 CPU 1 ------------------ ----------------- data_to_sync = 1 // kernel entry smp_mb() // data_to_sync undetermined // should not access data_to_sync here if (ll_isol_flags == ISOLATED) { ll_isol_flags = ISOLATED_BROKEN // ll_isol_flags(CPU 1) undetermined smp_mb() // data_to_sync updated if ll_isol_flags(CPU 1) != ISOLATED } // possibly nothing happens // CPU 0 assumes, CPU 1 will see READ data_to_sync // data_to_sync = 1 when in kernel 3. Kernel entry processing completed before the check for isolation on the sender side: ll_isol_flags(CPU 1) = ISOLATED CPU 0 CPU 1 ------------------ ----------------- // kernel entry if (ll_isol_flags == ISOLATED) { ll_isol_flags = ISOLATED_BROKEN smp_mb() } // interrupts are enabled at some data_to_sync = 1 // point here, data_to_sync value smp_mb() // is undetermined, CPU 0 makes no // ll_isol_flags(CPU 1) updated // assumptions about it if ll_isol_flags(CPU 1) != ISOLATED // smp_call(CPU 1) // kernel entry again if (ll_isol_flags == ISOLATED) // nothing happens // explicit or implied barriers // data_to_sync updated // kernel exit // CPU 0 assumes, CPU 1 will see // data_to_sync = 1 when in kernel READ data_to_sync 4. Kernel entry processing happens after the check for isolation on the sender side: ll_isol_flags(CPU 1) = ISOLATED CPU 0 CPU 1 ------------------ ----------------- data_to_sync = 1 // userspace or early kernel entry smp_mb() if ll_isol_flags(CPU 1) != ISOLATED smp_call(CPU 1) // skipped // userspace or early kernel entry // CPU 0 assumes, CPU 1 will see // continues undisturbed // data_to_sync = 1 when in kernel // kernel entry // data_to_sync undetermined // should not access data_to_sync here if (ll_isol_flags == ISOLATED) { ll_isol_flags = ISOLATED_BROKEN smp_mb() // data_to_sync updated } READ data_to_sync This also applies to exit to userspace after enabling isolation -- once ll_isol_flags is set to ISOLATED, synchronization will be missed, so one final barrier should happen before returning to userspace and enabling interrupts in the process. In this case "unlucky" timing would result in smp call interrupting userspace that is already supposed to be isolated, it will trigger normal isolation breaking procedure but otherwise will be an unremarkable synchronization call. On the other hand, synchronization that was supposed to happen after setting ll_isol_flags, will be delayed to the next kernel entry, so there should be nothing that needs synchronization between the end of task_isolation_exit_to_user_mode() and entering userspace (interrupts are ok, though, they will trigger isolation breaking). This is why I have placed task_isolation_kernel_enter() and task_isolation_exit_to_user_mode() in entry/exit code, and when I have seen any ambiguity or had any doubt allowed duplicate calls to task_isolation_kernel_enter() on entry. If I overdid any of that, I would appreciate fixes and clarifications, however it should be taken into account that for now different architectures and even drivers may call common and specific functions in a slightly different order. Synchronization also applies to possible effects on pipeline (when originating CPU writes instructions). There is a version under development that delays TLB flushes in this manner. On arm64 that requires a switch to soft TLB flushes, and that's another potential ball of wax. On architectures that already use soft TLB flushes, this will be unavoidable because TLB flush becomes another IPI, and IPIs break isolation. Maybe it will be better to make a somewhat smarter TLB handling in general, so it will be possible to avoid bothering unrelated CPUs. But then, I guess, hardware may still overdo it.Then it will be closer to the situation with timers. For now, I want to first do the obvious and exclude isolated tasks from TLB updates until they are back in kernel, and do a slow full flush on isolation breaking. > You should document that, ie: explain why what you're doing is safe. I tried to do that in comments, however this is clearly insufficient despite their verbosity. Would it make sense to create a separate documentation text file? Current documentation seems to be light on detailed specifications for things under heavy development except for something very significant that affects nearly everyone. Would it make sense to include all scenarios like the above plus exit to userspace, and existing sequences of calls that should lead to synchronization calls, task_isolation_kernel_enter() or task_isolation_exit_to_user_mode()? > Also Beware though that the data to sync in question doesn't need to > be visible > in the entry code before task_isolation_kernel_enter(). Right. And after task_isolation_exit_to_user_mode() as well. This is why I had to dig through entry/exit code. I can produce a somewhat usable call sequence diagram. If by any chance I am wrong there somewhere, I welcome all relevant comments and corrections. At this point I am not sure only about things that are called when CONFIG_TRACE_IRQFLAGS is enabled, simply because I have not yet checked what they depend on, and virtualization-specific entry/exit is not entirely clear to me. Maybe for correctness sake I should have declared task isolation incompatible with those until I either know for sure or added working support for them. > You need to audit all > the callers of kick_all_cpus_sync(). Right now it's just three places. do_tune_cpucache() does the right thing, keeps CPUs from seeing old values of cachep->cpu_cache as they are being deallocated. powerpc kvm_arch_destroy_vm() cares only about CPUs with VCPUs on them, what for now should not be used with isolation. arm64 flush_icache_range() is the main reason why this could be potentially problematic, it makes sure that other CPUs will not run stale instructions, and it's called from all places where code is modified. If in other situation some kind of delayed processing could be possible, this one has to be done in the right sequence. And that's the reason for instr_sync() after smp_mb() in task_isolation_kernel_enter(). Since flush_icache_range() could skip this CPU, we have to synchronize it by ourselves. There is an important issue of not having early kernel entry / late kernel exit rely on something that may be stale (with both data and code). With the above mentioned exceptions for now, it can be demonstrated that this is correct. It should be taken into account that even though static keys are used in early kernel entry code, static keys setting does not synchronize flushes, and therefore already should be done before use. Task isolation in virtualization guests can be a perfectly valid thing to support (it just has to be propagated to the host if permitted), however this is something I will want to revisit in the future. For now, I assume that virtualization-related events are not supposed to break isolation, however it would be nice to be ready for handling that possibility. -- Alex ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus @ 2020-11-25 3:20 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-25 3:20 UTC (permalink / raw) To: frederic Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel On Tue, 2020-11-24 at 00:21 +0100, Frederic Weisbecker wrote: > On Mon, Nov 23, 2020 at 10:39:34PM +0000, Alex Belits wrote: > > > > This is different from timers. The original design was based on the > > idea that every CPU should be able to enter kernel at any time and > > run > > kernel code with no additional preparation. Then the only solution > > is > > to always do full broadcast and require all CPUs to process it. > > > > What I am trying to introduce is the idea of CPU that is not likely > > to > > run kernel code any soon, and can afford to go through an > > additional > > synchronization procedure on the next entry into kernel. The > > synchronization is not skipped, it simply happens later, early in > > kernel entry code. > > Ah I see, this is ordered that way: > > ll_isol_flags = ISOLATED > > CPU 0 CPU 1 > ------------------ ----------------- > // kernel entry > data_to_sync = 1 ll_isol_flags = ISOLATED_BROKEN > smp_mb() smp_mb() > if ll_isol_flags(CPU 1) == ISOLATED READ data_to_sync > smp_call(CPU 1) > The check for ll_isol_flags(CPU 1) is reversed, and it's a bit more complex. In terms of scenarios, on entry from isolation the following can happen: 1. Kernel entry happens simultaneously with operation that requires synchronization, kernel entry processing happens before the check for isolation on the sender side: ll_isol_flags(CPU 1) = ISOLATED CPU 0 CPU 1 ------------------ ----------------- // kernel entry if (ll_isol_flags == ISOLATED) { ll_isol_flags = ISOLATED_BROKEN data_to_sync = 1 smp_mb() // data_to_sync undetermined smp_mb() } // ll_isol_flags(CPU 1) updated if ll_isol_flags(CPU 1) != ISOLATED // interrupts enabled smp_call(CPU 1) // kernel entry again if (ll_isol_flags == ISOLATED) // nothing happens // explicit or implied barriers // data_to_sync updated // kernel exit // CPU 0 assumes, CPU 1 will see READ data_to_sync // data_to_sync = 1 when in kernel 2. Kernel entry happens simultaneously with operation that requires synchronization, kernel entry processing happens after the check for isolation on the sender side: ll_isol_flags(CPU 1) = ISOLATED CPU 0 CPU 1 ------------------ ----------------- data_to_sync = 1 // kernel entry smp_mb() // data_to_sync undetermined // should not access data_to_sync here if (ll_isol_flags == ISOLATED) { ll_isol_flags = ISOLATED_BROKEN // ll_isol_flags(CPU 1) undetermined smp_mb() // data_to_sync updated if ll_isol_flags(CPU 1) != ISOLATED } // possibly nothing happens // CPU 0 assumes, CPU 1 will see READ data_to_sync // data_to_sync = 1 when in kernel 3. Kernel entry processing completed before the check for isolation on the sender side: ll_isol_flags(CPU 1) = ISOLATED CPU 0 CPU 1 ------------------ ----------------- // kernel entry if (ll_isol_flags == ISOLATED) { ll_isol_flags = ISOLATED_BROKEN smp_mb() } // interrupts are enabled at some data_to_sync = 1 // point here, data_to_sync value smp_mb() // is undetermined, CPU 0 makes no // ll_isol_flags(CPU 1) updated // assumptions about it if ll_isol_flags(CPU 1) != ISOLATED // smp_call(CPU 1) // kernel entry again if (ll_isol_flags == ISOLATED) // nothing happens // explicit or implied barriers // data_to_sync updated // kernel exit // CPU 0 assumes, CPU 1 will see // data_to_sync = 1 when in kernel READ data_to_sync 4. Kernel entry processing happens after the check for isolation on the sender side: ll_isol_flags(CPU 1) = ISOLATED CPU 0 CPU 1 ------------------ ----------------- data_to_sync = 1 // userspace or early kernel entry smp_mb() if ll_isol_flags(CPU 1) != ISOLATED smp_call(CPU 1) // skipped // userspace or early kernel entry // CPU 0 assumes, CPU 1 will see // continues undisturbed // data_to_sync = 1 when in kernel // kernel entry // data_to_sync undetermined // should not access data_to_sync here if (ll_isol_flags == ISOLATED) { ll_isol_flags = ISOLATED_BROKEN smp_mb() // data_to_sync updated } READ data_to_sync This also applies to exit to userspace after enabling isolation -- once ll_isol_flags is set to ISOLATED, synchronization will be missed, so one final barrier should happen before returning to userspace and enabling interrupts in the process. In this case "unlucky" timing would result in smp call interrupting userspace that is already supposed to be isolated, it will trigger normal isolation breaking procedure but otherwise will be an unremarkable synchronization call. On the other hand, synchronization that was supposed to happen after setting ll_isol_flags, will be delayed to the next kernel entry, so there should be nothing that needs synchronization between the end of task_isolation_exit_to_user_mode() and entering userspace (interrupts are ok, though, they will trigger isolation breaking). This is why I have placed task_isolation_kernel_enter() and task_isolation_exit_to_user_mode() in entry/exit code, and when I have seen any ambiguity or had any doubt allowed duplicate calls to task_isolation_kernel_enter() on entry. If I overdid any of that, I would appreciate fixes and clarifications, however it should be taken into account that for now different architectures and even drivers may call common and specific functions in a slightly different order. Synchronization also applies to possible effects on pipeline (when originating CPU writes instructions). There is a version under development that delays TLB flushes in this manner. On arm64 that requires a switch to soft TLB flushes, and that's another potential ball of wax. On architectures that already use soft TLB flushes, this will be unavoidable because TLB flush becomes another IPI, and IPIs break isolation. Maybe it will be better to make a somewhat smarter TLB handling in general, so it will be possible to avoid bothering unrelated CPUs. But then, I guess, hardware may still overdo it.Then it will be closer to the situation with timers. For now, I want to first do the obvious and exclude isolated tasks from TLB updates until they are back in kernel, and do a slow full flush on isolation breaking. > You should document that, ie: explain why what you're doing is safe. I tried to do that in comments, however this is clearly insufficient despite their verbosity. Would it make sense to create a separate documentation text file? Current documentation seems to be light on detailed specifications for things under heavy development except for something very significant that affects nearly everyone. Would it make sense to include all scenarios like the above plus exit to userspace, and existing sequences of calls that should lead to synchronization calls, task_isolation_kernel_enter() or task_isolation_exit_to_user_mode()? > Also Beware though that the data to sync in question doesn't need to > be visible > in the entry code before task_isolation_kernel_enter(). Right. And after task_isolation_exit_to_user_mode() as well. This is why I had to dig through entry/exit code. I can produce a somewhat usable call sequence diagram. If by any chance I am wrong there somewhere, I welcome all relevant comments and corrections. At this point I am not sure only about things that are called when CONFIG_TRACE_IRQFLAGS is enabled, simply because I have not yet checked what they depend on, and virtualization-specific entry/exit is not entirely clear to me. Maybe for correctness sake I should have declared task isolation incompatible with those until I either know for sure or added working support for them. > You need to audit all > the callers of kick_all_cpus_sync(). Right now it's just three places. do_tune_cpucache() does the right thing, keeps CPUs from seeing old values of cachep->cpu_cache as they are being deallocated. powerpc kvm_arch_destroy_vm() cares only about CPUs with VCPUs on them, what for now should not be used with isolation. arm64 flush_icache_range() is the main reason why this could be potentially problematic, it makes sure that other CPUs will not run stale instructions, and it's called from all places where code is modified. If in other situation some kind of delayed processing could be possible, this one has to be done in the right sequence. And that's the reason for instr_sync() after smp_mb() in task_isolation_kernel_enter(). Since flush_icache_range() could skip this CPU, we have to synchronize it by ourselves. There is an important issue of not having early kernel entry / late kernel exit rely on something that may be stale (with both data and code). With the above mentioned exceptions for now, it can be demonstrated that this is correct. It should be taken into account that even though static keys are used in early kernel entry code, static keys setting does not synchronize flushes, and therefore already should be done before use. Task isolation in virtualization guests can be a perfectly valid thing to support (it just has to be propagated to the host if permitted), however this is something I will want to revisit in the future. For now, I assume that virtualization-related events are not supposed to break isolation, however it would be nice to be ready for handling that possibility. -- Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus 2020-11-23 23:21 ` Frederic Weisbecker @ 2021-01-22 15:00 ` Marcelo Tosatti -1 siblings, 0 replies; 74+ messages in thread From: Marcelo Tosatti @ 2021-01-22 15:00 UTC (permalink / raw) To: Frederic Weisbecker Cc: Alex Belits, Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel, rostedt, peterx, tglx, nitesh, linux-arch, will, peterz, leon, linux-arm-kernel, catalin.marinas, pauld, netdev On Tue, Nov 24, 2020 at 12:21:06AM +0100, Frederic Weisbecker wrote: > On Mon, Nov 23, 2020 at 10:39:34PM +0000, Alex Belits wrote: > > > > On Mon, 2020-11-23 at 23:29 +0100, Frederic Weisbecker wrote: > > > External Email > > > > > > ------------------------------------------------------------------- > > > --- > > > On Mon, Nov 23, 2020 at 05:58:42PM +0000, Alex Belits wrote: > > > > From: Yuri Norov <ynorov@marvell.com> > > > > > > > > Make sure that kick_all_cpus_sync() does not call CPUs that are > > > > running > > > > isolated tasks. > > > > > > > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > > > > [abelits@marvell.com: use safe task_isolation_cpumask() > > > > implementation] > > > > Signed-off-by: Alex Belits <abelits@marvell.com> > > > > --- > > > > kernel/smp.c | 14 +++++++++++++- > > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/kernel/smp.c b/kernel/smp.c > > > > index 4d17501433be..b2faecf58ed0 100644 > > > > --- a/kernel/smp.c > > > > +++ b/kernel/smp.c > > > > @@ -932,9 +932,21 @@ static void do_nothing(void *unused) > > > > */ > > > > void kick_all_cpus_sync(void) > > > > { > > > > + struct cpumask mask; > > > > + > > > > /* Make sure the change is visible before we kick the cpus */ > > > > smp_mb(); > > > > - smp_call_function(do_nothing, NULL, 1); > > > > + > > > > + preempt_disable(); > > > > +#ifdef CONFIG_TASK_ISOLATION > > > > + cpumask_clear(&mask); > > > > + task_isolation_cpumask(&mask); > > > > + cpumask_complement(&mask, &mask); > > > > +#else > > > > + cpumask_setall(&mask); > > > > +#endif > > > > + smp_call_function_many(&mask, do_nothing, NULL, 1); > > > > + preempt_enable(); > > > > > > Same comment about IPIs here. > > > > This is different from timers. The original design was based on the > > idea that every CPU should be able to enter kernel at any time and run > > kernel code with no additional preparation. Then the only solution is > > to always do full broadcast and require all CPUs to process it. > > > > What I am trying to introduce is the idea of CPU that is not likely to > > run kernel code any soon, and can afford to go through an additional > > synchronization procedure on the next entry into kernel. The > > synchronization is not skipped, it simply happens later, early in > > kernel entry code. Perhaps a bitmask of pending flushes makes more sense? static_key_enable IPIs is one of the users, but for its case it would be necessary to differentiate between in-kernel mode and out of kernel mode atomically (since i-cache flush must be performed if isolated CPU is in kernel mode). > Ah I see, this is ordered that way: > > ll_isol_flags = ISOLATED > > CPU 0 CPU 1 > ------------------ ----------------- > // kernel entry > data_to_sync = 1 ll_isol_flags = ISOLATED_BROKEN > smp_mb() smp_mb() > if ll_isol_flags(CPU 1) == ISOLATED READ data_to_sync > smp_call(CPU 1) Since isolated mode with syscalls is a desired feature, having a separate atomic with in_kernel_mode = 0/1 (that is set/cleared on kernel entry / kernel exit, while on TIF_TASK_ISOLATION), would be necessary (and a similar race-free logic as above). > You should document that, ie: explain why what you're doing is safe. > > Also Beware though that the data to sync in question doesn't need to be visible > in the entry code before task_isolation_kernel_enter(). You need to audit all > the callers of kick_all_cpus_sync(). Cscope tag: flush_icache_range # line filename / context / line 1 96 arch/arc/kernel/jump_label.c <<arch_jump_label_transform>> flush_icache_range(entry->code, entry->code + JUMP_LABEL_NOP_SIZE); This case would be OK for delayed processing before kernel entry, as long as no code before task_isolation_kernel_enter can be modified (which i am not sure about). But: 36 28 arch/ia64/include/asm/cacheflush.h <<flush_icache_user_page>> flush_icache_range(_addr, _addr + (len)); \ Is less certain. Alex do you recall if arch_jump_label_transform was the only offender or there were others as well? (suppose handling only the ones which matter in production at the moment, and later fixing individual ones makes most sense). ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus @ 2021-01-22 15:00 ` Marcelo Tosatti 0 siblings, 0 replies; 74+ messages in thread From: Marcelo Tosatti @ 2021-01-22 15:00 UTC (permalink / raw) To: Frederic Weisbecker Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, Alex Belits, trix, linux-kernel, rostedt, davem, leon, netdev, peterx, linux-api, Prasun Kapoor, tglx, will, mingo, linux-arm-kernel On Tue, Nov 24, 2020 at 12:21:06AM +0100, Frederic Weisbecker wrote: > On Mon, Nov 23, 2020 at 10:39:34PM +0000, Alex Belits wrote: > > > > On Mon, 2020-11-23 at 23:29 +0100, Frederic Weisbecker wrote: > > > External Email > > > > > > ------------------------------------------------------------------- > > > --- > > > On Mon, Nov 23, 2020 at 05:58:42PM +0000, Alex Belits wrote: > > > > From: Yuri Norov <ynorov@marvell.com> > > > > > > > > Make sure that kick_all_cpus_sync() does not call CPUs that are > > > > running > > > > isolated tasks. > > > > > > > > Signed-off-by: Yuri Norov <ynorov@marvell.com> > > > > [abelits@marvell.com: use safe task_isolation_cpumask() > > > > implementation] > > > > Signed-off-by: Alex Belits <abelits@marvell.com> > > > > --- > > > > kernel/smp.c | 14 +++++++++++++- > > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/kernel/smp.c b/kernel/smp.c > > > > index 4d17501433be..b2faecf58ed0 100644 > > > > --- a/kernel/smp.c > > > > +++ b/kernel/smp.c > > > > @@ -932,9 +932,21 @@ static void do_nothing(void *unused) > > > > */ > > > > void kick_all_cpus_sync(void) > > > > { > > > > + struct cpumask mask; > > > > + > > > > /* Make sure the change is visible before we kick the cpus */ > > > > smp_mb(); > > > > - smp_call_function(do_nothing, NULL, 1); > > > > + > > > > + preempt_disable(); > > > > +#ifdef CONFIG_TASK_ISOLATION > > > > + cpumask_clear(&mask); > > > > + task_isolation_cpumask(&mask); > > > > + cpumask_complement(&mask, &mask); > > > > +#else > > > > + cpumask_setall(&mask); > > > > +#endif > > > > + smp_call_function_many(&mask, do_nothing, NULL, 1); > > > > + preempt_enable(); > > > > > > Same comment about IPIs here. > > > > This is different from timers. The original design was based on the > > idea that every CPU should be able to enter kernel at any time and run > > kernel code with no additional preparation. Then the only solution is > > to always do full broadcast and require all CPUs to process it. > > > > What I am trying to introduce is the idea of CPU that is not likely to > > run kernel code any soon, and can afford to go through an additional > > synchronization procedure on the next entry into kernel. The > > synchronization is not skipped, it simply happens later, early in > > kernel entry code. Perhaps a bitmask of pending flushes makes more sense? static_key_enable IPIs is one of the users, but for its case it would be necessary to differentiate between in-kernel mode and out of kernel mode atomically (since i-cache flush must be performed if isolated CPU is in kernel mode). > Ah I see, this is ordered that way: > > ll_isol_flags = ISOLATED > > CPU 0 CPU 1 > ------------------ ----------------- > // kernel entry > data_to_sync = 1 ll_isol_flags = ISOLATED_BROKEN > smp_mb() smp_mb() > if ll_isol_flags(CPU 1) == ISOLATED READ data_to_sync > smp_call(CPU 1) Since isolated mode with syscalls is a desired feature, having a separate atomic with in_kernel_mode = 0/1 (that is set/cleared on kernel entry / kernel exit, while on TIF_TASK_ISOLATION), would be necessary (and a similar race-free logic as above). > You should document that, ie: explain why what you're doing is safe. > > Also Beware though that the data to sync in question doesn't need to be visible > in the entry code before task_isolation_kernel_enter(). You need to audit all > the callers of kick_all_cpus_sync(). Cscope tag: flush_icache_range # line filename / context / line 1 96 arch/arc/kernel/jump_label.c <<arch_jump_label_transform>> flush_icache_range(entry->code, entry->code + JUMP_LABEL_NOP_SIZE); This case would be OK for delayed processing before kernel entry, as long as no code before task_isolation_kernel_enter can be modified (which i am not sure about). But: 36 28 arch/ia64/include/asm/cacheflush.h <<flush_icache_user_page>> flush_icache_range(_addr, _addr + (len)); \ Is less certain. Alex do you recall if arch_jump_label_transform was the only offender or there were others as well? (suppose handling only the ones which matter in production at the moment, and later fixing individual ones makes most sense). _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 0/9] "Task_isolation" mode 2020-11-23 17:42 ` Alex Belits @ 2020-11-24 16:36 ` Tom Rix -1 siblings, 0 replies; 74+ messages in thread From: Tom Rix @ 2020-11-24 16:36 UTC (permalink / raw) To: Alex Belits, nitesh, frederic Cc: Prasun Kapoor, linux-api, davem, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev On 11/23/20 9:42 AM, Alex Belits wrote: > This is an update of task isolation work that was originally done by > Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until > November 2017. It is adapted to the current kernel and cleaned up to > implement its functionality in a more complete and cleaner manner. I am having problems applying the patchset to today's linux-next. Which kernel should I be using ? Thanks, Tom ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 0/9] "Task_isolation" mode @ 2020-11-24 16:36 ` Tom Rix 0 siblings, 0 replies; 74+ messages in thread From: Tom Rix @ 2020-11-24 16:36 UTC (permalink / raw) To: Alex Belits, nitesh, frederic Cc: linux-arch, pauld, peterz, Prasun Kapoor, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, catalin.marinas, tglx, linux-api, will, mingo, linux-arm-kernel On 11/23/20 9:42 AM, Alex Belits wrote: > This is an update of task isolation work that was originally done by > Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until > November 2017. It is adapted to the current kernel and cleaned up to > implement its functionality in a more complete and cleaner manner. I am having problems applying the patchset to today's linux-next. Which kernel should I be using ? Thanks, Tom _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 0/9] "Task_isolation" mode 2020-11-24 16:36 ` Tom Rix @ 2020-11-24 17:40 ` Alex Belits -1 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-24 17:40 UTC (permalink / raw) To: trix Cc: Prasun Kapoor, linux-api, davem, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, nitesh, linux-arch, mtosatti, will, peterz, frederic, leon, linux-arm-kernel, pauld, netdev On Tue, 2020-11-24 at 08:36 -0800, Tom Rix wrote: > External Email > > ------------------------------------------------------------------- > --- > > On 11/23/20 9:42 AM, Alex Belits wrote: > > This is an update of task isolation work that was originally done > > by > > Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until > > November 2017. It is adapted to the current kernel and cleaned up > > to > > implement its functionality in a more complete and cleaner manner. > > I am having problems applying the patchset to today's linux-next. > > Which kernel should I be using ? The patches are against Linus' tree, in particular, commit a349e4c659609fd20e4beea89e5c4a4038e33a95 -- Alex ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 0/9] "Task_isolation" mode @ 2020-11-24 17:40 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-11-24 17:40 UTC (permalink / raw) To: trix Cc: linux-arch, nitesh, pauld, peterz, Prasun Kapoor, frederic, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, catalin.marinas, tglx, linux-api, will, mingo, linux-arm-kernel On Tue, 2020-11-24 at 08:36 -0800, Tom Rix wrote: > External Email > > ------------------------------------------------------------------- > --- > > On 11/23/20 9:42 AM, Alex Belits wrote: > > This is an update of task isolation work that was originally done > > by > > Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until > > November 2017. It is adapted to the current kernel and cleaned up > > to > > implement its functionality in a more complete and cleaner manner. > > I am having problems applying the patchset to today's linux-next. > > Which kernel should I be using ? The patches are against Linus' tree, in particular, commit a349e4c659609fd20e4beea89e5c4a4038e33a95 -- Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 0/9] "Task_isolation" mode 2020-11-24 17:40 ` Alex Belits @ 2020-12-02 14:02 ` Mark Rutland -1 siblings, 0 replies; 74+ messages in thread From: Mark Rutland @ 2020-12-02 14:02 UTC (permalink / raw) To: Alex Belits Cc: trix, Prasun Kapoor, linux-api, davem, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, nitesh, linux-arch, mtosatti, will, peterz, frederic, leon, linux-arm-kernel, pauld, netdev On Tue, Nov 24, 2020 at 05:40:49PM +0000, Alex Belits wrote: > > On Tue, 2020-11-24 at 08:36 -0800, Tom Rix wrote: > > External Email > > > > ------------------------------------------------------------------- > > --- > > > > On 11/23/20 9:42 AM, Alex Belits wrote: > > > This is an update of task isolation work that was originally done > > > by > > > Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until > > > November 2017. It is adapted to the current kernel and cleaned up > > > to > > > implement its functionality in a more complete and cleaner manner. > > > > I am having problems applying the patchset to today's linux-next. > > > > Which kernel should I be using ? > > The patches are against Linus' tree, in particular, commit > a349e4c659609fd20e4beea89e5c4a4038e33a95 Is there any reason to base on that commit in particular? Generally it's preferred that a series is based on a tag (so either a release or an -rc kernel), and that the cover letter explains what the base is. If you can do that in future it'll make the series much easier to work with. Thanks, Mark. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 0/9] "Task_isolation" mode @ 2020-12-02 14:02 ` Mark Rutland 0 siblings, 0 replies; 74+ messages in thread From: Mark Rutland @ 2020-12-02 14:02 UTC (permalink / raw) To: Alex Belits Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, trix, frederic, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, linux-api, Prasun Kapoor, tglx, will, mingo, linux-arm-kernel On Tue, Nov 24, 2020 at 05:40:49PM +0000, Alex Belits wrote: > > On Tue, 2020-11-24 at 08:36 -0800, Tom Rix wrote: > > External Email > > > > ------------------------------------------------------------------- > > --- > > > > On 11/23/20 9:42 AM, Alex Belits wrote: > > > This is an update of task isolation work that was originally done > > > by > > > Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until > > > November 2017. It is adapted to the current kernel and cleaned up > > > to > > > implement its functionality in a more complete and cleaner manner. > > > > I am having problems applying the patchset to today's linux-next. > > > > Which kernel should I be using ? > > The patches are against Linus' tree, in particular, commit > a349e4c659609fd20e4beea89e5c4a4038e33a95 Is there any reason to base on that commit in particular? Generally it's preferred that a series is based on a tag (so either a release or an -rc kernel), and that the cover letter explains what the base is. If you can do that in future it'll make the series much easier to work with. Thanks, Mark. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 0/9] "Task_isolation" mode 2020-12-02 14:02 ` Mark Rutland @ 2020-12-04 0:39 ` Alex Belits -1 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-12-04 0:39 UTC (permalink / raw) To: mark.rutland Cc: Prasun Kapoor, linux-api, davem, trix, mingo, linux-kernel, rostedt, peterx, tglx, nitesh, linux-arch, mtosatti, will, peterz, frederic, leon, linux-arm-kernel, catalin.marinas, pauld, netdev On Wed, 2020-12-02 at 14:02 +0000, Mark Rutland wrote: > On Tue, Nov 24, 2020 at 05:40:49PM +0000, Alex Belits wrote: > > > > > I am having problems applying the patchset to today's linux-next. > > > > > > Which kernel should I be using ? > > > > The patches are against Linus' tree, in particular, commit > > a349e4c659609fd20e4beea89e5c4a4038e33a95 > > Is there any reason to base on that commit in particular? No specific reason for that particular commit. > Generally it's preferred that a series is based on a tag (so either a > release or an -rc kernel), and that the cover letter explains what > the > base is. If you can do that in future it'll make the series much > easier > to work with. Ok. -- Alex ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [EXT] Re: [PATCH v5 0/9] "Task_isolation" mode @ 2020-12-04 0:39 ` Alex Belits 0 siblings, 0 replies; 74+ messages in thread From: Alex Belits @ 2020-12-04 0:39 UTC (permalink / raw) To: mark.rutland Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, Prasun Kapoor, frederic, mtosatti, linux-kernel, rostedt, davem, leon, netdev, peterx, trix, tglx, linux-api, will, mingo, linux-arm-kernel On Wed, 2020-12-02 at 14:02 +0000, Mark Rutland wrote: > On Tue, Nov 24, 2020 at 05:40:49PM +0000, Alex Belits wrote: > > > > > I am having problems applying the patchset to today's linux-next. > > > > > > Which kernel should I be using ? > > > > The patches are against Linus' tree, in particular, commit > > a349e4c659609fd20e4beea89e5c4a4038e33a95 > > Is there any reason to base on that commit in particular? No specific reason for that particular commit. > Generally it's preferred that a series is based on a tag (so either a > release or an -rc kernel), and that the cover letter explains what > the > base is. If you can do that in future it'll make the series much > easier > to work with. Ok. -- Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 0/9] "Task_isolation" mode 2020-11-23 17:42 ` Alex Belits @ 2020-12-05 20:40 ` Pavel Machek -1 siblings, 0 replies; 74+ messages in thread From: Pavel Machek @ 2020-12-05 20:40 UTC (permalink / raw) To: Alex Belits Cc: nitesh, frederic, Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, tglx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev [-- Attachment #1: Type: text/plain, Size: 1255 bytes --] Hi! > General description > > This is the result of development and maintenance of task isolation > functionality that originally started based on task isolation patch > v15 and was later updated to include v16. It provided predictable > environment for userspace tasks running on arm64 processors alongside > with full-featured Linux environment. It is intended to provide > reliable interruption-free environment from the point when a userspace > task enters isolation and until the moment it leaves isolation or > receives a signal intentionally sent to it, and was successfully used > for this purpose. While CPU isolation with nohz provides an > environment that is close to this requirement, the remaining IPIs and > other disturbances keep it from being usable for tasks that require > complete predictability of CPU timing. So... what kind of guarantees does this aim to provide / what tasks it is useful for? For real time response, we have other approaches. If you want to guarantee performnace of the "isolated" task... I don't see how that works. Other tasks on the system still compete for DRAM bandwidth, caches, etc... So... what is the usecase? Pavel -- http://www.livejournal.com/~pavelmachek [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 0/9] "Task_isolation" mode @ 2020-12-05 20:40 ` Pavel Machek 0 siblings, 0 replies; 74+ messages in thread From: Pavel Machek @ 2020-12-05 20:40 UTC (permalink / raw) To: Alex Belits Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, linux-api, frederic, mtosatti, linux-kernel, rostedt, mingo, leon, netdev, peterx, trix, Prasun Kapoor, tglx, will, davem, linux-arm-kernel [-- Attachment #1.1: Type: text/plain, Size: 1255 bytes --] Hi! > General description > > This is the result of development and maintenance of task isolation > functionality that originally started based on task isolation patch > v15 and was later updated to include v16. It provided predictable > environment for userspace tasks running on arm64 processors alongside > with full-featured Linux environment. It is intended to provide > reliable interruption-free environment from the point when a userspace > task enters isolation and until the moment it leaves isolation or > receives a signal intentionally sent to it, and was successfully used > for this purpose. While CPU isolation with nohz provides an > environment that is close to this requirement, the remaining IPIs and > other disturbances keep it from being usable for tasks that require > complete predictability of CPU timing. So... what kind of guarantees does this aim to provide / what tasks it is useful for? For real time response, we have other approaches. If you want to guarantee performnace of the "isolated" task... I don't see how that works. Other tasks on the system still compete for DRAM bandwidth, caches, etc... So... what is the usecase? Pavel -- http://www.livejournal.com/~pavelmachek [-- Attachment #1.2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] [-- Attachment #2: Type: text/plain, Size: 176 bytes --] _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 0/9] "Task_isolation" mode 2020-12-05 20:40 ` Pavel Machek @ 2020-12-05 23:25 ` Thomas Gleixner -1 siblings, 0 replies; 74+ messages in thread From: Thomas Gleixner @ 2020-12-05 23:25 UTC (permalink / raw) To: Pavel Machek, Alex Belits Cc: nitesh, frederic, Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev Pavel, On Sat, Dec 05 2020 at 21:40, Pavel Machek wrote: > So... what kind of guarantees does this aim to provide / what tasks it > is useful for? > > For real time response, we have other approaches. Depends on your requirements. Some problems are actually better solved with busy polling. See below. > If you want to guarantee performnace of the "isolated" task... I don't > see how that works. Other tasks on the system still compete for DRAM > bandwidth, caches, etc... Applications which want to run as undisturbed as possible. There is quite a range of those: - Hardware in the loop simulation is today often done with that crude approach of "offlining" a CPU and then instead of playing dead jumping to a preloaded bare metal executable. That's a horrible hack and impossible to debug, but gives them the results they need to achieve. These applications are well optimized vs. cache and memory foot print, so they don't worry about these things too much and they surely don't run on SMI and BIOS value add inflicted machines. Don't even think about waiting for an interrupt to achieve what these folks are doing. So no, there are problems which a general purpose realtime OS cannot solve ever. - HPC computations on large data sets. While the memory foot print is large the access patterns are cache optimized. The problem there is that any unnecessary IPI, tick interrupt or whatever nuisance is disturbing the carefully optimized cache usage and alone getting rid of the timer interrupt gained them measurable performance. Even very low single digit percentage of runtime saving is valuable for these folks because the compute time on such beasts is expensive. - Realtime guests in KVM. With posted interrupts and a fully populated host side page table there is no point in running host side interrupts or IPIs for random accounting or whatever purposes as they affect the latency in the guest. With all the side effects mitigated and a properly set up guest and host it is possible to get to a zero exit situation after the bootup phase which means pretty much matching bare metal behaviour. Yes, you can do that with e.g. Jailhouse as well, but you lose lots of the fancy things KVM provides. And people care about these not just because they are fancy. They care because their application scenario needs them. There are more reasons why people want to be able to get as much isolation from the OS as possible but at the same time have a sane execution environment, debugging, performance monitoring and the OS provided protection mechanisms instead of horrible hacks. Isolation makes sense for a range of applications and there is no reason why Linux should not support them. Thanks, tglx ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 0/9] "Task_isolation" mode @ 2020-12-05 23:25 ` Thomas Gleixner 0 siblings, 0 replies; 74+ messages in thread From: Thomas Gleixner @ 2020-12-05 23:25 UTC (permalink / raw) To: Pavel Machek, Alex Belits Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, linux-api, frederic, mtosatti, linux-kernel, rostedt, mingo, leon, netdev, peterx, trix, Prasun Kapoor, will, davem, linux-arm-kernel Pavel, On Sat, Dec 05 2020 at 21:40, Pavel Machek wrote: > So... what kind of guarantees does this aim to provide / what tasks it > is useful for? > > For real time response, we have other approaches. Depends on your requirements. Some problems are actually better solved with busy polling. See below. > If you want to guarantee performnace of the "isolated" task... I don't > see how that works. Other tasks on the system still compete for DRAM > bandwidth, caches, etc... Applications which want to run as undisturbed as possible. There is quite a range of those: - Hardware in the loop simulation is today often done with that crude approach of "offlining" a CPU and then instead of playing dead jumping to a preloaded bare metal executable. That's a horrible hack and impossible to debug, but gives them the results they need to achieve. These applications are well optimized vs. cache and memory foot print, so they don't worry about these things too much and they surely don't run on SMI and BIOS value add inflicted machines. Don't even think about waiting for an interrupt to achieve what these folks are doing. So no, there are problems which a general purpose realtime OS cannot solve ever. - HPC computations on large data sets. While the memory foot print is large the access patterns are cache optimized. The problem there is that any unnecessary IPI, tick interrupt or whatever nuisance is disturbing the carefully optimized cache usage and alone getting rid of the timer interrupt gained them measurable performance. Even very low single digit percentage of runtime saving is valuable for these folks because the compute time on such beasts is expensive. - Realtime guests in KVM. With posted interrupts and a fully populated host side page table there is no point in running host side interrupts or IPIs for random accounting or whatever purposes as they affect the latency in the guest. With all the side effects mitigated and a properly set up guest and host it is possible to get to a zero exit situation after the bootup phase which means pretty much matching bare metal behaviour. Yes, you can do that with e.g. Jailhouse as well, but you lose lots of the fancy things KVM provides. And people care about these not just because they are fancy. They care because their application scenario needs them. There are more reasons why people want to be able to get as much isolation from the OS as possible but at the same time have a sane execution environment, debugging, performance monitoring and the OS provided protection mechanisms instead of horrible hacks. Isolation makes sense for a range of applications and there is no reason why Linux should not support them. Thanks, tglx _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 0/9] "Task_isolation" mode 2020-12-05 23:25 ` Thomas Gleixner @ 2020-12-11 18:08 ` Yury Norov -1 siblings, 0 replies; 74+ messages in thread From: Yury Norov @ 2020-12-11 18:08 UTC (permalink / raw) To: Thomas Gleixner Cc: Pavel Machek, Alex Belits, nitesh, frederic, Prasun Kapoor, linux-api, davem, trix, mingo, catalin.marinas, rostedt, linux-kernel, peterx, linux-arch, mtosatti, will, peterz, leon, linux-arm-kernel, pauld, netdev On Sun, Dec 06, 2020 at 12:25:45AM +0100, Thomas Gleixner wrote: > Pavel, > > On Sat, Dec 05 2020 at 21:40, Pavel Machek wrote: > > So... what kind of guarantees does this aim to provide / what tasks it > > is useful for? > > > > For real time response, we have other approaches. > > Depends on your requirements. Some problems are actually better solved > with busy polling. See below. > > > If you want to guarantee performnace of the "isolated" task... I don't > > see how that works. Other tasks on the system still compete for DRAM > > bandwidth, caches, etc... > > Applications which want to run as undisturbed as possible. There is > quite a range of those: > > - Hardware in the loop simulation is today often done with that crude > approach of "offlining" a CPU and then instead of playing dead > jumping to a preloaded bare metal executable. That's a horrible hack > and impossible to debug, but gives them the results they need to > achieve. These applications are well optimized vs. cache and memory > foot print, so they don't worry about these things too much and they > surely don't run on SMI and BIOS value add inflicted machines. > > Don't even think about waiting for an interrupt to achieve what > these folks are doing. So no, there are problems which a general > purpose realtime OS cannot solve ever. > > - HPC computations on large data sets. While the memory foot print is > large the access patterns are cache optimized. > > The problem there is that any unnecessary IPI, tick interrupt or > whatever nuisance is disturbing the carefully optimized cache usage > and alone getting rid of the timer interrupt gained them measurable > performance. Even very low single digit percentage of runtime saving > is valuable for these folks because the compute time on such beasts > is expensive. > > - Realtime guests in KVM. With posted interrupts and a fully populated > host side page table there is no point in running host side > interrupts or IPIs for random accounting or whatever purposes as > they affect the latency in the guest. With all the side effects > mitigated and a properly set up guest and host it is possible to get > to a zero exit situation after the bootup phase which means pretty > much matching bare metal behaviour. > > Yes, you can do that with e.g. Jailhouse as well, but you lose lots > of the fancy things KVM provides. And people care about these not > just because they are fancy. They care because their application > scenario needs them. > > There are more reasons why people want to be able to get as much > isolation from the OS as possible but at the same time have a sane > execution environment, debugging, performance monitoring and the OS > provided protection mechanisms instead of horrible hacks. > > Isolation makes sense for a range of applications and there is no reason > why Linux should not support them. One good client for the task isolation is Open Data Plane. There are even some code stubs supposed to enable isolation where needed. > > If you want to guarantee performnace of the "isolated" task... I don't > > see how that works. Other tasks on the system still compete for DRAM > > bandwidth, caches, etc... My experiments say that typical delay caused by dry IPI or syscall is 2000-20000 'ticks'. Typical delay caused by cache miss is 3-30 ticks. To guarantee cache / memory bandwidth, one can use resctrl. Linux has implementation of it for x86 only, but arm64 has support for for resctrl on CPU side. Thanks, Yury > Thanks, > > tglx ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH v5 0/9] "Task_isolation" mode @ 2020-12-11 18:08 ` Yury Norov 0 siblings, 0 replies; 74+ messages in thread From: Yury Norov @ 2020-12-11 18:08 UTC (permalink / raw) To: Thomas Gleixner Cc: linux-arch, nitesh, pauld, catalin.marinas, peterz, Alex Belits, frederic, mtosatti, trix, linux-kernel, rostedt, mingo, leon, netdev, peterx, Pavel Machek, linux-api, Prasun Kapoor, will, davem, linux-arm-kernel On Sun, Dec 06, 2020 at 12:25:45AM +0100, Thomas Gleixner wrote: > Pavel, > > On Sat, Dec 05 2020 at 21:40, Pavel Machek wrote: > > So... what kind of guarantees does this aim to provide / what tasks it > > is useful for? > > > > For real time response, we have other approaches. > > Depends on your requirements. Some problems are actually better solved > with busy polling. See below. > > > If you want to guarantee performnace of the "isolated" task... I don't > > see how that works. Other tasks on the system still compete for DRAM > > bandwidth, caches, etc... > > Applications which want to run as undisturbed as possible. There is > quite a range of those: > > - Hardware in the loop simulation is today often done with that crude > approach of "offlining" a CPU and then instead of playing dead > jumping to a preloaded bare metal executable. That's a horrible hack > and impossible to debug, but gives them the results they need to > achieve. These applications are well optimized vs. cache and memory > foot print, so they don't worry about these things too much and they > surely don't run on SMI and BIOS value add inflicted machines. > > Don't even think about waiting for an interrupt to achieve what > these folks are doing. So no, there are problems which a general > purpose realtime OS cannot solve ever. > > - HPC computations on large data sets. While the memory foot print is > large the access patterns are cache optimized. > > The problem there is that any unnecessary IPI, tick interrupt or > whatever nuisance is disturbing the carefully optimized cache usage > and alone getting rid of the timer interrupt gained them measurable > performance. Even very low single digit percentage of runtime saving > is valuable for these folks because the compute time on such beasts > is expensive. > > - Realtime guests in KVM. With posted interrupts and a fully populated > host side page table there is no point in running host side > interrupts or IPIs for random accounting or whatever purposes as > they affect the latency in the guest. With all the side effects > mitigated and a properly set up guest and host it is possible to get > to a zero exit situation after the bootup phase which means pretty > much matching bare metal behaviour. > > Yes, you can do that with e.g. Jailhouse as well, but you lose lots > of the fancy things KVM provides. And people care about these not > just because they are fancy. They care because their application > scenario needs them. > > There are more reasons why people want to be able to get as much > isolation from the OS as possible but at the same time have a sane > execution environment, debugging, performance monitoring and the OS > provided protection mechanisms instead of horrible hacks. > > Isolation makes sense for a range of applications and there is no reason > why Linux should not support them. One good client for the task isolation is Open Data Plane. There are even some code stubs supposed to enable isolation where needed. > > If you want to guarantee performnace of the "isolated" task... I don't > > see how that works. Other tasks on the system still compete for DRAM > > bandwidth, caches, etc... My experiments say that typical delay caused by dry IPI or syscall is 2000-20000 'ticks'. Typical delay caused by cache miss is 3-30 ticks. To guarantee cache / memory bandwidth, one can use resctrl. Linux has implementation of it for x86 only, but arm64 has support for for resctrl on CPU side. Thanks, Yury > Thanks, > > tglx _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 74+ messages in thread
end of thread, other threads:[~2021-01-22 15:12 UTC | newest] Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-11-23 17:42 [PATCH v5 0/9] "Task_isolation" mode Alex Belits 2020-11-23 17:42 ` Alex Belits 2020-11-23 17:56 ` [PATCH v5 1/9] task_isolation: vmstat: add quiet_vmstat_sync function Alex Belits 2020-11-23 17:56 ` Alex Belits 2020-11-23 21:48 ` Thomas Gleixner 2020-11-23 21:48 ` Thomas Gleixner 2020-11-23 17:56 ` [PATCH v5 2/9] task_isolation: vmstat: add vmstat_idle function Alex Belits 2020-11-23 17:56 ` Alex Belits 2020-11-23 21:49 ` Thomas Gleixner 2020-11-23 21:49 ` Thomas Gleixner 2020-11-23 17:56 ` [PATCH v5 3/9] task_isolation: userspace hard isolation from kernel Alex Belits 2020-11-23 17:56 ` Alex Belits 2020-11-23 22:01 ` Thomas Gleixner 2020-11-23 22:01 ` Thomas Gleixner 2020-11-23 17:57 ` [PATCH v5 4/9] task_isolation: Add task isolation hooks to arch-independent code Alex Belits 2020-11-23 17:57 ` Alex Belits 2020-11-23 22:31 ` Thomas Gleixner 2020-11-23 22:31 ` Thomas Gleixner 2020-11-23 17:57 ` [PATCH v5 5/9] task_isolation: Add driver-specific hooks Alex Belits 2020-11-23 17:57 ` Alex Belits 2020-12-02 14:18 ` Mark Rutland 2020-12-02 14:18 ` Mark Rutland 2020-12-04 0:43 ` [EXT] " Alex Belits 2020-12-04 0:43 ` Alex Belits 2020-11-23 17:58 ` [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality Alex Belits 2020-11-23 17:58 ` Alex Belits 2020-12-02 13:59 ` Mark Rutland 2020-12-02 13:59 ` Mark Rutland 2020-12-04 0:37 ` [EXT] " Alex Belits 2020-12-04 0:37 ` Alex Belits 2020-12-07 11:57 ` Mark Rutland 2020-12-07 11:57 ` Mark Rutland 2020-11-23 17:58 ` [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu() Alex Belits 2020-11-23 17:58 ` Alex Belits 2020-11-23 22:13 ` Frederic Weisbecker 2020-11-23 22:13 ` Frederic Weisbecker 2020-11-23 22:35 ` Alex Belits 2020-11-23 22:35 ` Alex Belits 2020-11-23 22:36 ` Thomas Gleixner 2020-11-23 22:36 ` Thomas Gleixner 2020-12-02 14:20 ` Mark Rutland 2020-12-02 14:20 ` Mark Rutland 2020-12-04 0:54 ` [EXT] " Alex Belits 2020-12-04 0:54 ` Alex Belits 2020-12-07 11:58 ` Mark Rutland 2020-12-07 11:58 ` Mark Rutland 2020-11-23 17:58 ` [PATCH v5 8/9] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize Alex Belits 2020-11-23 17:58 ` Alex Belits 2020-11-23 17:58 ` [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus Alex Belits 2020-11-23 17:58 ` Alex Belits 2020-11-23 22:29 ` Frederic Weisbecker 2020-11-23 22:29 ` Frederic Weisbecker 2020-11-23 22:39 ` [EXT] " Alex Belits 2020-11-23 22:39 ` Alex Belits 2020-11-23 23:21 ` Frederic Weisbecker 2020-11-23 23:21 ` Frederic Weisbecker 2020-11-25 3:20 ` Alex Belits 2020-11-25 3:20 ` Alex Belits 2021-01-22 15:00 ` Marcelo Tosatti 2021-01-22 15:00 ` Marcelo Tosatti 2020-11-24 16:36 ` [PATCH v5 0/9] "Task_isolation" mode Tom Rix 2020-11-24 16:36 ` Tom Rix 2020-11-24 17:40 ` [EXT] " Alex Belits 2020-11-24 17:40 ` Alex Belits 2020-12-02 14:02 ` Mark Rutland 2020-12-02 14:02 ` Mark Rutland 2020-12-04 0:39 ` Alex Belits 2020-12-04 0:39 ` Alex Belits 2020-12-05 20:40 ` Pavel Machek 2020-12-05 20:40 ` Pavel Machek 2020-12-05 23:25 ` Thomas Gleixner 2020-12-05 23:25 ` Thomas Gleixner 2020-12-11 18:08 ` Yury Norov 2020-12-11 18:08 ` Yury Norov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.