linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v15 02/13] vmstat: add vmstat_idle function
       [not found] <1471382376-5443-1-git-send-email-cmetcalf@mellanox.com>
@ 2016-08-16 21:19 ` Chris Metcalf
  2016-08-16 21:19 ` [PATCH v15 03/13] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
  2016-08-16 21:19 ` [PATCH v15 04/13] task_isolation: add initial support Chris Metcalf
  2 siblings, 0 replies; 28+ messages in thread
From: Chris Metcalf @ 2016-08-16 21:19 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-mm, linux-kernel
  Cc: Chris Metcalf

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index fab62aa74079..69b6cc4be909 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -235,6 +235,7 @@ extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
 void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -338,6 +339,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 57fc29750da6..7dd17c06d3a7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1763,6 +1763,16 @@ void quiet_vmstat_sync(void)
 }
 
 /*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+	return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) &&
+		!need_update(smp_processor_id());
+}
+
+/*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
  * threads for vm statistics updates disabled because of
-- 
2.7.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v15 03/13] lru_add_drain_all: factor out lru_add_drain_needed
       [not found] <1471382376-5443-1-git-send-email-cmetcalf@mellanox.com>
  2016-08-16 21:19 ` [PATCH v15 02/13] vmstat: add vmstat_idle function Chris Metcalf
@ 2016-08-16 21:19 ` Chris Metcalf
  2016-08-16 21:19 ` [PATCH v15 04/13] task_isolation: add initial support Chris Metcalf
  2 siblings, 0 replies; 28+ messages in thread
From: Chris Metcalf @ 2016-08-16 21:19 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	linux-mm, linux-kernel
  Cc: Chris Metcalf

This per-cpu check was being done in the loop in lru_add_drain_all(),
but having it be callable for a particular cpu is helpful for the
task-isolation patches.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 include/linux/swap.h |  1 +
 mm/swap.c            | 15 ++++++++++-----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b17cc4830fa6..58966a235298 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -295,6 +295,7 @@ extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
+extern bool lru_add_drain_needed(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index 75c63bb2a1da..a2be6f0931b5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -655,6 +655,15 @@ void deactivate_page(struct page *page)
 	}
 }
 
+bool lru_add_drain_needed(int cpu)
+{
+	return (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
+		pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
+		pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+		need_activate_page_drain(cpu));
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
@@ -699,11 +708,7 @@ void lru_add_drain_all(void)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
-		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
-		    need_activate_page_drain(cpu)) {
+		if (lru_add_drain_needed(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, lru_add_drain_wq, work);
 			cpumask_set_cpu(cpu, &has_work);
-- 
2.7.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v15 04/13] task_isolation: add initial support
       [not found] <1471382376-5443-1-git-send-email-cmetcalf@mellanox.com>
  2016-08-16 21:19 ` [PATCH v15 02/13] vmstat: add vmstat_idle function Chris Metcalf
  2016-08-16 21:19 ` [PATCH v15 03/13] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
@ 2016-08-16 21:19 ` Chris Metcalf
  2016-08-29 16:33   ` Peter Zijlstra
  2 siblings, 1 reply; 28+ messages in thread
From: Chris Metcalf @ 2016-08-16 21:19 UTC (permalink / raw)
  To: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Andy Lutomirski,
	Michal Hocko, linux-mm, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well.  The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in
thread_info flags.  When task isolation is enabled for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future.

The task_isolation_ready() call is invoked when TIF_TASK_ISOLATION is
set in prepare_exit_to_usermode() or its architectural equivalent,
and forces the loop to retry if the system is not ready.  It is
called with interrupts disabled and inspects the kernel state
to determine if it is safe to return into an isolated state.
In particular, if it sees that the scheduler tick is still enabled,
it reports that it is not yet safe.

Each time through the loop of TIF work to do, if TIF_TASK_ISOLATION
is set, we call the new task_isolation_enter() routine.  This
takes any actions that might avoid a future interrupt to the core,
such as a worker thread being scheduled that could be quiesced now
(e.g. the vmstat worker) or a future IPI to the core to clean up some
state that could be cleaned up now (e.g. the mm lru per-cpu cache).
In addition, it reqeusts rescheduling if the scheduler dyntick is
still running.

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
of a number of other synchronous traps, the kernel will kill it
with SIGKILL.  For system calls, this test is performed immediately
before the SECCOMP test and causes the syscall to return immediately
with ENOSYS.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl() syscall so that we can clear the bit again
later, and ignores exit/exit_group to allow exiting the task without
a pointless signal killing you as you try to do so.

A new /sys/devices/system/cpu/task_isolation pseudo-file is added,
parallel to the comparable nohz_full file.

Separate patches that follow provide these changes for x86, tile,
and arm64.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 Documentation/kernel-parameters.txt |   8 ++
 drivers/base/cpu.c                  |  18 +++
 include/linux/isolation.h           |  60 ++++++++++
 include/linux/sched.h               |   3 +
 include/linux/tick.h                |   2 +
 include/uapi/linux/prctl.h          |   5 +
 init/Kconfig                        |  27 +++++
 kernel/Makefile                     |   1 +
 kernel/fork.c                       |   3 +
 kernel/isolation.c                  | 218 ++++++++++++++++++++++++++++++++++++
 kernel/signal.c                     |   8 ++
 kernel/sys.c                        |   9 ++
 kernel/time/tick-sched.c            |  36 +++---
 13 files changed, 385 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 46c030a49186..7f1336b50dcc 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3943,6 +3943,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation=	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION=y, set
+			the specified list of CPUs where cpus will be able
+			to use prctl(PR_SET_TASK_ISOLATION) to set up task
+			isolation mode.  Setting this boot flag implicitly
+			also sets up nohz_full and isolcpus mode for the
+			listed set of cpus.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 691eeea2f19a..eaf40f4264ee 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -17,6 +17,7 @@
 #include <linux/of.h>
 #include <linux/cpufeature.h>
 #include <linux/tick.h>
+#include <linux/isolation.h>
 
 #include "base.h"
 
@@ -290,6 +291,20 @@ static ssize_t print_cpus_nohz_full(struct device *dev,
 static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL);
 #endif
 
+#ifdef CONFIG_TASK_ISOLATION
+static ssize_t print_cpus_task_isolation(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	int n = 0, len = PAGE_SIZE-2;
+
+	n = scnprintf(buf, len, "%*pbl\n", cpumask_pr_args(task_isolation_map));
+
+	return n;
+}
+static DEVICE_ATTR(task_isolation, 0444, print_cpus_task_isolation, NULL);
+#endif
+
 static void cpu_device_release(struct device *dev)
 {
 	/*
@@ -460,6 +475,9 @@ static struct attribute *cpu_root_attrs[] = {
 #ifdef CONFIG_NO_HZ_FULL
 	&dev_attr_nohz_full.attr,
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	&dev_attr_task_isolation.attr,
+#endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
 #endif
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..d9288b85b41f
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,60 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <linux/tick.h>
+#include <linux/prctl.h>
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/* cpus that are configured to support task isolation */
+extern cpumask_var_t task_isolation_map;
+
+extern int task_isolation_init(void);
+
+static inline bool task_isolation_possible(int cpu)
+{
+	return task_isolation_map != NULL &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+extern int task_isolation_set(unsigned int flags);
+
+extern bool task_isolation_ready(void);
+extern void task_isolation_enter(void);
+
+static inline void task_isolation_set_flags(struct task_struct *p,
+					    unsigned int flags)
+{
+	p->task_isolation_flags = flags;
+
+	if (flags & PR_TASK_ISOLATION_ENABLE)
+		set_tsk_thread_flag(p, TIF_TASK_ISOLATION);
+	else
+		clear_tsk_thread_flag(p, TIF_TASK_ISOLATION);
+}
+
+extern int task_isolation_syscall(int nr);
+
+/* Report on exceptions that don't cause a signal for the user process. */
+extern void _task_isolation_quiet_exception(const char *fmt, ...);
+#define task_isolation_quiet_exception(fmt, ...)			\
+	do {								\
+		if (current_thread_info()->flags & _TIF_TASK_ISOLATION) \
+			_task_isolation_quiet_exception(fmt, ## __VA_ARGS__); \
+	} while (0)
+
+#else
+static inline void task_isolation_init(void) { }
+static inline bool task_isolation_possible(int cpu) { return false; }
+static inline bool task_isolation_ready(void) { return true; }
+static inline void task_isolation_enter(void) { }
+extern inline void task_isolation_set_flags(struct task_struct *p,
+					    unsigned int flags) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_quiet_exception(const char *fmt, ...) { }
+#endif
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e513e39..77dc12cd4fe8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1923,6 +1923,9 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned int	task_isolation_flags;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 62be0786d6d0..fbd81e322860 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -235,6 +235,8 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
 
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
+extern void tick_nohz_full_add_cpus(const struct cpumask *mask);
+extern bool can_stop_my_full_tick(void);
 #else
 static inline int housekeeping_any_cpu(void)
 {
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..2a49d0d2940a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,9 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Enable/disable or query task_isolation mode for TASK_ISOLATION kernels. */
+#define PR_SET_TASK_ISOLATION		48
+#define PR_GET_TASK_ISOLATION		49
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index cac3f096050d..a95a35a31b46 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -786,6 +786,33 @@ config RCU_EXPEDITE_BOOT
 
 endmenu # "RCU Subsystem"
 
+config HAVE_ARCH_TASK_ISOLATION
+	bool
+
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
+	help
+	 Allow userspace processes to place themselves on task_isolation
+	 cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate"
+	 themselves from the kernel.  Prior to returning to userspace,
+	 isolated tasks will arrange that no future kernel
+	 activity will interrupt the task while the task is running
+	 in userspace.  By default, attempting to re-enter the kernel
+	 while in this mode will cause the task to be terminated
+	 with a signal; you must explicitly use prctl() to disable
+	 task isolation before resuming normal use of the kernel.
+
+	 This "hard" isolation from the kernel is required for
+	 userspace tasks that are running hard real-time tasks in
+	 userspace, such as a 10 Gbit network driver in userspace.
+	 Without this option, but with NO_HZ_FULL enabled, the kernel
+	 will make a best-faith, "soft" effort to shield a single userspace
+	 process from interrupts, but makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index e2ec54e2b952..91ff1615f4d6 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -112,6 +112,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 52e725d4a866..54542266d7a8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -76,6 +76,7 @@
 #include <linux/compiler.h>
 #include <linux/sysctl.h>
 #include <linux/kcov.h>
+#include <linux/isolation.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1533,6 +1534,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #endif
 	clear_all_latency_tracing(p);
 
+	task_isolation_set_flags(p, 0);
+
 	/* ok, now we should be set up.. */
 	p->pid = pid_nr(pid);
 	if (clone_flags & CLONE_THREAD) {
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..4382e2043de9
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,218 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include <asm/unistd.h>
+#include <asm/syscall.h>
+#include "time/tick-sched.h"
+
+cpumask_var_t task_isolation_map;
+static bool saw_boot_arg;
+
+/*
+ * Isolation requires both nohz and isolcpus support from the scheduler.
+ * We provide a boot flag that enables both for now, and which we can
+ * add other functionality to over time if needed.  Note that just
+ * specifying "nohz_full=... isolcpus=..." does not enable task isolation.
+ */
+static int __init task_isolation_setup(char *str)
+{
+	saw_boot_arg = true;
+
+	alloc_bootmem_cpumask_var(&task_isolation_map);
+	if (cpulist_parse(str, task_isolation_map) < 0) {
+		pr_warn("task_isolation: Incorrect cpumask '%s'\n", str);
+		return 1;
+	}
+
+	return 1;
+}
+__setup("task_isolation=", task_isolation_setup);
+
+int __init task_isolation_init(void)
+{
+	/* For offstack cpumask, ensure we allocate an empty cpumask early. */
+	if (!saw_boot_arg) {
+		zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+		return 0;
+	}
+
+	/*
+	 * Add our task_isolation cpus to nohz_full and isolcpus.  Note
+	 * that we are called relatively early in boot, from tick_init();
+	 * at this point neither nohz_full nor isolcpus has been used
+	 * to configure the system, but isolcpus has been allocated
+	 * already in sched_init().
+	 */
+	tick_nohz_full_add_cpus(task_isolation_map);
+	cpumask_or(cpu_isolated_map, cpu_isolated_map, task_isolation_map);
+
+	return 0;
+}
+
+/*
+ * Get a snapshot of whether, at this moment, it would be possible to
+ * stop the tick.  This test normally requires interrupts disabled since
+ * the condition can change if an interrupt is delivered.  However, in
+ * this case we are using it in an advisory capacity to see if there
+ * is anything obviously indicating that the task isolation
+ * preconditions have not been met, so it's OK that in principle it
+ * might not still be true later in the prctl() syscall path.
+ */
+static bool can_stop_my_full_tick_now(void)
+{
+	bool ret;
+
+	local_irq_disable();
+	ret = can_stop_my_full_tick();
+	local_irq_enable();
+	return ret;
+}
+
+/*
+ * This routine controls whether we can enable task-isolation mode.
+ * The task must be affinitized to a single task_isolation core, or
+ * else we return EINVAL.  And, it must be at least statically able to
+ * stop the nohz_full tick (e.g., no other schedulable tasks currently
+ * running, no POSIX cpu timers currently set up, etc.); if not, we
+ * return EAGAIN.
+ */
+int task_isolation_set(unsigned int flags)
+{
+	if (flags != 0) {
+		if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
+		    !task_isolation_possible(raw_smp_processor_id())) {
+			/* Invalid task affinity setting. */
+			return -EINVAL;
+		}
+		if (!can_stop_my_full_tick_now()) {
+			/* System not yet ready for task isolation. */
+			return -EAGAIN;
+		}
+	}
+
+	task_isolation_set_flags(current, flags);
+	return 0;
+}
+
+/*
+ * In task isolation mode we try to return to userspace only after
+ * attempting to make sure we won't be interrupted again.  This test
+ * is run with interrupts disabled to test that everything we need
+ * to be true is true before we can return to userspace.
+ */
+bool task_isolation_ready(void)
+{
+	WARN_ON_ONCE(!irqs_disabled());
+
+	return (!lru_add_drain_needed(smp_processor_id()) &&
+		vmstat_idle() &&
+		tick_nohz_tick_stopped());
+}
+
+/*
+ * Each time we try to prepare for return to userspace in a process
+ * with task isolation enabled, we run this code to quiesce whatever
+ * subsystems we can readily quiesce to avoid later interrupts.
+ */
+void task_isolation_enter(void)
+{
+	WARN_ON_ONCE(irqs_disabled());
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	/* Quieten the vmstat worker so it won't interrupt us. */
+	if (!vmstat_idle())
+		quiet_vmstat_sync();
+
+	/*
+	 * Request rescheduling unless we are in full dynticks mode.
+	 * We would eventually get pre-empted without this, and if
+	 * there's another task waiting, it would run; but by
+	 * explicitly requesting the reschedule, we may reduce the
+	 * latency.  We could directly call schedule() here as well,
+	 * but since our caller is the standard place where schedule()
+	 * is called, we defer to the caller.
+	 *
+	 * A more substantive approach here would be to use a struct
+	 * completion here explicitly, and complete it when we shut
+	 * down dynticks, but since we presumably have nothing better
+	 * to do on this core anyway, just spinning seems plausible.
+	 */
+	if (!tick_nohz_tick_stopped())
+		set_tsk_need_resched(current);
+}
+
+static void task_isolation_deliver_signal(struct task_struct *task,
+					  const char *buf)
+{
+	siginfo_t info = {};
+
+	info.si_signo = SIGKILL;
+
+	/*
+	 * Report on the fact that isolation was violated for the task.
+	 * It may not be the task's fault (e.g. a TLB flush from another
+	 * core) but we are not blaming it, just reporting that it lost
+	 * its isolation status.
+	 */
+	pr_warn("%s/%d: task_isolation mode lost due to %s\n",
+		task->comm, task->pid, buf);
+
+	/* Turn off task isolation mode to avoid further isolation callbacks. */
+	task_isolation_set_flags(task, 0);
+
+	send_sig_info(info.si_signo, &info, task);
+}
+
+/*
+ * This routine is called from any userspace exception that doesn't
+ * otherwise trigger a signal to the user process (e.g. simple page fault).
+ */
+void _task_isolation_quiet_exception(const char *fmt, ...)
+{
+	struct task_struct *task = current;
+	va_list args;
+	char buf[100];
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	task_isolation_deliver_signal(task, buf);
+}
+
+/*
+ * This routine is called from syscall entry (with the syscall number
+ * passed in), and prevents most syscalls from executing and raises a
+ * signal to notify the process.
+ */
+int task_isolation_syscall(int syscall)
+{
+	char buf[20];
+
+	if (syscall == __NR_prctl ||
+	    syscall == __NR_exit ||
+	    syscall == __NR_exit_group)
+		return 0;
+
+	snprintf(buf, sizeof(buf), "syscall %d", syscall);
+	task_isolation_deliver_signal(current, buf);
+
+	syscall_set_return_value(current, current_pt_regs(),
+					 -ERESTARTNOINTR, -1);
+	return -1;
+}
diff --git a/kernel/signal.c b/kernel/signal.c
index af21afc00d08..895f547ff66f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -34,6 +34,7 @@
 #include <linux/compat.h>
 #include <linux/cn_proc.h>
 #include <linux/compiler.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -2213,6 +2214,13 @@ relock:
 		/* Trace actually delivered signals. */
 		trace_signal_deliver(signr, &ksig->info, ka);
 
+		/*
+		 * Disable task isolation when delivering a signal.
+		 * The isolation model requires users to reset task
+		 * isolation from the signal handler if desired.
+		 */
+		task_isolation_set_flags(current, 0);
+
 		if (ka->sa.sa_handler == SIG_IGN) /* Do nothing.  */
 			continue;
 		if (ka->sa.sa_handler != SIG_DFL) {
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be418157..4df84af425e3 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2270,6 +2271,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_TASK_ISOLATION
+	case PR_SET_TASK_ISOLATION:
+		error = task_isolation_set(arg2);
+		break;
+	case PR_GET_TASK_ISOLATION:
+		error = me->task_isolation_flags;
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 204fdc86863d..a6e29527743e 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -23,6 +23,7 @@
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
 #include <linux/context_tracking.h>
+#include <linux/isolation.h>
 
 #include <asm/irq_regs.h>
 
@@ -205,6 +206,11 @@ static bool can_stop_full_tick(struct tick_sched *ts)
 	return true;
 }
 
+bool can_stop_my_full_tick(void)
+{
+	return can_stop_full_tick(this_cpu_ptr(&tick_cpu_sched));
+}
+
 static void nohz_full_kick_func(struct irq_work *work)
 {
 	/* Empty, the tick restart happens on tick_nohz_irq_exit() */
@@ -407,30 +413,34 @@ static int tick_nohz_cpu_down_callback(struct notifier_block *nfb,
 	return NOTIFY_OK;
 }
 
-static int tick_nohz_init_all(void)
+void tick_nohz_full_add_cpus(const struct cpumask *mask)
 {
-	int err = -1;
+	if (!cpumask_weight(mask))
+		return;
 
-#ifdef CONFIG_NO_HZ_FULL_ALL
-	if (!alloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
+	if (tick_nohz_full_mask == NULL &&
+	    !zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
 		WARN(1, "NO_HZ: Can't allocate full dynticks cpumask\n");
-		return err;
+		return;
 	}
-	err = 0;
-	cpumask_setall(tick_nohz_full_mask);
+
+	cpumask_or(tick_nohz_full_mask, tick_nohz_full_mask, mask);
 	tick_nohz_full_running = true;
-#endif
-	return err;
 }
 
 void __init tick_nohz_init(void)
 {
 	int cpu;
 
-	if (!tick_nohz_full_running) {
-		if (tick_nohz_init_all() < 0)
-			return;
-	}
+	task_isolation_init();
+
+#ifdef CONFIG_NO_HZ_FULL_ALL
+	if (!tick_nohz_full_running)
+		tick_nohz_full_add_cpus(cpu_possible_mask);
+#endif
+
+	if (!tick_nohz_full_running)
+		return;
 
 	if (!alloc_cpumask_var(&housekeeping_mask, GFP_KERNEL)) {
 		WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n");
-- 
2.7.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-16 21:19 ` [PATCH v15 04/13] task_isolation: add initial support Chris Metcalf
@ 2016-08-29 16:33   ` Peter Zijlstra
  2016-08-29 16:40     ` Chris Metcalf
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2016-08-29 16:33 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
> +	/*
> +	 * Request rescheduling unless we are in full dynticks mode.
> +	 * We would eventually get pre-empted without this, and if
> +	 * there's another task waiting, it would run; but by
> +	 * explicitly requesting the reschedule, we may reduce the
> +	 * latency.  We could directly call schedule() here as well,
> +	 * but since our caller is the standard place where schedule()
> +	 * is called, we defer to the caller.
> +	 *
> +	 * A more substantive approach here would be to use a struct
> +	 * completion here explicitly, and complete it when we shut
> +	 * down dynticks, but since we presumably have nothing better
> +	 * to do on this core anyway, just spinning seems plausible.
> +	 */
> +	if (!tick_nohz_tick_stopped())
> +		set_tsk_need_resched(current);

This is broken.. and it would be really good if you don't actually need
to do this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-29 16:33   ` Peter Zijlstra
@ 2016-08-29 16:40     ` Chris Metcalf
  2016-08-29 16:48       ` Peter Zijlstra
  2016-08-30  7:58       ` Peter Zijlstra
  0 siblings, 2 replies; 28+ messages in thread
From: Chris Metcalf @ 2016-08-29 16:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
> On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
>> +	/*
>> +	 * Request rescheduling unless we are in full dynticks mode.
>> +	 * We would eventually get pre-empted without this, and if
>> +	 * there's another task waiting, it would run; but by
>> +	 * explicitly requesting the reschedule, we may reduce the
>> +	 * latency.  We could directly call schedule() here as well,
>> +	 * but since our caller is the standard place where schedule()
>> +	 * is called, we defer to the caller.
>> +	 *
>> +	 * A more substantive approach here would be to use a struct
>> +	 * completion here explicitly, and complete it when we shut
>> +	 * down dynticks, but since we presumably have nothing better
>> +	 * to do on this core anyway, just spinning seems plausible.
>> +	 */
>> +	if (!tick_nohz_tick_stopped())
>> +		set_tsk_need_resched(current);
> This is broken.. and it would be really good if you don't actually need
> to do this.

Can you elaborate?  We clearly do want to wait until we are in full
dynticks mode before we return to userspace.

We could do it just in the prctl() syscall only, but then we lose the
ability to implement the NOSIG mode, which can be a convenience.

Even without that consideration, we really can't be sure we stay in
dynticks mode if we disable the dynamic tick, but then enable interrupts,
and end up taking an interrupt on the way back to userspace, and
it turns the tick back on.  That's why we do it here, where we know
interrupts will stay disabled until we get to userspace.

So if we are doing it here, what else can/should we do?  There really
shouldn't be any other tasks waiting to run at this point, so there's
not a heck of a lot else to do on this core.  We could just spin and
check need_resched and signal status manually instead, but that
seems kind of duplicative of code already done in our caller here.

So... thoughts?

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-29 16:40     ` Chris Metcalf
@ 2016-08-29 16:48       ` Peter Zijlstra
  2016-08-29 16:53         ` Chris Metcalf
  2016-08-30  7:58       ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2016-08-29 16:48 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
> >On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
> >>+	/*
> >>+	 * Request rescheduling unless we are in full dynticks mode.
> >>+	 * We would eventually get pre-empted without this, and if
> >>+	 * there's another task waiting, it would run; but by
> >>+	 * explicitly requesting the reschedule, we may reduce the
> >>+	 * latency.  We could directly call schedule() here as well,
> >>+	 * but since our caller is the standard place where schedule()
> >>+	 * is called, we defer to the caller.
> >>+	 *
> >>+	 * A more substantive approach here would be to use a struct
> >>+	 * completion here explicitly, and complete it when we shut
> >>+	 * down dynticks, but since we presumably have nothing better
> >>+	 * to do on this core anyway, just spinning seems plausible.
> >>+	 */
> >>+	if (!tick_nohz_tick_stopped())
> >>+		set_tsk_need_resched(current);
> >This is broken.. and it would be really good if you don't actually need
> >to do this.
> 
> Can you elaborate?  

Naked use of TIF_NEED_RESCHED like this is busted. There is more state
that needs to be poked to keep things consistent / working.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-29 16:48       ` Peter Zijlstra
@ 2016-08-29 16:53         ` Chris Metcalf
  2016-08-30  7:59           ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Metcalf @ 2016-08-29 16:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On 8/29/2016 12:48 PM, Peter Zijlstra wrote:
> On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
>> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
>>> On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
>>>> +	/*
>>>> +	 * Request rescheduling unless we are in full dynticks mode.
>>>> +	 * We would eventually get pre-empted without this, and if
>>>> +	 * there's another task waiting, it would run; but by
>>>> +	 * explicitly requesting the reschedule, we may reduce the
>>>> +	 * latency.  We could directly call schedule() here as well,
>>>> +	 * but since our caller is the standard place where schedule()
>>>> +	 * is called, we defer to the caller.
>>>> +	 *
>>>> +	 * A more substantive approach here would be to use a struct
>>>> +	 * completion here explicitly, and complete it when we shut
>>>> +	 * down dynticks, but since we presumably have nothing better
>>>> +	 * to do on this core anyway, just spinning seems plausible.
>>>> +	 */
>>>> +	if (!tick_nohz_tick_stopped())
>>>> +		set_tsk_need_resched(current);
>>> This is broken.. and it would be really good if you don't actually need
>>> to do this.
>> Can you elaborate?
> Naked use of TIF_NEED_RESCHED like this is busted. There is more state
> that needs to be poked to keep things consistent / working.

Would it be cleaner to just replace the set_tsk_need_resched() call
with something like:

     set_current_state(TASK_INTERRUPTIBLE);
     schedule();
     __set_current_state(TASK_RUNNING);

or what would you recommend?

Or, as I said, just doing a busy loop here while testing to see
if need_resched or signal had been set?

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-29 16:40     ` Chris Metcalf
  2016-08-29 16:48       ` Peter Zijlstra
@ 2016-08-30  7:58       ` Peter Zijlstra
  2016-08-30 15:32         ` Chris Metcalf
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2016-08-30  7:58 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
> >On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
> >>+	/*
> >>+	 * Request rescheduling unless we are in full dynticks mode.
> >>+	 * We would eventually get pre-empted without this, and if
> >>+	 * there's another task waiting, it would run; but by
> >>+	 * explicitly requesting the reschedule, we may reduce the
> >>+	 * latency.  We could directly call schedule() here as well,
> >>+	 * but since our caller is the standard place where schedule()
> >>+	 * is called, we defer to the caller.
> >>+	 *
> >>+	 * A more substantive approach here would be to use a struct
> >>+	 * completion here explicitly, and complete it when we shut
> >>+	 * down dynticks, but since we presumably have nothing better
> >>+	 * to do on this core anyway, just spinning seems plausible.
> >>+	 */
> >>+	if (!tick_nohz_tick_stopped())
> >>+		set_tsk_need_resched(current);
> >This is broken.. and it would be really good if you don't actually need
> >to do this.
> 
> Can you elaborate?  We clearly do want to wait until we are in full
> dynticks mode before we return to userspace.
> 
> We could do it just in the prctl() syscall only, but then we lose the
> ability to implement the NOSIG mode, which can be a convenience.

So this isn't spelled out anywhere. Why does this need to be in the
return to user path?

> Even without that consideration, we really can't be sure we stay in
> dynticks mode if we disable the dynamic tick, but then enable interrupts,
> and end up taking an interrupt on the way back to userspace, and
> it turns the tick back on.  That's why we do it here, where we know
> interrupts will stay disabled until we get to userspace.

But but but.. task_isolation_enter() is explicitly ran with IRQs
_enabled_!! It even WARNs if they're disabled.

> So if we are doing it here, what else can/should we do?  There really
> shouldn't be any other tasks waiting to run at this point, so there's
> not a heck of a lot else to do on this core.  We could just spin and
> check need_resched and signal status manually instead, but that
> seems kind of duplicative of code already done in our caller here.

What !? I really don't get this, what are you waiting for? Why is
rescheduling making things better.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-29 16:53         ` Chris Metcalf
@ 2016-08-30  7:59           ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2016-08-30  7:59 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On Mon, Aug 29, 2016 at 12:53:30PM -0400, Chris Metcalf wrote:

> Would it be cleaner to just replace the set_tsk_need_resched() call
> with something like:
> 
>     set_current_state(TASK_INTERRUPTIBLE);
>     schedule();
>     __set_current_state(TASK_RUNNING);
> 
> or what would you recommend?

That'll just get you to sleep _forever_...

> Or, as I said, just doing a busy loop here while testing to see
> if need_resched or signal had been set?

Why do you care about need_resched() and or signals? How is that related
to the tick being stopped or not?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30  7:58       ` Peter Zijlstra
@ 2016-08-30 15:32         ` Chris Metcalf
  2016-08-30 16:30           ` Andy Lutomirski
  2016-09-01 10:06           ` Peter Zijlstra
  0 siblings, 2 replies; 28+ messages in thread
From: Chris Metcalf @ 2016-08-30 15:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On 8/30/2016 3:58 AM, Peter Zijlstra wrote:
> On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
>> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
>>> On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
>>>> +	/*
>>>> +	 * Request rescheduling unless we are in full dynticks mode.
>>>> +	 * We would eventually get pre-empted without this, and if
>>>> +	 * there's another task waiting, it would run; but by
>>>> +	 * explicitly requesting the reschedule, we may reduce the
>>>> +	 * latency.  We could directly call schedule() here as well,
>>>> +	 * but since our caller is the standard place where schedule()
>>>> +	 * is called, we defer to the caller.
>>>> +	 *
>>>> +	 * A more substantive approach here would be to use a struct
>>>> +	 * completion here explicitly, and complete it when we shut
>>>> +	 * down dynticks, but since we presumably have nothing better
>>>> +	 * to do on this core anyway, just spinning seems plausible.
>>>> +	 */
>>>> +	if (!tick_nohz_tick_stopped())
>>>> +		set_tsk_need_resched(current);
>>> This is broken.. and it would be really good if you don't actually need
>>> to do this.
>> Can you elaborate?  We clearly do want to wait until we are in full
>> dynticks mode before we return to userspace.
>>
>> We could do it just in the prctl() syscall only, but then we lose the
>> ability to implement the NOSIG mode, which can be a convenience.
> So this isn't spelled out anywhere. Why does this need to be in the
> return to user path?

I'm not sure where this should be spelled out, to be honest.  I guess
I can add some commentary to the commit message explaining this part.

The basic idea is just that we don't want to be at risk from the
dyntick getting enabled.  Similarly, we don't want to be at risk of a
later global IPI due to lru_add_drain stuff, for example.  And, we may
want to add additional stuff, like catching kernel TLB flushes and
deferring them when a remote core is in userspace.  To do all of this
kind of stuff, we need to run in the return to user path so we are
late enough to guarantee no further kernel things will happen to
perturb our carefully-arranged isolation state that includes dyntick
off, per-cpu lru cache empty, etc etc.

>> Even without that consideration, we really can't be sure we stay in
>> dynticks mode if we disable the dynamic tick, but then enable interrupts,
>> and end up taking an interrupt on the way back to userspace, and
>> it turns the tick back on.  That's why we do it here, where we know
>> interrupts will stay disabled until we get to userspace.
> But but but.. task_isolation_enter() is explicitly ran with IRQs
> _enabled_!! It even WARNs if they're disabled.

Yes, true!  But if you pop up to the caller, the key thing is the
task_isolation_ready() routine where we are invoked with interrupts
disabled, and we confirm that all our criteria are met (including
tick_nohz_tick_stopped), and then leave interrupts disabled as we
return from there onwards to userspace.

The task_isolation_enter() code just does its best-faith attempt to
make sure all these criteria are met, just like all the other TIF_xxx
flag tests do in exit_to_usermode_loop() on x86, like scheduling,
delivering signals, etc.  As you know, we might run that code, go
around the loop, and discover that the TIF flag has been re-set, and
we have to run the code again before all of that stuff has "quiesced".
The isolation code uses that same model; the only difference is that
we clear the TIF flag manually in the loop by checking
task_isolation_ready().

>> So if we are doing it here, what else can/should we do?  There really
>> shouldn't be any other tasks waiting to run at this point, so there's
>> not a heck of a lot else to do on this core.  We could just spin and
>> check need_resched and signal status manually instead, but that
>> seems kind of duplicative of code already done in our caller here.
> What !? I really don't get this, what are you waiting for? Why is
> rescheduling making things better.

We need to wait for the last dyntick to fire before we can return to
userspace.  There are plenty of options as to what we can do in the
meanwhile.

1. Try to schedule().  Good luck with that in practice, since a
userspace process that has enabled task isolation is going to be alone
on its core unless something pretty broken is happening on the system.
But, at least folks understand the idiom of scheduling out while you wait.

2. Another variant of that: set up a wait completion and have the
dynticks code complete it when the tick turns off.  But this adds
complexity to option 1, and really doesn't buy us much in practice
that I can see.

3. Just admit that we are likely alone on the core, and just burn
cycles in a busy loop waiting for that last tick to fire.  Obviously
if we do this we also need to test for signals and resched so the core
remains responsive.  We can either do this in a loop just by spinning
explicitly, or I could literally just remove the line in the current
patchset that sets TIF_NEED_RESCHED, at which point we busy-wait by
just going around and around in exit_to_usermode_loop().  The only
flaw here is that we don't mark the task explicitly as TASK_INTERRUPTIBLE
while we are doing this - and that's probably worth doing.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30 15:32         ` Chris Metcalf
@ 2016-08-30 16:30           ` Andy Lutomirski
  2016-08-30 17:02             ` Chris Metcalf
  2016-09-01 10:06           ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2016-08-30 16:30 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Peter Zijlstra, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Michal Hocko,
	linux-mm, linux-doc, Linux API, linux-kernel

On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
> On 8/30/2016 3:58 AM, Peter Zijlstra wrote:
>>
>> On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
>>>
>>> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
>>>>
>>>> On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
>>>>>
>>>>> +       /*
>>>>> +        * Request rescheduling unless we are in full dynticks mode.
>>>>> +        * We would eventually get pre-empted without this, and if
>>>>> +        * there's another task waiting, it would run; but by
>>>>> +        * explicitly requesting the reschedule, we may reduce the
>>>>> +        * latency.  We could directly call schedule() here as well,
>>>>> +        * but since our caller is the standard place where schedule()
>>>>> +        * is called, we defer to the caller.
>>>>> +        *
>>>>> +        * A more substantive approach here would be to use a struct
>>>>> +        * completion here explicitly, and complete it when we shut
>>>>> +        * down dynticks, but since we presumably have nothing better
>>>>> +        * to do on this core anyway, just spinning seems plausible.
>>>>> +        */
>>>>> +       if (!tick_nohz_tick_stopped())
>>>>> +               set_tsk_need_resched(current);
>>>>
>>>> This is broken.. and it would be really good if you don't actually need
>>>> to do this.
>>>
>>> Can you elaborate?  We clearly do want to wait until we are in full
>>> dynticks mode before we return to userspace.
>>>
>>> We could do it just in the prctl() syscall only, but then we lose the
>>> ability to implement the NOSIG mode, which can be a convenience.
>>
>> So this isn't spelled out anywhere. Why does this need to be in the
>> return to user path?
>
>
> I'm not sure where this should be spelled out, to be honest.  I guess
> I can add some commentary to the commit message explaining this part.
>
> The basic idea is just that we don't want to be at risk from the
> dyntick getting enabled.  Similarly, we don't want to be at risk of a
> later global IPI due to lru_add_drain stuff, for example.  And, we may
> want to add additional stuff, like catching kernel TLB flushes and
> deferring them when a remote core is in userspace.  To do all of this
> kind of stuff, we need to run in the return to user path so we are
> late enough to guarantee no further kernel things will happen to
> perturb our carefully-arranged isolation state that includes dyntick
> off, per-cpu lru cache empty, etc etc.

None of the above should need to *loop*, though, AFAIK.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30 16:30           ` Andy Lutomirski
@ 2016-08-30 17:02             ` Chris Metcalf
  2016-08-30 18:43               ` Andy Lutomirski
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Metcalf @ 2016-08-30 17:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Andrew Morton, Rik van Riel, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon, Michal Hocko,
	linux-mm, linux-doc, Linux API, linux-kernel

On 8/30/2016 12:30 PM, Andy Lutomirski wrote:
> On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>> On 8/30/2016 3:58 AM, Peter Zijlstra wrote:
>>> On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
>>>> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
>>>>> On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
>>>>>> +       /*
>>>>>> +        * Request rescheduling unless we are in full dynticks mode.
>>>>>> +        * We would eventually get pre-empted without this, and if
>>>>>> +        * there's another task waiting, it would run; but by
>>>>>> +        * explicitly requesting the reschedule, we may reduce the
>>>>>> +        * latency.  We could directly call schedule() here as well,
>>>>>> +        * but since our caller is the standard place where schedule()
>>>>>> +        * is called, we defer to the caller.
>>>>>> +        *
>>>>>> +        * A more substantive approach here would be to use a struct
>>>>>> +        * completion here explicitly, and complete it when we shut
>>>>>> +        * down dynticks, but since we presumably have nothing better
>>>>>> +        * to do on this core anyway, just spinning seems plausible.
>>>>>> +        */
>>>>>> +       if (!tick_nohz_tick_stopped())
>>>>>> +               set_tsk_need_resched(current);
>>>>> This is broken.. and it would be really good if you don't actually need
>>>>> to do this.
>>>> Can you elaborate?  We clearly do want to wait until we are in full
>>>> dynticks mode before we return to userspace.
>>>>
>>>> We could do it just in the prctl() syscall only, but then we lose the
>>>> ability to implement the NOSIG mode, which can be a convenience.
>>> So this isn't spelled out anywhere. Why does this need to be in the
>>> return to user path?
>>
>> I'm not sure where this should be spelled out, to be honest.  I guess
>> I can add some commentary to the commit message explaining this part.
>>
>> The basic idea is just that we don't want to be at risk from the
>> dyntick getting enabled.  Similarly, we don't want to be at risk of a
>> later global IPI due to lru_add_drain stuff, for example.  And, we may
>> want to add additional stuff, like catching kernel TLB flushes and
>> deferring them when a remote core is in userspace.  To do all of this
>> kind of stuff, we need to run in the return to user path so we are
>> late enough to guarantee no further kernel things will happen to
>> perturb our carefully-arranged isolation state that includes dyntick
>> off, per-cpu lru cache empty, etc etc.
> None of the above should need to *loop*, though, AFAIK.

Ordering is a problem, though.

We really want to run task isolation last, so we can guarantee that
all the isolation prerequisites are met (dynticks stopped, per-cpu lru
cache empty, etc).  But achieving that state can require enabling
interrupts - most obviously if we have to schedule, e.g. for vmstat
clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
just while waiting for that last dyntick interrupt to occur.  I'm also
not sure that even something as simple as draining the per-cpu lru
cache can be done holding interrupts disabled throughout - certainly
there's a !SMP code path there that just re-enables interrupts
unconditionally, which gives me pause.

At any rate at that point you need to retest for signals, resched,
etc, all as usual, and then you need to recheck the task isolation
prerequisites once more.

I may be missing something here, but it's really not obvious to me
that there's a way to do this without having task isolation integrated
into the usual return-to-userspace loop.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30 17:02             ` Chris Metcalf
@ 2016-08-30 18:43               ` Andy Lutomirski
  2016-08-30 19:37                 ` Chris Metcalf
  2016-09-30 16:59                 ` Chris Metcalf
  0 siblings, 2 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-08-30 18:43 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: linux-doc, Thomas Gleixner, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Linux API, Viresh Kumar,
	Ingo Molnar, Steven Rostedt, Tejun Heo, Will Deacon,
	Rik van Riel, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On Aug 30, 2016 10:02 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>
> On 8/30/2016 12:30 PM, Andy Lutomirski wrote:
>>
>> On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>>>
>>> On 8/30/2016 3:58 AM, Peter Zijlstra wrote:
>>>>
>>>> On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
>>>>>
>>>>> On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
>>>>>>
>>>>>> On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
>>>>>>>
>>>>>>> +       /*
>>>>>>> +        * Request rescheduling unless we are in full dynticks mode.
>>>>>>> +        * We would eventually get pre-empted without this, and if
>>>>>>> +        * there's another task waiting, it would run; but by
>>>>>>> +        * explicitly requesting the reschedule, we may reduce the
>>>>>>> +        * latency.  We could directly call schedule() here as well,
>>>>>>> +        * but since our caller is the standard place where schedule()
>>>>>>> +        * is called, we defer to the caller.
>>>>>>> +        *
>>>>>>> +        * A more substantive approach here would be to use a struct
>>>>>>> +        * completion here explicitly, and complete it when we shut
>>>>>>> +        * down dynticks, but since we presumably have nothing better
>>>>>>> +        * to do on this core anyway, just spinning seems plausible.
>>>>>>> +        */
>>>>>>> +       if (!tick_nohz_tick_stopped())
>>>>>>> +               set_tsk_need_resched(current);
>>>>>>
>>>>>> This is broken.. and it would be really good if you don't actually need
>>>>>> to do this.
>>>>>
>>>>> Can you elaborate?  We clearly do want to wait until we are in full
>>>>> dynticks mode before we return to userspace.
>>>>>
>>>>> We could do it just in the prctl() syscall only, but then we lose the
>>>>> ability to implement the NOSIG mode, which can be a convenience.
>>>>
>>>> So this isn't spelled out anywhere. Why does this need to be in the
>>>> return to user path?
>>>
>>>
>>> I'm not sure where this should be spelled out, to be honest.  I guess
>>> I can add some commentary to the commit message explaining this part.
>>>
>>> The basic idea is just that we don't want to be at risk from the
>>> dyntick getting enabled.  Similarly, we don't want to be at risk of a
>>> later global IPI due to lru_add_drain stuff, for example.  And, we may
>>> want to add additional stuff, like catching kernel TLB flushes and
>>> deferring them when a remote core is in userspace.  To do all of this
>>> kind of stuff, we need to run in the return to user path so we are
>>> late enough to guarantee no further kernel things will happen to
>>> perturb our carefully-arranged isolation state that includes dyntick
>>> off, per-cpu lru cache empty, etc etc.
>>
>> None of the above should need to *loop*, though, AFAIK.
>
>
> Ordering is a problem, though.
>
> We really want to run task isolation last, so we can guarantee that
> all the isolation prerequisites are met (dynticks stopped, per-cpu lru
> cache empty, etc).  But achieving that state can require enabling
> interrupts - most obviously if we have to schedule, e.g. for vmstat
> clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
> just while waiting for that last dyntick interrupt to occur.  I'm also
> not sure that even something as simple as draining the per-cpu lru
> cache can be done holding interrupts disabled throughout - certainly
> there's a !SMP code path there that just re-enables interrupts
> unconditionally, which gives me pause.
>
> At any rate at that point you need to retest for signals, resched,
> etc, all as usual, and then you need to recheck the task isolation
> prerequisites once more.
>
> I may be missing something here, but it's really not obvious to me
> that there's a way to do this without having task isolation integrated
> into the usual return-to-userspace loop.
>

What if we did it the other way around: set a percpu flag saying
"going quiescent; disallow new deferred work", then finish all
existing work and return to userspace.  Then, on the next entry, clear
that flag.  With the flag set, vmstat would just flush anything that
it accumulates immediately, nothing would be added to the LRU list,
etc.

Also, this cond_resched stuff doesn't worry me too much at a
fundamental level -- if we're really going quiescent, shouldn't we be
able to arrange that there are no other schedulable tasks on the CPU
in question?

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30 18:43               ` Andy Lutomirski
@ 2016-08-30 19:37                 ` Chris Metcalf
  2016-08-30 19:50                   ` Andy Lutomirski
  2016-09-30 16:59                 ` Chris Metcalf
  1 sibling, 1 reply; 28+ messages in thread
From: Chris Metcalf @ 2016-08-30 19:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-doc, Thomas Gleixner, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Linux API, Viresh Kumar,
	Ingo Molnar, Steven Rostedt, Tejun Heo, Will Deacon,
	Rik van Riel, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On 8/30/2016 2:43 PM, Andy Lutomirski wrote:
> On Aug 30, 2016 10:02 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>> On 8/30/2016 12:30 PM, Andy Lutomirski wrote:
>>> On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>>>> The basic idea is just that we don't want to be at risk from the
>>>> dyntick getting enabled.  Similarly, we don't want to be at risk of a
>>>> later global IPI due to lru_add_drain stuff, for example.  And, we may
>>>> want to add additional stuff, like catching kernel TLB flushes and
>>>> deferring them when a remote core is in userspace.  To do all of this
>>>> kind of stuff, we need to run in the return to user path so we are
>>>> late enough to guarantee no further kernel things will happen to
>>>> perturb our carefully-arranged isolation state that includes dyntick
>>>> off, per-cpu lru cache empty, etc etc.
>>> None of the above should need to *loop*, though, AFAIK.
>> Ordering is a problem, though.
>>
>> We really want to run task isolation last, so we can guarantee that
>> all the isolation prerequisites are met (dynticks stopped, per-cpu lru
>> cache empty, etc).  But achieving that state can require enabling
>> interrupts - most obviously if we have to schedule, e.g. for vmstat
>> clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
>> just while waiting for that last dyntick interrupt to occur.  I'm also
>> not sure that even something as simple as draining the per-cpu lru
>> cache can be done holding interrupts disabled throughout - certainly
>> there's a !SMP code path there that just re-enables interrupts
>> unconditionally, which gives me pause.
>>
>> At any rate at that point you need to retest for signals, resched,
>> etc, all as usual, and then you need to recheck the task isolation
>> prerequisites once more.
>>
>> I may be missing something here, but it's really not obvious to me
>> that there's a way to do this without having task isolation integrated
>> into the usual return-to-userspace loop.
>>
> What if we did it the other way around: set a percpu flag saying
> "going quiescent; disallow new deferred work", then finish all
> existing work and return to userspace.  Then, on the next entry, clear
> that flag.  With the flag set, vmstat would just flush anything that
> it accumulates immediately, nothing would be added to the LRU list,
> etc.

This is an interesting idea!

However, there are a number of implementation ideas that make me
worry that it might be a trickier approach overall.

First, "on the next entry" hides a world of hurt in four simple words.
Some platforms (arm64 and tile, that I'm familiar with) have a common
chunk of code that always runs on every entry to the kernel.  It would
not be too hard to poke at the assembly and make those platforms
always run some task-isolation specific code on entry.  But x86 scares
me - there seem to be a whole lot of ways to get into the kernel, and
I'm not convinced there is a lot of shared macrology or whatever that
would make it straightforward to intercept all of them.

Then, there are the two actual subsystems in question.  It looks like
we could intercept LRU reasonably cleanly by hooking pagevec_add()
to return zero when we are in this "going quiescent" mode, and that
would keep the per-cpu vectors empty.  The vmstat stuff is a little
trickier since all the existing code is built around updating the per-cpu
stuff and then only later copying it off to the global state.  I suppose
we could add a test-and-flush at the end of every public API and not
worry about the implementation cost.

But it does seem like we are adding noticeable maintenance cost on
the mainline kernel to support task isolation by doing this.  My guess
is that it is easier to support the kind of "are you clean?" / "get clean"
APIs for subsystems, rather than weaving a whole set of "stay clean"
mechanism into each subsystem.

So to pop up a level, what is your actual concern about the existing
"do it in a loop" model?  The macrology currently in use means there
is zero cost if you don't configure TASK_ISOLATION, and the software
maintenance cost seems low since the idioms used for task isolation
in the loop are generally familiar to people reading that code.

> Also, this cond_resched stuff doesn't worry me too much at a
> fundamental level -- if we're really going quiescent, shouldn't we be
> able to arrange that there are no other schedulable tasks on the CPU
> in question?

We aren't currently planning to enforce things in the scheduler, so if
the application affinitizes another task on top of an existing task
isolation task, by default the task isolation task just dies. (Unless
it's using NOSIG mode, in which case it just ends up stuck in the
kernel trying to wait out the dyntick until you either kill it, or
re-affinitize the offending task.)  But I'm reluctant to guarantee
every possible way that you might (perhaps briefly) have some
schedulable task, and the current approach seems pretty robust if that
sort of thing happens.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30 19:37                 ` Chris Metcalf
@ 2016-08-30 19:50                   ` Andy Lutomirski
  2016-09-02 14:04                     ` Chris Metcalf
  0 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2016-08-30 19:50 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: linux-doc, Thomas Gleixner, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Linux API, Viresh Kumar,
	Ingo Molnar, Steven Rostedt, Tejun Heo, Will Deacon,
	Rik van Riel, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
> On 8/30/2016 2:43 PM, Andy Lutomirski wrote:
>>
>> On Aug 30, 2016 10:02 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>>>
>>> On 8/30/2016 12:30 PM, Andy Lutomirski wrote:
>>>>
>>>> On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf <cmetcalf@mellanox.com>
>>>> wrote:
>>>>>
>>>>> The basic idea is just that we don't want to be at risk from the
>>>>> dyntick getting enabled.  Similarly, we don't want to be at risk of a
>>>>> later global IPI due to lru_add_drain stuff, for example.  And, we may
>>>>> want to add additional stuff, like catching kernel TLB flushes and
>>>>> deferring them when a remote core is in userspace.  To do all of this
>>>>> kind of stuff, we need to run in the return to user path so we are
>>>>> late enough to guarantee no further kernel things will happen to
>>>>> perturb our carefully-arranged isolation state that includes dyntick
>>>>> off, per-cpu lru cache empty, etc etc.
>>>>
>>>> None of the above should need to *loop*, though, AFAIK.
>>>
>>> Ordering is a problem, though.
>>>
>>> We really want to run task isolation last, so we can guarantee that
>>> all the isolation prerequisites are met (dynticks stopped, per-cpu lru
>>> cache empty, etc).  But achieving that state can require enabling
>>> interrupts - most obviously if we have to schedule, e.g. for vmstat
>>> clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
>>> just while waiting for that last dyntick interrupt to occur.  I'm also
>>> not sure that even something as simple as draining the per-cpu lru
>>> cache can be done holding interrupts disabled throughout - certainly
>>> there's a !SMP code path there that just re-enables interrupts
>>> unconditionally, which gives me pause.
>>>
>>> At any rate at that point you need to retest for signals, resched,
>>> etc, all as usual, and then you need to recheck the task isolation
>>> prerequisites once more.
>>>
>>> I may be missing something here, but it's really not obvious to me
>>> that there's a way to do this without having task isolation integrated
>>> into the usual return-to-userspace loop.
>>>
>> What if we did it the other way around: set a percpu flag saying
>> "going quiescent; disallow new deferred work", then finish all
>> existing work and return to userspace.  Then, on the next entry, clear
>> that flag.  With the flag set, vmstat would just flush anything that
>> it accumulates immediately, nothing would be added to the LRU list,
>> etc.
>
>
> This is an interesting idea!
>
> However, there are a number of implementation ideas that make me
> worry that it might be a trickier approach overall.
>
> First, "on the next entry" hides a world of hurt in four simple words.
> Some platforms (arm64 and tile, that I'm familiar with) have a common
> chunk of code that always runs on every entry to the kernel.  It would
> not be too hard to poke at the assembly and make those platforms
> always run some task-isolation specific code on entry.  But x86 scares
> me - there seem to be a whole lot of ways to get into the kernel, and
> I'm not convinced there is a lot of shared macrology or whatever that
> would make it straightforward to intercept all of them.

Just use the context tracking entry hook.  It's 100% reliable.  The
relevant x86 function is enter_from_user_mode(), but I would just hook
into user_exit() in the common code.  (This code is had better be
reliable, because context tracking depends on it, and, if context
tracking doesn't work on a given arch, then isolation isn't going to
work regardless.

>
> Then, there are the two actual subsystems in question.  It looks like
> we could intercept LRU reasonably cleanly by hooking pagevec_add()
> is to return zero when we are in this "going quiescent" mode, and that
> would keep the per-cpu vectors empty.  The vmstat stuff is a little
> trickier since all the existing code is built around updating the per-cpu
> stuff and then only later copying it off to the global state.  I suppose
> we could add a test-and-flush at the end of every public API and not
> worry about the implementation cost.

Seems reasonable to me.  If anyone cares about the performance hit,
they can fix it.

>
> But it does seem like we are adding noticeable maintenance cost on
> the mainline kernel to support task isolation by doing this.  My guess
> is that it is easier to support the kind of "are you clean?" / "get clean"
> APIs for subsystems, rather than weaving a whole set of "stay clean"
> mechanism into each subsystem.

My intuition is that it's the other way around.  For the mainline
kernel, having a nice clean well-integrated implementation is nicer
than having a bolted-on implementation that interacts in potentially
complicated ways.  Once quiescence support is in mainline, the size of
the diff or the degree to which it's scattered around is irrelevant
because it's not a diff any more.

>
> So to pop up a level, what is your actual concern about the existing
> "do it in a loop" model?  The macrology currently in use means there
> is zero cost if you don't configure TASK_ISOLATION, and the software
> maintenance cost seems low since the idioms used for task isolation
> in the loop are generally familiar to people reading that code.

My concern is that it's not obvious to readers of the code that the
loop ever terminates.  It really ought to, but it's doing something
very odd.  Normally we can loop because we get scheduled out, but
actually blocking in the return-to-userspace path, especially blocking
on a condition that doesn't have a wakeup associated with it, is odd.

>
>> Also, this cond_resched stuff doesn't worry me too much at a
>> fundamental level -- if we're really going quiescent, shouldn't we be
>> able to arrange that there are no other schedulable tasks on the CPU
>> in question?
>
>
> We aren't currently planning to enforce things in the scheduler, so if
> the application affinitizes another task on top of an existing task
> isolation task, by default the task isolation task just dies. (Unless
> it's using NOSIG mode, in which case it just ends up stuck in the
> kernel trying to wait out the dyntick until you either kill it, or
> re-affinitize the offending task.)  But I'm reluctant to guarantee
> every possible way that you might (perhaps briefly) have some
> schedulable task, and the current approach seems pretty robust if that
> sort of thing happens.
>

This kind of waiting out the dyntick scares me.  Why is there ever a
dyntick that you're waiting out?  If quiescence is to be a supported
mainline feature, shouldn't the scheduler be integrated well enough
with it that you don't need to wait like this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30 15:32         ` Chris Metcalf
  2016-08-30 16:30           ` Andy Lutomirski
@ 2016-09-01 10:06           ` Peter Zijlstra
  2016-09-02 14:03             ` Chris Metcalf
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2016-09-01 10:06 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On Tue, Aug 30, 2016 at 11:32:16AM -0400, Chris Metcalf wrote:
> On 8/30/2016 3:58 AM, Peter Zijlstra wrote:

> >What !? I really don't get this, what are you waiting for? Why is
> >rescheduling making things better.
> 
> We need to wait for the last dyntick to fire before we can return to
> userspace.  There are plenty of options as to what we can do in the
> meanwhile.

Why not keep your _TIF_TASK_ISOLATION_FOO flag set and re-enter the
loop?

I really don't see how setting TIF_NEED_RESCHED is helping anything.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-01 10:06           ` Peter Zijlstra
@ 2016-09-02 14:03             ` Chris Metcalf
  2016-09-02 16:40               ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Metcalf @ 2016-09-02 14:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On 9/1/2016 6:06 AM, Peter Zijlstra wrote:
> On Tue, Aug 30, 2016 at 11:32:16AM -0400, Chris Metcalf wrote:
>> On 8/30/2016 3:58 AM, Peter Zijlstra wrote:
>>> What !? I really don't get this, what are you waiting for? Why is
>>> rescheduling making things better.
>> We need to wait for the last dyntick to fire before we can return to
>> userspace.  There are plenty of options as to what we can do in the
>> meanwhile.
> Why not keep your _TIF_TASK_ISOLATION_FOO flag set and re-enter the
> loop?
>
> I really don't see how setting TIF_NEED_RESCHED is helping anything.

Yes, I think I addressed that in an earlier reply to Frederic; and you're right,
I don't think TIF_NEED_RESCHED or schedule() are the way to go.

https://lkml.kernel.org/g/107bd666-dbcf-7fa5-ff9c-f79358899712@mellanox.com

Any thoughts on the question of "just re-enter the loop" vs. schedule_timeout()?

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30 19:50                   ` Andy Lutomirski
@ 2016-09-02 14:04                     ` Chris Metcalf
  2016-09-02 17:28                       ` Andy Lutomirski
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Metcalf @ 2016-09-02 14:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-doc, Thomas Gleixner, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Linux API, Viresh Kumar,
	Ingo Molnar, Steven Rostedt, Tejun Heo, Will Deacon,
	Rik van Riel, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On 8/30/2016 3:50 PM, Andy Lutomirski wrote:
> On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>> On 8/30/2016 2:43 PM, Andy Lutomirski wrote:
>>> What if we did it the other way around: set a percpu flag saying
>>> "going quiescent; disallow new deferred work", then finish all
>>> existing work and return to userspace.  Then, on the next entry, clear
>>> that flag.  With the flag set, vmstat would just flush anything that
>>> it accumulates immediately, nothing would be added to the LRU list,
>>> etc.
>>
>> This is an interesting idea!
>>
>> However, there are a number of implementation ideas that make me
>> worry that it might be a trickier approach overall.
>>
>> First, "on the next entry" hides a world of hurt in four simple words.
>> Some platforms (arm64 and tile, that I'm familiar with) have a common
>> chunk of code that always runs on every entry to the kernel.  It would
>> not be too hard to poke at the assembly and make those platforms
>> always run some task-isolation specific code on entry.  But x86 scares
>> me - there seem to be a whole lot of ways to get into the kernel, and
>> I'm not convinced there is a lot of shared macrology or whatever that
>> would make it straightforward to intercept all of them.
> Just use the context tracking entry hook.  It's 100% reliable.  The
> relevant x86 function is enter_from_user_mode(), but I would just hook
> into user_exit() in the common code.  (This code is had better be
> reliable, because context tracking depends on it, and, if context
> tracking doesn't work on a given arch, then isolation isn't going to
> work regardless.

This looks a lot cleaner than last time I looked at the x86 code. So yes, I think
we could do an entry-point approach plausibly now.

This is also good for when we want to look at deferring the kernel TLB flush,
since it's the same mechanism that would be required for that.

>> But it does seem like we are adding noticeable maintenance cost on
>> the mainline kernel to support task isolation by doing this.  My guess
>> is that it is easier to support the kind of "are you clean?" / "get clean"
>> APIs for subsystems, rather than weaving a whole set of "stay clean"
>> mechanism into each subsystem.
> My intuition is that it's the other way around.  For the mainline
> kernel, having a nice clean well-integrated implementation is nicer
> than having a bolted-on implementation that interacts in potentially
> complicated ways.  Once quiescence support is in mainline, the size of
> the diff or the degree to which it's scattered around is irrelevant
> because it's not a diff any more.

I'm not concerned with the size of the diff, just with the intrusiveness into
the various subsystems.

That said, code talks, so let me take a swing at doing it the way you suggest
for vmstat/lru and we'll see what it looks like.

>> So to pop up a level, what is your actual concern about the existing
>> "do it in a loop" model?  The macrology currently in use means there
>> is zero cost if you don't configure TASK_ISOLATION, and the software
>> maintenance cost seems low since the idioms used for task isolation
>> in the loop are generally familiar to people reading that code.
> My concern is that it's not obvious to readers of the code that the
> loop ever terminates.  It really ought to, but it's doing something
> very odd.  Normally we can loop because we get scheduled out, but
> actually blocking in the return-to-userspace path, especially blocking
> on a condition that doesn't have a wakeup associated with it, is odd.

True, although, comments :-)

Regardless, though, this doesn't seem at all weird to me in the
context of the vmstat and lru stuff, though.  It's exactly parallel to
the fact that we loop around on checking need_resched and signal, and
in some cases you could imagine multiple loops around when we schedule
out and get a signal, so loop around again, and then another
reschedule event happens during signal processing so we go around
again, etc.  Eventually it settles down.  It's the same with the
vmstat/lru stuff.

>>> Also, this cond_resched stuff doesn't worry me too much at a
>>> fundamental level -- if we're really going quiescent, shouldn't we be
>>> able to arrange that there are no other schedulable tasks on the CPU
>>> in question?
>> We aren't currently planning to enforce things in the scheduler, so if
>> the application affinitizes another task on top of an existing task
>> isolation task, by default the task isolation task just dies. (Unless
>> it's using NOSIG mode, in which case it just ends up stuck in the
>> kernel trying to wait out the dyntick until you either kill it, or
>> re-affinitize the offending task.)  But I'm reluctant to guarantee
>> every possible way that you might (perhaps briefly) have some
>> schedulable task, and the current approach seems pretty robust if that
>> sort of thing happens.
> This kind of waiting out the dyntick scares me.  Why is there ever a
> dyntick that you're waiting out?  If quiescence is to be a supported
> mainline feature, shouldn't the scheduler be integrated well enough
> with it that you don't need to wait like this?

Well, this is certainly the funkiest piece of the task isolation
stuff.  The problem is that the dyntick stuff may, for example, need
one more tick 4us from now (or whatever) just to close out the current
RCU period.  We can't return to userspace until that happens.  So what
else can we do when the task is ready to return to userspace?  We
could punt into the idle task instead of waiting in this task, which
was my earlier schedule_time() suggestion.  Do you think that's cleaner?

> Have you confirmed that this works correctly wrt PTRACE_SYSCALL?  It
> should result in an even number of events (like raise(2) or an async
> signal) and that should have a test case.

I have not tested PTRACE_SYSCALL.  I'll see about adding that to the
selftest code.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-02 14:03             ` Chris Metcalf
@ 2016-09-02 16:40               ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2016-09-02 16:40 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	linux-mm, linux-doc, linux-api, linux-kernel

On Fri, Sep 02, 2016 at 10:03:52AM -0400, Chris Metcalf wrote:
> Any thoughts on the question of "just re-enter the loop" vs. schedule_timeout()?

schedule_timeout() should only be used for things you do not have
control over, like things outside of the machine.

If you want to actually block running, use that completion you were
talking of.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-02 14:04                     ` Chris Metcalf
@ 2016-09-02 17:28                       ` Andy Lutomirski
  2016-09-09 17:40                         ` Chris Metcalf
  2016-09-27 14:22                         ` Frederic Weisbecker
  0 siblings, 2 replies; 28+ messages in thread
From: Andy Lutomirski @ 2016-09-02 17:28 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Thomas Gleixner, linux-doc, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Viresh Kumar, Linux API,
	Steven Rostedt, Ingo Molnar, Tejun Heo, Rik van Riel,
	Will Deacon, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On Sep 2, 2016 7:04 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>
> On 8/30/2016 3:50 PM, Andy Lutomirski wrote:
>>
>> On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>>>
>>> On 8/30/2016 2:43 PM, Andy Lutomirski wrote:
>>>>
>>>> What if we did it the other way around: set a percpu flag saying
>>>> "going quiescent; disallow new deferred work", then finish all
>>>> existing work and return to userspace.  Then, on the next entry, clear
>>>> that flag.  With the flag set, vmstat would just flush anything that
>>>> it accumulates immediately, nothing would be added to the LRU list,
>>>> etc.
>>>
>>>
>>> This is an interesting idea!
>>>
>>> However, there are a number of implementation ideas that make me
>>> worry that it might be a trickier approach overall.
>>>
>>> First, "on the next entry" hides a world of hurt in four simple words.
>>> Some platforms (arm64 and tile, that I'm familiar with) have a common
>>> chunk of code that always runs on every entry to the kernel.  It would
>>> not be too hard to poke at the assembly and make those platforms
>>> always run some task-isolation specific code on entry.  But x86 scares
>>> me - there seem to be a whole lot of ways to get into the kernel, and
>>> I'm not convinced there is a lot of shared macrology or whatever that
>>> would make it straightforward to intercept all of them.
>>
>> Just use the context tracking entry hook.  It's 100% reliable.  The
>> relevant x86 function is enter_from_user_mode(), but I would just hook
>> into user_exit() in the common code.  (This code is had better be
>> reliable, because context tracking depends on it, and, if context
>> tracking doesn't work on a given arch, then isolation isn't going to
>> work regardless.
>
>
> This looks a lot cleaner than last time I looked at the x86 code. So yes, I think
> we could do an entry-point approach plausibly now.
>
> This is also good for when we want to look at deferring the kernel TLB flush,
> since it's the same mechanism that would be required for that.
>
>

There's at least one gotcha for the latter: NMIs aren't currently
guaranteed to go through context tracking.  Instead they use their own
RCU hooks.  Deferred TLB flushes can still be made to work, but a bit
more care will be needed.  I would probably approach it with an
additional NMI hook in the same places as rcu_nmi_enter() that does,
more or less:

if (need_tlb_flush) flush();

and then make sure that the normal exit hook looks like:

if (need_tlb_flush) {
  flush();
  barrier(); /* An NMI must not see !need_tlb_flush if the TLB hasn't
been flushed */
  flush the TLB;
}

>>> But it does seem like we are adding noticeable maintenance cost on
>>> the mainline kernel to support task isolation by doing this.  My guess
>>> is that it is easier to support the kind of "are you clean?" / "get clean"
>>> APIs for subsystems, rather than weaving a whole set of "stay clean"
>>> mechanism into each subsystem.
>>
>> My intuition is that it's the other way around.  For the mainline
>> kernel, having a nice clean well-integrated implementation is nicer
>> than having a bolted-on implementation that interacts in potentially
>> complicated ways.  Once quiescence support is in mainline, the size of
>> the diff or the degree to which it's scattered around is irrelevant
>> because it's not a diff any more.
>
>
> I'm not concerned with the size of the diff, just with the intrusiveness into
> the various subsystems.
>
> That said, code talks, so let me take a swing at doing it the way you suggest
> for vmstat/lru and we'll see what it looks like.

Thanks :)

>
>
>>> So to pop up a level, what is your actual concern about the existing
>>> "do it in a loop" model?  The macrology currently in use means there
>>> is zero cost if you don't configure TASK_ISOLATION, and the software
>>> maintenance cost seems low since the idioms used for task isolation
>>> in the loop are generally familiar to people reading that code.
>>
>> My concern is that it's not obvious to readers of the code that the
>> loop ever terminates.  It really ought to, but it's doing something
>> very odd.  Normally we can loop because we get scheduled out, but
>> actually blocking in the return-to-userspace path, especially blocking
>> on a condition that doesn't have a wakeup associated with it, is odd.
>
>
> True, although, comments :-)
>
> Regardless, though, this doesn't seem at all weird to me in the
> context of the vmstat and lru stuff, though.  It's exactly parallel to
> the fact that we loop around on checking need_resched and signal, and
> in some cases you could imagine multiple loops around when we schedule
> out and get a signal, so loop around again, and then another
> reschedule event happens during signal processing so we go around
> again, etc.  Eventually it settles down.  It's the same with the
> vmstat/lru stuff.

Only kind of.

When we say, effectively, while (need_resched()) schedule();, we're
not waiting for an event or condition per se.  We're runnable (in the
sense that userspace wants to run and we're not blocked on anything)
the entire time -- we're simply yielding to some other thread that is
also runnable.  So if that loop runs forever, it either means that
we're at low priority and we genuinely shouldn't be running or that
there's a scheduler bug.

If, on the other hand, we say while (not quiesced) schedule(); (or
equivalent), we're saying that we're *not* really ready to run and
that we're waiting for some condition to change.  The condition in
question is fairly complicated and won't wake us when we are ready.  I
can also imagine the scheduler getting rather confused, since, as far
as the scheduler knows, we are runnable and we are supposed to be
running.

>
>
>>>> Also, this cond_resched stuff doesn't worry me too much at a
>>>> fundamental level -- if we're really going quiescent, shouldn't we be
>>>> able to arrange that there are no other schedulable tasks on the CPU
>>>> in question?
>>>
>>> We aren't currently planning to enforce things in the scheduler, so if
>>> the application affinitizes another task on top of an existing task
>>> isolation task, by default the task isolation task just dies. (Unless
>>> it's using NOSIG mode, in which case it just ends up stuck in the
>>> kernel trying to wait out the dyntick until you either kill it, or
>>> re-affinitize the offending task.)  But I'm reluctant to guarantee
>>> every possible way that you might (perhaps briefly) have some
>>> schedulable task, and the current approach seems pretty robust if that
>>> sort of thing happens.
>>
>> This kind of waiting out the dyntick scares me.  Why is there ever a
>> dyntick that you're waiting out?  If quiescence is to be a supported
>> mainline feature, shouldn't the scheduler be integrated well enough
>> with it that you don't need to wait like this?
>
>
> Well, this is certainly the funkiest piece of the task isolation
> stuff.  The problem is that the dyntick stuff may, for example, need
> one more tick 4us from now (or whatever) just to close out the current
> RCU period.  We can't return to userspace until that happens.  So what
> else can we do when the task is ready to return to userspace?  We
> could punt into the idle task instead of waiting in this task, which
> was my earlier schedule_time() suggestion.  Do you think that's cleaner?
>

Unless I'm missing something (which is reasonably likely), couldn't
the isolation code just force or require rcu_nocbs on the isolated
CPUs to avoid this problem entirely.

I admit I still don't understand why the RCU context tracking code
can't just run the callback right away instead of waiting however many
microseconds in general.  I feel like paulmck has explained it to me
at least once, but that doesn't mean I remember the answer.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-02 17:28                       ` Andy Lutomirski
@ 2016-09-09 17:40                         ` Chris Metcalf
  2016-09-12 17:41                           ` Andy Lutomirski
  2016-09-27 14:22                         ` Frederic Weisbecker
  1 sibling, 1 reply; 28+ messages in thread
From: Chris Metcalf @ 2016-09-09 17:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, linux-doc, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Viresh Kumar, Linux API,
	Steven Rostedt, Ingo Molnar, Tejun Heo, Rik van Riel,
	Will Deacon, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On 9/2/2016 1:28 PM, Andy Lutomirski wrote:
> On Sep 2, 2016 7:04 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>> On 8/30/2016 3:50 PM, Andy Lutomirski wrote:
>>> On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>>>> On 8/30/2016 2:43 PM, Andy Lutomirski wrote:
>>>>> What if we did it the other way around: set a percpu flag saying
>>>>> "going quiescent; disallow new deferred work", then finish all
>>>>> existing work and return to userspace.  Then, on the next entry, clear
>>>>> that flag.  With the flag set, vmstat would just flush anything that
>>>>> it accumulates immediately, nothing would be added to the LRU list,
>>>>> etc.
>>>>
>>>> This is an interesting idea!
>>>>
>>>> However, there are a number of implementation ideas that make me
>>>> worry that it might be a trickier approach overall.
>>>>
>>>> First, "on the next entry" hides a world of hurt in four simple words.
>>>> Some platforms (arm64 and tile, that I'm familiar with) have a common
>>>> chunk of code that always runs on every entry to the kernel.  It would
>>>> not be too hard to poke at the assembly and make those platforms
>>>> always run some task-isolation specific code on entry.  But x86 scares
>>>> me - there seem to be a whole lot of ways to get into the kernel, and
>>>> I'm not convinced there is a lot of shared macrology or whatever that
>>>> would make it straightforward to intercept all of them.
>>> Just use the context tracking entry hook.  It's 100% reliable.  The
>>> relevant x86 function is enter_from_user_mode(), but I would just hook
>>> into user_exit() in the common code.  (This code is had better be
>>> reliable, because context tracking depends on it, and, if context
>>> tracking doesn't work on a given arch, then isolation isn't going to
>>> work regardless.
>>
>> This looks a lot cleaner than last time I looked at the x86 code. So yes, I think
>> we could do an entry-point approach plausibly now.
>>
>> This is also good for when we want to look at deferring the kernel TLB flush,
>> since it's the same mechanism that would be required for that.
>>
>>
> There's at least one gotcha for the latter: NMIs aren't currently
> guaranteed to go through context tracking.  Instead they use their own
> RCU hooks.  Deferred TLB flushes can still be made to work, but a bit
> more care will be needed.  I would probably approach it with an
> additional NMI hook in the same places as rcu_nmi_enter() that does,
> more or less:
>
> if (need_tlb_flush) flush();
>
> and then make sure that the normal exit hook looks like:
>
> if (need_tlb_flush) {
>    flush();
>    barrier(); /* An NMI must not see !need_tlb_flush if the TLB hasn't
> been flushed */
>    flush the TLB;
> }

This is a good point.  For now I will continue not trying to include the TLB flush
in the current patch series, so I will sit on this until we're ready to do so.

>>>> So to pop up a level, what is your actual concern about the existing
>>>> "do it in a loop" model?  The macrology currently in use means there
>>>> is zero cost if you don't configure TASK_ISOLATION, and the software
>>>> maintenance cost seems low since the idioms used for task isolation
>>>> in the loop are generally familiar to people reading that code.
>>> My concern is that it's not obvious to readers of the code that the
>>> loop ever terminates.  It really ought to, but it's doing something
>>> very odd.  Normally we can loop because we get scheduled out, but
>>> actually blocking in the return-to-userspace path, especially blocking
>>> on a condition that doesn't have a wakeup associated with it, is odd.
>>
>> True, although, comments :-)
>>
>> Regardless, though, this doesn't seem at all weird to me in the
>> context of the vmstat and lru stuff, though.  It's exactly parallel to
>> the fact that we loop around on checking need_resched and signal, and
>> in some cases you could imagine multiple loops around when we schedule
>> out and get a signal, so loop around again, and then another
>> reschedule event happens during signal processing so we go around
>> again, etc.  Eventually it settles down.  It's the same with the
>> vmstat/lru stuff.
> Only kind of.
>
> When we say, effectively, while (need_resched()) schedule();, we're
> not waiting for an event or condition per se.  We're runnable (in the
> sense that userspace wants to run and we're not blocked on anything)
> the entire time -- we're simply yielding to some other thread that is
> also runnable.  So if that loop runs forever, it either means that
> we're at low priority and we genuinely shouldn't be running or that
> there's a scheduler bug.
>
> If, on the other hand, we say while (not quiesced) schedule(); (or
> equivalent), we're saying that we're *not* really ready to run and
> that we're waiting for some condition to change.  The condition in
> question is fairly complicated and won't wake us when we are ready.  I
> can also imagine the scheduler getting rather confused, since, as far
> as the scheduler knows, we are runnable and we are supposed to be
> running.

So, how about a code structure like this?

In the main return-to-userspace loop where we check TIF flags,
we keep the notion of our TIF_TASK_ISOLATION flag that causes
us to invoke a task_isolation_prepare() routine.  This routine
does the following things:

1. As you suggested, set a new TIF bit (or equivalent) that says the
system should no longer create deferred work on this core, and then
flush any necessary already-deferred work (currently, the LRU cache
and the vmstat stuff).  We never have to go flush the deferred work
again during this task's return to userspace.  Note that this bit can
only be set on a core marked for task isolation, so it can't be used
for denial of service type attacks on normal cores that are trying to
multitask normal Linux processes.

2. Check if the dyntick is stopped, and if not, wait on a completion
that will be set when it does stop.  This means we may schedule out at
this point, but when we return, the deferred work stuff is still safe
since your bit is still set, and in principle the dyn tick is
stopped.

Then, after we disable interrupts and re-read the thread-info flags,
we check to see if the TIF_TASK_ISOLATION flag is the ONLY flag still
set that would keep us in the loop.  This will always end up happening
on each return to userspace, since the only thing that actually clears
the bit is a prctl() call.  When that happens we know we are about to
return to userspace, so we call task_isolation_ready(), which now has
two things to do:

1. We check that the dyntick is in fact stopped, since it's possible
that a race condition led to it being somehow restarted by an interrupt.
If it is not stopped, we go around the loop again so we can go back in
to the completion discussed above and wait some more.  This may merit
a WARN_ON or other notice since it seems like people aren't convinced
there are things that could restart it, but honestly the dyntick stuff
is complex enough that I think a belt-and-suspenders kind of test here
at the last minute is just defensive programming.

2. Assuming it's stopped, we clear your bit at this point, and
return "true" so the loop code knows to break out of the loop and do
the actual return to userspace.  Clearing the bit at this point is
better than waiting until we re-enter the kernel later, since it
avoids having to figure out all the ways we actually can re-enter.
With interrupts disabled, and this late in the return to userspace
process, there's no way additional deferred work can be created.

>>>>> Also, this cond_resched stuff doesn't worry me too much at a
>>>>> fundamental level -- if we're really going quiescent, shouldn't we be
>>>>> able to arrange that there are no other schedulable tasks on the CPU
>>>>> in question?
>>>> We aren't currently planning to enforce things in the scheduler, so if
>>>> the application affinitizes another task on top of an existing task
>>>> isolation task, by default the task isolation task just dies. (Unless
>>>> it's using NOSIG mode, in which case it just ends up stuck in the
>>>> kernel trying to wait out the dyntick until you either kill it, or
>>>> re-affinitize the offending task.)  But I'm reluctant to guarantee
>>>> every possible way that you might (perhaps briefly) have some
>>>> schedulable task, and the current approach seems pretty robust if that
>>>> sort of thing happens.
>>> This kind of waiting out the dyntick scares me.  Why is there ever a
>>> dyntick that you're waiting out?  If quiescence is to be a supported
>>> mainline feature, shouldn't the scheduler be integrated well enough
>>> with it that you don't need to wait like this?
>>
>> Well, this is certainly the funkiest piece of the task isolation
>> stuff.  The problem is that the dyntick stuff may, for example, need
>> one more tick 4us from now (or whatever) just to close out the current
>> RCU period.  We can't return to userspace until that happens.  So what
>> else can we do when the task is ready to return to userspace?  We
>> could punt into the idle task instead of waiting in this task, which
>> was my earlier schedule_time() suggestion.  Do you think that's cleaner?
>>
> Unless I'm missing something (which is reasonably likely), couldn't
> the isolation code just force or require rcu_nocbs on the isolated
> CPUs to avoid this problem entirely.
>
> I admit I still don't understand why the RCU context tracking code
> can't just run the callback right away instead of waiting however many
> microseconds in general.  I feel like paulmck has explained it to me
> at least once, but that doesn't mean I remember the answer.

I admit I am not clear on this either.  However, since there are a
bunch of reasons why the dyntick might run (not just LRU), I think
fixing LRU may well not be enough to guarantee the dyntick
turns off exactly when we'd like it to.

And, with the structure proposed here, we can always come back
and revisit this by just removing the code that does the completion
waiting and replacing it with a call that just tells the dyntick to
just stop immediately, once we're confident we can make that work.

Then separately, we can also think about removing the code that
re-checks dyntick being stopped as we are about to return to
userspace with interrupts disabled, if we're convinced there's
also no way for the dyntick to get restarted due to an interrupt
being handled after we think the dyntick has been stopped.
I'd argue always leaving a WARN_ON() there would be good, though.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-09 17:40                         ` Chris Metcalf
@ 2016-09-12 17:41                           ` Andy Lutomirski
  2016-09-12 19:25                             ` Chris Metcalf
  0 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2016-09-12 17:41 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: linux-doc, Thomas Gleixner, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Linux API, Viresh Kumar,
	Ingo Molnar, Steven Rostedt, Tejun Heo, Will Deacon,
	Rik van Riel, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On Sep 9, 2016 1:40 PM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>
> On 9/2/2016 1:28 PM, Andy Lutomirski wrote:
>>
>> On Sep 2, 2016 7:04 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>>>
>>> On 8/30/2016 3:50 PM, Andy Lutomirski wrote:
>>>>
>>>> On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>>>>>
>>>>> On 8/30/2016 2:43 PM, Andy Lutomirski wrote:
>>>>>>
>>>>>> What if we did it the other way around: set a percpu flag saying
>>>>>> "going quiescent; disallow new deferred work", then finish all
>>>>>> existing work and return to userspace.  Then, on the next entry, clear
>>>>>> that flag.  With the flag set, vmstat would just flush anything that
>>>>>> it accumulates immediately, nothing would be added to the LRU list,
>>>>>> etc.
>>>>>
>>>>>
>>>>> This is an interesting idea!
>>>>>
>>>>> However, there are a number of implementation ideas that make me
>>>>> worry that it might be a trickier approach overall.
>>>>>
>>>>> First, "on the next entry" hides a world of hurt in four simple words.
>>>>> Some platforms (arm64 and tile, that I'm familiar with) have a common
>>>>> chunk of code that always runs on every entry to the kernel.  It would
>>>>> not be too hard to poke at the assembly and make those platforms
>>>>> always run some task-isolation specific code on entry.  But x86 scares
>>>>> me - there seem to be a whole lot of ways to get into the kernel, and
>>>>> I'm not convinced there is a lot of shared macrology or whatever that
>>>>> would make it straightforward to intercept all of them.
>>>>
>>>> Just use the context tracking entry hook.  It's 100% reliable.  The
>>>> relevant x86 function is enter_from_user_mode(), but I would just hook
>>>> into user_exit() in the common code.  (This code is had better be
>>>> reliable, because context tracking depends on it, and, if context
>>>> tracking doesn't work on a given arch, then isolation isn't going to
>>>> work regardless.
>>>
>>>
>>> This looks a lot cleaner than last time I looked at the x86 code. So yes, I think
>>> we could do an entry-point approach plausibly now.
>>>
>>> This is also good for when we want to look at deferring the kernel TLB flush,
>>> since it's the same mechanism that would be required for that.
>>>
>>>
>> There's at least one gotcha for the latter: NMIs aren't currently
>> guaranteed to go through context tracking.  Instead they use their own
>> RCU hooks.  Deferred TLB flushes can still be made to work, but a bit
>> more care will be needed.  I would probably approach it with an
>> additional NMI hook in the same places as rcu_nmi_enter() that does,
>> more or less:
>>
>> if (need_tlb_flush) flush();
>>
>> and then make sure that the normal exit hook looks like:
>>
>> if (need_tlb_flush) {
>>    flush();
>>    barrier(); /* An NMI must not see !need_tlb_flush if the TLB hasn't
>> been flushed */
>>    flush the TLB;
>> }
>
>
> This is a good point.  For now I will continue not trying to include the TLB flush
> in the current patch series, so I will sit on this until we're ready to do so.
>
>
>>>>> So to pop up a level, what is your actual concern about the existing
>>>>> "do it in a loop" model?  The macrology currently in use means there
>>>>> is zero cost if you don't configure TASK_ISOLATION, and the software
>>>>> maintenance cost seems low since the idioms used for task isolation
>>>>> in the loop are generally familiar to people reading that code.
>>>>
>>>> My concern is that it's not obvious to readers of the code that the
>>>> loop ever terminates.  It really ought to, but it's doing something
>>>> very odd.  Normally we can loop because we get scheduled out, but
>>>> actually blocking in the return-to-userspace path, especially blocking
>>>> on a condition that doesn't have a wakeup associated with it, is odd.
>>>
>>>
>>> True, although, comments :-)
>>>
>>> Regardless, though, this doesn't seem at all weird to me in the
>>> context of the vmstat and lru stuff, though.  It's exactly parallel to
>>> the fact that we loop around on checking need_resched and signal, and
>>> in some cases you could imagine multiple loops around when we schedule
>>> out and get a signal, so loop around again, and then another
>>> reschedule event happens during signal processing so we go around
>>> again, etc.  Eventually it settles down.  It's the same with the
>>> vmstat/lru stuff.
>>
>> Only kind of.
>>
>> When we say, effectively, while (need_resched()) schedule();, we're
>> not waiting for an event or condition per se.  We're runnable (in the
>> sense that userspace wants to run and we're not blocked on anything)
>> the entire time -- we're simply yielding to some other thread that is
>> also runnable.  So if that loop runs forever, it either means that
>> we're at low priority and we genuinely shouldn't be running or that
>> there's a scheduler bug.
>>
>> If, on the other hand, we say while (not quiesced) schedule(); (or
>> equivalent), we're saying that we're *not* really ready to run and
>> that we're waiting for some condition to change.  The condition in
>> question is fairly complicated and won't wake us when we are ready.  I
>> can also imagine the scheduler getting rather confused, since, as far
>> as the scheduler knows, we are runnable and we are supposed to be
>> running.
>
>
> So, how about a code structure like this?
>
> In the main return-to-userspace loop where we check TIF flags,
> we keep the notion of our TIF_TASK_ISOLATION flag that causes
> us to invoke a task_isolation_prepare() routine.  This routine
> does the following things:
>
> 1. As you suggested, set a new TIF bit (or equivalent) that says the
> system should no longer create deferred work on this core, and then
> flush any necessary already-deferred work (currently, the LRU cache
> and the vmstat stuff).  We never have to go flush the deferred work
> again during this task's return to userspace.  Note that this bit can
> only be set on a core marked for task isolation, so it can't be used
> for denial of service type attacks on normal cores that are trying to
> multitask normal Linux processes.

I think it can't be a TIF flag unless you can do the whole mess with
preemption off because, if you get preempted, other tasks on the CPU
won't see the flag.  You could do it with a percpu flag, I think.

>
> 2. Check if the dyntick is stopped, and if not, wait on a completion
> that will be set when it does stop.  This means we may schedule out at
> this point, but when we return, the deferred work stuff is still safe
> since your bit is still set, and in principle the dyn tick is
> stopped.
>
> Then, after we disable interrupts and re-read the thread-info flags,
> we check to see if the TIF_TASK_ISOLATION flag is the ONLY flag still
> set that would keep us in the loop.  This will always end up happening
> on each return to userspace, since the only thing that actually clears
> the bit is a prctl() call.  When that happens we know we are about to
> return to userspace, so we call task_isolation_ready(), which now has
> two things to do:

Would it perhaps be more straightforward to do the stuff before the
loop and not check TIF_TASK_ISOLATION in the loop?

>
> 1. We check that the dyntick is in fact stopped, since it's possible
> that a race condition led to it being somehow restarted by an interrupt.
> If it is not stopped, we go around the loop again so we can go back in
> to the completion discussed above and wait some more.  This may merit
> a WARN_ON or other notice since it seems like people aren't convinced
> there are things that could restart it, but honestly the dyntick stuff
> is complex enough that I think a belt-and-suspenders kind of test here
> at the last minute is just defensive programming.

Seems reasonable.  But maybe this could go after the loop and, if the
dyntick is back, it could be treated like any other kernel bug that
interrupts an isolated task?  That would preserve more of the existing
code structure.

If that works, it could go in user_enter().

>
> 2. Assuming it's stopped, we clear your bit at this point, and
> return "true" so the loop code knows to break out of the loop and do
> the actual return to userspace.  Clearing the bit at this point is
> better than waiting until we re-enter the kernel later, since it
> avoids having to figure out all the ways we actually can re-enter.
> With interrupts disabled, and this late in the return to userspace
> process, there's no way additional deferred work can be created.
>
>
>>>>>> Also, this cond_resched stuff doesn't worry me too much at a
>>>>>> fundamental level -- if we're really going quiescent, shouldn't we be
>>>>>> able to arrange that there are no other schedulable tasks on the CPU
>>>>>> in question?
>>>>>
>>>>> We aren't currently planning to enforce things in the scheduler, so if
>>>>> the application affinitizes another task on top of an existing task
>>>>> isolation task, by default the task isolation task just dies. (Unless
>>>>> it's using NOSIG mode, in which case it just ends up stuck in the
>>>>> kernel trying to wait out the dyntick until you either kill it, or
>>>>> re-affinitize the offending task.)  But I'm reluctant to guarantee
>>>>> every possible way that you might (perhaps briefly) have some
>>>>> schedulable task, and the current approach seems pretty robust if that
>>>>> sort of thing happens.
>>>>
>>>> This kind of waiting out the dyntick scares me.  Why is there ever a
>>>> dyntick that you're waiting out?  If quiescence is to be a supported
>>>> mainline feature, shouldn't the scheduler be integrated well enough
>>>> with it that you don't need to wait like this?
>>>
>>>
>>> Well, this is certainly the funkiest piece of the task isolation
>>> stuff.  The problem is that the dyntick stuff may, for example, need
>>> one more tick 4us from now (or whatever) just to close out the current
>>> RCU period.  We can't return to userspace until that happens.  So what
>>> else can we do when the task is ready to return to userspace?  We
>>> could punt into the idle task instead of waiting in this task, which
>>> was my earlier schedule_time() suggestion.  Do you think that's cleaner?
>>>
>> Unless I'm missing something (which is reasonably likely), couldn't
>> the isolation code just force or require rcu_nocbs on the isolated
>> CPUs to avoid this problem entirely.
>>
>> I admit I still don't understand why the RCU context tracking code
>> can't just run the callback right away instead of waiting however many
>> microseconds in general.  I feel like paulmck has explained it to me
>> at least once, but that doesn't mean I remember the answer.
>
>
> I admit I am not clear on this either.  However, since there are a
> bunch of reasons why the dyntick might run (not just LRU), I think
> fixing LRU may well not be enough to guarantee the dyntick
> turns off exactly when we'd like it to.
>
> And, with the structure proposed here, we can always come back
> and revisit this by just removing the code that does the completion
> waiting and replacing it with a call that just tells the dyntick to
> just stop immediately, once we're confident we can make that work.
>
> Then separately, we can also think about removing the code that
> re-checks dyntick being stopped as we are about to return to
> userspace with interrupts disabled, if we're convinced there's
> also no way for the dyntick to get restarted due to an interrupt
> being handled after we think the dyntick has been stopped.
> I'd argue always leaving a WARN_ON() there would be good, though.
>
>
> --
> Chris Metcalf, Mellanox Technologies
> http://www.mellanox.com
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-12 17:41                           ` Andy Lutomirski
@ 2016-09-12 19:25                             ` Chris Metcalf
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Metcalf @ 2016-09-12 19:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-doc, Thomas Gleixner, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Linux API, Viresh Kumar,
	Ingo Molnar, Steven Rostedt, Tejun Heo, Will Deacon,
	Rik van Riel, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On 9/12/2016 1:41 PM, Andy Lutomirski wrote:
> On Sep 9, 2016 1:40 PM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>> On 9/2/2016 1:28 PM, Andy Lutomirski wrote:
>>> On Sep 2, 2016 7:04 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>>>> On 8/30/2016 3:50 PM, Andy Lutomirski wrote:
>>>>> On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@mellanox.com> wrote:
>>>>>> So to pop up a level, what is your actual concern about the existing
>>>>>> "do it in a loop" model?  The macrology currently in use means there
>>>>>> is zero cost if you don't configure TASK_ISOLATION, and the software
>>>>>> maintenance cost seems low since the idioms used for task isolation
>>>>>> in the loop are generally familiar to people reading that code.
>>>>> My concern is that it's not obvious to readers of the code that the
>>>>> loop ever terminates.  It really ought to, but it's doing something
>>>>> very odd.  Normally we can loop because we get scheduled out, but
>>>>> actually blocking in the return-to-userspace path, especially blocking
>>>>> on a condition that doesn't have a wakeup associated with it, is odd.
>>>>
>>>> True, although, comments :-)
>>>>
>>>> Regardless, though, this doesn't seem at all weird to me in the
>>>> context of the vmstat and lru stuff, though.  It's exactly parallel to
>>>> the fact that we loop around on checking need_resched and signal, and
>>>> in some cases you could imagine multiple loops around when we schedule
>>>> out and get a signal, so loop around again, and then another
>>>> reschedule event happens during signal processing so we go around
>>>> again, etc.  Eventually it settles down.  It's the same with the
>>>> vmstat/lru stuff.
>>> Only kind of.
>>>
>>> When we say, effectively, while (need_resched()) schedule();, we're
>>> not waiting for an event or condition per se.  We're runnable (in the
>>> sense that userspace wants to run and we're not blocked on anything)
>>> the entire time -- we're simply yielding to some other thread that is
>>> also runnable.  So if that loop runs forever, it either means that
>>> we're at low priority and we genuinely shouldn't be running or that
>>> there's a scheduler bug.
>>>
>>> If, on the other hand, we say while (not quiesced) schedule(); (or
>>> equivalent), we're saying that we're *not* really ready to run and
>>> that we're waiting for some condition to change.  The condition in
>>> question is fairly complicated and won't wake us when we are ready.  I
>>> can also imagine the scheduler getting rather confused, since, as far
>>> as the scheduler knows, we are runnable and we are supposed to be
>>> running.
>>
>> So, how about a code structure like this?
>>
>> In the main return-to-userspace loop where we check TIF flags,
>> we keep the notion of our TIF_TASK_ISOLATION flag that causes
>> us to invoke a task_isolation_prepare() routine.  This routine
>> does the following things:
>>
>> 1. As you suggested, set a new TIF bit (or equivalent) that says the
>> system should no longer create deferred work on this core, and then
>> flush any necessary already-deferred work (currently, the LRU cache
>> and the vmstat stuff).  We never have to go flush the deferred work
>> again during this task's return to userspace.  Note that this bit can
>> only be set on a core marked for task isolation, so it can't be used
>> for denial of service type attacks on normal cores that are trying to
>> multitask normal Linux processes.
> I think it can't be a TIF flag unless you can do the whole mess with
> preemption off because, if you get preempted, other tasks on the CPU
> won't see the flag.  You could do it with a percpu flag, I think.

Yes, a percpu flag - you're right.  I think it will make sense for this to
be a flag declared in linux/isolation.h which can be read by vmstat, LRU, etc.

>> 2. Check if the dyntick is stopped, and if not, wait on a completion
>> that will be set when it does stop.  This means we may schedule out at
>> this point, but when we return, the deferred work stuff is still safe
>> since your bit is still set, and in principle the dyn tick is
>> stopped.
>>
>> Then, after we disable interrupts and re-read the thread-info flags,
>> we check to see if the TIF_TASK_ISOLATION flag is the ONLY flag still
>> set that would keep us in the loop.  This will always end up happening
>> on each return to userspace, since the only thing that actually clears
>> the bit is a prctl() call.  When that happens we know we are about to
>> return to userspace, so we call task_isolation_ready(), which now has
>> two things to do:
> Would it perhaps be more straightforward to do the stuff before the
> loop and not check TIF_TASK_ISOLATION in the loop?

We can certainly play around with just not looping in this case, but
in particular I can imagine an isolated task entering the kernel and
then doing something that requires scheduling a kernel task.  We'd
clearly like that other task to run before the isolated task returns to
userspace.  But then, that other task might do something to re-enable
the dyntick.  That's why we'd like to recheck that dyntick is off in
the loop after each potential call to schedule().

>> 1. We check that the dyntick is in fact stopped, since it's possible
>> that a race condition led to it being somehow restarted by an interrupt.
>> If it is not stopped, we go around the loop again so we can go back in
>> to the completion discussed above and wait some more.  This may merit
>> a WARN_ON or other notice since it seems like people aren't convinced
>> there are things that could restart it, but honestly the dyntick stuff
>> is complex enough that I think a belt-and-suspenders kind of test here
>> at the last minute is just defensive programming.
> Seems reasonable.  But maybe this could go after the loop and, if the
> dyntick is back, it could be treated like any other kernel bug that
> interrupts an isolated task?  That would preserve more of the existing
> code structure.

Well, we can certainly try it that way.  If I move it out and my testing
doesn't trigger the bug, that's at least an initial sign that it might be
OK.  But I worry/suspect that it will trip up at some point in some use
case and we'll have to fix it at that point.

> If that works, it could go in user_enter().

Presumably with trace_user_enter() and vtime_user_enter()
in __context_tracking_enter()?

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-02 17:28                       ` Andy Lutomirski
  2016-09-09 17:40                         ` Chris Metcalf
@ 2016-09-27 14:22                         ` Frederic Weisbecker
  2016-09-27 14:39                           ` Peter Zijlstra
  2016-09-27 14:48                           ` Paul E. McKenney
  1 sibling, 2 replies; 28+ messages in thread
From: Frederic Weisbecker @ 2016-09-27 14:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Thomas Gleixner, linux-doc, Christoph Lameter,
	Michal Hocko, Gilad Ben Yossef, Andrew Morton, Viresh Kumar,
	Linux API, Steven Rostedt, Ingo Molnar, Tejun Heo, Rik van Riel,
	Will Deacon, Paul E. McKenney, linux-mm, linux-kernel,
	Catalin Marinas, Peter Zijlstra

On Fri, Sep 02, 2016 at 10:28:00AM -0700, Andy Lutomirski wrote:
> 
> Unless I'm missing something (which is reasonably likely), couldn't
> the isolation code just force or require rcu_nocbs on the isolated
> CPUs to avoid this problem entirely.

rcu_nocb is already implied by nohz_full. Which means that RCU callbacks
are offlined outside the nohz_full set of CPUs.

> 
> I admit I still don't understand why the RCU context tracking code
> can't just run the callback right away instead of waiting however many
> microseconds in general.  I feel like paulmck has explained it to me
> at least once, but that doesn't mean I remember the answer.

The RCU context tracking doesn't take care of callbacks. It's only there
to tell the RCU core whether the CPU runs code that may or may not run
RCU read side critical sections. This is assumed by "kernel may use RCU,
userspace can't".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-27 14:22                         ` Frederic Weisbecker
@ 2016-09-27 14:39                           ` Peter Zijlstra
  2016-09-27 14:51                             ` Frederic Weisbecker
  2016-09-27 14:48                           ` Paul E. McKenney
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2016-09-27 14:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andy Lutomirski, Chris Metcalf, Thomas Gleixner, linux-doc,
	Christoph Lameter, Michal Hocko, Gilad Ben Yossef, Andrew Morton,
	Viresh Kumar, Linux API, Steven Rostedt, Ingo Molnar, Tejun Heo,
	Rik van Riel, Will Deacon, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas

On Tue, Sep 27, 2016 at 04:22:20PM +0200, Frederic Weisbecker wrote:

> The RCU context tracking doesn't take care of callbacks. It's only there
> to tell the RCU core whether the CPU runs code that may or may not run
> RCU read side critical sections. This is assumed by "kernel may use RCU,
> userspace can't".

Userspace never can use the kernels RCU in any case. What you mean to
say is that userspace is treated like an idle CPU in that the CPU will
no longer be part of the RCU quescent state machine.

The transition to userspace (as per context tracking) must ensure that
CPUs RCU state is 'complete', just like our transition to idle (mostly)
does.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-27 14:22                         ` Frederic Weisbecker
  2016-09-27 14:39                           ` Peter Zijlstra
@ 2016-09-27 14:48                           ` Paul E. McKenney
  1 sibling, 0 replies; 28+ messages in thread
From: Paul E. McKenney @ 2016-09-27 14:48 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andy Lutomirski, Chris Metcalf, Thomas Gleixner, linux-doc,
	Christoph Lameter, Michal Hocko, Gilad Ben Yossef, Andrew Morton,
	Viresh Kumar, Linux API, Steven Rostedt, Ingo Molnar, Tejun Heo,
	Rik van Riel, Will Deacon, linux-mm, linux-kernel,
	Catalin Marinas, Peter Zijlstra

On Tue, Sep 27, 2016 at 04:22:20PM +0200, Frederic Weisbecker wrote:
> On Fri, Sep 02, 2016 at 10:28:00AM -0700, Andy Lutomirski wrote:
> > 
> > Unless I'm missing something (which is reasonably likely), couldn't
> > the isolation code just force or require rcu_nocbs on the isolated
> > CPUs to avoid this problem entirely.
> 
> rcu_nocb is already implied by nohz_full. Which means that RCU callbacks
> are offlined outside the nohz_full set of CPUs.

Indeed, at boot time, RCU makes any nohz_full CPU also be a rcu_nocb
CPU.

> > I admit I still don't understand why the RCU context tracking code
> > can't just run the callback right away instead of waiting however many
> > microseconds in general.  I feel like paulmck has explained it to me
> > at least once, but that doesn't mean I remember the answer.
> 
> The RCU context tracking doesn't take care of callbacks. It's only there
> to tell the RCU core whether the CPU runs code that may or may not run
> RCU read side critical sections. This is assumed by "kernel may use RCU,
> userspace can't".

And RCU has to wait for read-side critical sections to complete before
invoking callbacks.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-09-27 14:39                           ` Peter Zijlstra
@ 2016-09-27 14:51                             ` Frederic Weisbecker
  0 siblings, 0 replies; 28+ messages in thread
From: Frederic Weisbecker @ 2016-09-27 14:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Chris Metcalf, Thomas Gleixner, linux-doc,
	Christoph Lameter, Michal Hocko, Gilad Ben Yossef, Andrew Morton,
	Viresh Kumar, Linux API, Steven Rostedt, Ingo Molnar, Tejun Heo,
	Rik van Riel, Will Deacon, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas

On Tue, Sep 27, 2016 at 04:39:26PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 27, 2016 at 04:22:20PM +0200, Frederic Weisbecker wrote:
> 
> > The RCU context tracking doesn't take care of callbacks. It's only there
> > to tell the RCU core whether the CPU runs code that may or may not run
> > RCU read side critical sections. This is assumed by "kernel may use RCU,
> > userspace can't".
> 
> Userspace never can use the kernels RCU in any case. What you mean to
> say is that userspace is treated like an idle CPU in that the CPU will
> no longer be part of the RCU quescent state machine.
> 
> The transition to userspace (as per context tracking) must ensure that
> CPUs RCU state is 'complete', just like our transition to idle (mostly)
> does.

Exactly!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v15 04/13] task_isolation: add initial support
  2016-08-30 18:43               ` Andy Lutomirski
  2016-08-30 19:37                 ` Chris Metcalf
@ 2016-09-30 16:59                 ` Chris Metcalf
  1 sibling, 0 replies; 28+ messages in thread
From: Chris Metcalf @ 2016-09-30 16:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Christoph Lameter, Michal Hocko,
	Gilad Ben Yossef, Andrew Morton, Linux API, Viresh Kumar,
	Ingo Molnar, Steven Rostedt, Tejun Heo, Will Deacon,
	Rik van Riel, Frederic Weisbecker, Paul E. McKenney, linux-mm,
	linux-kernel, Catalin Marinas, Peter Zijlstra

On 8/30/2016 2:43 PM, Andy Lutomirski wrote:
> On Aug 30, 2016 10:02 AM, "Chris Metcalf" <cmetcalf@mellanox.com> wrote:
>> We really want to run task isolation last, so we can guarantee that
>> all the isolation prerequisites are met (dynticks stopped, per-cpu lru
>> cache empty, etc).  But achieving that state can require enabling
>> interrupts - most obviously if we have to schedule, e.g. for vmstat
>> clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
>> just while waiting for that last dyntick interrupt to occur.  I'm also
>> not sure that even something as simple as draining the per-cpu lru
>> cache can be done holding interrupts disabled throughout - certainly
>> there's a !SMP code path there that just re-enables interrupts
>> unconditionally, which gives me pause.
>>
>> At any rate at that point you need to retest for signals, resched,
>> etc, all as usual, and then you need to recheck the task isolation
>> prerequisites once more.
>>
>> I may be missing something here, but it's really not obvious to me
>> that there's a way to do this without having task isolation integrated
>> into the usual return-to-userspace loop.
> What if we did it the other way around: set a percpu flag saying
> "going quiescent; disallow new deferred work", then finish all
> existing work and return to userspace.  Then, on the next entry, clear
> that flag.  With the flag set, vmstat would just flush anything that
> it accumulates immediately, nothing would be added to the LRU list,
> etc.

Thinking about this some more, I was struck by an even simpler way
to approach this.  What if we just said that on task isolation cores, no
kernel subsystem should do something that would require a future
interruption?  So vmstat would just always sync immediately on task
isolation cores, the mm subsystem wouldn't use per-cpu LRU stuff on
task isolation cores, etc.  That way we don't have to worry about the
status of those things as we are returning to userspace for a task
isolation process, since it's just always kept "pristine".

The task-isolation setting per-core is not user-customizable, and the
task-stealing scheduler doesn't even run there, so it's not like any
processes will land there and be in a position to complain about the
performance overhead of having no deferred work being created...

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-09-30 17:00 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1471382376-5443-1-git-send-email-cmetcalf@mellanox.com>
2016-08-16 21:19 ` [PATCH v15 02/13] vmstat: add vmstat_idle function Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 03/13] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
2016-08-16 21:19 ` [PATCH v15 04/13] task_isolation: add initial support Chris Metcalf
2016-08-29 16:33   ` Peter Zijlstra
2016-08-29 16:40     ` Chris Metcalf
2016-08-29 16:48       ` Peter Zijlstra
2016-08-29 16:53         ` Chris Metcalf
2016-08-30  7:59           ` Peter Zijlstra
2016-08-30  7:58       ` Peter Zijlstra
2016-08-30 15:32         ` Chris Metcalf
2016-08-30 16:30           ` Andy Lutomirski
2016-08-30 17:02             ` Chris Metcalf
2016-08-30 18:43               ` Andy Lutomirski
2016-08-30 19:37                 ` Chris Metcalf
2016-08-30 19:50                   ` Andy Lutomirski
2016-09-02 14:04                     ` Chris Metcalf
2016-09-02 17:28                       ` Andy Lutomirski
2016-09-09 17:40                         ` Chris Metcalf
2016-09-12 17:41                           ` Andy Lutomirski
2016-09-12 19:25                             ` Chris Metcalf
2016-09-27 14:22                         ` Frederic Weisbecker
2016-09-27 14:39                           ` Peter Zijlstra
2016-09-27 14:51                             ` Frederic Weisbecker
2016-09-27 14:48                           ` Paul E. McKenney
2016-09-30 16:59                 ` Chris Metcalf
2016-09-01 10:06           ` Peter Zijlstra
2016-09-02 14:03             ` Chris Metcalf
2016-09-02 16:40               ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).