All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] [PATCH] Pre-emption control for userspace
@ 2014-03-03 18:07 Khalid Aziz
  2014-03-03 21:51 ` Davidlohr Bueso
                   ` (4 more replies)
  0 siblings, 5 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-03 18:07 UTC (permalink / raw)
  To: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg, venki
  Cc: Khalid Aziz, linux-kernel


I am working on a feature that has been requested by database folks that
helps with performance. Some of the oft executed database code uses
mutexes to lock other threads out of a critical section. They often see
a situation where a thread grabs the mutex, runs out of its timeslice
and gets switched out which then causes another thread to run which
tries to grab the same mutex, spins for a while and finally gives up.
This can happen with multiple threads until original lock owner gets the
CPU again and can complete executing its critical section. This queueing
and subsequent CPU cycle wastage can be avoided if the locking thread
could request to be granted an additional timeslice if its current
timeslice runs out before it gives up the lock. Other operating systems
have implemented this functionality and is used by databases as well as
JVM. This functionality has been shown to improve performance by 3%-5%.

I have implemented similar functionality for Linux. This patch adds a
file /proc/<tgid>/task/<tid>/sched_preempt_delay for each thread.
Writing 1 to this file causes CFS scheduler to grant additional time
slice if the currently running process comes up for pre-emption. Writing
to this file needs to be very quick operation, so I have implemented
code to allow mmap'ing /proc/<tgid>/task/<tid>/sched_preempt_delay. This
allows a userspace task to write this flag very quickly. Usage model is
a thread mmaps this file during initialization. It then writes a 1 to
the mmap'd file after it grabs the lock in its critical section where it
wants immunity from pre-emption. It then writes 0 again to this file
after it releases the lock and calls sched_yield() to give up the
processor. I have also added a new field in scheduler statistics -
nr_preempt_delayed, that counts the number of times a thread has been
granted amnesty. Further details on using this functionality are in 
Documentation/scheduler/sched-preempt-delay.txt in the patch. This
patch is based upon the work Venkatesh Pallipadi had done couple of
years ago.

Please provide feedback on this functionality and patch.

-- Khalid

--------------------------------

This patch adds a way for a thread to request additional timeslice from
the scheduler if it is about to be preempted, so it could complete any
critical task it is in the middle of. This functionality helps with
performance on databases and has been used for many years on other OSs
by the databases. This functionality helps in situation where a thread
acquires a lock before performing a critical operation on the database,
happens to get preempted before it completes its task and releases the
lock. This lock causes all other threads that also acquire the same
lock to perform their critical operation on the database to start
queueing up and causing large number of context switches. This queueing
problem can be avoided if the thread that acquires lock first could
request scheduler to grant it an additional timeslice once it enters
its critical section and hence allow it to complete its critical sectiona
without causing queueing problem. If critical section completes before
the thread is due for preemption, the thread can simply desassert its
request. A thread sends the scheduler this request through a proc file
which it can mmap and write to with very little overhead. Documentation
file included in this patch contains further details on how to use this
functionality and conditions associated with its use. This patch also
adds a new field in scheduler statistics which keeps track of how many
times was a thread granted amnesty from preemption.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 Documentation/scheduler/sched-preempt-delay.txt |  85 ++++++++++++
 arch/x86/Kconfig                                |  10 ++
 fs/proc/base.c                                  | 173 ++++++++++++++++++++++++
 include/linux/preempt_delay.h                   |  37 +++++
 include/linux/sched.h                           |  12 ++
 kernel/exit.c                                   |   9 ++
 kernel/fork.c                                   |   7 +
 kernel/sched/Makefile                           |   1 +
 kernel/sched/debug.c                            |   1 +
 kernel/sched/fair.c                             |  59 +++++++-
 kernel/sched/preempt_delay.c                    |  39 ++++++
 11 files changed, 430 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/scheduler/sched-preempt-delay.txt
 create mode 100644 include/linux/preempt_delay.h
 create mode 100644 kernel/sched/preempt_delay.c

diff --git a/Documentation/scheduler/sched-preempt-delay.txt b/Documentation/scheduler/sched-preempt-delay.txt
new file mode 100644
index 0000000..aa7e611
--- /dev/null
+++ b/Documentation/scheduler/sched-preempt-delay.txt
@@ -0,0 +1,85 @@
+=================================
+What is preemption delay feature?
+=================================
+
+There are times when a userspace task is executing a critical section
+which gates a number of other tasks that want access to the same
+critical section. If the task holding the lock that guards this critical
+section is preempted by the scheduler in the middle of its critical
+section because its timeslice is up, scheduler ends up scheduling other
+threads which immediately try to grab the lock to enter the critical
+section. This only results in lots of context changes are tasks wake up
+and go to sleep immediately again. If on the other hand, the original
+task were allowed to run for an extra timeslice, it could have completed
+executing its critical section allowing other tasks to make progress
+when they get scheduled. Preemption delay feature allows a task to
+request scheduler to grant it one extra timeslice, if possible.
+
+
+==================================
+Using the preemption delay feature
+==================================
+
+This feature is enabled in the kernel by setting
+CONFIG_SCHED_PREEMPT_DELAY in kernel configuration. Once this feature is
+enabled, kernel creates a file
+/proc/<tgid>/task/<tid>/sched_preempt_delay. This file can be mmapped by
+the task. This file contains a single long value which is a flag to
+indicate if the task is requesting a preemption delay or not. Task
+requests a preemption delay by writing a non-zero value to this file.
+Scheduler checks this value before preempting the task. Scheduler can
+choose to grant one and only an additional time slice to the task for
+each delay request. Scheduler will clear this flag when it makes
+decision to grant or not grant the delay request. Following sample code
+illustrates the use:
+
+int main()
+{
+	int fd, fsz;
+	unsigned char fn[256];
+	unsigned long *map;
+
+	sprintf(fn, “/proc/%lu/task/%lu/sched_preempt_delay”, getpid(), syscall(SYS_gettid));
+	fd = open(fn, O_RDWR);
+	fsz = sysconf(_SC_PAGE_SIZE);
+	map = mmap(NULL, fsz, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+	while (/* some condition is true */) {
+		/* do some work and get ready to enter critical section */
+		map[0] = 1;
+		/*
+		 * critical section
+		 */
+		map[0] = 0;
+		/* Give the CPU up */
+		sched_yield();
+		/* do some more work */
+	}
+	munmap(map, fsz);
+	close(fd);
+}
+
+
+====================
+Scheduler statistics
+====================
+
+Preemption delay features adds a new field to scheduler statictics -
+nr_preempt_delayed. This is a per thread statistic that tracks the
+number of times a thread was granted amnesty from preemption when it
+requested for one. "cat /proc/<pid>/task/<tid>/sched" will list this
+number along with other scheduler statistics.
+
+
+=====
+Notes
+=====
+
+1. /proc/<tgid>/task/<tid>/sched_preempt_delay can be mmap'd by only
+   one task at a time.
+
+2. Once mmap'd, /proc/<tgid>/task/<tid>/sched_preempt_delay can be written
+   to only by the task that mmap'd it. Reading
+   /proc/<tgid>/task/<tid>/sched_preempt_delay will continue to return the
+   current value.
+
+3. Upon mmap, /proc/<tgid>/task/<tid>/sched_preempt_delay is set to 0.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0af5250..ee10019 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -849,6 +849,16 @@ config SCHED_MC
 	  making when dealing with multi-core CPU chips at a cost of slightly
 	  increased overhead in some places. If unsure say N here.
 
+config SCHED_PREEMPT_DELAY
+	def_bool n
+	prompt "Scheduler preemption delay support"
+	depends on PROC_FS && PREEMPT_NOTIFIERS
+	---help---
+	  Say Y here if you want to be able to delay scheduler preemption
+	  when possible by writing to
+	  /proc/<tgid>/task/<tid>/sched_preempt_delay.
+	  If in doubt, say "N".
+
 source "kernel/Kconfig.preempt"
 
 config X86_UP_APIC
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 5150706..902d07f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1304,6 +1304,176 @@ static const struct file_operations proc_pid_sched_operations = {
 
 #endif
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+static int
+tid_preempt_delay_show(struct seq_file *m, void *v)
+{
+	struct inode *inode = m->private;
+	struct task_struct *task = get_proc_task(inode);
+
+	if (!task)
+		return -ENOENT;
+
+	sched_preempt_delay_show(m, task);
+	put_task_struct(task);
+	return 0;
+}
+
+static ssize_t
+tid_preempt_delay_write(struct file *file, const char __user *buf,
+			  size_t count, loff_t *offset)
+{
+	struct inode *inode = file_inode(file);
+	struct task_struct *task = get_proc_task(inode);
+
+	if (!task)
+		return -ENOENT;
+
+	/*
+	 * Do not allow write if proc file is currently mmap'd
+	 */
+	if (task->sched_preempt_delay.mmap_state)
+		return -EPERM;
+
+	sched_preempt_delay_set(task, buf[0]);
+	put_task_struct(task);
+	return count;
+}
+
+static int
+tid_preempt_delay_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, tid_preempt_delay_show, inode);
+}
+
+static int
+fault_preempt_delay_vmops(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct preemp_delay_mmap_state *state;
+
+	state = (struct preemp_delay_mmap_state *) vma->vm_private_data;
+	if (!state)
+		return VM_FAULT_SIGBUS;
+
+	if (vmf->flags & FAULT_FLAG_MKWRITE) {
+		SetPageUptodate(state->page);
+		return 0;
+	}
+
+	get_page(state->page);
+	vmf->page = state->page;
+	vmf->page->mapping = vma->vm_file->f_mapping;
+	vmf->page->index = vmf->pgoff;
+
+	return 0;
+}
+
+static void
+close_preempt_delay_vmops(struct vm_area_struct *vma)
+{
+	struct preemp_delay_mmap_state *state;
+
+	state = (struct preemp_delay_mmap_state *) vma->vm_private_data;
+	BUG_ON(!state || !state->task);
+
+	state->page->mapping = NULL;
+	/* point delay request flag pointer back to old flag in task_struct */
+	state->task->sched_preempt_delay.delay_req =
+			&state->task->sched_preempt_delay.delay_flag;
+	state->task->sched_preempt_delay.mmap_state = NULL;
+	vfree(state->kaddr);
+	kfree(state);
+	vma->vm_private_data = NULL;
+}
+
+static const struct vm_operations_struct preempt_delay_vmops = {
+	.fault		= fault_preempt_delay_vmops,
+	.page_mkwrite	= fault_preempt_delay_vmops,
+	.close		= close_preempt_delay_vmops,
+};
+
+
+static int
+tid_preempt_delay_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	int retval = 0;
+	void *kaddr = NULL;
+	struct preemp_delay_mmap_state *state = NULL;
+	struct inode *inode = file_inode(file);
+	struct task_struct *task;
+	struct page *page;
+
+	/*
+	 * Validate args:
+	 * - Only offset 0 support for now
+	 * - size should be PAGE_SIZE
+	 */
+	if (vma->vm_pgoff != 0 || (vma->vm_end - vma->vm_start) != PAGE_SIZE) {
+		retval = -EINVAL;
+		goto error;
+	}
+
+	/*
+	 * Only one mmap allowed at a time
+	 */
+	if (current->sched_preempt_delay.mmap_state != NULL) {
+		retval = -EEXIST;
+		goto error;
+	}
+
+	state = kzalloc(sizeof(struct preemp_delay_mmap_state), GFP_KERNEL);
+	kaddr = vmalloc_user(PAGE_SIZE);
+	if (!state || !kaddr) {
+		retval = -ENOMEM;
+		goto error;
+	}
+
+	page = vmalloc_to_page(kaddr);
+	if (!page) {
+		retval = -ENOMEM;
+		goto error;
+	}
+	/*
+	 * This mmap belongs to the thread that owns the preemption
+	 * delay request flag, not to other threads that may belong to
+	 * the same process.
+	 */
+	task = get_proc_task(inode);
+	state->page = page;
+	state->kaddr = kaddr;
+	state->uaddr = (void *)vma->vm_start;
+	state->task = task;
+
+	/* Clear the current delay request flag */
+	task->sched_preempt_delay.delay_flag = 0;
+
+	/* Point delay request flag pointer to the newly allocated memory */
+	task->sched_preempt_delay.delay_req = (unsigned char *)kaddr;
+
+	task->sched_preempt_delay.mmap_state = state;
+	vma->vm_private_data = state;
+	vma->vm_ops = &preempt_delay_vmops;
+	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_SHARED | VM_WRITE;
+
+	return 0;
+
+error:
+	kfree(state);
+	if (kaddr)
+		vfree(kaddr);
+	return retval;
+}
+
+static const struct file_operations proc_tid_preempt_delay_ops = {
+	.open		= tid_preempt_delay_open,
+	.read		= seq_read,
+	.write		= tid_preempt_delay_write,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+	.mmap		= tid_preempt_delay_mmap,
+};
+#endif
+
 #ifdef CONFIG_SCHED_AUTOGROUP
 /*
  * Print out autogroup related information:
@@ -2998,6 +3168,9 @@ static const struct pid_entry tid_base_stuff[] = {
 	REG("gid_map",    S_IRUGO|S_IWUSR, proc_gid_map_operations),
 	REG("projid_map", S_IRUGO|S_IWUSR, proc_projid_map_operations),
 #endif
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+	REG("sched_preempt_delay", S_IRUGO|S_IWUSR, proc_tid_preempt_delay_ops),
+#endif
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/preempt_delay.h b/include/linux/preempt_delay.h
new file mode 100644
index 0000000..70caa81
--- /dev/null
+++ b/include/linux/preempt_delay.h
@@ -0,0 +1,37 @@
+#ifndef __PREEMPT_DELAY_H__
+#define __PREEMPT_DELAY_H__
+
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+/*
+ * struct preempt_delay is part of task struct. It keeps the status
+ * of current request flag, if the request has been granted and a
+ * pointer to strtcure that holds data needed for mmap. This structure
+ * has a variable to hold the delay request flag and a pointer to
+ * the delay request flag. When a new task is initialized, delay_req
+ * pointer points to delay_flag elemnt in this structure. This pointer
+ * is changed when mmap happens on /proc/<pid>/task/<tid>/sched_preempt_delay.
+ * Since procfs works on the granularity of a page size, it makes no sense
+ * to allocate a whole page for each task to hold just a flag. So
+ * initially delay_req pointer points to a single unsigned char sized
+ * variable. When mmap happens, a page is allocated and this pointer
+ * is redirected to the newly allocated page for mmap. Race is not much
+ * of an issue since delay_req always points to a valid memory address.
+ * mmap operation always causes task to start with a request flag value
+ * of 0, so old value of the flag is irrelevant. munmap will point
+ * delay_req back to delay_flag.
+ */
+struct preempt_delay {
+	unsigned char *delay_req;	/* delay request flag pointer */
+	unsigned char delay_flag;	/* delay request flag */
+	unsigned char delay_granted;	/* currently in delay */
+	struct preemp_delay_mmap_state *mmap_state;
+};
+
+struct preemp_delay_mmap_state {
+	void *kaddr;			/* kernel vmalloc access addr */
+	void *uaddr;			/* user mapped address */
+	struct page *page;
+	struct task_struct *task;
+};
+#endif /* CONFIG_SCHED_PREEMPT_DELAY */
+#endif /* __PREEMPT_DELAY_H__ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a781dec..eeaf7f3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -54,6 +54,7 @@ struct sched_param {
 #include <linux/llist.h>
 #include <linux/uidgid.h>
 #include <linux/gfp.h>
+#include <linux/preempt_delay.h>
 
 #include <asm/processor.h>
 
@@ -1056,6 +1057,7 @@ struct sched_statistics {
 	u64			nr_wakeups_affine_attempts;
 	u64			nr_wakeups_passive;
 	u64			nr_wakeups_idle;
+	u64			nr_preempt_delayed;
 };
 #endif
 
@@ -1250,6 +1252,9 @@ struct task_struct {
 	/* Revert to default priority/policy when forking */
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+	struct preempt_delay sched_preempt_delay;
+#endif
 
 	pid_t pid;
 	pid_t tgid;
@@ -2061,6 +2066,13 @@ extern u64 scheduler_tick_max_deferment(void);
 static inline bool sched_can_stop_tick(void) { return false; }
 #endif
 
+#if defined(CONFIG_SCHED_PREEMPT_DELAY) && defined(CONFIG_PROC_FS)
+extern void sched_preempt_delay_show(struct seq_file *m,
+					struct task_struct *task);
+extern void sched_preempt_delay_set(struct task_struct *task,
+					unsigned char val);
+#endif
+
 #ifdef CONFIG_SCHED_AUTOGROUP
 extern void sched_autogroup_create_attach(struct task_struct *p);
 extern void sched_autogroup_detach(struct task_struct *p);
diff --git a/kernel/exit.c b/kernel/exit.c
index 1e77fc6..b7e4dbf 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -53,6 +53,7 @@
 #include <linux/oom.h>
 #include <linux/writeback.h>
 #include <linux/shm.h>
+#include <linux/preempt_delay.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -721,6 +722,14 @@ void do_exit(long code)
 
 	validate_creds_for_do_exit(tsk);
 
+#if CONFIG_SCHED_PREEMPT_DELAY
+	if (tsk->sched_preempt_delay.mmap_state) {
+		sys_munmap((unsigned long)
+			tsk->sched_preempt_delay.mmap_state->uaddr, PAGE_SIZE);
+		vfree(tsk->sched_preempt_delay.mmap_state->kaddr);
+		kfree(tsk->sched_preempt_delay.mmap_state);
+	}
+#endif
 	/*
 	 * We're taking recursive faults here in do_exit. Safest is to just
 	 * leave this task alone and wait for reboot.
diff --git a/kernel/fork.c b/kernel/fork.c
index a17621c..94b65c0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -71,6 +71,7 @@
 #include <linux/signalfd.h>
 #include <linux/uprobes.h>
 #include <linux/aio.h>
+#include <linux/preempt_delay.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1617,6 +1618,12 @@ long do_fork(unsigned long clone_flags,
 			init_completion(&vfork);
 			get_task_struct(p);
 		}
+#if CONFIG_SCHED_PREEMPT_DELAY
+		p->sched_preempt_delay.delay_req =
+				&p->sched_preempt_delay.delay_flag;
+		p->sched_preempt_delay.delay_flag = 0;
+		p->sched_preempt_delay.delay_granted = 0;
+#endif
 
 		wake_up_new_task(p);
 
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 9a95c8c..b2582fe 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_SCHED_PREEMPT_DELAY) += preempt_delay.o
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index dd52e7f..2abd02b 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -602,6 +602,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.statistics.nr_wakeups_affine_attempts);
 	P(se.statistics.nr_wakeups_passive);
 	P(se.statistics.nr_wakeups_idle);
+	P(se.statistics.nr_preempt_delayed);
 
 	{
 		u64 avg_atom, avg_per_cpu;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7815709..4250733 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -29,6 +29,7 @@
 #include <linux/mempolicy.h>
 #include <linux/migrate.h>
 #include <linux/task_work.h>
+#include <linux/preempt_delay.h>
 
 #include <trace/events/sched.h>
 
@@ -444,6 +445,58 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+/*
+ * delay_resched_task(): Check if the task about to be preempted has
+ *	requested an additional time slice. If it has, grant it additional
+ *	timeslice once.
+ */
+static void
+delay_resched_task(struct task_struct *curr)
+{
+	struct sched_entity *se;
+	int cpu = task_cpu(curr);
+	unsigned char *delay_req;
+
+	if (cpu != smp_processor_id())
+		goto resched_now;
+
+	/*
+	 * Pre-emption delay will  be granted only once. If this task
+	 * has already been granted delay, rechedule now
+	 */
+	delay_req = curr->sched_preempt_delay.delay_req;
+	if (curr->sched_preempt_delay.delay_granted) {
+		curr->sched_preempt_delay.delay_granted = 0;
+		if (delay_req)
+			*delay_req = 0;
+		goto resched_now;
+	}
+
+	/* If task is not requestin additional timeslice, resched now */
+	if (delay_req && (*delay_req == 0))
+		goto resched_now;
+
+	/*
+	 * Current thread has requested preemption delay and has not
+	 * been granted an extension yet. Give it an extra timeslice.
+	 */
+	se = &curr->se;
+	curr->sched_preempt_delay.delay_granted = 1;
+	schedstat_inc(curr, se.statistics.nr_preempt_delayed);
+	return;
+
+resched_now:
+	resched_task(curr);
+}
+#else
+static void
+delay_resched_task(struct task_struct *curr)
+{
+	resched_task(curr);
+}
+#endif /* CONFIG_SCHED_PREEMPT_DELAY */
+
 static __always_inline
 void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
 
@@ -2679,7 +2732,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	ideal_runtime = sched_slice(cfs_rq, curr);
 	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
 	if (delta_exec > ideal_runtime) {
-		resched_task(rq_of(cfs_rq)->curr);
+		delay_resched_task(rq_of(cfs_rq)->curr);
 		/*
 		 * The current task ran long enough, ensure it doesn't get
 		 * re-elected due to buddy favours.
@@ -2703,7 +2756,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 		return;
 
 	if (delta > ideal_runtime)
-		resched_task(rq_of(cfs_rq)->curr);
+		delay_resched_task(rq_of(cfs_rq)->curr);
 }
 
 static void
@@ -4477,7 +4530,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	return;
 
 preempt:
-	resched_task(curr);
+	delay_resched_task(curr);
 	/*
 	 * Only set the backward buddy when the current task is still
 	 * on the rq. This can happen when a wakeup gets interleaved
diff --git a/kernel/sched/preempt_delay.c b/kernel/sched/preempt_delay.c
new file mode 100644
index 0000000..a8b1c1e
--- /dev/null
+++ b/kernel/sched/preempt_delay.c
@@ -0,0 +1,39 @@
+/*
+ * preempt_delay.c - Facility to allow tasks to request additional
+ *		     timeslice from the scheduler at scheduler's discretion
+ * Copyright (C) 2013 Khalid Aziz <khalid.aziz@oracle.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ *
+ * CONFIG_SCHED_PREEMPT_DELAY infrastructure creates a proc file
+ * /proc/<pid>/task/<tid>/sched_preempt_delay which allows a task to
+ * signal to the scheduler to grant it an extra timeslice once if the
+ * task is about to be pre-empted, by writing a 1 to the file. This
+ * file includes the code to support reading from and writing to this
+ * proc file.
+ */
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+#include "sched.h"
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/preempt_delay.h>
+
+void
+sched_preempt_delay_show(struct seq_file *m, struct task_struct *task)
+{
+	unsigned char *delay_req = task->sched_preempt_delay.delay_req;
+
+	if (delay_req)
+		seq_printf(m, "%d\n", *delay_req);
+}
+
+void
+sched_preempt_delay_set(struct task_struct *task, unsigned char val)
+{
+	unsigned char *delay_req = task->sched_preempt_delay.delay_req;
+
+	if (delay_req)
+		*delay_req = (val != '0'?1:0);
+}
+#endif
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-03 18:07 [RFC] [PATCH] Pre-emption control for userspace Khalid Aziz
@ 2014-03-03 21:51 ` Davidlohr Bueso
  2014-03-03 23:29   ` Khalid Aziz
  2014-03-04 13:56 ` Oleg Nesterov
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 67+ messages in thread
From: Davidlohr Bueso @ 2014-03-03 21:51 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
	venki, linux-kernel

On Mon, 2014-03-03 at 11:07 -0700, Khalid Aziz wrote:
> I am working on a feature that has been requested by database folks that
> helps with performance. Some of the oft executed database code uses
> mutexes to lock other threads out of a critical section. They often see
> a situation where a thread grabs the mutex, runs out of its timeslice
> and gets switched out which then causes another thread to run which
> tries to grab the same mutex, spins for a while and finally gives up.

This strikes me more of a feature for a real-time kernel. It is
definitely an interesting concept but wonder about it being abused.
Also, what about just using a voluntary preemption model instead? I'd
think that systems where this is really a problem would opt for that.

> This can happen with multiple threads until original lock owner gets the
> CPU again and can complete executing its critical section. This queueing
> and subsequent CPU cycle wastage can be avoided if the locking thread
> could request to be granted an additional timeslice if its current
> timeslice runs out before it gives up the lock. Other operating systems
> have implemented this functionality and is used by databases as well as
> JVM. This functionality has been shown to improve performance by 3%-5%.

Could you elaborate more on those performance numbers? What
benchmark/workload?

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-03 21:51 ` Davidlohr Bueso
@ 2014-03-03 23:29   ` Khalid Aziz
  0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-03 23:29 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
	venki, linux-kernel

On 03/03/2014 02:51 PM, Davidlohr Bueso wrote:
> On Mon, 2014-03-03 at 11:07 -0700, Khalid Aziz wrote:
>> I am working on a feature that has been requested by database folks that
>> helps with performance. Some of the oft executed database code uses
>> mutexes to lock other threads out of a critical section. They often see
>> a situation where a thread grabs the mutex, runs out of its timeslice
>> and gets switched out which then causes another thread to run which
>> tries to grab the same mutex, spins for a while and finally gives up.
>
> This strikes me more of a feature for a real-time kernel. It is
> definitely an interesting concept but wonder about it being abused.
> Also, what about just using a voluntary preemption model instead? I'd
> think that systems where this is really a problem would opt for that.

That was my first thought as well when I was asked to implement this 
feature :) Designing a system as a real-time system indeed gives the 
designer good control over pre-emption but the database folks do not 
really want or need a full real-time system espcially since they may 
have to run on the same server as other database related services. JVM 
certainly can not expect to be run as a realtime process. Database folks 
are perfectly happy running with CFS scheduler all the time except 
during this kind of critical section. This approach gives them some 
control to get extra timeslice when they need it. As for the abuse, it 
is no different from a realtime process that can lock up a processor 
much worse than this approach. As is the case when using realtime 
schedulers, one must use the tools wisely. I have thought about allowing 
sysadmins to lock this functionality down some but that does add more 
complexity. I am open to doing that if most people feel it is necessary.

>
>> This can happen with multiple threads until original lock owner gets the
>> CPU again and can complete executing its critical section. This queueing
>> and subsequent CPU cycle wastage can be avoided if the locking thread
>> could request to be granted an additional timeslice if its current
>> timeslice runs out before it gives up the lock. Other operating systems
>> have implemented this functionality and is used by databases as well as
>> JVM. This functionality has been shown to improve performance by 3%-5%.
>
> Could you elaborate more on those performance numbers? What
> benchmark/workload?
>
> Thanks,
> Davidlohr
>

This was with tpc-c.

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-03 18:07 [RFC] [PATCH] Pre-emption control for userspace Khalid Aziz
  2014-03-03 21:51 ` Davidlohr Bueso
@ 2014-03-04 13:56 ` Oleg Nesterov
  2014-03-04 17:44   ` Khalid Aziz
  2014-03-04 21:12 ` H. Peter Anvin
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2014-03-04 13:56 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, venki,
	linux-kernel

On 03/03, Khalid Aziz wrote:
>
> This queueing
> and subsequent CPU cycle wastage can be avoided if the locking thread
> could request to be granted an additional timeslice if its current
> timeslice runs out before it gives up the lock.

Well. I am in no position to discuss the changes in sched/fair.c. I have
to admit that I am skeptical about the whole idea/implementation, but I
leave this to sched/ maintainers.

However, at least the proc/mmap changes do not look right. I didn't read
the whole patch, just picked a "random" ->mmap function, see below.

>  kernel/sched/preempt_delay.c                    |  39 ++++++

Why? This can go into proc/ as well.

> +static void
> +close_preempt_delay_vmops(struct vm_area_struct *vma)
> +{
> +	struct preemp_delay_mmap_state *state;
> +
> +	state = (struct preemp_delay_mmap_state *) vma->vm_private_data;
> +	BUG_ON(!state || !state->task);
> +
> +	state->page->mapping = NULL;
> +	/* point delay request flag pointer back to old flag in task_struct */
> +	state->task->sched_preempt_delay.delay_req =
> +			&state->task->sched_preempt_delay.delay_flag;
> +	state->task->sched_preempt_delay.mmap_state = NULL;
> +	vfree(state->kaddr);
> +	kfree(state);
> +	vma->vm_private_data = NULL;
> +}

Suppose that state->task != current. Then this can race with do_exit()
which cleanups ->mmap_state too. OTOH do_exit() unmaps this region, it
is not clear why it can't rely in vm_ops->close().

Hmm. In fact I think do_exit() should crash after munmap? ->mmap_state
should be NULL ?? Perhaps I misread this patch completely...

> +static int
> +tid_preempt_delay_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	int retval = 0;
> +	void *kaddr = NULL;
> +	struct preemp_delay_mmap_state *state = NULL;
> +	struct inode *inode = file_inode(file);
> +	struct task_struct *task;
> +	struct page *page;
> +
> +	/*
> +	 * Validate args:
> +	 * - Only offset 0 support for now
> +	 * - size should be PAGE_SIZE
> +	 */
> +	if (vma->vm_pgoff != 0 || (vma->vm_end - vma->vm_start) != PAGE_SIZE) {
> +		retval = -EINVAL;
> +		goto error;
> +	}
> +
> +	/*
> +	 * Only one mmap allowed at a time
> +	 */
> +	if (current->sched_preempt_delay.mmap_state != NULL) {
> +		retval = -EEXIST;
> +		goto error;

This assumes that we are going to setup current->sched_preempt_delay.mmap_state,
but what if the task opens /proc/random_tid/sched_preempt_delay ?

> +	state = kzalloc(sizeof(struct preemp_delay_mmap_state), GFP_KERNEL);
> +	kaddr = vmalloc_user(PAGE_SIZE);

Why vmalloc() ? We only need a single page?

> +	task = get_proc_task(inode);

And it seems that nobody does put_task_struct(state->task);

> +	state->page = page;
> +	state->kaddr = kaddr;
> +	state->uaddr = (void *)vma->vm_start;

This is used by do_exit(). But ->vm_start can be changed by mremap() ?


Hmm. And mremap() can do vm_ops->close() too. But the new vma will
have the same vm_ops/vm_private_data, so exit_mmap() will try to do
this again... Perhaps I missed something, but I bet this all can't be
right.

> +	state->task = task;
> +
> +	/* Clear the current delay request flag */
> +	task->sched_preempt_delay.delay_flag = 0;
> +
> +	/* Point delay request flag pointer to the newly allocated memory */
> +	task->sched_preempt_delay.delay_req = (unsigned char *)kaddr;
> +
> +	task->sched_preempt_delay.mmap_state = state;
> +	vma->vm_private_data = state;
> +	vma->vm_ops = &preempt_delay_vmops;
> +	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_SHARED | VM_WRITE;

This probably also needs VM_IO, to protect from madvise(MADV_DOFORK).
VM_SHARED/VM_WRITE doesn't look right.

Oleg.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-04 13:56 ` Oleg Nesterov
@ 2014-03-04 17:44   ` Khalid Aziz
  2014-03-04 18:38     ` Al Viro
  2014-03-04 19:03     ` Oleg Nesterov
  0 siblings, 2 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-04 17:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, venki,
	linux-kernel

Thanks for the review. Please see my comments inline below.

On 03/04/2014 06:56 AM, Oleg Nesterov wrote:
> On 03/03, Khalid Aziz wrote:
>>   kernel/sched/preempt_delay.c                    |  39 ++++++
>
> Why? This can go into proc/ as well.
>

Sure. No strong reason to keep these functions in separate file. These 
functions can go into proc/fs/base.c.

>> +static void
>> +close_preempt_delay_vmops(struct vm_area_struct *vma)
>> +{
>> +	struct preemp_delay_mmap_state *state;
>> +
>> +	state = (struct preemp_delay_mmap_state *) vma->vm_private_data;
>> +	BUG_ON(!state || !state->task);
>> +
>> +	state->page->mapping = NULL;
>> +	/* point delay request flag pointer back to old flag in task_struct */
>> +	state->task->sched_preempt_delay.delay_req =
>> +			&state->task->sched_preempt_delay.delay_flag;
>> +	state->task->sched_preempt_delay.mmap_state = NULL;
>> +	vfree(state->kaddr);
>> +	kfree(state);
>> +	vma->vm_private_data = NULL;
>> +}
>
> Suppose that state->task != current. Then this can race with do_exit()
> which cleanups ->mmap_state too. OTOH do_exit() unmaps this region, it
> is not clear why it can't rely in vm_ops->close().
>
> Hmm. In fact I think do_exit() should crash after munmap? ->mmap_state
> should be NULL ?? Perhaps I misread this patch completely...

do_exit() unmaps mmap_state->uaddr, and frees up mmap_state->kaddr and 
mmap_state. mmap_state should not be NULL after unmap. vfree() and 
kfree() are tolerant of pointers that have already been freed. On the 
other hand mmap_state can be NULL in do_exit() if do_exit() and 
close_preempt_delay_vmops() were to race since 
close_preempt_delay_vmops() sets mmap_state to NULL just before it frees 
it up. Could they indeed race, because the thread happens to be killed 
just as it had called munmap()? I can protect against that with a 
refcount in mmap_state. Do you feel this is necessary/helpful to do?

>
>> +static int
>> +tid_preempt_delay_mmap(struct file *file, struct vm_area_struct *vma)
>> +{
>> +	int retval = 0;
>> +	void *kaddr = NULL;
>> +	struct preemp_delay_mmap_state *state = NULL;
>> +	struct inode *inode = file_inode(file);
>> +	struct task_struct *task;
>> +	struct page *page;
>> +
>> +	/*
>> +	 * Validate args:
>> +	 * - Only offset 0 support for now
>> +	 * - size should be PAGE_SIZE
>> +	 */
>> +	if (vma->vm_pgoff != 0 || (vma->vm_end - vma->vm_start) != PAGE_SIZE) {
>> +		retval = -EINVAL;
>> +		goto error;
>> +	}
>> +
>> +	/*
>> +	 * Only one mmap allowed at a time
>> +	 */
>> +	if (current->sched_preempt_delay.mmap_state != NULL) {
>> +		retval = -EEXIST;
>> +		goto error;
>
> This assumes that we are going to setup current->sched_preempt_delay.mmap_state,
> but what if the task opens /proc/random_tid/sched_preempt_delay ?

Good point. A thread should not be allowed to request preemption delay 
for another thread. I would recommend leaving this code alone and adding 
following code before this:

if (get_proc_task(inode) != current) {
	retval = -EPERM;
	goto error;
}

Sounds reasonable?

>
>> +	state = kzalloc(sizeof(struct preemp_delay_mmap_state), GFP_KERNEL);
>> +	kaddr = vmalloc_user(PAGE_SIZE);
>
> Why vmalloc() ? We only need a single page?
>

Makes sense. I will switch to get_zeroed_page().

>> +	task = get_proc_task(inode);
>
> And it seems that nobody does put_task_struct(state->task);

Good catch. I had caught the other two instances of get_proc_task() but 
missed this one.

>
>> +	state->page = page;
>> +	state->kaddr = kaddr;
>> +	state->uaddr = (void *)vma->vm_start;
>
> This is used by do_exit(). But ->vm_start can be changed by mremap() ?
>
>
> Hmm. And mremap() can do vm_ops->close() too. But the new vma will
> have the same vm_ops/vm_private_data, so exit_mmap() will try to do
> this again... Perhaps I missed something, but I bet this all can't be
> right.

Would you say sys_munmap() of mmap_state->uaddr is not even needed since 
exit_mm() will do this any way further down in do_exit()? If I were to 
remove this sys_munmap(), that could simplify the race issues as well.

>
>> +	state->task = task;
>> +
>> +	/* Clear the current delay request flag */
>> +	task->sched_preempt_delay.delay_flag = 0;
>> +
>> +	/* Point delay request flag pointer to the newly allocated memory */
>> +	task->sched_preempt_delay.delay_req = (unsigned char *)kaddr;
>> +
>> +	task->sched_preempt_delay.mmap_state = state;
>> +	vma->vm_private_data = state;
>> +	vma->vm_ops = &preempt_delay_vmops;
>> +	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_SHARED | VM_WRITE;
>
> This probably also needs VM_IO, to protect from madvise(MADV_DOFORK).

Yes, you are right. I will add that.

> VM_SHARED/VM_WRITE doesn't look right.

VM_SHARED is wrong but VM_WRITE is needed I think since the thread will 
write to the mmap'd page to signal to request preemption delay.

>
> Oleg.
>

I appreciate your taking the time to review this code. Thank you very much.

--
Khalid

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-04 17:44   ` Khalid Aziz
@ 2014-03-04 18:38     ` Al Viro
  2014-03-04 19:01       ` Khalid Aziz
  2014-03-04 19:03     ` Oleg Nesterov
  1 sibling, 1 reply; 67+ messages in thread
From: Al Viro @ 2014-03-04 18:38 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Oleg Nesterov, tglx, mingo, hpa, peterz, akpm, andi.kleen, rob,
	venki, linux-kernel

On Tue, Mar 04, 2014 at 10:44:54AM -0700, Khalid Aziz wrote:

> do_exit() unmaps mmap_state->uaddr, and frees up mmap_state->kaddr
> and mmap_state. mmap_state should not be NULL after unmap. vfree()
> and kfree() are tolerant of pointers that have already been freed.

Huh?  Double free() is a bug, plain and simple.  Never do that - not
in userland and especially not in the kernel.  Think what happens if
some code gets executed between those two and asks to allocate something.
If it gets the area you'd just freed, your second free will leave it
with all kinds of nasty surprises.  Starting with "who the hell has
started to modify the object I'd allocated and hadn't freed?"

A:	p = alloc();
A:	free(p);
B:	q = alloc();	/* q == p now */
B:	*q = 0;		/* *q is zero */
A:	free(p);	/* same as free(q) */
C:	r = alloc();	/* r == q now */
C:	*r = 1;		/* *q is one */
B:	if (*q != 0) panic("somebody's buggering my memory");

It's always a bug, whether the implementation catches it or not.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-04 18:38     ` Al Viro
@ 2014-03-04 19:01       ` Khalid Aziz
  0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-04 19:01 UTC (permalink / raw)
  To: Al Viro
  Cc: Oleg Nesterov, tglx, mingo, hpa, peterz, akpm, andi.kleen, rob,
	venki, linux-kernel

On Tue, 2014-03-04 at 18:38 +0000, Al Viro wrote:
> On Tue, Mar 04, 2014 at 10:44:54AM -0700, Khalid Aziz wrote:
> 
> > do_exit() unmaps mmap_state->uaddr, and frees up mmap_state->kaddr
> > and mmap_state. mmap_state should not be NULL after unmap. vfree()
> > and kfree() are tolerant of pointers that have already been freed.
> 
> Huh?  Double free() is a bug, plain and simple.  Never do that - not
> in userland and especially not in the kernel.  Think what happens if
> some code gets executed between those two and asks to allocate something.
> If it gets the area you'd just freed, your second free will leave it
> with all kinds of nasty surprises.  Starting with "who the hell has
> started to modify the object I'd allocated and hadn't freed?"
> 
> A:	p = alloc();
> A:	free(p);
> B:	q = alloc();	/* q == p now */
> B:	*q = 0;		/* *q is zero */
> A:	free(p);	/* same as free(q) */
> C:	r = alloc();	/* r == q now */
> C:	*r = 1;		/* *q is one */
> B:	if (*q != 0) panic("somebody's buggering my memory");
> 
> It's always a bug, whether the implementation catches it or not.

Agreed, you are right. I will fix it.

--
Khalid




^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-04 17:44   ` Khalid Aziz
  2014-03-04 18:38     ` Al Viro
@ 2014-03-04 19:03     ` Oleg Nesterov
  2014-03-04 20:14       ` Khalid Aziz
  1 sibling, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2014-03-04 19:03 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, venki,
	linux-kernel

On 03/04, Khalid Aziz wrote:
>
> On 03/04/2014 06:56 AM, Oleg Nesterov wrote:
>> Hmm. In fact I think do_exit() should crash after munmap? ->mmap_state
>> should be NULL ?? Perhaps I misread this patch completely...
>
> do_exit() unmaps mmap_state->uaddr, and frees up mmap_state->kaddr and
> mmap_state. mmap_state should not be NULL after unmap.

Can't understand... do_exit() does:

	+#if CONFIG_SCHED_PREEMPT_DELAY
	+       if (tsk->sched_preempt_delay.mmap_state) {
	+               sys_munmap((unsigned long)
	+                       tsk->sched_preempt_delay.mmap_state->uaddr, PAGE_SIZE);
	+               vfree(tsk->sched_preempt_delay.mmap_state->kaddr);
	+               kfree(tsk->sched_preempt_delay.mmap_state);
	
sys_munmap() (which btw should not be used) obviously unmaps that
vma and vma_ops()->close() should be called.

close_preempt_delay_vmops() does:

	state->task->sched_preempt_delay.mmap_state = NULL;

vfree(tsk->sched_preempt_delay.mmap_state->kaddr) above will try to
dereference .mmap_state == NULL.

IOW, I think that with this patch this trivial program

	int main(void)
	{
		fd = open("/proc/self/task/$TID/sched_preempt_delay", O_RDWR);
		mmap(NULL, 4096, PROT_READ,MAP_SHARED, fd, 0);
		return 0;
	}

should crash the kernel.

>>> +	if (current->sched_preempt_delay.mmap_state != NULL) {
>>> +		retval = -EEXIST;
>>> +		goto error;
>>
>> This assumes that we are going to setup current->sched_preempt_delay.mmap_state,
>> but what if the task opens /proc/random_tid/sched_preempt_delay ?
>
> Good point. A thread should not be allowed to request preemption delay
> for another thread. I would recommend leaving this code alone and adding
> following code before this:
>
> if (get_proc_task(inode) != current) {
> 	retval = -EPERM;
> 	goto error;
> }
>
> Sounds reasonable?

Yes, we should check == current, but this interface looks strange anyway, imho.

>>> +	state->page = page;
>>> +	state->kaddr = kaddr;
>>> +	state->uaddr = (void *)vma->vm_start;
>>
>> This is used by do_exit(). But ->vm_start can be changed by mremap() ?
>>
>>
>> Hmm. And mremap() can do vm_ops->close() too. But the new vma will
>> have the same vm_ops/vm_private_data, so exit_mmap() will try to do
>> this again... Perhaps I missed something, but I bet this all can't be
>> right.
>
> Would you say sys_munmap() of mmap_state->uaddr is not even needed since
> exit_mm() will do this any way further down in do_exit()?

No.

I meant:

	1. mremap() can move this vma, so do_exit() can't trust ->uaddr

	2. Even worse, mremap() itself is not safe. It can do ->close()
	   too and create the new vma with the same vm_ops. Another
	   unmap from (say) exit_mm() won't be happy.

>>> +	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_SHARED | VM_WRITE;
>>
>> This probably also needs VM_IO, to protect from madvise(MADV_DOFORK).
>
> Yes, you are right. I will add that.
>
>> VM_SHARED/VM_WRITE doesn't look right.
>
> VM_SHARED is wrong but VM_WRITE is needed I think since the thread will
> write to the mmap'd page to signal to request preemption delay.

But ->mmap() should not set VM_WRITE if application does mmap(PROT_READ) ?
VM_WRITE-or-not should be decided by do_mmap_pgoff/mprotect, ->mmap()
should not play with this bit.

Oleg.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-04 19:03     ` Oleg Nesterov
@ 2014-03-04 20:14       ` Khalid Aziz
  2014-03-05 14:38         ` Oleg Nesterov
  0 siblings, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-04 20:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, venki,
	linux-kernel

On 03/04/2014 12:03 PM, Oleg Nesterov wrote:
> On 03/04, Khalid Aziz wrote:
>>
>> On 03/04/2014 06:56 AM, Oleg Nesterov wrote:
>>> Hmm. In fact I think do_exit() should crash after munmap? ->mmap_state
>>> should be NULL ?? Perhaps I misread this patch completely...
>>
>> do_exit() unmaps mmap_state->uaddr, and frees up mmap_state->kaddr and
>> mmap_state. mmap_state should not be NULL after unmap.
>
> Can't understand... do_exit() does:
>
> 	+#if CONFIG_SCHED_PREEMPT_DELAY
> 	+       if (tsk->sched_preempt_delay.mmap_state) {
> 	+               sys_munmap((unsigned long)
> 	+                       tsk->sched_preempt_delay.mmap_state->uaddr, PAGE_SIZE);
> 	+               vfree(tsk->sched_preempt_delay.mmap_state->kaddr);
> 	+               kfree(tsk->sched_preempt_delay.mmap_state);
> 	
> sys_munmap() (which btw should not be used) obviously unmaps that
> vma and vma_ops()->close() should be called.
>
> close_preempt_delay_vmops() does:
>
> 	state->task->sched_preempt_delay.mmap_state = NULL;
>
> vfree(tsk->sched_preempt_delay.mmap_state->kaddr) above will try to
> dereference .mmap_state == NULL.
>
> IOW, I think that with this patch this trivial program
>
> 	int main(void)
> 	{
> 		fd = open("/proc/self/task/$TID/sched_preempt_delay", O_RDWR);
> 		mmap(NULL, 4096, PROT_READ,MAP_SHARED, fd, 0);
> 		return 0;
> 	}
>
> should crash the kernel.
>
>>>> +	state->page = page;
>>>> +	state->kaddr = kaddr;
>>>> +	state->uaddr = (void *)vma->vm_start;
>>>
>>> This is used by do_exit(). But ->vm_start can be changed by mremap() ?
>>>
>>>
>>> Hmm. And mremap() can do vm_ops->close() too. But the new vma will
>>> have the same vm_ops/vm_private_data, so exit_mmap() will try to do
>>> this again... Perhaps I missed something, but I bet this all can't be
>>> right.
>>
>> Would you say sys_munmap() of mmap_state->uaddr is not even needed since
>> exit_mm() will do this any way further down in do_exit()?
>
> No.
>
> I meant:
>
> 	1. mremap() can move this vma, so do_exit() can't trust ->uaddr
>
> 	2. Even worse, mremap() itself is not safe. It can do ->close()
> 	   too and create the new vma with the same vm_ops. Another
> 	   unmap from (say) exit_mm() won't be happy.

I agree this looks like a potential spot for trouble. I was asking if 
removing sys_munmap() of uaddr from do_exit() solves both of the above 
problems? You have convinced me this sys_munmap() I added is unnecessary.

>
>>>> +	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_SHARED | VM_WRITE;
>>>
>>> This probably also needs VM_IO, to protect from madvise(MADV_DOFORK).
>>
>> Yes, you are right. I will add that.
>>
>>> VM_SHARED/VM_WRITE doesn't look right.
>>
>> VM_SHARED is wrong but VM_WRITE is needed I think since the thread will
>> write to the mmap'd page to signal to request preemption delay.
>
> But ->mmap() should not set VM_WRITE if application does mmap(PROT_READ) ?
> VM_WRITE-or-not should be decided by do_mmap_pgoff/mprotect, ->mmap()
> should not play with this bit.
>

Ah, I see. This makes sense. I will remove it.

Thanks,
Khalid



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-03 18:07 [RFC] [PATCH] Pre-emption control for userspace Khalid Aziz
  2014-03-03 21:51 ` Davidlohr Bueso
  2014-03-04 13:56 ` Oleg Nesterov
@ 2014-03-04 21:12 ` H. Peter Anvin
  2014-03-04 21:39   ` Khalid Aziz
  2014-03-06 13:24   ` Rasmus Villemoes
  2014-03-25 17:17 ` [PATCH v2] " Khalid Aziz
  2014-03-25 23:01 ` [RFC] [PATCH] " Davidlohr Bueso
  4 siblings, 2 replies; 67+ messages in thread
From: H. Peter Anvin @ 2014-03-04 21:12 UTC (permalink / raw)
  To: Khalid Aziz, tglx, Ingo Molnar, peterz, akpm, andi.kleen, rob,
	viro, oleg, venki
  Cc: linux-kernel

On 03/03/2014 10:07 AM, Khalid Aziz wrote:
> 
> I am working on a feature that has been requested by database folks that
> helps with performance. Some of the oft executed database code uses
> mutexes to lock other threads out of a critical section. They often see
> a situation where a thread grabs the mutex, runs out of its timeslice
> and gets switched out which then causes another thread to run which
> tries to grab the same mutex, spins for a while and finally gives up.
> This can happen with multiple threads until original lock owner gets the
> CPU again and can complete executing its critical section. This queueing
> and subsequent CPU cycle wastage can be avoided if the locking thread
> could request to be granted an additional timeslice if its current
> timeslice runs out before it gives up the lock. Other operating systems
> have implemented this functionality and is used by databases as well as
> JVM. This functionality has been shown to improve performance by 3%-5%.
> 
> I have implemented similar functionality for Linux. This patch adds a
> file /proc/<tgid>/task/<tid>/sched_preempt_delay for each thread.
> Writing 1 to this file causes CFS scheduler to grant additional time
> slice if the currently running process comes up for pre-emption. Writing
> to this file needs to be very quick operation, so I have implemented
> code to allow mmap'ing /proc/<tgid>/task/<tid>/sched_preempt_delay. This
> allows a userspace task to write this flag very quickly. Usage model is
> a thread mmaps this file during initialization. It then writes a 1 to
> the mmap'd file after it grabs the lock in its critical section where it
> wants immunity from pre-emption. It then writes 0 again to this file
> after it releases the lock and calls sched_yield() to give up the
> processor. I have also added a new field in scheduler statistics -
> nr_preempt_delayed, that counts the number of times a thread has been
> granted amnesty. Further details on using this functionality are in 
> Documentation/scheduler/sched-preempt-delay.txt in the patch. This
> patch is based upon the work Venkatesh Pallipadi had done couple of
> years ago.
> 

Shades of the Android wakelocks, no?

This seems to effectively give userspace an option to turn preemptive
multitasking into cooperative multitasking, which of course is
unacceptable for a privileged process (the same reason why unprivileged
processes aren't allowed to run at above-normal priority, including RT
priority.)

I have several issues with this interface:

1. First, a process needs to know if it *should* have been preempted
before it calls sched_yield().  So there needs to be a second flag set
by the scheduler when granting amnesty.

2. A process which fails to call sched_yield() after being granted
amnesty must be penalized.

3. I'm not keen on occupying a full page for this.  I'm wondering if
doing a pointer into user space, futex-style, might make more sense.
The downside, of course, is what happens if the page being pointed to is
swapped out.

Keep in mind this HAS to be per thread.

	-hpa



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-04 21:12 ` H. Peter Anvin
@ 2014-03-04 21:39   ` Khalid Aziz
  2014-03-04 22:23     ` One Thousand Gnomes
  2014-03-06 13:24   ` Rasmus Villemoes
  1 sibling, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-04 21:39 UTC (permalink / raw)
  To: H. Peter Anvin, tglx, Ingo Molnar, peterz, akpm, andi.kleen, rob,
	viro, oleg
  Cc: linux-kernel

On 03/04/2014 02:12 PM, H. Peter Anvin wrote:
>
> Shades of the Android wakelocks, no?
>
> This seems to effectively give userspace an option to turn preemptive
> multitasking into cooperative multitasking, which of course is
> unacceptable for a privileged process (the same reason why unprivileged
> processes aren't allowed to run at above-normal priority, including RT
> priority.)
>
> I have several issues with this interface:
>
> 1. First, a process needs to know if it *should* have been preempted
> before it calls sched_yield().  So there needs to be a second flag set
> by the scheduler when granting amnesty.

Good idea. I like it. I will add it.

>
> 2. A process which fails to call sched_yield() after being granted
> amnesty must be penalized.

I agree. Is it fair to say that such a process sees the penalty by being 
charged that extra timeslice and being pushed to the right side of RB 
tree since its p->se.vruntime would have gone up, which then delays the 
time when it can get CPU again? I am open to adding a more explicit 
penalty - maybe deny its next preemption delay request if it failed to 
call sched_yield() the last time when it should have?

>
> 3. I'm not keen on occupying a full page for this.  I'm wondering if
> doing a pointer into user space, futex-style, might make more sense.
> The downside, of course, is what happens if the page being pointed to is
> swapped out.

Using a full page for what is effectively a single bit flag does not sit 
well with me either. Doing it through proc forces minimum size of a page 
(please correct me there if I am wrong). I will explore your idea some 
more to see if that can be made to work.

>
> Keep in mind this HAS to be per thread.
>

Thanks, hpa!

--
Khalid


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-04 21:39   ` Khalid Aziz
@ 2014-03-04 22:23     ` One Thousand Gnomes
  2014-03-04 22:44       ` Khalid Aziz
  0 siblings, 1 reply; 67+ messages in thread
From: One Thousand Gnomes @ 2014-03-04 22:23 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: H. Peter Anvin, tglx, Ingo Molnar, peterz, akpm, andi.kleen, rob,
	viro, oleg, linux-kernel

Obvious bug

| Usage model is a thread mmaps this file during initialization. It then
| writes a 1 to the mmap'd file after it grabs the lock in its critical
| section where it wants immunity from pre-emption.

You need to write it first or you can be pre-empted taking the lock
before asking for immunity.

Presumably you could equally use something to group tasks (say a control
group of some form) and implement voluntary pre-emption within the group
only when in user space. Ie they only pre-empt each other by yielding but
they can be pre-empted by other tasks outside the group ?


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-04 22:23     ` One Thousand Gnomes
@ 2014-03-04 22:44       ` Khalid Aziz
  2014-03-05  0:39         ` Thomas Gleixner
  0 siblings, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-04 22:44 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: H. Peter Anvin, tglx, Ingo Molnar, peterz, akpm, andi.kleen, rob,
	viro, oleg, linux-kernel

On 03/04/2014 03:23 PM, One Thousand Gnomes wrote:
> Obvious bug
>
> | Usage model is a thread mmaps this file during initialization. It then
> | writes a 1 to the mmap'd file after it grabs the lock in its critical
> | section where it wants immunity from pre-emption.
>
> You need to write it first or you can be pre-empted taking the lock
> before asking for immunity.

Ah, yes. Thanks, Alan!

>
> Presumably you could equally use something to group tasks (say a control
> group of some form) and implement voluntary pre-emption within the group
> only when in user space. Ie they only pre-empt each other by yielding but
> they can be pre-empted by other tasks outside the group ?
>

I had suggested that to database folks initially, but they use mutexes 
that are shared across processes and they do not believe they can 
control the environment in customer scenario where they can ensure right 
tasks will always be in the right control group. Besides they want to 
use a common mechanism across multiple OSs and pre-emption delay is 
already in use on other OSs. Good idea though.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-04 22:44       ` Khalid Aziz
@ 2014-03-05  0:39         ` Thomas Gleixner
  2014-03-05  0:51           ` Andi Kleen
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Gleixner @ 2014-03-05  0:39 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: One Thousand Gnomes, H. Peter Anvin, Ingo Molnar, peterz, akpm,
	andi.kleen, rob, viro, oleg, linux-kernel

On Tue, 4 Mar 2014, Khalid Aziz wrote:
> be in the right control group. Besides they want to use a common mechanism
> across multiple OSs and pre-emption delay is already in use on other OSs. Good
> idea though.

Well, just because preemption delay is a mechanism exposed by some
other OS does not make it a good idea.

In fact its a horrible idea.

What you are creating is a crystal ball based form of time bound
priority ceiling with the worst user space interface i've ever seen.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05  0:39         ` Thomas Gleixner
@ 2014-03-05  0:51           ` Andi Kleen
  2014-03-05 11:10             ` Peter Zijlstra
                               ` (2 more replies)
  0 siblings, 3 replies; 67+ messages in thread
From: Andi Kleen @ 2014-03-05  0:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Khalid Aziz, One Thousand Gnomes, H. Peter Anvin, Ingo Molnar,
	peterz, akpm, viro, oleg, linux-kernel

Thomas Gleixner <tglx@linutronix.de> writes:

> On Tue, 4 Mar 2014, Khalid Aziz wrote:
>> be in the right control group. Besides they want to use a common mechanism
>> across multiple OSs and pre-emption delay is already in use on other OSs. Good
>> idea though.
>
> Well, just because preemption delay is a mechanism exposed by some
> other OS does not make it a good idea.
>
> In fact its a horrible idea.
>
> What you are creating is a crystal ball based form of time bound
> priority ceiling with the worst user space interface i've ever seen.

So how would you solve the user space lock holder preemption 
problem then?

It's a real problem, affecting lots of workloads.

Just saying everything is crap without suggesting anything 
constructive is not really getting us anywhere.

The workarounds I've seen for it are generally far worse
than this. Often people do all kinds of fragile tunings
to address this, when then break on the next kernel
update that does even a minor scheduler change.

futex doesn't solve the problem at all.

The real time scheduler is also really poor fit for these 
workloads and needs a lot of hacks to scale.

The thread swap proposal from plumbers had some potential,
but it's likely very intrusive everywhere and seems
to have died too.

Anything else?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05  0:51           ` Andi Kleen
@ 2014-03-05 11:10             ` Peter Zijlstra
  2014-03-05 17:29               ` Khalid Aziz
  2014-03-05 19:58               ` Khalid Aziz
  2014-03-05 14:54             ` Oleg Nesterov
  2014-03-06 12:13             ` Kevin Easton
  2 siblings, 2 replies; 67+ messages in thread
From: Peter Zijlstra @ 2014-03-05 11:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Thomas Gleixner, Khalid Aziz, One Thousand Gnomes,
	H. Peter Anvin, Ingo Molnar, akpm, viro, oleg, linux-kernel

On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
> Anything else?

Proxy execution; its a form of PI that works for arbitrary scheduling
policies (thus also very much including fair).

With that what you effectively end up with is the lock holder running
'boosted' by the runtime of its blocked chain until the entire chain
runs out of time, at which point preemption doesn't matter anyhow.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-04 20:14       ` Khalid Aziz
@ 2014-03-05 14:38         ` Oleg Nesterov
  2014-03-05 16:12           ` Oleg Nesterov
  0 siblings, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2014-03-05 14:38 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, venki,
	linux-kernel

On 03/04, Khalid Aziz wrote:
>
> On 03/04/2014 12:03 PM, Oleg Nesterov wrote:
>>
>> 	1. mremap() can move this vma, so do_exit() can't trust ->uaddr
>>
>> 	2. Even worse, mremap() itself is not safe. It can do ->close()
>> 	   too and create the new vma with the same vm_ops. Another
>> 	   unmap from (say) exit_mm() won't be happy.
>
> I agree this looks like a potential spot for trouble. I was asking if
> removing sys_munmap() of uaddr from do_exit() solves both of the above
> problems?

How? Of course this won't solve the problems. And there are more problems.

> You have convinced me this sys_munmap() I added is unnecessary.

Cough ;) I didn't try to convince you that it should be removed. It is
necessary (but of course you should not use sys_munmap(), say vm_munmap
is better.

But you know, I think this all doesn't matter. Imho, this proc/mmap
interface is horrible. Perhaps something like tls area visible to kernel
make sense, but it should be more generic at least.

You added /proc/sched_preempt_delay to avoid the syscall. I think it
would be better to simply add vdso_sched_preempt_delay() instead.


But the main problem, of course, is that this feature is questionable.

Oleg.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05  0:51           ` Andi Kleen
  2014-03-05 11:10             ` Peter Zijlstra
@ 2014-03-05 14:54             ` Oleg Nesterov
  2014-03-05 15:56               ` Andi Kleen
  2014-03-06 12:13             ` Kevin Easton
  2 siblings, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2014-03-05 14:54 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Thomas Gleixner, Khalid Aziz, One Thousand Gnomes,
	H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel

On 03/04, Andi Kleen wrote:
>
> Anything else?

Well, we have yield_to(). Perhaps sys_yield_to(lock_owner) can help.
Or perhaps sys_futex() can do this if it knows the owner. Don't ask
me what exactly I mean though ;)

Oleg.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05 14:54             ` Oleg Nesterov
@ 2014-03-05 15:56               ` Andi Kleen
  2014-03-05 16:36                 ` Oleg Nesterov
  0 siblings, 1 reply; 67+ messages in thread
From: Andi Kleen @ 2014-03-05 15:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andi Kleen, Thomas Gleixner, Khalid Aziz, One Thousand Gnomes,
	H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel

On Wed, Mar 05, 2014 at 03:54:20PM +0100, Oleg Nesterov wrote:
> On 03/04, Andi Kleen wrote:
> >
> > Anything else?
> 
> Well, we have yield_to(). Perhaps sys_yield_to(lock_owner) can help.
> Or perhaps sys_futex() can do this if it knows the owner. Don't ask
> me what exactly I mean though ;)

You mean yield_to() would extend the time slice?

That would be the same as the mmap page, just with a syscall right?

-andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05 14:38         ` Oleg Nesterov
@ 2014-03-05 16:12           ` Oleg Nesterov
  2014-03-05 17:10             ` Khalid Aziz
  0 siblings, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2014-03-05 16:12 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, venki,
	linux-kernel

On 03/05, Oleg Nesterov wrote:
>
> You added /proc/sched_preempt_delay to avoid the syscall. I think it
> would be better to simply add vdso_sched_preempt_delay() instead.

I am stupid. vdso_sched_preempt_delay() obviously can't write to, say,
task_struct.

Oleg.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05 15:56               ` Andi Kleen
@ 2014-03-05 16:36                 ` Oleg Nesterov
  2014-03-05 17:22                   ` Khalid Aziz
  0 siblings, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2014-03-05 16:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Thomas Gleixner, Khalid Aziz, One Thousand Gnomes,
	H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel

On 03/05, Andi Kleen wrote:
>
> On Wed, Mar 05, 2014 at 03:54:20PM +0100, Oleg Nesterov wrote:
> > On 03/04, Andi Kleen wrote:
> > >
> > > Anything else?
> >
> > Well, we have yield_to(). Perhaps sys_yield_to(lock_owner) can help.
> > Or perhaps sys_futex() can do this if it knows the owner. Don't ask
> > me what exactly I mean though ;)
>
> You mean yield_to() would extend the time slice?
>
> That would be the same as the mmap page, just with a syscall right?

Not the same. Very roughly I meant something like

	my_lock()
	{
		if (!TRY_LOCK()) {
			yield_to(owner);
			LOCK();
		}

		owner = gettid();
	}

But once again, I am not sure if this makes any sense.

Oleg.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05 16:12           ` Oleg Nesterov
@ 2014-03-05 17:10             ` Khalid Aziz
  0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-05 17:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, venki,
	linux-kernel

On 03/05/2014 09:12 AM, Oleg Nesterov wrote:
> On 03/05, Oleg Nesterov wrote:
>>
>> You added /proc/sched_preempt_delay to avoid the syscall. I think it
>> would be better to simply add vdso_sched_preempt_delay() instead.
>
> I am stupid. vdso_sched_preempt_delay() obviously can't write to, say,
> task_struct.
>
> Oleg.
>

That is easy to miss :) I had looked into vdso when I started but found 
out I can't use vdso for this.

--
Khalid

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05 16:36                 ` Oleg Nesterov
@ 2014-03-05 17:22                   ` Khalid Aziz
  2014-03-05 23:13                     ` David Lang
  0 siblings, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-05 17:22 UTC (permalink / raw)
  To: Oleg Nesterov, Andi Kleen
  Cc: Thomas Gleixner, One Thousand Gnomes, H. Peter Anvin,
	Ingo Molnar, peterz, akpm, viro, linux-kernel

On 03/05/2014 09:36 AM, Oleg Nesterov wrote:
> On 03/05, Andi Kleen wrote:
>>
>> On Wed, Mar 05, 2014 at 03:54:20PM +0100, Oleg Nesterov wrote:
>>> On 03/04, Andi Kleen wrote:
>>>>
>>>> Anything else?
>>>
>>> Well, we have yield_to(). Perhaps sys_yield_to(lock_owner) can help.
>>> Or perhaps sys_futex() can do this if it knows the owner. Don't ask
>>> me what exactly I mean though ;)
>>
>> You mean yield_to() would extend the time slice?
>>
>> That would be the same as the mmap page, just with a syscall right?
>
> Not the same. Very roughly I meant something like
>
> 	my_lock()
> 	{
> 		if (!TRY_LOCK()) {
> 			yield_to(owner);
> 			LOCK();
> 		}
>
> 		owner = gettid();
> 	}
>
> But once again, I am not sure if this makes any sense.
>
> Oleg.
>

Trouble with that approach is by the time a thread finds out it can not 
acquire the lock because someone else has it, we have already paid the 
price of context switch. What I am trying to do is to avoid that cost. I 
looked into a few other approaches to solving this problem without 
making kernel changes:

- Use PTHREAD_PRIO_PROTECT protocol to boost the priority of thread that 
holds the lock to minimize contention and CPU cycles wasted by other 
threads only to find out someone already has the lock. Problem I ran 
into is the implementation of PTHREAD_PRIO_PROTECT requires another 
system call, sched_setscheduler(), inside the library to boost priority. 
Now I have added the overhead of a new system call which easily 
outweighs any performance gains from removing lock contention. Besides 
databases implement their own spinlocks to maximize performance and thus 
can not use PTHREAD_PRIO_PROTECT in posix threads library.

- I looked into adaptive spinning futex work Darren Hart was working on. 
It looked very promising but I ran into the same problem again. It 
reduces the cost of contention by delaying context switches in cases 
where spinning is quicker but it still does not do anything to reduce 
the cost of context switch for a thread to get the CPU only to find out 
it can not get the lock. This cost again outweighs the 3%-5% benefit we 
are seeing from just not giving up CPU in the middle of critical section.

Makes sense?

--
Khalid

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05 11:10             ` Peter Zijlstra
@ 2014-03-05 17:29               ` Khalid Aziz
  2014-03-05 19:58               ` Khalid Aziz
  1 sibling, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-05 17:29 UTC (permalink / raw)
  To: Peter Zijlstra, Andi Kleen
  Cc: Thomas Gleixner, One Thousand Gnomes, H. Peter Anvin,
	Ingo Molnar, akpm, viro, oleg, linux-kernel

On 03/05/2014 04:10 AM, Peter Zijlstra wrote:
> On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
>> Anything else?
>
> Proxy execution; its a form of PI that works for arbitrary scheduling
> policies (thus also very much including fair).
>
> With that what you effectively end up with is the lock holder running
> 'boosted' by the runtime of its blocked chain until the entire chain
> runs out of time, at which point preemption doesn't matter anyhow.
>

I assume you are referring to 
http://users.soe.ucsc.edu/~jayhawk/watkins-rtlws09.pdf and/or 
http://www.osadl.org/fileadmin/dam/presentations/RTLWS11/peterz-edf-why-not.pdf 
? I will read through these and see if this can solve the problem.

--
Khalid

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05 11:10             ` Peter Zijlstra
  2014-03-05 17:29               ` Khalid Aziz
@ 2014-03-05 19:58               ` Khalid Aziz
  2014-03-06  9:57                 ` Peter Zijlstra
  2014-03-06 11:14                 ` Thomas Gleixner
  1 sibling, 2 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-05 19:58 UTC (permalink / raw)
  To: Peter Zijlstra, Andi Kleen
  Cc: Thomas Gleixner, One Thousand Gnomes, H. Peter Anvin,
	Ingo Molnar, akpm, viro, oleg, linux-kernel

On 03/05/2014 04:10 AM, Peter Zijlstra wrote:
> On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
>> Anything else?
>
> Proxy execution; its a form of PI that works for arbitrary scheduling
> policies (thus also very much including fair).
>
> With that what you effectively end up with is the lock holder running
> 'boosted' by the runtime of its blocked chain until the entire chain
> runs out of time, at which point preemption doesn't matter anyhow.
>

Hello Peter,

I read through the concept of proxy execution and it is a very 
interesting concept. I come from many years of realtime and embeddded 
systems development and I can easily recall various problems in the past 
that can be solved or helped by this. Looking at the current problem I 
am trying to solve with databases and JVM, I run into the same issue I 
described in my earlier email. Proxy execution is a post-contention 
solution. By the time proxy execution can do something for my case, I 
have already paid the price of contention and a context switch which is 
what I am trying to avoid. For a critical section that is very short 
compared to the size of execution thread, which is the case I am looking 
at, avoiding preemption in the middle of that short critical section 
helps much more than dealing with lock contention later on. The goal 
here is to avoid lock contention and associated cost. I do understand 
the cost of dealing with lock contention poorly and that can easily be 
much bigger cost, but I am looking into avoiding even getting there.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05 17:22                   ` Khalid Aziz
@ 2014-03-05 23:13                     ` David Lang
  2014-03-05 23:48                       ` Khalid Aziz
  0 siblings, 1 reply; 67+ messages in thread
From: David Lang @ 2014-03-05 23:13 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
	H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel

On Wed, 5 Mar 2014, Khalid Aziz wrote:

> On 03/05/2014 09:36 AM, Oleg Nesterov wrote:
>> On 03/05, Andi Kleen wrote:
>>> 
>>> On Wed, Mar 05, 2014 at 03:54:20PM +0100, Oleg Nesterov wrote:
>>>> On 03/04, Andi Kleen wrote:
>>>>> 
>>>>> Anything else?
>>>> 
>>>> Well, we have yield_to(). Perhaps sys_yield_to(lock_owner) can help.
>>>> Or perhaps sys_futex() can do this if it knows the owner. Don't ask
>>>> me what exactly I mean though ;)
>>> 
>>> You mean yield_to() would extend the time slice?
>>> 
>>> That would be the same as the mmap page, just with a syscall right?
>> 
>> Not the same. Very roughly I meant something like
>>
>> 	my_lock()
>> 	{
>> 		if (!TRY_LOCK()) {
>> 			yield_to(owner);
>> 			LOCK();
>> 		}
>>
>> 		owner = gettid();
>> 	}
>> 
>> But once again, I am not sure if this makes any sense.
>> 
>> Oleg.
>> 
>
> Trouble with that approach is by the time a thread finds out it can not 
> acquire the lock because someone else has it, we have already paid the price 
> of context switch. What I am trying to do is to avoid that cost. I looked 
> into a few other approaches to solving this problem without making kernel 
> changes:

Yes, you have paid the cost of the context switch, but your original problem 
description talked about having multiple other threads trying to get the lock, 
then spinning trying to get the lock (wasting time if the process holding it is 
asleep, but not if it's running on another core) and causing a long delay before 
the process holding the lock gets a chance to run again.

Having the threads immediately yield to the process that has the lock reduces 
this down to two context switches, which isn't perfect, but it's a LOT better 
than what you started from.

> - Use PTHREAD_PRIO_PROTECT protocol to boost the priority of thread that 
> holds the lock to minimize contention and CPU cycles wasted by other threads 
> only to find out someone already has the lock. Problem I ran into is the 
> implementation of PTHREAD_PRIO_PROTECT requires another system call, 
> sched_setscheduler(), inside the library to boost priority. Now I have added 
> the overhead of a new system call which easily outweighs any performance 
> gains from removing lock contention. Besides databases implement their own 
> spinlocks to maximize performance and thus can not use PTHREAD_PRIO_PROTECT 
> in posix threads library.

well, writing to something in /proc isn't free either. And how is the thread 
supposed to know if it needs to do so or if it's going to have enough time to 
finish it's work before it's out of time (how can it know how much time it would 
have left anyway?)

> - I looked into adaptive spinning futex work Darren Hart was working on. It 
> looked very promising but I ran into the same problem again. It reduces the 
> cost of contention by delaying context switches in cases where spinning is 
> quicker but it still does not do anything to reduce the cost of context 
> switch for a thread to get the CPU only to find out it can not get the lock. 
> This cost again outweighs the 3%-5% benefit we are seeing from just not 
> giving up CPU in the middle of critical section.

is this gain from not giving up the CPU at all? or is it from avoiding all the 
delays due to the contending thread trying in turn? the yield_to() approach 
avoids all those other threads trying in turn so it should get fairly close to 
the same benefits.

David Lang

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05 23:13                     ` David Lang
@ 2014-03-05 23:48                       ` Khalid Aziz
  2014-03-05 23:56                         ` H. Peter Anvin
  2014-03-05 23:59                         ` David Lang
  0 siblings, 2 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-05 23:48 UTC (permalink / raw)
  To: David Lang
  Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
	H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel

On 03/05/2014 04:13 PM, David Lang wrote:
> Yes, you have paid the cost of the context switch, but your original
> problem description talked about having multiple other threads trying to
> get the lock, then spinning trying to get the lock (wasting time if the
> process holding it is asleep, but not if it's running on another core)
> and causing a long delay before the process holding the lock gets a
> chance to run again.
>
> Having the threads immediately yield to the process that has the lock
> reduces this down to two context switches, which isn't perfect, but it's
> a LOT better than what you started from.

OK, let us consider the multiple core scenario:

- Thread A gets scheduled on core 1, runs for 95% of its timeslice 
before it gets to its critical section.
- Thread A grabs the lock and quickly reaches the end of its timeslice 
before finishing its critical section.
- Thread A is preempted on core 1 by a completely unrelated thread.
- Thread B in the mean time is scheduled on core 2 and happens to get to 
its critical section right away where it tries to grab the lock held by 
thread A. It spins for a bit waiting to see if lock becomes available, 
gives up and yields to next process in queue.
- Since thread A ran recently, it is now stuck towards the end of run 
queue, so thread C gets to run on core 2 which goes through same fate as 
thread A.

Now scale this scenario across more cores and more threads that all want 
the same lock to execute their small critical section. This cost of 
spinning and context switch could have been avoided if thread A could 
get additional timeslice to complete its critical section. Yielding to 
the process holding the lock happens only after contention has happened 
and we have paid the price of two context switches for the lock owner. 
When yield_to() happens, the lock owner may not get to run on the core 
it was on because an unrelated thread is running on that core and it 
needs to wait for that thread's timeslice to run out. If the lock owner 
gets scheduled on another core, we pay the price of repopulating the 
cache for a new thread on that core. yield_to() is better than having a 
convoy building up of processes waiting for the same lock but worse than 
avoiding the contention altogether.

>
> well, writing to something in /proc isn't free either. And how is the
> thread supposed to know if it needs to do so or if it's going to have
> enough time to finish it's work before it's out of time (how can it know
> how much time it would have left anyway?)

Cost is writing to a memory location since thread is using mmap, not 
insignificant but hardly expensive. Thread does not need to know how 
much time it has left in current timeslice. It always sets the flag to 
request pre-emption immunity before entering the critical section and 
clears the flag when it exits its critical section. If the thread comes 
up for pre-emption while the flag is set, it gets immunity. If it does 
not, flag will be cleared at the end of critical section any way.

> is this gain from not giving up the CPU at all? or is it from avoiding
> all the delays due to the contending thread trying in turn? the
> yield_to() approach avoids all those other threads trying in turn so it
> should get fairly close to the same benefits.
>

The gain is from avoiding contention by giving locking thread a chance 
to complete its critical section which is expected to be very short 
(certainly shorter than timeslice). Pre-emption immunity gives it one 
and only one additional timeslice.

Hope this helps clear things up.

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05 23:48                       ` Khalid Aziz
@ 2014-03-05 23:56                         ` H. Peter Anvin
  2014-03-06  0:02                           ` Khalid Aziz
  2014-03-05 23:59                         ` David Lang
  1 sibling, 1 reply; 67+ messages in thread
From: H. Peter Anvin @ 2014-03-05 23:56 UTC (permalink / raw)
  To: Khalid Aziz, David Lang
  Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
	Ingo Molnar, peterz, akpm, viro, linux-kernel

On 03/05/2014 03:48 PM, Khalid Aziz wrote:
> 
> Cost is writing to a memory location since thread is using mmap, not
> insignificant but hardly expensive. Thread does not need to know how
> much time it has left in current timeslice. It always sets the flag to
> request pre-emption immunity before entering the critical section and
> clears the flag when it exits its critical section. If the thread comes
> up for pre-emption while the flag is set, it gets immunity. If it does
> not, flag will be cleared at the end of critical section any way.
> 

A little more than that.  The scheduler needs to set *another* flag
telling the process to yield upon leaving the critical section; if the
process doesn't, the scheduler needs to keep enough accounting to know
to penalize the process, or this method will not be usable for
unprivileged processes.

I have been thinking about how to deal with the fact that if this isn't
locked in memory, what happens if it is paged out.  If that can be
detected at reasonable cost, it probably makes most sense to simply say
that if the pointed-to memory is paged out, we're not in a critical section.
	
	-hpa


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05 23:48                       ` Khalid Aziz
  2014-03-05 23:56                         ` H. Peter Anvin
@ 2014-03-05 23:59                         ` David Lang
  2014-03-06  0:17                           ` Khalid Aziz
  1 sibling, 1 reply; 67+ messages in thread
From: David Lang @ 2014-03-05 23:59 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
	H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel

On Wed, 5 Mar 2014, Khalid Aziz wrote:

> On 03/05/2014 04:13 PM, David Lang wrote:
>> Yes, you have paid the cost of the context switch, but your original
>> problem description talked about having multiple other threads trying to
>> get the lock, then spinning trying to get the lock (wasting time if the
>> process holding it is asleep, but not if it's running on another core)
>> and causing a long delay before the process holding the lock gets a
>> chance to run again.
>> 
>> Having the threads immediately yield to the process that has the lock
>> reduces this down to two context switches, which isn't perfect, but it's
>> a LOT better than what you started from.
>
> OK, let us consider the multiple core scenario:
>
> - Thread A gets scheduled on core 1, runs for 95% of its timeslice before it 
> gets to its critical section.
> - Thread A grabs the lock and quickly reaches the end of its timeslice before 
> finishing its critical section.
> - Thread A is preempted on core 1 by a completely unrelated thread.
> - Thread B in the mean time is scheduled on core 2 and happens to get to its 
> critical section right away where it tries to grab the lock held by thread A. 
> It spins for a bit waiting to see if lock becomes available, gives up and 
> yields to next process in queue.
> - Since thread A ran recently, it is now stuck towards the end of run queue, 
> so thread C gets to run on core 2 which goes through same fate as thread A.
>
> Now scale this scenario across more cores and more threads that all want the 
> same lock to execute their small critical section. This cost of spinning and 
> context switch could have been avoided if thread A could get additional 
> timeslice to complete its critical section. Yielding to the process holding 
> the lock happens only after contention has happened and we have paid the 
> price of two context switches for the lock owner. When yield_to() happens, 
> the lock owner may not get to run on the core it was on because an unrelated 
> thread is running on that core and it needs to wait for that thread's 
> timeslice to run out. If the lock owner gets scheduled on another core, we 
> pay the price of repopulating the cache for a new thread on that core. 
> yield_to() is better than having a convoy building up of processes waiting 
> for the same lock but worse than avoiding the contention altogether.

with yield_to(), when thread B hit the contention, thread A would get more time 
an be able to complete immediatly, so by the time thread C gets around to 
running, there is no longer any contention.

>> 
>> well, writing to something in /proc isn't free either. And how is the
>> thread supposed to know if it needs to do so or if it's going to have
>> enough time to finish it's work before it's out of time (how can it know
>> how much time it would have left anyway?)
>
> Cost is writing to a memory location since thread is using mmap, not 
> insignificant but hardly expensive. Thread does not need to know how much 
> time it has left in current timeslice. It always sets the flag to request 
> pre-emption immunity before entering the critical section and clears the flag 
> when it exits its critical section. If the thread comes up for pre-emption 
> while the flag is set, it gets immunity. If it does not, flag will be cleared 
> at the end of critical section any way.

what's the cost to setup mmap of this file in /proc. this is sounding like a lot 
of work.

>> is this gain from not giving up the CPU at all? or is it from avoiding
>> all the delays due to the contending thread trying in turn? the
>> yield_to() approach avoids all those other threads trying in turn so it
>> should get fairly close to the same benefits.
>> 
>
> The gain is from avoiding contention by giving locking thread a chance to 
> complete its critical section which is expected to be very short (certainly 
> shorter than timeslice). Pre-emption immunity gives it one and only one 
> additional timeslice.

but the yield_to() does almost the same thing, there is a small bump, but you 
don't have to wait for thread B to spin, thread C..ZZZ etc to spin before thread 
A can finish it's work. As soon as the second thread hits the critical section, 
thread A is going to be able to do more work (and hopefully finish)

> Hope this helps clear things up.

It doesn't sound like you and I are understanding how the yield_to() approach 
would work. I hope my comments have helped get us on the same page.

David Lang

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05 23:56                         ` H. Peter Anvin
@ 2014-03-06  0:02                           ` Khalid Aziz
  2014-03-06  0:13                             ` H. Peter Anvin
  0 siblings, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-06  0:02 UTC (permalink / raw)
  To: H. Peter Anvin, David Lang
  Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
	Ingo Molnar, peterz, akpm, viro, linux-kernel

On 03/05/2014 04:56 PM, H. Peter Anvin wrote:
> On 03/05/2014 03:48 PM, Khalid Aziz wrote:
>>
>> Cost is writing to a memory location since thread is using mmap, not
>> insignificant but hardly expensive. Thread does not need to know how
>> much time it has left in current timeslice. It always sets the flag to
>> request pre-emption immunity before entering the critical section and
>> clears the flag when it exits its critical section. If the thread comes
>> up for pre-emption while the flag is set, it gets immunity. If it does
>> not, flag will be cleared at the end of critical section any way.
>>
>
> A little more than that.  The scheduler needs to set *another* flag
> telling the process to yield upon leaving the critical section; if the
> process doesn't, the scheduler needs to keep enough accounting to know
> to penalize the process, or this method will not be usable for
> unprivileged processes.

Yes, you had made that suggestion earlier and I like it. It will be in 
v2 patch. I am thinking of making the penalty be denial of next 
preemption immunity request if a process fails to yield when it should 
have. Sounds good?

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-06  0:02                           ` Khalid Aziz
@ 2014-03-06  0:13                             ` H. Peter Anvin
  0 siblings, 0 replies; 67+ messages in thread
From: H. Peter Anvin @ 2014-03-06  0:13 UTC (permalink / raw)
  To: Khalid Aziz, David Lang
  Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
	Ingo Molnar, peterz, akpm, viro, linux-kernel

On 03/05/2014 04:02 PM, Khalid Aziz wrote:
> 
> Yes, you had made that suggestion earlier and I like it. It will be in
> v2 patch. I am thinking of making the penalty be denial of next
> preemption immunity request if a process fails to yield when it should
> have. Sounds good?
> 

That is the minimum penalty possible.  Hopefully that is sufficient.

	-hpa



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05 23:59                         ` David Lang
@ 2014-03-06  0:17                           ` Khalid Aziz
  2014-03-06  0:36                             ` David Lang
  0 siblings, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-06  0:17 UTC (permalink / raw)
  To: David Lang
  Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
	H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel

On 03/05/2014 04:59 PM, David Lang wrote:
> what's the cost to setup mmap of this file in /proc. this is sounding
> like a lot of work.

That is a one time cost paid when a thread initializes itself.

>
>>> is this gain from not giving up the CPU at all? or is it from avoiding
>>> all the delays due to the contending thread trying in turn? the
>>> yield_to() approach avoids all those other threads trying in turn so it
>>> should get fairly close to the same benefits.
>>>
>>
>> The gain is from avoiding contention by giving locking thread a chance
>> to complete its critical section which is expected to be very short
>> (certainly shorter than timeslice). Pre-emption immunity gives it one
>> and only one additional timeslice.
>
> but the yield_to() does almost the same thing, there is a small bump,
> but you don't have to wait for thread B to spin, thread C..ZZZ etc to
> spin before thread A can finish it's work. As soon as the second thread
> hits the critical section, thread A is going to be able to do more work
> (and hopefully finish)
>
>> Hope this helps clear things up.
>
> It doesn't sound like you and I are understanding how the yield_to()
> approach would work. I hope my comments have helped get us on the same
> page.
>

I apologize if I am being dense. My understanding of yield_to() is what 
Oleg had said in his reply earlier, so I will quote the example he gave:

	my_lock()
	{
		if (!TRY_LOCK()) {
			yield_to(owner);
			LOCK();
		}

		owner = gettid();
	}

If thread A had already lost the processor by the time thread B executes 
above code, wouldn't we have paid the price of two context switches for 
thread A?

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-06  0:17                           ` Khalid Aziz
@ 2014-03-06  0:36                             ` David Lang
  2014-03-06  1:22                               ` Khalid Aziz
  0 siblings, 1 reply; 67+ messages in thread
From: David Lang @ 2014-03-06  0:36 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
	H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel

On Wed, 5 Mar 2014, Khalid Aziz wrote:

> On 03/05/2014 04:59 PM, David Lang wrote:
>> what's the cost to setup mmap of this file in /proc. this is sounding
>> like a lot of work.
>
> That is a one time cost paid when a thread initializes itself.
>
>> 
>>>> is this gain from not giving up the CPU at all? or is it from avoiding
>>>> all the delays due to the contending thread trying in turn? the
>>>> yield_to() approach avoids all those other threads trying in turn so it
>>>> should get fairly close to the same benefits.
>>>> 
>>> 
>>> The gain is from avoiding contention by giving locking thread a chance
>>> to complete its critical section which is expected to be very short
>>> (certainly shorter than timeslice). Pre-emption immunity gives it one
>>> and only one additional timeslice.
>> 
>> but the yield_to() does almost the same thing, there is a small bump,
>> but you don't have to wait for thread B to spin, thread C..ZZZ etc to
>> spin before thread A can finish it's work. As soon as the second thread
>> hits the critical section, thread A is going to be able to do more work
>> (and hopefully finish)
>> 
>>> Hope this helps clear things up.
>> 
>> It doesn't sound like you and I are understanding how the yield_to()
>> approach would work. I hope my comments have helped get us on the same
>> page.
>> 
>
> I apologize if I am being dense. My understanding of yield_to() is what Oleg 
> had said in his reply earlier, so I will quote the example he gave:
>
> 	my_lock()
> 	{
> 		if (!TRY_LOCK()) {
> 			yield_to(owner);
> 			LOCK();
> 		}
>
> 		owner = gettid();
> 	}
>
> If thread A had already lost the processor by the time thread B executes 
> above code, wouldn't we have paid the price of two context switches for 
> thread A?

Yes, you pay for two context switches, but you don't pay for threads B..ZZZ all 
running (and potentially spinning) trying to aquire the lock before thread A is 
able to complete it's work.

As soon as a second thread hits the contention, thread A gets time to finish.

It's not as 'good' [1] as thread A just working longer, but it's FAR better than 
thread A sleeping while every other thread runs and potentially tries to get the 
lock

[1] it wastes the context switches, but it avoids the overhead of figuring out 
if the thread needs to extend it's time, and if it's time was actually extended, 
and what penalty it should suffer the next time it runs....

I expect that in practice, it will be very close to the 'theoretical ideal' of 
"I'm doing something important, don't interrupt me", without all the potential 
abuse of such a flag.

David Lang

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-06  0:36                             ` David Lang
@ 2014-03-06  1:22                               ` Khalid Aziz
  2014-03-06 14:23                                 ` David Lang
  0 siblings, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-06  1:22 UTC (permalink / raw)
  To: David Lang
  Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
	H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel

On 03/05/2014 05:36 PM, David Lang wrote:
> Yes, you pay for two context switches, but you don't pay for threads
> B..ZZZ all running (and potentially spinning) trying to aquire the lock
> before thread A is able to complete it's work.
>

Ah, great. We are converging now.

> As soon as a second thread hits the contention, thread A gets time to
> finish.

Only as long as thread A could be scheduled immediately which may or may 
not be the case depending upon what else is running on the core thread A 
last ran on and if thread A needs to be migrated to another core.

>
> It's not as 'good' [1] as thread A just working longer,

and that is the exact spot where I am trying to improve performance.

> but it's FAR
> better than thread A sleeping while every other thread runs and
> potentially tries to get the lock

Absolutely. I agree with that.

>
> [1] it wastes the context switches, but it avoids the overhead of
> figuring out if the thread needs to extend it's time, and if it's time
> was actually extended, and what penalty it should suffer the next time
> it runs....

All of it can be done by setting and checking couple of flags in 
task_struct. That is not insignificant, but hardly expensive. Logic is 
quite simple:

resched()
{
	........
	if (immmunity) {
		if (!penalty) {
			immunity = 0;
			penalty = 1;
			-- skip context switch --
		}
		else {
			immunity = penalty = 0;
			-- do the context switch --
		}
	}
	.........
}

sched_yield()
{
	......
	penalty = 0;
	......
}

This simple logic will also work to defeat the obnoxius threads that 
keep setting immunity request flag repeatedly within the same critical 
section to give themselves multiple extensions.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05 19:58               ` Khalid Aziz
@ 2014-03-06  9:57                 ` Peter Zijlstra
  2014-03-06 16:08                   ` Khalid Aziz
  2014-03-06 11:14                 ` Thomas Gleixner
  1 sibling, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2014-03-06  9:57 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Andi Kleen, Thomas Gleixner, One Thousand Gnomes, H. Peter Anvin,
	Ingo Molnar, akpm, viro, oleg, linux-kernel

On Wed, Mar 05, 2014 at 12:58:29PM -0700, Khalid Aziz wrote:
> On 03/05/2014 04:10 AM, Peter Zijlstra wrote:
> >On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
> >>Anything else?
> >
> >Proxy execution; its a form of PI that works for arbitrary scheduling
> >policies (thus also very much including fair).
> >
> >With that what you effectively end up with is the lock holder running
> >'boosted' by the runtime of its blocked chain until the entire chain
> >runs out of time, at which point preemption doesn't matter anyhow.
> >
> 
> Hello Peter,
> 
> I read through the concept of proxy execution and it is a very interesting
> concept. I come from many years of realtime and embeddded systems
> development and I can easily recall various problems in the past that can be
> solved or helped by this. 

Yeah; there's a few nasty cases with PEP on SMP though (the reason we've
not already have it); the trivial implementation that works wonderfully
on UP ends up capable of running the same task on multiple CPUs -- which
is an obvious fail.

There's people working on this though; as the scheme also works well for
the recently added deadline scheduler.

> Looking at the current problem I am trying to
> solve with databases and JVM, I run into the same issue I described in my
> earlier email. Proxy execution is a post-contention solution. By the time
> proxy execution can do something for my case, I have already paid the price
> of contention and a context switch which is what I am trying to avoid. For a
> critical section that is very short compared to the size of execution
> thread, which is the case I am looking at, avoiding preemption in the middle
> of that short critical section helps much more than dealing with lock
> contention later on.

Like others have already stated; its likely still cheaper than the
pile-up you get now. It might not be optimally fast, but it sure takes
out the worst case you have now.

> The goal here is to avoid lock contention and
> associated cost. I do understand the cost of dealing with lock contention
> poorly and that can easily be much bigger cost, but I am looking into
> avoiding even getting there.

The thing is; unless userspace is a RT program or practises the same
discipline in such an extend as that it make no practical difference,
there's always going to be the case where you fail to cover the entire
critical section, at which point you're back to your pile-up fail.

So while the limited preemption guard helps the best cast, it doesn't
help the worst case at all.

So supposing we went with this now; you (or someone else) will come back
in a year's time and tell us that if we only just stretch this window a
little, their favourite workload will also benefit.

Where's the end of that?

And what about CONFIG_HZ; suppose you compile your kernel with HZ=100
and your 1 extra tick is sufficient. Then someone compiles their kernel
with HZ=1000 and it all comes apart.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05 19:58               ` Khalid Aziz
  2014-03-06  9:57                 ` Peter Zijlstra
@ 2014-03-06 11:14                 ` Thomas Gleixner
  2014-03-06 16:32                   ` Khalid Aziz
  1 sibling, 1 reply; 67+ messages in thread
From: Thomas Gleixner @ 2014-03-06 11:14 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Peter Zijlstra, Andi Kleen, One Thousand Gnomes, H. Peter Anvin,
	Ingo Molnar, Andrew Morton, Al Viro, Oleg Nesterov, LKML

On Wed, 5 Mar 2014, Khalid Aziz wrote:
> On 03/05/2014 04:10 AM, Peter Zijlstra wrote:
> > On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
> > > Anything else?
> > 
> > Proxy execution; its a form of PI that works for arbitrary scheduling
> > policies (thus also very much including fair).
> > 
> > With that what you effectively end up with is the lock holder running
> > 'boosted' by the runtime of its blocked chain until the entire chain
> > runs out of time, at which point preemption doesn't matter anyhow.
> > 
> 
> Hello Peter,
> 
> I read through the concept of proxy execution and it is a very interesting
> concept. I come from many years of realtime and embeddded systems development
> and I can easily recall various problems in the past that can be solved or
> helped by this. Looking at the current problem I am trying to solve with
> databases and JVM, I run into the same issue I described in my earlier email.
> Proxy execution is a post-contention solution. By the time proxy execution can
> do something for my case, I have already paid the price of contention and a
> context switch which is what I am trying to avoid. For a critical section that
> is very short compared to the size of execution thread, which is the case I am
> looking at, avoiding preemption in the middle of that short critical section
> helps much more than dealing with lock contention later on. The goal here is
> to avoid lock contention and associated cost. I do understand the cost of
> dealing with lock contention poorly and that can easily be much bigger cost,
> but I am looking into avoiding even getting there.

We understand that you want to avoid preemption in the first place and
not getting into the contention handling case.

But, what you're trying to do is essentially creating an ABI which we
have to support and maintain forever. And that definitely is worth a
few serious questions.

Lets ignore the mm related issues for now as those can be solved. That's
the least of my worries.

Right now you are using this for a single use case with a well defined
environment, where all related threads reside in the same scheduling
class (FAIR). But that's one of a gazillion of use cases of Linux.

If we allow you to special case your database workload then we have no
argument why we should not do the same thing for realtime workloads
where the SCHED_FAIR housekeeping thread can hold a lock shortly to
access some important data in the SCHED_FIFO realtime computation
thread. Of course the RT people want to avoid the lock contention as
much as you do, just for different reasons.

Add SCHED_EDF, cgroups and hierarchical scheduling to the picture and
hell breaks lose.

Why? Simply because you applied the everything is a nail therefor I
use a hammer approach. Understandable because in your case (data base
workload) almost everything is the same nail.

Thanks,

	tglx








      


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-05  0:51           ` Andi Kleen
  2014-03-05 11:10             ` Peter Zijlstra
  2014-03-05 14:54             ` Oleg Nesterov
@ 2014-03-06 12:13             ` Kevin Easton
  2014-03-06 13:59               ` Peter Zijlstra
  2014-03-06 14:25               ` David Lang
  2 siblings, 2 replies; 67+ messages in thread
From: Kevin Easton @ 2014-03-06 12:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Thomas Gleixner, Khalid Aziz, One Thousand Gnomes,
	H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, oleg,
	linux-kernel

On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
> Anything else?

If it was possible to make the time remaining in the current timeslice
available to userspace through the vdso, the thread could do something like:

if (sys_timeleft() < CRITICAL_SECTION_SIZE)
    yield();
lock();

to avoid running out of timeslice in the middle of the critical section.

    - Kevin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-04 21:12 ` H. Peter Anvin
  2014-03-04 21:39   ` Khalid Aziz
@ 2014-03-06 13:24   ` Rasmus Villemoes
  2014-03-06 13:34     ` Peter Zijlstra
  1 sibling, 1 reply; 67+ messages in thread
From: Rasmus Villemoes @ 2014-03-06 13:24 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Khalid Aziz, tglx, Ingo Molnar, peterz, akpm, andi.kleen, rob,
	viro, oleg, venki, linux-kernel

"H. Peter Anvin" <hpa@zytor.com> writes:

> I have several issues with this interface:
>
> 1. First, a process needs to know if it *should* have been preempted
> before it calls sched_yield().  So there needs to be a second flag set
> by the scheduler when granting amnesty.
>
> 2. A process which fails to call sched_yield() after being granted
> amnesty must be penalized.
>
> 3. I'm not keen on occupying a full page for this.  I'm wondering if
> doing a pointer into user space, futex-style, might make more sense.
> The downside, of course, is what happens if the page being pointed to is
> swapped out.

Is it possible to implement non-sleeping versions of {get,put}_user()?
That is, use the same basic approach (let the MMU do the hard work) but use
different, and simpler, "fixup" code (return -ESOMETHING on a major
fault).

If so, I think an extra pointer (->amnesty) and an extra bit
(->amnesty_granted) in task_struct suffices:

If the scheduler runs and tsk->amnesty_granted is true, penalize the
current task (and possibly clear tsk->amnesty_granted).

Otherwise, the task is granted amnesty provided these conditions are
met:

tsk->amnesty is non-NULL
get_user_nosleep(j, tsk->amnesty) succeeds
j is now 1
put_user_nosleep(YOU_WERE_LUCKY, tsk->amnesty) succeeds

If so, set amnesty_granted and let the thread continue; otherwise reschedule.

The userspace side would be something like

void *thread_func(void*)
{
     int Im_busy = 0;
     sched_need_amnesty(&Im_busy); /* better name needed */
     /* -EPERM if not allowed (new capability?) */

     /* go critical */
     Im_busy = 1;
     LOCK();
     do_stuff();
     UNLOCK();
     if (Im_busy != 1) {
          /* better play nice with the others */
          sched_yield();
     }
     Im_busy = 0;

     /* If the thread doesn't need the amnesty feature anymore, it can just do
     sched_need_amnesty(NULL); */
}


Rasmus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-06 13:24   ` Rasmus Villemoes
@ 2014-03-06 13:34     ` Peter Zijlstra
  2014-03-06 13:45       ` Rasmus Villemoes
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2014-03-06 13:34 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: H. Peter Anvin, Khalid Aziz, tglx, Ingo Molnar, akpm, andi.kleen,
	rob, viro, oleg, venki, linux-kernel

On Thu, Mar 06, 2014 at 02:24:43PM +0100, Rasmus Villemoes wrote:
> Is it possible to implement non-sleeping versions of {get,put}_user()?

__{get,put}_user()

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-06 13:34     ` Peter Zijlstra
@ 2014-03-06 13:45       ` Rasmus Villemoes
  2014-03-06 14:02         ` Peter Zijlstra
  2014-03-06 14:04         ` Thomas Gleixner
  0 siblings, 2 replies; 67+ messages in thread
From: Rasmus Villemoes @ 2014-03-06 13:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Khalid Aziz, tglx, Ingo Molnar, akpm, andi.kleen,
	rob, viro, oleg, venki, linux-kernel

Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, Mar 06, 2014 at 02:24:43PM +0100, Rasmus Villemoes wrote:
>> Is it possible to implement non-sleeping versions of {get,put}_user()?
>
> __{get,put}_user()

Huh?

arch/x86/include/asm/uaccess.h:

/**
 * __get_user: - Get a simple variable from user space, with less checking.
 * @x:   Variable to store result.
 * @ptr: Source address, in user space.
 *
 * Context: User context only.  This function may sleep.

What am I missing?

Rasmus


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-06 12:13             ` Kevin Easton
@ 2014-03-06 13:59               ` Peter Zijlstra
  2014-03-06 22:41                 ` Andi Kleen
  2014-03-06 14:25               ` David Lang
  1 sibling, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2014-03-06 13:59 UTC (permalink / raw)
  To: Kevin Easton
  Cc: Andi Kleen, Thomas Gleixner, Khalid Aziz, One Thousand Gnomes,
	H. Peter Anvin, Ingo Molnar, akpm, viro, oleg, linux-kernel

On Thu, Mar 06, 2014 at 11:13:33PM +1100, Kevin Easton wrote:
> On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
> > Anything else?
> 
> If it was possible to make the time remaining in the current timeslice
> available to userspace through the vdso, the thread could do something like:

Assuming we can do per-cpu values in the VDSO; this would mean hitting
that cacheline on every context switch and wakeup. That's a complete
non-starter performance wise.

> 
> if (sys_timeleft() < CRITICAL_SECTION_SIZE)
>     yield();
> lock();
> 
> to avoid running out of timeslice in the middle of the critical section.

Can still happen, the effective slice of a single runnable task is
infinite, the moment another task gets woken this gets reduced to a finite
amount, we then keep reducing the slice until there are about 8 runnable
tasks (assuming you've not poked at any sysctls).

Also; depending on the state of the just woken task; it might land left
of you in the tree, making it immediately eligible to run, completely
obviating whatever number you just read.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-06 13:45       ` Rasmus Villemoes
@ 2014-03-06 14:02         ` Peter Zijlstra
  2014-03-06 14:33           ` Thomas Gleixner
  2014-03-06 14:04         ` Thomas Gleixner
  1 sibling, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2014-03-06 14:02 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: H. Peter Anvin, Khalid Aziz, tglx, Ingo Molnar, akpm, andi.kleen,
	rob, viro, oleg, venki, linux-kernel

On Thu, Mar 06, 2014 at 02:45:00PM +0100, Rasmus Villemoes wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Thu, Mar 06, 2014 at 02:24:43PM +0100, Rasmus Villemoes wrote:
> >> Is it possible to implement non-sleeping versions of {get,put}_user()?
> >
> > __{get,put}_user()
> 
> Huh?
> 
> arch/x86/include/asm/uaccess.h:
> 
> /**
>  * __get_user: - Get a simple variable from user space, with less checking.
>  * @x:   Variable to store result.
>  * @ptr: Source address, in user space.
>  *
>  * Context: User context only.  This function may sleep.
> 
> What am I missing?

__get_user() -> __get_user_nocheck() -> __get_user_size() -> __get_user_asm()

And __get_user_asm() seems to generate the required .fixup section for
this to work in pagefault_disable() context.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-06 13:45       ` Rasmus Villemoes
  2014-03-06 14:02         ` Peter Zijlstra
@ 2014-03-06 14:04         ` Thomas Gleixner
  1 sibling, 0 replies; 67+ messages in thread
From: Thomas Gleixner @ 2014-03-06 14:04 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Peter Zijlstra, H. Peter Anvin, Khalid Aziz, Ingo Molnar, akpm,
	andi.kleen, rob, viro, oleg, venki, linux-kernel

On Thu, 6 Mar 2014, Rasmus Villemoes wrote:

> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Thu, Mar 06, 2014 at 02:24:43PM +0100, Rasmus Villemoes wrote:
> >> Is it possible to implement non-sleeping versions of {get,put}_user()?
> >
> > __{get,put}_user()
> 
> Huh?
> 
> arch/x86/include/asm/uaccess.h:
> 
> /**
>  * __get_user: - Get a simple variable from user space, with less checking.
>  * @x:   Variable to store result.
>  * @ptr: Source address, in user space.
>  *
>  * Context: User context only.  This function may sleep.
> 
> What am I missing?

All these variants can sleep, as they might trip a page fault.....

If you want to access user space from a context which cant sleep you
need to use the __copy_to/from_user_inatomic variants. If the access
fails then they return -EFAULT and you need to deal with that
yourself.

Thanks,

	tglx




^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-06  1:22                               ` Khalid Aziz
@ 2014-03-06 14:23                                 ` David Lang
  0 siblings, 0 replies; 67+ messages in thread
From: David Lang @ 2014-03-06 14:23 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
	H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel

On Wed, 5 Mar 2014, Khalid Aziz wrote:

> On 03/05/2014 05:36 PM, David Lang wrote:
>> Yes, you pay for two context switches, but you don't pay for threads
>> B..ZZZ all running (and potentially spinning) trying to aquire the lock
>> before thread A is able to complete it's work.
>> 
>
> Ah, great. We are converging now.
>
>> As soon as a second thread hits the contention, thread A gets time to
>> finish.
>
> Only as long as thread A could be scheduled immediately which may or may not 
> be the case depending upon what else is running on the core thread A last ran 
> on and if thread A needs to be migrated to another core.
>
>> 
>> It's not as 'good' [1] as thread A just working longer,
>
> and that is the exact spot where I am trying to improve performance.

well, you have said that the "give me more time" flag results in 3-5% better 
performance for databases under some workloads, how does this compare with the 
results of yield_to()?

I think that the two approaches are going to be very close. I'd lay good odds 
that the difference between the two is very hard to extract from the noise of 
the variation of different runs (I won't say statistically insignificant, but I 
will say that I expect it to take a lot of statistical analysis to pull it out 
of the clutter)

David Lang

>> but it's FAR
>> better than thread A sleeping while every other thread runs and
>> potentially tries to get the lock
>
> Absolutely. I agree with that.
>
>> 
>> [1] it wastes the context switches, but it avoids the overhead of
>> figuring out if the thread needs to extend it's time, and if it's time
>> was actually extended, and what penalty it should suffer the next time
>> it runs....
>
> All of it can be done by setting and checking couple of flags in task_struct. 
> That is not insignificant, but hardly expensive. Logic is quite simple:
>
> resched()
> {
> 	........
> 	if (immmunity) {
> 		if (!penalty) {
> 			immunity = 0;
> 			penalty = 1;
> 			-- skip context switch --
> 		}
> 		else {
> 			immunity = penalty = 0;
> 			-- do the context switch --
> 		}
> 	}
> 	.........
> }
>
> sched_yield()
> {
> 	......
> 	penalty = 0;
> 	......
> }
>
> This simple logic will also work to defeat the obnoxius threads that keep 
> setting immunity request flag repeatedly within the same critical section to 
> give themselves multiple extensions.
>
> Thanks,
> Khalid
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-06 12:13             ` Kevin Easton
  2014-03-06 13:59               ` Peter Zijlstra
@ 2014-03-06 14:25               ` David Lang
  2014-03-06 16:12                 ` Khalid Aziz
  1 sibling, 1 reply; 67+ messages in thread
From: David Lang @ 2014-03-06 14:25 UTC (permalink / raw)
  To: Kevin Easton
  Cc: Andi Kleen, Thomas Gleixner, Khalid Aziz, One Thousand Gnomes,
	H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, oleg,
	linux-kernel

On Thu, 6 Mar 2014, Kevin Easton wrote:

> On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
>> Anything else?
>
> If it was possible to make the time remaining in the current timeslice
> available to userspace through the vdso, the thread could do something like:
>
> if (sys_timeleft() < CRITICAL_SECTION_SIZE)
>    yield();
> lock();
>
> to avoid running out of timeslice in the middle of the critical section.

but won't the system call result in context switches? According to Kevin, even a 
context switch to another thread and back immediatly is bad enough to need to be 
avoided, so replacing that with the context switch to the kernel and back isn't 
a subtantial win.

David Lang

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-06 14:02         ` Peter Zijlstra
@ 2014-03-06 14:33           ` Thomas Gleixner
  2014-03-06 14:34             ` H. Peter Anvin
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Gleixner @ 2014-03-06 14:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rasmus Villemoes, H. Peter Anvin, Khalid Aziz, Ingo Molnar, akpm,
	andi.kleen, rob, viro, oleg, venki, linux-kernel



On Thu, 6 Mar 2014, Peter Zijlstra wrote:

> On Thu, Mar 06, 2014 at 02:45:00PM +0100, Rasmus Villemoes wrote:
> > Peter Zijlstra <peterz@infradead.org> writes:
> > 
> > > On Thu, Mar 06, 2014 at 02:24:43PM +0100, Rasmus Villemoes wrote:
> > >> Is it possible to implement non-sleeping versions of {get,put}_user()?
> > >
> > > __{get,put}_user()
> > 
> > Huh?
> > 
> > arch/x86/include/asm/uaccess.h:
> > 
> > /**
> >  * __get_user: - Get a simple variable from user space, with less checking.
> >  * @x:   Variable to store result.
> >  * @ptr: Source address, in user space.
> >  *
> >  * Context: User context only.  This function may sleep.
> > 
> > What am I missing?
> 
> __get_user() -> __get_user_nocheck() -> __get_user_size() -> __get_user_asm()
> 
> And __get_user_asm() seems to generate the required .fixup section for
> this to work in pagefault_disable() context.

Well, it still might sleep if you're using it in page fault enabled
context. So the documentation of that function sucks.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-06 14:33           ` Thomas Gleixner
@ 2014-03-06 14:34             ` H. Peter Anvin
  0 siblings, 0 replies; 67+ messages in thread
From: H. Peter Anvin @ 2014-03-06 14:34 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra
  Cc: Rasmus Villemoes, Khalid Aziz, Ingo Molnar, akpm, andi.kleen,
	rob, viro, oleg, venki, linux-kernel

The no checking is omitting access_ok(), no? Either way, disabling page faults have to be done explicitly.

On March 6, 2014 6:33:04 AM PST, Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
>On Thu, 6 Mar 2014, Peter Zijlstra wrote:
>
>> On Thu, Mar 06, 2014 at 02:45:00PM +0100, Rasmus Villemoes wrote:
>> > Peter Zijlstra <peterz@infradead.org> writes:
>> > 
>> > > On Thu, Mar 06, 2014 at 02:24:43PM +0100, Rasmus Villemoes wrote:
>> > >> Is it possible to implement non-sleeping versions of
>{get,put}_user()?
>> > >
>> > > __{get,put}_user()
>> > 
>> > Huh?
>> > 
>> > arch/x86/include/asm/uaccess.h:
>> > 
>> > /**
>> >  * __get_user: - Get a simple variable from user space, with less
>checking.
>> >  * @x:   Variable to store result.
>> >  * @ptr: Source address, in user space.
>> >  *
>> >  * Context: User context only.  This function may sleep.
>> > 
>> > What am I missing?
>> 
>> __get_user() -> __get_user_nocheck() -> __get_user_size() ->
>__get_user_asm()
>> 
>> And __get_user_asm() seems to generate the required .fixup section
>for
>> this to work in pagefault_disable() context.
>
>Well, it still might sleep if you're using it in page fault enabled
>context. So the documentation of that function sucks.
>
>Thanks,
>
>	tglx

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-06  9:57                 ` Peter Zijlstra
@ 2014-03-06 16:08                   ` Khalid Aziz
  0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-06 16:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Thomas Gleixner, One Thousand Gnomes, H. Peter Anvin,
	Ingo Molnar, akpm, viro, oleg, linux-kernel

On 03/06/2014 02:57 AM, Peter Zijlstra wrote:
> On Wed, Mar 05, 2014 at 12:58:29PM -0700, Khalid Aziz wrote:
>> Looking at the current problem I am trying to
>> solve with databases and JVM, I run into the same issue I described in my
>> earlier email. Proxy execution is a post-contention solution. By the time
>> proxy execution can do something for my case, I have already paid the price
>> of contention and a context switch which is what I am trying to avoid. For a
>> critical section that is very short compared to the size of execution
>> thread, which is the case I am looking at, avoiding preemption in the middle
>> of that short critical section helps much more than dealing with lock
>> contention later on.
>
> Like others have already stated; its likely still cheaper than the
> pile-up you get now. It might not be optimally fast, but it sure takes
> out the worst case you have now.
>
>> The goal here is to avoid lock contention and
>> associated cost. I do understand the cost of dealing with lock contention
>> poorly and that can easily be much bigger cost, but I am looking into
>> avoiding even getting there.
>
> The thing is; unless userspace is a RT program or practises the same
> discipline in such an extend as that it make no practical difference,
> there's always going to be the case where you fail to cover the entire
> critical section, at which point you're back to your pile-up fail.
>
> So while the limited preemption guard helps the best cast, it doesn't
> help the worst case at all.

That is true. I am breaking this problem into two parts - (1) avoid pile 
up, (2) if pile up happens, deal with it efficiently. Worst case 
scenario you point out is the second part of the problem. Solutions for 
that can be PTHREAD_PRIO_PROTECT protocol for the threads that use POSIX 
threads or proxy execution. Once pile up has happened, cost of a system 
call to boost thread priority becomes much smaller part of overall cost 
of handling the pile up.

Part (1) of this problem is what my patch attempts to solve. Here the 
cost of system call to boost priority or do anything else is too high. 
The mechanism to avoid pile up has to be very light weight to be of any use.

>
> So supposing we went with this now; you (or someone else) will come back
> in a year's time and tell us that if we only just stretch this window a
> little, their favourite workload will also benefit.
>
> Where's the end of that?
>
> And what about CONFIG_HZ; suppose you compile your kernel with HZ=100
> and your 1 extra tick is sufficient. Then someone compiles their kernel
> with HZ=1000 and it all comes apart.
>
>

My goal here is to help the cases where critical section is short and 
executes quickly as it should be for well designed critical sections in 
threads that want to run using CFS. I see this as an incremental 
improvement over current situation. With CFS, timeslice is adaptive and 
depends upon the workload, so it is not directly tied to CONFIG_HZ. But 
you are right, CONFIG_HZ does have a bearing on this. I see a critical 
section that can easily go over a single timeslice and cause a pile up, 
as a workload designed to create these problems. Such a workload needs 
to use SCHED_FIFO or the deadline scheduler with properly designed yield 
points and priorities, or live with the pile ups caused by using CFS. 
Trying to help such cases with CFS is not beneficial and will cause CFS 
to become more and more complex. What I am trying to do is help the 
cases where a short critical section ends up being pre-empted simply 
because the execution reached critical section only towards the end of 
current timeslice and resulted in an unintended pile up. So give these 
cases a tool to avoid pile ups but use of the tool comes with 
restrictions (yield the processor as soon as you can if you got amnesty, 
and pay a penalty if you don't). At this point, the two workloads I know 
of that fit this group are databases and JVM both of which are in 
significant use.

Makes sense?

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-06 14:25               ` David Lang
@ 2014-03-06 16:12                 ` Khalid Aziz
  0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-06 16:12 UTC (permalink / raw)
  To: David Lang, Kevin Easton
  Cc: Andi Kleen, Thomas Gleixner, One Thousand Gnomes, H. Peter Anvin,
	Ingo Molnar, peterz, akpm, viro, oleg, linux-kernel

On 03/06/2014 07:25 AM, David Lang wrote:
> On Thu, 6 Mar 2014, Kevin Easton wrote:
>
>> On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
>>> Anything else?
>>
>> If it was possible to make the time remaining in the current timeslice
>> available to userspace through the vdso, the thread could do something
>> like:
>>
>> if (sys_timeleft() < CRITICAL_SECTION_SIZE)
>>    yield();
>> lock();
>>
>> to avoid running out of timeslice in the middle of the critical section.
>
> but won't the system call result in context switches? According to
> Kevin, even a context switch to another thread and back immediatly is
> bad enough to need to be avoided, so replacing that with the context
> switch to the kernel and back isn't a subtantial win.
>
> David Lang

Using vdso reduces the cost of system call significantly, but as Peter 
pointed out a thread can not really rely upon the number it will get back.

--
Khalid

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-06 11:14                 ` Thomas Gleixner
@ 2014-03-06 16:32                   ` Khalid Aziz
  0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-06 16:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Andi Kleen, One Thousand Gnomes, H. Peter Anvin,
	Ingo Molnar, Andrew Morton, Al Viro, Oleg Nesterov, LKML

On 03/06/2014 04:14 AM, Thomas Gleixner wrote:
> We understand that you want to avoid preemption in the first place and
> not getting into the contention handling case.
>
> But, what you're trying to do is essentially creating an ABI which we
> have to support and maintain forever. And that definitely is worth a
> few serious questions.

Fair enough. I agree a new ABI should not be created lightly.

>
> Lets ignore the mm related issues for now as those can be solved. That's
> the least of my worries.
>
> Right now you are using this for a single use case with a well defined
> environment, where all related threads reside in the same scheduling
> class (FAIR). But that's one of a gazillion of use cases of Linux.
>

Creating a new ABI for a single use case or a special case is something 
I would argue against as well. I am with you on that. I am stating that 
databases and JVM happen to be two real world examples of the scenario 
where CFS can cause convoying problem inadvertently for a well designed 
critical section that represents a small portion of overall execution 
thread, simply because of where in the current timeslice the critical 
section is hit. If there are other examples others have come across, I 
would love to hear it. If we can indeed say this is a very special case 
for an uncommon workload, I would completely agree with refusing to 
create a new ABI.

> If we allow you to special case your database workload then we have no
> argument why we should not do the same thing for realtime workloads
> where the SCHED_FAIR housekeeping thread can hold a lock shortly to
> access some important data in the SCHED_FIFO realtime computation
> thread. Of course the RT people want to avoid the lock contention as
> much as you do, just for different reasons.
>
> Add SCHED_EDF, cgroups and hierarchical scheduling to the picture and
> hell breaks lose.

Realtime and deadline scheduler policies are supposed to be higher 
priority than CFS. A thread running in CFS that can impact threads 
running with realtime policies is a bad thing, agreed? What I am 
proposing actually allows a thread running with CFS to get out of the 
way of threads running with realtime policies quicker. In your specific 
example, the SCHED_FAIR housekeeping thread gets a chance to get out of 
SCHED_FIFO threads' way by giving its critical section better chance to 
complete execution before causing a convoy problem and while its cache 
is hot by using the exact same mechanism I am proposing. The logic is 
not onerous. Thread asks for amnesty from one context switch if and only 
if rescheduling point happens in the middle of its timeslice. If 
rescheduling point does not occur during its critical section, the 
thread takes that request back and life goes on as if nothing changed. 
If rescheduling point happens in the middle of thread's critical 
section, it gets the amnesty but it yields the processor as soon as it 
is done with its critical section. Any thread that does not play nice 
gets penalized next time it wants immunity (as hpa suggested).

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-06 13:59               ` Peter Zijlstra
@ 2014-03-06 22:41                 ` Andi Kleen
  0 siblings, 0 replies; 67+ messages in thread
From: Andi Kleen @ 2014-03-06 22:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kevin Easton, Andi Kleen, Thomas Gleixner, Khalid Aziz,
	One Thousand Gnomes, H. Peter Anvin, Ingo Molnar, akpm, viro,
	oleg, linux-kernel

On Thu, Mar 06, 2014 at 02:59:46PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 06, 2014 at 11:13:33PM +1100, Kevin Easton wrote:
> > On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
> > > Anything else?
> > 
> > If it was possible to make the time remaining in the current timeslice
> > available to userspace through the vdso, the thread could do something like:
> 
> Assuming we can do per-cpu values in the VDSO; this would mean hitting
> that cacheline on every context switch and wakeup. That's a complete
> non-starter performance wise.

If you worry about fetching it you can always prefetch it early.

> > if (sys_timeleft() < CRITICAL_SECTION_SIZE)
> >     yield();
> > lock();
> > 
> > to avoid running out of timeslice in the middle of the critical section.
> 
> Can still happen, the effective slice of a single runnable task is
> infinite, the moment another task gets woken this gets reduced to a finite
> amount, we then keep reducing the slice until there are about 8 runnable
> tasks (assuming you've not poked at any sysctls).

I guess it could be some predicted value, similar to how the menu
governour works.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v2] Pre-emption control for userspace
  2014-03-03 18:07 [RFC] [PATCH] Pre-emption control for userspace Khalid Aziz
                   ` (2 preceding siblings ...)
  2014-03-04 21:12 ` H. Peter Anvin
@ 2014-03-25 17:17 ` Khalid Aziz
  2014-03-25 17:44   ` Andrew Morton
                     ` (3 more replies)
  2014-03-25 23:01 ` [RFC] [PATCH] " Davidlohr Bueso
  4 siblings, 4 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-25 17:17 UTC (permalink / raw)
  To: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
	gnomes, riel, snorcht, dhowells, luto, daeseok.youn, ebiederm
  Cc: Khalid Aziz, linux-kernel, linux-doc


This patch adds a way for a thread to request additional timeslice from
the scheduler if it is about to be preempted, so it could complete any
critical task it is in the middle of. This functionality helps with
performance on databases and has been used for many years on other OSs
by the databases. This functionality helps in situation where a thread
acquires a lock before performing a critical operation on the database,
happens to get preempted before it completes its task and releases the
lock.  This lock causes all other threads that also acquire the same
lock to perform their critical operation on the database to start
queueing up and causing large number of context switches. This queueing
problem can be avoided if the thread that acquires lock first could
request scheduler to grant it an additional timeslice once it enters its
critical section and hence allow it to complete its critical sectiona
without causing queueing problem. If critical section completes before
the thread is due for preemption, the thread can simply desassert its
request. A thread sends the scheduler this request by setting a flag in
a memory location it has shared with the kernel.  Kernel uses bytes in
the same memory location to let the thread know when its request for
amnesty from preemption has been granted. Thread should yield the
processor at the end of its critical section if it was granted amnesty
to play nice with other threads. If thread fails to yield processor, it
gets penalized by having its next amnesty request turned down by
scheduler.  Documentation file included in this patch contains further
details on how to use this functionality and conditions associated with
its use. This patch also adds a new field in scheduler statistics which
keeps track of how many times was a thread granted amnesty from
preemption. This feature and its usage are documented in
Documentation/scheduler/sched-preempt-delay.txt and this patch includes
a test for this feature under tools/testing/selftests/preempt-delay

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
v2:
	- Replaced mmap operation with a more memory efficient futex
	  like communication between userspace and kernel
	- Added a flag to let userspace know if it was granted amnesty
	- Added a penalty for tasks failing to yield CPU when they
	  are granted amnesty from pre-emption

v1:
	- Initial RFC patch with mmap for communication between userspace
	  and kernel

 Documentation/scheduler/sched-preempt-delay.txt    | 121 ++++++++++
 arch/x86/Kconfig                                   |  12 +
 fs/proc/base.c                                     |  89 ++++++++
 include/linux/sched.h                              |  15 ++
 kernel/fork.c                                      |   5 +
 kernel/sched/core.c                                |   8 +
 kernel/sched/debug.c                               |   1 +
 kernel/sched/fair.c                                | 114 ++++++++-
 tools/testing/selftests/preempt-delay/Makefile     |   8 +
 .../selftests/preempt-delay/preempt-delay.c        | 254 +++++++++++++++++++++
 10 files changed, 624 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/scheduler/sched-preempt-delay.txt
 create mode 100644 tools/testing/selftests/preempt-delay/Makefile
 create mode 100644 tools/testing/selftests/preempt-delay/preempt-delay.c

diff --git a/Documentation/scheduler/sched-preempt-delay.txt b/Documentation/scheduler/sched-preempt-delay.txt
new file mode 100644
index 0000000..38b4edc
--- /dev/null
+++ b/Documentation/scheduler/sched-preempt-delay.txt
@@ -0,0 +1,121 @@
+=================================
+What is preemption delay feature?
+=================================
+
+There are times when a userspace task is executing a critical section
+which gates a number of other tasks that want access to the same
+critical section. If the task holding the lock that guards this critical
+section is preempted by the scheduler in the middle of its critical
+section because its timeslice is up, scheduler ends up scheduling other
+threads which immediately try to grab the lock to enter the critical
+section. This only results in lots of context changes are tasks wake up
+and go to sleep immediately again. If on the other hand, the original
+task were allowed to run for an extra timeslice, it could have completed
+executing its critical section allowing other tasks to make progress
+when they get scheduled. Preemption delay feature allows a task to
+request scheduler to grant it one extra timeslice, if possible.
+
+
+==================================
+Using the preemption delay feature
+==================================
+
+This feature is enabled in the kernel by setting
+CONFIG_SCHED_PREEMPT_DELAY in kernel configuration. Once this feature is
+enabled, the userspace process communicates with the kernel using a
+4-byte memory location in its address space. It first gives the kernel
+address for this memory location by writing its address to
+/proc/<tgid>/task/<tid>/sched_preempt_delay. This memory location is
+interpreted as a sequence of 4 bytes:
+
+	byte[0] = flag to request preemption delay
+	byte[1] = flag from kernel indicating preemption delay was granted
+	byte[2] = reserved for future use
+	byte[3] = reserved for future use
+
+Task requests a preemption delay by writing a non-zero value to the
+first byte. Scheduler checks this value before preempting the task.
+Scheduler can choose to grant one and only an additional time slice to
+the task for each delay request but this delay is not guaranteed.
+If scheduler does grant an additional timeslice, it will set the flag
+in second byte. Upon completion of the section of code where the task
+wants preemption delay, task should check the second byte. If the flag
+in second byte is set, it should clear this flag and call sched_yield()
+so as to not hog the processor. If a thread was granted additional
+timeslice and it fails to call sched_yield(), scheduler will penalize
+it by denying its next request for additional timeslice. Following sample
+code illustrates the use:
+
+int main()
+{
+	int fd, fsz;
+	unsigned char buf[256];
+	unsigned char preempt_delay[4];
+
+	sprintf(buf, “/proc/%lu/task/%lu/sched_preempt_delay”, getpid(),
+							syscall(SYS_gettid));
+	fd = open(buf, O_RDWR);
+
+	preempt_delay[0] = preempt_delay[1] = 0;
+
+	/* Tell kernel where the flag lives */
+	*(unsigned int **)buf = &preempt_delay;
+	write(fd, buf, sizeof(unsigned int *));
+
+	while (/* some condition is true */) {
+		/* do some work and get ready to enter critical section */
+		preempt_delay[0] = 1;
+		/*
+		 * Obtain lock for critical section
+		 */
+		/*
+		 * critical section
+		 */
+		/*
+		 * Release lock for critical section
+		 */
+		preempt_delay[0] = 0;
+		/* Give the CPU up if required */
+		if (preempt_delay[1]) {
+			preempt_delay[1] = 0;
+			sched_yield();
+		}
+		/* do some more work */
+	}
+	/*
+	 * Tell kernel we are done asking for preemption delay
+	 */
+	*(unsigned int **)buf = 0;
+	write(fd, buf, sizeof(unsigned int *));
+	close(fd);
+}
+
+
+====================
+Scheduler statistics
+====================
+
+Preemption delay features adds a new field to scheduler statictics -
+nr_preempt_delayed. This is a per thread statistic that tracks the
+number of times a thread was granted amnesty from preemption when it
+requested for one. "cat /proc/<pid>/task/<tid>/sched" will list this
+number along with other scheduler statistics.
+
+
+=====
+Notes
+=====
+
+1. /proc/<tgid>/task/<tid>/sched_preempt_delay can be written to only
+   by the thread that corresponds to this file.
+
+2. /proc/<tgid>/task/<tid>/sched_preempt_delay can be written with valid
+   memory address once. To write a new memory address, the previous
+   memory address must be cleared first by writing NULL. Each new
+   memory address requires validation in the kernel and update of
+   pointers. Changing this address too many times creates too much
+   overhead.
+
+2. Reading /proc/<tgid>/task/<tid>/sched_preempt_delay returns the
+   current memory location address thread is using to communicate with
+   the kernel.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0af5250..2d54816 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -849,6 +849,18 @@ config SCHED_MC
 	  making when dealing with multi-core CPU chips at a cost of slightly
 	  increased overhead in some places. If unsure say N here.
 
+config SCHED_PREEMPT_DELAY
+	def_bool n
+	prompt "Scheduler preemption delay support"
+	depends on PROC_FS && PREEMPT_NOTIFIERS
+	---help---
+	  Say Y here if you want to be able to delay scheduler preemption
+	  when possible by setting a flag in a memory location after
+	  sharing the address of this location by writing to
+	  /proc/<tgid>/task/<tid>/sched_preempt_delay. See
+	  Documentation/scheduler/sched-preempt-delay.txt for details.
+	  If in doubt, say "N".
+
 source "kernel/Kconfig.preempt"
 
 config X86_UP_APIC
diff --git a/fs/proc/base.c b/fs/proc/base.c
index b976062..f6ab240 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1304,6 +1304,92 @@ static const struct file_operations proc_pid_sched_operations = {
 
 #endif
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+static int
+tid_preempt_delay_show(struct seq_file *m, void *v)
+{
+	struct inode *inode = m->private;
+	struct task_struct *task = get_proc_task(inode);
+	unsigned char *delay_req;
+
+	if (!task)
+		return -ENOENT;
+
+	delay_req = (unsigned char *)task->sched_preempt_delay.delay_req;
+	seq_printf(m, "0x%-p\n", delay_req);
+
+	put_task_struct(task);
+	return 0;
+}
+
+static ssize_t
+tid_preempt_delay_write(struct file *file, const char __user *buf,
+			  size_t count, loff_t *offset)
+{
+	struct inode *inode = file_inode(file);
+	struct task_struct *task = get_proc_task(inode);
+	u32 __user *delay_req;
+	int retval;
+
+	if (!task) {
+		retval = -ENOENT;
+		goto out;
+	}
+
+	/*
+	 * A thread can write only to its corresponding preempt_delay
+	 * proc file
+	 */
+	if (current != task) {
+		retval =  -EPERM;
+		goto out;
+	}
+
+	delay_req = *(u32 __user **)buf;
+
+	/*
+	 * Do not allow write if pointer is currently set
+	 */
+	if (task->sched_preempt_delay.delay_req && (delay_req != NULL)) {
+		retval = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * Validate the pointer.
+	 */
+	if (unlikely(!access_ok(rw, delay_req, sizeof(u32)))) {
+		retval = -EFAULT;
+		goto out;
+	}
+
+	task->sched_preempt_delay.delay_req = delay_req;
+
+	/* zero out flags */
+	put_user(0, delay_req);
+
+	retval = count;
+
+out:
+	put_task_struct(task);
+	return retval;
+}
+
+static int
+tid_preempt_delay_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, tid_preempt_delay_show, inode);
+}
+
+static const struct file_operations proc_tid_preempt_delay_ops = {
+	.open		= tid_preempt_delay_open,
+	.read		= seq_read,
+	.write		= tid_preempt_delay_write,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+#endif
+
 #ifdef CONFIG_SCHED_AUTOGROUP
 /*
  * Print out autogroup related information:
@@ -2999,6 +3085,9 @@ static const struct pid_entry tid_base_stuff[] = {
 	REG("gid_map",    S_IRUGO|S_IWUSR, proc_gid_map_operations),
 	REG("projid_map", S_IRUGO|S_IWUSR, proc_projid_map_operations),
 #endif
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+	REG("sched_preempt_delay", S_IRUGO|S_IWUSR, proc_tid_preempt_delay_ops),
+#endif
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a781dec..77aba5c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1056,6 +1056,7 @@ struct sched_statistics {
 	u64			nr_wakeups_affine_attempts;
 	u64			nr_wakeups_passive;
 	u64			nr_wakeups_idle;
+	u64			nr_preempt_delayed;
 };
 #endif
 
@@ -1250,6 +1251,13 @@ struct task_struct {
 	/* Revert to default priority/policy when forking */
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+	struct preempt_delay {
+		u32 __user *delay_req;		/* delay request flag pointer */
+		unsigned char delay_granted:1;	/* currently in delay */
+		unsigned char yield_penalty:1;	/* failure to yield penalty */
+	} sched_preempt_delay;
+#endif
 
 	pid_t pid;
 	pid_t tgid;
@@ -2061,6 +2069,13 @@ extern u64 scheduler_tick_max_deferment(void);
 static inline bool sched_can_stop_tick(void) { return false; }
 #endif
 
+#if defined(CONFIG_SCHED_PREEMPT_DELAY) && defined(CONFIG_PROC_FS)
+extern void sched_preempt_delay_show(struct seq_file *m,
+					struct task_struct *task);
+extern void sched_preempt_delay_set(struct task_struct *task,
+					unsigned char *val);
+#endif
+
 #ifdef CONFIG_SCHED_AUTOGROUP
 extern void sched_autogroup_create_attach(struct task_struct *p);
 extern void sched_autogroup_detach(struct task_struct *p);
diff --git a/kernel/fork.c b/kernel/fork.c
index a17621c..8847176 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1617,6 +1617,11 @@ long do_fork(unsigned long clone_flags,
 			init_completion(&vfork);
 			get_task_struct(p);
 		}
+#if CONFIG_SCHED_PREEMPT_DELAY
+		p->sched_preempt_delay.delay_req = NULL;
+		p->sched_preempt_delay.delay_granted = 0;
+		p->sched_preempt_delay.yield_penalty = 0;
+#endif
 
 		wake_up_new_task(p);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5c6635..ec16b4e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4055,6 +4055,14 @@ SYSCALL_DEFINE0(sched_yield)
 {
 	struct rq *rq = this_rq_lock();
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+	/*
+	 * Clear the penalty flag for current task to reward it for
+	 * palying by the rules
+	 */
+	current->sched_preempt_delay.yield_penalty = 0;
+#endif
+
 	schedstat_inc(rq, yld_count);
 	current->sched_class->yield_task(rq);
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index dd52e7f..2abd02b 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -602,6 +602,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.statistics.nr_wakeups_affine_attempts);
 	P(se.statistics.nr_wakeups_passive);
 	P(se.statistics.nr_wakeups_idle);
+	P(se.statistics.nr_preempt_delayed);
 
 	{
 		u64 avg_atom, avg_per_cpu;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9b4c4f3..142bed5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -444,6 +444,114 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+/*
+ * delay_resched_task(): Check if the task about to be preempted has
+ *	requested an additional time slice. If it has, grant it additional
+ *	timeslice once.
+ */
+static void
+delay_resched_task(struct task_struct *curr)
+{
+	struct sched_entity *se;
+	int cpu = task_cpu(curr);
+	u32 __user *delay_req;
+	unsigned int delay_req_flag;
+	unsigned char *delay_flag;
+
+	/*
+	 * Check if task is using pre-emption delay feature. If address
+	 * for preemption delay request flag is not set, this task is
+	 * not using preemption delay feature, we can reschedule without
+	 * any delay
+	 */
+	delay_req = curr->sched_preempt_delay.delay_req;
+
+	if ((delay_req == NULL) || (cpu != smp_processor_id()))
+		goto resched_now;
+
+	/*
+	 * Pre-emption delay will  be granted only once. If this task
+	 * has already been granted delay, rechedule now
+	 */
+	if (curr->sched_preempt_delay.delay_granted) {
+		curr->sched_preempt_delay.delay_granted = 0;
+		goto resched_now;
+	}
+
+	/*
+	 * Get the value of preemption delay request flag from userspace.
+	 * Task had already passed us the address where the flag is stored
+	 * in userspace earlier. This flag is just like the PROCESS_PRIVATE
+	 * futex, leverage the futex code here to read the flag. If there
+	 * is a page fault accessing this flag in userspace, that means
+	 * userspace has not touched this flag recently and we can
+	 * assume no preemption delay is needed.
+	 *
+	 * If task is not requesting additional timeslice, resched now
+	 */
+	if (delay_req) {
+		int ret;
+
+		pagefault_disable();
+		ret = __copy_from_user_inatomic(&delay_req_flag, delay_req,
+				sizeof(u32));
+		pagefault_enable();
+		delay_flag = &delay_req_flag;
+		if (ret || !delay_flag[0])
+			goto resched_now;
+	} else {
+		goto resched_now;
+	}
+
+	/*
+	 * Current thread has requested preemption delay and has not
+	 * been granted an extension yet. If this thread failed to yield
+	 * processor after being granted amnesty last time, penalize it
+	 * by not granting this delay request, otherwise give it an extra
+	 * timeslice.
+	 */
+	if (curr->sched_preempt_delay.yield_penalty) {
+		curr->sched_preempt_delay.yield_penalty = 0;
+		goto resched_now;
+	}
+
+	se = &curr->se;
+	curr->sched_preempt_delay.delay_granted = 1;
+
+	/*
+	 * Set the penalty flag for failing to yield the processor after
+	 * being granted immunity. This flag will be cleared in
+	 * sched_yield() if the thread indeed calls sched_yield
+	 */
+	curr->sched_preempt_delay.yield_penalty = 1;
+
+	/*
+	 * Let the thread know it got amnesty and it should call
+	 * sched_yield() when it is done to avoid penalty next time
+	 * it wants amnesty. We need to write to userspace location.
+	 * Since we just read from this location, chances are extremley
+	 * low we might page fault. If we do page fault, we will ignore
+	 * it and accept the cost of failed write in form of unnecessary
+	 * penalty for userspace task for not yielding processor.
+	 * This is a highly unlikely scenario.
+	 */
+	delay_flag[0] = 0;
+	delay_flag[1] = 1;
+	pagefault_disable();
+	__copy_to_user_inatomic(delay_req, &delay_req_flag, sizeof(u32));
+	pagefault_enable();
+
+	schedstat_inc(curr, se.statistics.nr_preempt_delayed);
+	return;
+
+resched_now:
+	resched_task(curr);
+}
+#else
+#define delay_resched_task(curr) resched_task(curr)
+#endif /* CONFIG_SCHED_PREEMPT_DELAY */
+
 static __always_inline
 void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
 
@@ -2679,7 +2787,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	ideal_runtime = sched_slice(cfs_rq, curr);
 	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
 	if (delta_exec > ideal_runtime) {
-		resched_task(rq_of(cfs_rq)->curr);
+		delay_resched_task(rq_of(cfs_rq)->curr);
 		/*
 		 * The current task ran long enough, ensure it doesn't get
 		 * re-elected due to buddy favours.
@@ -2703,7 +2811,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 		return;
 
 	if (delta > ideal_runtime)
-		resched_task(rq_of(cfs_rq)->curr);
+		delay_resched_task(rq_of(cfs_rq)->curr);
 }
 
 static void
@@ -4477,7 +4585,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	return;
 
 preempt:
-	resched_task(curr);
+	delay_resched_task(curr);
 	/*
 	 * Only set the backward buddy when the current task is still
 	 * on the rq. This can happen when a wakeup gets interleaved
diff --git a/tools/testing/selftests/preempt-delay/Makefile b/tools/testing/selftests/preempt-delay/Makefile
new file mode 100644
index 0000000..b2da185
--- /dev/null
+++ b/tools/testing/selftests/preempt-delay/Makefile
@@ -0,0 +1,8 @@
+all:
+	gcc -pthread preempt-delay.c -o preempt-delay -lrt
+
+run_tests: all
+	./preempt-delay 300 400
+
+clean:
+	rm -f ./preempt-delay
diff --git a/tools/testing/selftests/preempt-delay/preempt-delay.c b/tools/testing/selftests/preempt-delay/preempt-delay.c
new file mode 100644
index 0000000..59daf8f
--- /dev/null
+++ b/tools/testing/selftests/preempt-delay/preempt-delay.c
@@ -0,0 +1,254 @@
+/*
+ * This test program checks for the presence of preemption delay feature
+ * in the kernel. If the feature is present, it exercises it by running
+ * a number of threads that ask for preemption delay and checks if they
+ * are granted these preemption delays. It then runs the threads again
+ * without requesting preemption delays and verifies preemption delays
+ * are not granted when not requested (negative test).
+ */
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <time.h>
+#include <string.h>
+#include <pthread.h>
+#include <sys/types.h>
+#include <sys/syscall.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/mman.h>
+
+#define NUMTHREADS	1000
+
+pthread_mutex_t		mylock = PTHREAD_MUTEX_INITIALIZER;
+unsigned long		iterations;
+unsigned long		delays_granted = 0;
+unsigned long		request_delay = 1;
+
+#define BUFSIZE		1024
+
+int
+feature_check()
+{
+	unsigned char buf[BUFSIZE];
+
+	sprintf(buf, "/proc/%d/task/%ld/sched_preempt_delay",
+					getpid(), syscall(SYS_gettid));
+	if (access(buf, F_OK))
+		return 1;
+	return 0;
+}
+
+void
+do_some_work(void *param)
+{
+	struct timespec timeout;
+	int i, j, tid, fd, fsz;
+	unsigned long sum;
+	unsigned char buf[BUFSIZE];
+	unsigned char delay[4];
+	int cnt = 0;
+
+	/*
+	 * mmap the sched_preempt_delay file
+	 */
+	sprintf(buf, "/proc/%d/task/%ld/sched_preempt_delay",
+					getpid(), syscall(SYS_gettid));
+	fd = open(buf, O_RDWR);
+	if (fd == -1) {
+		perror("Error opening sched_preemp_delay file");
+		return;
+	}
+
+	for (i = 0; i < 4; i++)
+		delay[i] = 0;
+
+	if (request_delay) {
+		*(unsigned int **)buf = (unsigned int *) &delay;
+		if (write(fd, buf, sizeof(unsigned int *)) < 0) {
+			perror("Error writing flag address");
+			close(fd);
+			return;
+		}
+	}
+
+	tid = *(int *) param;
+
+	for (i = 0; i < iterations; i++) {
+		/* start by locking the resource */
+		if (request_delay)
+			delay[0] = 1;
+		if (pthread_mutex_lock(&mylock)) {
+			perror("mutex_lock():");
+			delay[0] = 0;
+			return;
+		}
+
+		/* Do some busy work */
+		sum = 0;
+		for (j = 0; j < (iterations*(tid+1)); j++)
+			sum += sum;
+		for (j = 0; j < iterations/(tid+1); j++)
+			sum += i^2;
+
+		/* Now unlock the resource */
+		if (pthread_mutex_unlock(&mylock)) {
+			perror("mutex_unlock():");
+			delay[0] = 0;
+			return;
+		}
+		delay[0] = 0;
+
+		if (delay[1]) {
+			delay[1] = 0;
+			cnt++;
+			sched_yield();
+		}
+	}
+
+	if (request_delay) {
+		*(unsigned int **)buf = 0;
+		if (write(fd, buf, sizeof(unsigned int *)) < 0) {
+			perror("Error clearing flag address");
+			close(fd);
+			return;
+		}
+	}
+	close(fd);
+
+	/*
+	 * Update global count of delays granted. Need to grab a lock
+	 * since this is a global.
+	 */
+	if (pthread_mutex_lock(&mylock)) {
+		perror("mutex_lock():");
+		delay[0] = 0;
+		return;
+	}
+	delays_granted += cnt;
+	if (pthread_mutex_unlock(&mylock)) {
+		perror("mutex_unlock():");
+		delay[0] = 0;
+		return;
+	}
+}
+
+void
+help(char *progname)
+{
+	fprintf(stderr, "Usage: %s <number of threads> ", progname);
+	fprintf(stderr, "<number of iterations>\n");
+	fprintf(stderr, "   Notes: (1) Maximum number of threads is %d\n",
+								NUMTHREADS);
+	fprintf(stderr, "          (2) Suggested number of iterations is ");
+	fprintf(stderr, "300-10000\n");
+	fprintf(stderr, "          (3) Exit codes are: 1 = Failed with no ");
+	fprintf(stderr, "preemption delays granted\n");
+	fprintf(stderr, "                              2 = Failed with ");
+	fprintf(stderr, "preemption delays granted when\n");
+	fprintf(stderr, "                                  not requested\n");
+	fprintf(stderr, "                              3 = Error in test ");
+	fprintf(stderr, "arguments\n");
+	fprintf(stderr, "                              4 = Other errors\n");
+}
+
+int main(int argc, char **argv)
+{
+	pthread_t	thread[NUMTHREADS];
+	int		ret, i, tid[NUMTHREADS];
+	unsigned long	nthreads;
+
+	/* check arguments */
+	if (argc < 3) {
+		help(argv[0]);
+		exit(3);
+	}
+
+	nthreads = atoi(argv[1]);
+	iterations = atoi(argv[2]);
+	if (nthreads > NUMTHREADS) {
+		fprintf(stderr, "ERROR: exceeded maximum number of threads\n");
+		exit(3);
+	}
+
+	/*
+	 * Check for the presence of feature
+	 */
+	if (feature_check()) {
+		printf("INFO: Pre-emption delay feature is not present in ");
+		printf("this kernel\n");
+		exit(0);
+	}
+
+	/*
+	 * Create a bunch of threads that will compete for the
+	 * same mutex. Run these threads first while requesting
+	 * preemption delay.
+	 */
+	for (i = 0; i < nthreads; i++) {
+		tid[i] = i;
+		ret = pthread_create(&thread[i], NULL, (void *)&do_some_work,
+				&tid[i]);
+		if (ret) {
+			perror("pthread_create(): ");
+			exit(4);
+		}
+	}
+
+	printf("Threads started. Waiting......\n");
+	/* Now wait for threads to get done */
+	for (i = 0; i < nthreads; i++) {
+		ret = pthread_join(thread[i], NULL);
+		if (ret) {
+			perror("pthread_join(): ");
+			exit(4);
+		}
+	}
+
+	/*
+	 * We started out with requesting pre-emption delays, check if
+	 * we got at least a few.
+	 */
+	if (delays_granted == 0) {
+		fprintf(stderr, "FAIL: No delays granted at all.\n");
+		exit(1);
+	}
+
+	/*
+	 * Run the threads again, this time not requesting preemption delays
+	 */
+	request_delay = 0;
+	delays_granted = 0;
+	for (i = 0; i < nthreads; i++) {
+		tid[i] = i;
+		ret = pthread_create(&thread[i], NULL, (void *)&do_some_work,
+				&tid[i]);
+		if (ret) {
+			perror("pthread_create(): ");
+			exit(4);
+		}
+	}
+
+	printf("Threads started. Waiting......\n");
+	/* Now wait for threads to get done */
+	for (i = 0; i < nthreads; i++) {
+		ret = pthread_join(thread[i], NULL);
+		if (ret) {
+			perror("pthread_join(): ");
+			exit(4);
+		}
+	}
+
+	/*
+	 * Check if preemption delays were granted even though we
+	 * did not ask for them
+	 */
+	if (delays_granted > 0) {
+		fprintf(stderr, "FAIL: delays granted when not requested.\n");
+		exit(2);
+	}
+}
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v2] Pre-emption control for userspace
  2014-03-25 17:17 ` [PATCH v2] " Khalid Aziz
@ 2014-03-25 17:44   ` Andrew Morton
  2014-03-25 17:56     ` Khalid Aziz
  2014-03-25 17:46   ` Oleg Nesterov
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 67+ messages in thread
From: Andrew Morton @ 2014-03-25 17:44 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: tglx, mingo, hpa, peterz, andi.kleen, rob, viro, oleg, gnomes,
	riel, snorcht, dhowells, luto, daeseok.youn, ebiederm,
	linux-kernel, linux-doc

On Tue, 25 Mar 2014 11:17:50 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:

> 
> This patch adds a way for a thread to request additional timeslice from
> the scheduler if it is about to be preempted, so it could complete any
> critical task it is in the middle of. This functionality helps with
> performance on databases and has been used for many years on other OSs
> by the databases. This functionality helps in situation where a thread
> acquires a lock before performing a critical operation on the database,
> happens to get preempted before it completes its task and releases the
> lock.  This lock causes all other threads that also acquire the same
> lock to perform their critical operation on the database to start
> queueing up and causing large number of context switches. This queueing
> problem can be avoided if the thread that acquires lock first could
> request scheduler to grant it an additional timeslice once it enters its
> critical section and hence allow it to complete its critical sectiona
> without causing queueing problem. If critical section completes before
> the thread is due for preemption, the thread can simply desassert its
> request. A thread sends the scheduler this request by setting a flag in
> a memory location it has shared with the kernel.  Kernel uses bytes in
> the same memory location to let the thread know when its request for
> amnesty from preemption has been granted. Thread should yield the
> processor at the end of its critical section if it was granted amnesty
> to play nice with other threads. If thread fails to yield processor, it
> gets penalized by having its next amnesty request turned down by
> scheduler.  Documentation file included in this patch contains further
> details on how to use this functionality and conditions associated with
> its use. This patch also adds a new field in scheduler statistics which
> keeps track of how many times was a thread granted amnesty from
> preemption. This feature and its usage are documented in
> Documentation/scheduler/sched-preempt-delay.txt and this patch includes
> a test for this feature under tools/testing/selftests/preempt-delay

What a long paragraph ;)

The feature makes sense and sounds useful, but I'd like to see some
real world performance testing results so we can understand what the
benefit will be to our users?

>
> ...
>
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +static int
> +tid_preempt_delay_show(struct seq_file *m, void *v)
> +{
> +	struct inode *inode = m->private;
> +	struct task_struct *task = get_proc_task(inode);
> +	unsigned char *delay_req;
> +
> +	if (!task)
> +		return -ENOENT;
> +
> +	delay_req = (unsigned char *)task->sched_preempt_delay.delay_req;
> +	seq_printf(m, "0x%-p\n", delay_req);
> +
> +	put_task_struct(task);
> +	return 0;
> +}
> +
> +static ssize_t
> +tid_preempt_delay_write(struct file *file, const char __user *buf,
> +			  size_t count, loff_t *offset)
> +{
> +	struct inode *inode = file_inode(file);
> +	struct task_struct *task = get_proc_task(inode);
> +	u32 __user *delay_req;
> +	int retval;
> +
> +	if (!task) {
> +		retval = -ENOENT;
> +		goto out;
> +	}
> +
> +	/*
> +	 * A thread can write only to its corresponding preempt_delay
> +	 * proc file
> +	 */
> +	if (current != task) {
> +		retval =  -EPERM;
> +		goto out;
> +	}
> +
> +	delay_req = *(u32 __user **)buf;
> +
> +	/*
> +	 * Do not allow write if pointer is currently set
> +	 */
> +	if (task->sched_preempt_delay.delay_req && (delay_req != NULL)) {
> +		retval = -EINVAL;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Validate the pointer.
> +	 */
> +	if (unlikely(!access_ok(rw, delay_req, sizeof(u32)))) {
> +		retval = -EFAULT;
> +		goto out;
> +	}
> +
> +	task->sched_preempt_delay.delay_req = delay_req;
> +
> +	/* zero out flags */
> +	put_user(0, delay_req);
> +
> +	retval = count;
> +
> +out:
> +	put_task_struct(task);
> +	return retval;
> +}

So the procfs file is written in binary format and is read back in
ascii format.  Seems odd.

Perhaps this should all be done as a new syscall rather than some
procfs thing.

>
> ...
>
> @@ -1250,6 +1251,13 @@ struct task_struct {
>  	/* Revert to default priority/policy when forking */
>  	unsigned sched_reset_on_fork:1;
>  	unsigned sched_contributes_to_load:1;
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +	struct preempt_delay {
> +		u32 __user *delay_req;		/* delay request flag pointer */
> +		unsigned char delay_granted:1;	/* currently in delay */
> +		unsigned char yield_penalty:1;	/* failure to yield penalty */
> +	} sched_preempt_delay;

The problem with bitfields is that a write to one bitfield can corrupt
a concurrent write to the other one.  So it's your responsibility to
provide locking and/or to describe how this race is avoided.  A comment
here in the definition would be a suitable way of addressing this.

> +#endif
>  
>  	pid_t pid;
>  	pid_t tgid;
>
> ...
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4055,6 +4055,14 @@ SYSCALL_DEFINE0(sched_yield)
>  {
>  	struct rq *rq = this_rq_lock();
>  
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +	/*
> +	 * Clear the penalty flag for current task to reward it for
> +	 * palying by the rules

"playing"

> +	 */
> +	current->sched_preempt_delay.yield_penalty = 0;
> +#endif
> +
>  	schedstat_inc(rq, yld_count);
>  	current->sched_class->yield_task(rq);
>  
> ...
>
> +static void
> +delay_resched_task(struct task_struct *curr)
> +{
> +	struct sched_entity *se;
> +	int cpu = task_cpu(curr);
> +	u32 __user *delay_req;
> +	unsigned int delay_req_flag;
> +	unsigned char *delay_flag;
> +
> +	/*
> +	 * Check if task is using pre-emption delay feature. If address
> +	 * for preemption delay request flag is not set, this task is
> +	 * not using preemption delay feature, we can reschedule without
> +	 * any delay
> +	 */
> +	delay_req = curr->sched_preempt_delay.delay_req;
> +
> +	if ((delay_req == NULL) || (cpu != smp_processor_id()))

Presumably we don't get "smp_processor_id() used in preemptible code"
warnings here, so called-with-preempt-disabled is a secret prerequisite
for delay_resched_task().

> +		goto resched_now;
> +
> +	/*
> +	 * Pre-emption delay will  be granted only once. If this task
> +	 * has already been granted delay, rechedule now
> +	 */
> +	if (curr->sched_preempt_delay.delay_granted) {
> +		curr->sched_preempt_delay.delay_granted = 0;
> +		goto resched_now;
> +	}
> +
> +	/*
> +	 * Get the value of preemption delay request flag from userspace.
> +	 * Task had already passed us the address where the flag is stored
> +	 * in userspace earlier. This flag is just like the PROCESS_PRIVATE
> +	 * futex, leverage the futex code here to read the flag. If there
> +	 * is a page fault accessing this flag in userspace, that means
> +	 * userspace has not touched this flag recently and we can
> +	 * assume no preemption delay is needed.
> +	 *
> +	 * If task is not requesting additional timeslice, resched now
> +	 */
> +	if (delay_req) {
> +		int ret;
> +
> +		pagefault_disable();
> +		ret = __copy_from_user_inatomic(&delay_req_flag, delay_req,
> +				sizeof(u32));
> +		pagefault_enable();

This all looks rather hacky and unneccesary.  Can't we somehow use
plain old get_user() and avoid such fuss?

> +		delay_flag = &delay_req_flag;
> +		if (ret || !delay_flag[0])
> +			goto resched_now;
> +	} else {
> +		goto resched_now;
> +	}
> +
> +	/*
> +	 * Current thread has requested preemption delay and has not
> +	 * been granted an extension yet. If this thread failed to yield
> +	 * processor after being granted amnesty last time, penalize it
> +	 * by not granting this delay request, otherwise give it an extra
> +	 * timeslice.
> +	 */
> +	if (curr->sched_preempt_delay.yield_penalty) {
> +		curr->sched_preempt_delay.yield_penalty = 0;
> +		goto resched_now;
> +	}
> +
> +	se = &curr->se;
> +	curr->sched_preempt_delay.delay_granted = 1;
> +
> +	/*
> +	 * Set the penalty flag for failing to yield the processor after
> +	 * being granted immunity. This flag will be cleared in
> +	 * sched_yield() if the thread indeed calls sched_yield
> +	 */
> +	curr->sched_preempt_delay.yield_penalty = 1;
> +
> +	/*
> +	 * Let the thread know it got amnesty and it should call
> +	 * sched_yield() when it is done to avoid penalty next time
> +	 * it wants amnesty. We need to write to userspace location.
> +	 * Since we just read from this location, chances are extremley
> +	 * low we might page fault. If we do page fault, we will ignore
> +	 * it and accept the cost of failed write in form of unnecessary
> +	 * penalty for userspace task for not yielding processor.
> +	 * This is a highly unlikely scenario.
> +	 */
> +	delay_flag[0] = 0;
> +	delay_flag[1] = 1;
> +	pagefault_disable();
> +	__copy_to_user_inatomic(delay_req, &delay_req_flag, sizeof(u32));
> +	pagefault_enable();

and put_user() here.

> +	schedstat_inc(curr, se.statistics.nr_preempt_delayed);
> +	return;
> +
> +resched_now:
> +	resched_task(curr);
> +}
> +#else
> +#define delay_resched_task(curr) resched_task(curr)

This could have been implemented in C...

> +#endif /* CONFIG_SCHED_PREEMPT_DELAY */
> +
>
> ...
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2] Pre-emption control for userspace
  2014-03-25 17:17 ` [PATCH v2] " Khalid Aziz
  2014-03-25 17:44   ` Andrew Morton
@ 2014-03-25 17:46   ` Oleg Nesterov
  2014-03-25 17:59     ` Khalid Aziz
  2014-03-25 18:20   ` Andi Kleen
  2014-03-25 18:59   ` Eric W. Biederman
  3 siblings, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2014-03-25 17:46 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, gnomes,
	riel, snorcht, dhowells, luto, daeseok.youn, ebiederm,
	linux-kernel, linux-doc

On 03/25, Khalid Aziz wrote:
>
>  fs/proc/base.c                                     |  89 ++++++++

This code can be simplified, but the real question is why do we need it.

Just add PR_SET_PREEMPT_DELAY ?

Oleg.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2] Pre-emption control for userspace
  2014-03-25 17:44   ` Andrew Morton
@ 2014-03-25 17:56     ` Khalid Aziz
  2014-03-25 18:14       ` Andrew Morton
  0 siblings, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-25 17:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tglx, mingo, hpa, peterz, andi.kleen, rob, viro, oleg, gnomes,
	riel, snorcht, dhowells, luto, daeseok.youn, ebiederm,
	linux-kernel, linux-doc

On 03/25/2014 11:44 AM, Andrew Morton wrote:
> So the procfs file is written in binary format and is read back in
> ascii format.  Seems odd.
>
> Perhaps this should all be done as a new syscall rather than some
> procfs thing.
>

I didn't want to add yet another syscall which will then need to be 
added to glibc, but I am open to doing it through a syscall if that is 
the consensus.

>> +	struct preempt_delay {
>> +		u32 __user *delay_req;		/* delay request flag pointer */
>> +		unsigned char delay_granted:1;	/* currently in delay */
>> +		unsigned char yield_penalty:1;	/* failure to yield penalty */
>> +	} sched_preempt_delay;
>
> The problem with bitfields is that a write to one bitfield can corrupt
> a concurrent write to the other one.  So it's your responsibility to
> provide locking and/or to describe how this race is avoided.  A comment
> here in the definition would be a suitable way of addressing this.
>

I do not have a strong reason to use a bitfield, just trying to not use 
any more bytes than I need to. If using a char is safer, I would rather 
use safer code.

>> +	if (delay_req) {
>> +		int ret;
>> +
>> +		pagefault_disable();
>> +		ret = __copy_from_user_inatomic(&delay_req_flag, delay_req,
>> +				sizeof(u32));
>> +		pagefault_enable();
>
> This all looks rather hacky and unneccesary.  Can't we somehow use
> plain old get_user() and avoid such fuss?

get_user() takes longer and can sleep if page fault occurs. I need this 
code to be very fast for it to be beneficial and am willing to ignore 
page faults since page fault would imply the task has not touched 
pre-emption delay request field and hence we can resched safely.

>> +#else
>> +#define delay_resched_task(curr) resched_task(curr)
>
> This could have been implemented in C...
>

Sure, I can do that.

Thanks, Andrew!

--
Khalid


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2] Pre-emption control for userspace
  2014-03-25 17:46   ` Oleg Nesterov
@ 2014-03-25 17:59     ` Khalid Aziz
  0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-25 17:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, gnomes,
	riel, snorcht, dhowells, luto, daeseok.youn, ebiederm,
	linux-kernel, linux-doc

On 03/25/2014 11:46 AM, Oleg Nesterov wrote:
> On 03/25, Khalid Aziz wrote:
>>
>>   fs/proc/base.c                                     |  89 ++++++++
>
> This code can be simplified, but the real question is why do we need it.
>
> Just add PR_SET_PREEMPT_DELAY ?
>
> Oleg.
>

I like this idea! Thanks.

--
Khalid

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2] Pre-emption control for userspace
  2014-03-25 17:56     ` Khalid Aziz
@ 2014-03-25 18:14       ` Andrew Morton
  0 siblings, 0 replies; 67+ messages in thread
From: Andrew Morton @ 2014-03-25 18:14 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: tglx, mingo, hpa, peterz, andi.kleen, rob, viro, oleg, gnomes,
	riel, snorcht, dhowells, luto, daeseok.youn, ebiederm,
	linux-kernel, linux-doc

On Tue, 25 Mar 2014 11:56:31 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:

> On 03/25/2014 11:44 AM, Andrew Morton wrote:
> > So the procfs file is written in binary format and is read back in
> > ascii format.  Seems odd.
> >
> > Perhaps this should all be done as a new syscall rather than some
> > procfs thing.
> >
> 
> I didn't want to add yet another syscall which will then need to be 
> added to glibc, but I am open to doing it through a syscall if that is 
> the consensus.
> 
> >> +	struct preempt_delay {
> >> +		u32 __user *delay_req;		/* delay request flag pointer */
> >> +		unsigned char delay_granted:1;	/* currently in delay */
> >> +		unsigned char yield_penalty:1;	/* failure to yield penalty */
> >> +	} sched_preempt_delay;
> >
> > The problem with bitfields is that a write to one bitfield can corrupt
> > a concurrent write to the other one.  So it's your responsibility to
> > provide locking and/or to describe how this race is avoided.  A comment
> > here in the definition would be a suitable way of addressing this.
> >
> 
> I do not have a strong reason to use a bitfield, just trying to not use 
> any more bytes than I need to. If using a char is safer, I would rather 
> use safer code.

My point is that the locking rules should be documented, via a code comment.

Presumably that rule is "only ever modified by this task".

> >> +	if (delay_req) {
> >> +		int ret;
> >> +
> >> +		pagefault_disable();
> >> +		ret = __copy_from_user_inatomic(&delay_req_flag, delay_req,
> >> +				sizeof(u32));
> >> +		pagefault_enable();
> >
> > This all looks rather hacky and unneccesary.  Can't we somehow use
> > plain old get_user() and avoid such fuss?
> 
> get_user() takes longer and can sleep if page fault occurs. I need this 
> code to be very fast for it to be beneficial and am willing to ignore 
> page faults since page fault would imply the task has not touched 
> pre-emption delay request field and hence we can resched safely.

That's what I meant by "hacky" :)



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2] Pre-emption control for userspace
  2014-03-25 17:17 ` [PATCH v2] " Khalid Aziz
  2014-03-25 17:44   ` Andrew Morton
  2014-03-25 17:46   ` Oleg Nesterov
@ 2014-03-25 18:20   ` Andi Kleen
  2014-03-25 18:47     ` Khalid Aziz
  2014-03-25 18:59   ` Eric W. Biederman
  3 siblings, 1 reply; 67+ messages in thread
From: Andi Kleen @ 2014-03-25 18:20 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
	gnomes, riel, snorcht, dhowells, luto, daeseok.youn, ebiederm,
	linux-kernel, linux-doc

Khalid Aziz <khalid.aziz@oracle.com> writes:

First it would be nice to have some standard reference lock library that
uses this. What would it take to support this in glibc?

> +==================================
> +Using the preemption delay feature
> +==================================
> +
> +This feature is enabled in the kernel by setting
> +CONFIG_SCHED_PREEMPT_DELAY in kernel configuration. Once this feature is
> +enabled, the userspace process communicates with the kernel using a
> +4-byte memory location in its address space. It first gives the kernel
> +address for this memory location by writing its address to
> +/proc/<tgid>/task/<tid>/sched_preempt_delay. This memory location is
> +interpreted as a sequence of 4 bytes:
> +
> +	byte[0] = flag to request preemption delay
> +	byte[1] = flag from kernel indicating preemption delay was granted
> +	byte[2] = reserved for future use
> +	byte[3] = reserved for future use

Should reserve more bytes (64, 128?) and rename the proc flag
to a more generic name. I could well assume other things
using such a mechanism in the future. Also please add a flag
word with feature bits (similar to the perf mmap page)

How about alignment? x86 will not care, but other architectures
may. 

>  #endif
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +	REG("sched_preempt_delay", S_IRUGO|S_IWUSR,
> proc_tid_preempt_delay_ops),

This shouldn't be readable by group/other, as it exposes the address space,
so could help exploits.

> @@ -2061,6 +2069,13 @@ extern u64 scheduler_tick_max_deferment(void);
>  static inline bool sched_can_stop_tick(void) { return false; }
>  #endif
>  
> +#if defined(CONFIG_SCHED_PREEMPT_DELAY) && defined(CONFIG_PROC_FS)
> +extern void sched_preempt_delay_show(struct seq_file *m,
> +					struct task_struct *task);
> +extern void sched_preempt_delay_set(struct task_struct *task,
> +					unsigned char *val);
> +#endif

Prototypes don't need to be ifdefed.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2] Pre-emption control for userspace
  2014-03-25 18:20   ` Andi Kleen
@ 2014-03-25 18:47     ` Khalid Aziz
  2014-03-25 19:47       ` Andi Kleen
  0 siblings, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-25 18:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
	gnomes, riel, snorcht, dhowells, luto, daeseok.youn, ebiederm,
	linux-kernel, linux-doc

On 03/25/2014 12:20 PM, Andi Kleen wrote:
> Khalid Aziz <khalid.aziz@oracle.com> writes:
>
> First it would be nice to have some standard reference lock library that
> uses this. What would it take to support this in glibc?

I am not sure if it would be practical and useful to integrate this into 
any of the standard locking interfaces, but I have not looked into it 
much either. My initial intent is to let individual apps decide if they 
could benefit from this interface and code it in if so since the 
interface is meant to be very simple. Do you see any of the standard 
locking interfaces where it would make sense to integrate this feature 
in, or are you thinking of creating a new interface?

>
>> +==================================
>> +Using the preemption delay feature
>> +==================================
>> +
>> +This feature is enabled in the kernel by setting
>> +CONFIG_SCHED_PREEMPT_DELAY in kernel configuration. Once this feature is
>> +enabled, the userspace process communicates with the kernel using a
>> +4-byte memory location in its address space. It first gives the kernel
>> +address for this memory location by writing its address to
>> +/proc/<tgid>/task/<tid>/sched_preempt_delay. This memory location is
>> +interpreted as a sequence of 4 bytes:
>> +
>> +	byte[0] = flag to request preemption delay
>> +	byte[1] = flag from kernel indicating preemption delay was granted
>> +	byte[2] = reserved for future use
>> +	byte[3] = reserved for future use
>
> Should reserve more bytes (64, 128?) and rename the proc flag
> to a more generic name. I could well assume other things
> using such a mechanism in the future. Also please add a flag
> word with feature bits (similar to the perf mmap page)

I am reluctant to make it too big since reading larger quantities from 
userspace will take longer and start to impact performance. Keeping 
shared data limited to 32-bits allows us to move it between userspace 
and kernel with one instruction.

>
> How about alignment? x86 will not care, but other architectures
> may.
>

You are right. These 4-bytes need to be aligned to a 4-byte boundary. I 
will look into adding an alignment check at the time this address is 
passed to kernel.

>>   #endif
>> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
>> +	REG("sched_preempt_delay", S_IRUGO|S_IWUSR,
>> proc_tid_preempt_delay_ops),
>
> This shouldn't be readable by group/other, as it exposes the address space,
> so could help exploits.

I like Oleg's suggestion of using prctl() instead of procfs to pass the 
userspace address to kernel. Above problem will disappear when I switch 
to prctl() in v3 of this patch.

>
>> @@ -2061,6 +2069,13 @@ extern u64 scheduler_tick_max_deferment(void);
>>   static inline bool sched_can_stop_tick(void) { return false; }
>>   #endif
>>
>> +#if defined(CONFIG_SCHED_PREEMPT_DELAY) && defined(CONFIG_PROC_FS)
>> +extern void sched_preempt_delay_show(struct seq_file *m,
>> +					struct task_struct *task);
>> +extern void sched_preempt_delay_set(struct task_struct *task,
>> +					unsigned char *val);
>> +#endif
>
> Prototypes don't need to be ifdefed.
>
> -Andi
>

Ah, you are right. Thanks, Andi!

--
Khalid


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2] Pre-emption control for userspace
  2014-03-25 17:17 ` [PATCH v2] " Khalid Aziz
                     ` (2 preceding siblings ...)
  2014-03-25 18:20   ` Andi Kleen
@ 2014-03-25 18:59   ` Eric W. Biederman
  2014-03-25 19:15     ` Khalid Aziz
  2014-03-26  6:03     ` Mike Galbraith
  3 siblings, 2 replies; 67+ messages in thread
From: Eric W. Biederman @ 2014-03-25 18:59 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
	gnomes, riel, snorcht, dhowells, luto, daeseok.youn,
	linux-kernel, linux-doc

Khalid Aziz <khalid.aziz@oracle.com> writes:

> This patch adds a way for a thread to request additional timeslice from
> the scheduler if it is about to be preempted, so it could complete any
> critical task it is in the middle of. This functionality helps with
> performance on databases and has been used for many years on other OSs
> by the databases. This functionality helps in situation where a thread
> acquires a lock before performing a critical operation on the database,
> happens to get preempted before it completes its task and releases the
> lock.  This lock causes all other threads that also acquire the same
> lock to perform their critical operation on the database to start
> queueing up and causing large number of context switches. This queueing
> problem can be avoided if the thread that acquires lock first could
> request scheduler to grant it an additional timeslice once it enters its
> critical section and hence allow it to complete its critical sectiona
> without causing queueing problem. If critical section completes before
> the thread is due for preemption, the thread can simply desassert its
> request. A thread sends the scheduler this request by setting a flag in
> a memory location it has shared with the kernel.  Kernel uses bytes in
> the same memory location to let the thread know when its request for
> amnesty from preemption has been granted. Thread should yield the
> processor at the end of its critical section if it was granted amnesty
> to play nice with other threads. If thread fails to yield processor, it
> gets penalized by having its next amnesty request turned down by
> scheduler.  Documentation file included in this patch contains further
> details on how to use this functionality and conditions associated with
> its use. This patch also adds a new field in scheduler statistics which
> keeps track of how many times was a thread granted amnesty from
> preemption. This feature and its usage are documented in
> Documentation/scheduler/sched-preempt-delay.txt and this patch includes
> a test for this feature under tools/testing/selftests/preempt-delay


Let me see if I understand the problem.  Your simulated application has
a ridiculous number of threads (1000) all contending for a single lock
with fairly long lock hold times between 600 and 20000 clocks assuming
no cache line misses.  So 1000 threads contending for about 10usec or
1/100 of a tick when HZ=1000.  Giving  you something like 1 chance in
100 of being preempted while holding the lock.  With 1000 threads
those sound like pretty bad odds.

Either your test program is a serious exageration of what your userspace
is doing or this looks like an application design problem.

I am sorry no number of kernel patches can fix a stupid userspace
application, and what is worse it looks like this approach will make
the situation worse for applications that aren't stupid.  Because they
will now suffer from much less predictability in how long they have to
wait for the cpu.

Maybe if this was limited to a cooperating set of userspace
tasks/threads this might not be too bad.  As this exists I have users
who would hunt me down with malicious intent if this code ever showed up
on our servers, because it would make life for every other application
on the server worse.

The only two sane versions of this I can see are (a) having the
scheduler write the predicted next preemption time into the vdso page so
your thread can yield preemptively before taking the lock if it doesn't
look like it has enough time, or (b) limiting this to just a small
cooperating set of threads in a single cgroup.

As you have not limited the effects of this patch and as this will make
latencies worse for every other program on a system I think this is a
horrible approach.  This really is not something you can do unless all
of the threads that could be affected are in the same code base, which
is definitely not the case here.

So for the general horrible idea.
Nacked-With-Extreme-Prejudice-by: "Eric W. Biederman" <ebiederm@xmission.com>

Cooperative multitasking sucked in Windows 3.1 and it would be much
worse now.  Please stop the crazy.  Linux is challenging enough to
comprehend as it is, and I can't possibly see this patch makes anything
more predictable.

Eric

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2] Pre-emption control for userspace
  2014-03-25 18:59   ` Eric W. Biederman
@ 2014-03-25 19:15     ` Khalid Aziz
  2014-03-25 20:31       ` Eric W. Biederman
  2014-03-26  6:03     ` Mike Galbraith
  1 sibling, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-25 19:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
	gnomes, riel, snorcht, dhowells, luto, daeseok.youn,
	linux-kernel, linux-doc

On 03/25/2014 12:59 PM, ebiederm@xmission.com wrote:
> Khalid Aziz <khalid.aziz@oracle.com> writes:
>
>> This patch adds a way for a thread to request additional timeslice from
>> the scheduler if it is about to be preempted, so it could complete any
>> critical task it is in the middle of. .......
>
>
> Let me see if I understand the problem.  Your simulated application has
> a ridiculous number of threads (1000) all contending for a single lock
> with fairly long lock hold times between 600 and 20000 clocks assuming
> no cache line misses.  So 1000 threads contending for about 10usec or
> 1/100 of a tick when HZ=1000.  Giving  you something like 1 chance in
> 100 of being preempted while holding the lock.  With 1000 threads
> those sound like pretty bad odds.

This problem does not happen because threads are holding the lock for 
too long, rather it happens when a thread does a large number of things 
in its loop and one small part of it requires it to hold a lock. So it 
holds the lock for a very short time but what can happen is thread is 
executing in non-critical section of its loop, it finally gets to the 
critical section just as its timeslice is about to end, it grabs the 
lock and is pre-empted right away. Now we start building a convoy of 
threads that want the same lock. This problem can be avoided if the 
locking thread could be given additional time to complete its critical 
section, release the lock and yield the processor if it indeed was 
granted amnesty by the scheduler.

>
> Either your test program is a serious exageration of what your userspace
> is doing or this looks like an application design problem.

If you are referring to the test I included in the patch, that test is 
designed to exaggerate this problem just so it could test the 
functionality. It is nothing more than a test for this functionality so 
we can test for any regressions in future.

>
> I am sorry no number of kernel patches can fix a stupid userspace
> application, and what is worse it looks like this approach will make
> the situation worse for applications that aren't stupid.  Because they
> will now suffer from much less predictability in how long they have to
> wait for the cpu.

New code in scheduler kicks in only for the threads that have explicitly 
asked kernel to use this functionality by sending it the address of a 
shared memory location. There is a very quick bail out at the top of new 
resched routine to check for this and not affect every thread.

>
> Maybe if this was limited to a cooperating set of userspace
> tasks/threads this might not be too bad.  As this exists I have users
> who would hunt me down with malicious intent if this code ever showed up
> on our servers, because it would make life for every other application
> on the server worse.
>

Yes, it is indeed limited to a cooperating set of userspace 
tasks/threads. Tasks/threads will explicitly choose to use this feature. 
It is a no-op for every one else.

> The only two sane versions of this I can see are (a) having the
> scheduler write the predicted next preemption time into the vdso page so
> your thread can yield preemptively before taking the lock if it doesn't
> look like it has enough time,

This was discussed as an option when I posted first version of this 
patch and dismissed due to lack of reliability of such a data.

> or (b) limiting this to just a small
> cooperating set of threads in a single cgroup.

and that is almost what this patch does. It is not limited to a cgroup, 
rather to the tasks/threads that ask to use this feature.

>
> As you have not limited the effects of this patch and as this will make
> latencies worse for every other program on a system I think this is a
> horrible approach.  This really is not something you can do unless all
> of the threads that could be affected are in the same code base, which
> is definitely not the case here.
>
> So for the general horrible idea.
> Nacked-With-Extreme-Prejudice-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> Cooperative multitasking sucked in Windows 3.1 and it would be much
> worse now.  Please stop the crazy.  Linux is challenging enough to
> comprehend as it is, and I can't possibly see this patch makes anything
> more predictable.
>
> Eric
>

I completely agree with you on cooperative multitasking. I am definitely 
not trying to do anything like that.

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2] Pre-emption control for userspace
  2014-03-25 18:47     ` Khalid Aziz
@ 2014-03-25 19:47       ` Andi Kleen
  0 siblings, 0 replies; 67+ messages in thread
From: Andi Kleen @ 2014-03-25 19:47 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Andi Kleen, tglx, mingo, hpa, peterz, akpm, andi.kleen, rob,
	viro, oleg, gnomes, riel, snorcht, dhowells, luto, daeseok.youn,
	ebiederm, linux-kernel, linux-doc

On Tue, Mar 25, 2014 at 12:47:52PM -0600, Khalid Aziz wrote:
> I am not sure if it would be practical and useful to integrate this
> into any of the standard locking interfaces, but I have not looked
> into it much either. My initial intent is to let individual apps
> decide if they could benefit from this interface and code it in if
> so since the interface is meant to be very simple. Do you see any of
> the standard locking interfaces where it would make sense to
> integrate this feature in, or are you thinking of creating a new
> interface?

It would probably make sense to use by default with glibc adaptive mutexes.

> I am reluctant to make it too big since reading larger quantities
> from userspace will take longer and start to impact performance.
> Keeping shared data limited to 32-bits allows us to move it between
> userspace and kernel with one instruction.

You don't need to read/write more. Just reserve more so that it 
can be sensibly extended later.

The feature bits would only need to be written once when it is set up.


-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2] Pre-emption control for userspace
  2014-03-25 19:15     ` Khalid Aziz
@ 2014-03-25 20:31       ` Eric W. Biederman
  2014-03-25 21:37         ` Khalid Aziz
  0 siblings, 1 reply; 67+ messages in thread
From: Eric W. Biederman @ 2014-03-25 20:31 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
	gnomes, riel, snorcht, dhowells, luto, daeseok.youn,
	linux-kernel, linux-doc

Khalid Aziz <khalid.aziz@oracle.com> writes:

> On 03/25/2014 12:59 PM, ebiederm@xmission.com wrote:
>> Khalid Aziz <khalid.aziz@oracle.com> writes:
>>
>>> This patch adds a way for a thread to request additional timeslice from
>>> the scheduler if it is about to be preempted, so it could complete any
>>> critical task it is in the middle of. .......
>>
>>
>> Let me see if I understand the problem.  Your simulated application has
>> a ridiculous number of threads (1000) all contending for a single lock
>> with fairly long lock hold times between 600 and 20000 clocks assuming
>> no cache line misses.  So 1000 threads contending for about 10usec or
>> 1/100 of a tick when HZ=1000.  Giving  you something like 1 chance in
>> 100 of being preempted while holding the lock.  With 1000 threads
>> those sound like pretty bad odds.
>
> This problem does not happen because threads are holding the lock for too long,
> rather it happens when a thread does a large number of things in its loop and
> one small part of it requires it to hold a lock. So it holds the lock for a very
> short time but what can happen is thread is executing in non-critical section of
> its loop, it finally gets to the critical section just as its timeslice is about
> to end, it grabs the lock and is pre-empted right away. Now we start building a
> convoy of threads that want the same lock. This problem can be avoided if the
> locking thread could be given additional time to complete its critical section,
> release the lock and yield the processor if it indeed was granted amnesty by the
> scheduler.

I would dearly like to see the math that shows such a change will make
actually significantly change the propbabilities of this hitting.  It
seems more likely that this will just be a way for threads to cheat and
get larger time slices.

>> Maybe if this was limited to a cooperating set of userspace
>> tasks/threads this might not be too bad.  As this exists I have users
>> who would hunt me down with malicious intent if this code ever showed up
>> on our servers, because it would make life for every other application
>> on the server worse.
>>
>
> Yes, it is indeed limited to a cooperating set of userspace
> tasks/threads. Tasks/threads will explicitly choose to use this feature. It is a
> no-op for every one else.

It is absolutely not a no-op for me if my task can't be scheduled soon
enough because your task executed sched-prempt.

It means my latency goes up and the *random bad thing* will happen
because I missed my deadline because I was not scheduled fast enough.

>> or (b) limiting this to just a small
>> cooperating set of threads in a single cgroup.
>
> and that is almost what this patch does. It is not limited to a cgroup, rather
> to the tasks/threads that ask to use this feature.

Except you do not appear to be considering what could be scheduled in
your tasks place.

You allow any task to extend it's timeslice.

Which means I will get the question why does why does
really_important_job only miss it's latency guarantees when running on
the same box as sched_preempt_using_job?

Your change appears to have extremely difficult to debug non-local
effects.

Eric


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2] Pre-emption control for userspace
  2014-03-25 20:31       ` Eric W. Biederman
@ 2014-03-25 21:37         ` Khalid Aziz
  0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-25 21:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
	gnomes, riel, snorcht, dhowells, luto, daeseok.youn,
	linux-kernel, linux-doc

On 03/25/2014 02:31 PM, ebiederm@xmission.com wrote:
>> Yes, it is indeed limited to a cooperating set of userspace
>> tasks/threads. Tasks/threads will explicitly choose to use this feature. It is a
>> no-op for every one else.
>
> It is absolutely not a no-op for me if my task can't be scheduled soon
> enough because your task executed sched-prempt.
>
> It means my latency goes up and the *random bad thing* will happen
> because I missed my deadline because I was not scheduled fast enough.
>

This patch changes only CFS, so SCHED_RR and SCHED_FIFO are not affected 
by this change. Does CFS ever guarantee that any task will be scheduled 
fast enough to meet its critical deadline? Shouldn't a system that 
requires guarantees to meet deadline use SCHED_FIFO or SCHED_RR? Even if 
a task that has an important deadline to meet and is ready to run, it 
can be denied processor time simply based upon which other tasks are 
ready to run, how long they have run previously and what their 
priorities are. I don't see how CFS can guarantee temporal determinism 
other than every task will get a chance to run in a finite time period. 
Please do correct me if my understanding of CFS is flawed.

Any task that asks for extension of its current timeslice will get 
charged this additional time in p->se.vruntime and cause it to move to 
the right on rbtree. This amounts to a task borrowing time from its 
future runtime as opposed to being able to get more processor time than 
others.

>>> or (b) limiting this to just a small
>>> cooperating set of threads in a single cgroup.
>>
>> and that is almost what this patch does. It is not limited to a cgroup, rather
>> to the tasks/threads that ask to use this feature.
>
> Except you do not appear to be considering what could be scheduled in
> your tasks place.
>
> You allow any task to extend it's timeslice.

Would you recommend a policy that determines which tasks are allowed to 
extend their timeslice?

>
> Which means I will get the question why does why does
> really_important_job only miss it's latency guarantees when running on
> the same box as sched_preempt_using_job?

I would expect a system that requires latency guarantees to be designed 
using SCHED_FIFO or SCHED_RR, not SCHED_OTHER.

Hope this makes sense.

Thanks,
Khalid



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-03 18:07 [RFC] [PATCH] Pre-emption control for userspace Khalid Aziz
                   ` (3 preceding siblings ...)
  2014-03-25 17:17 ` [PATCH v2] " Khalid Aziz
@ 2014-03-25 23:01 ` Davidlohr Bueso
  2014-03-25 23:29   ` Khalid Aziz
  4 siblings, 1 reply; 67+ messages in thread
From: Davidlohr Bueso @ 2014-03-25 23:01 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
	venki, linux-kernel

On Mon, 2014-03-03 at 11:07 -0700, Khalid Aziz wrote:
> I am working on a feature that has been requested by database folks that
> helps with performance. Some of the oft executed database code uses
> mutexes to lock other threads out of a critical section. They often see
> a situation where a thread grabs the mutex, runs out of its timeslice
> and gets switched out which then causes another thread to run which
> tries to grab the same mutex, spins for a while and finally gives up.
> This can happen with multiple threads until original lock owner gets the
> CPU again and can complete executing its critical section. This queueing
> and subsequent CPU cycle wastage can be avoided if the locking thread
> could request to be granted an additional timeslice if its current
> timeslice runs out before it gives up the lock. Other operating systems
> have implemented this functionality and is used by databases as well as
> JVM. This functionality has been shown to improve performance by 3%-5%.
> 
> I have implemented similar functionality for Linux. This patch adds a
> file /proc/<tgid>/task/<tid>/sched_preempt_delay for each thread.
> Writing 1 to this file causes CFS scheduler to grant additional time
> slice if the currently running process comes up for pre-emption. Writing
> to this file needs to be very quick operation, so I have implemented
> code to allow mmap'ing /proc/<tgid>/task/<tid>/sched_preempt_delay. This
> allows a userspace task to write this flag very quickly. Usage model is
> a thread mmaps this file during initialization. It then writes a 1 to
> the mmap'd file after it grabs the lock in its critical section where it
> wants immunity from pre-emption. It then writes 0 again to this file
> after it releases the lock and calls sched_yield() to give up the
> processor. I have also added a new field in scheduler statistics -
> nr_preempt_delayed, that counts the number of times a thread has been
> granted amnesty. Further details on using this functionality are in 
> Documentation/scheduler/sched-preempt-delay.txt in the patch. This
> patch is based upon the work Venkatesh Pallipadi had done couple of
> years ago.
> 
> Please provide feedback on this functionality and patch.

Good timing! The topic came up just yesterday in LSF/MM. This
functionality is on the wish list for both facebook and postgres.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC] [PATCH] Pre-emption control for userspace
  2014-03-25 23:01 ` [RFC] [PATCH] " Davidlohr Bueso
@ 2014-03-25 23:29   ` Khalid Aziz
  0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-25 23:29 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
	venki, linux-kernel

On 03/25/2014 05:01 PM, Davidlohr Bueso wrote:
> Good timing! The topic came up just yesterday in LSF/MM. This
> functionality is on the wish list for both facebook and postgres.
>

Thanks for letting me know. I am glad to hear of others who need this 
functionality. Did you happen to catch the names of folks at facebook 
and postgres that are interested in this? I would like to work with them 
to make sure what I do works for them as well.

--
Khalid

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v2] Pre-emption control for userspace
  2014-03-25 18:59   ` Eric W. Biederman
  2014-03-25 19:15     ` Khalid Aziz
@ 2014-03-26  6:03     ` Mike Galbraith
  1 sibling, 0 replies; 67+ messages in thread
From: Mike Galbraith @ 2014-03-26  6:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Khalid Aziz, tglx, mingo, hpa, peterz, akpm, andi.kleen, rob,
	viro, oleg, gnomes, riel, snorcht, dhowells, luto, daeseok.youn,
	linux-kernel, linux-doc

On Tue, 2014-03-25 at 11:59 -0700, Eric W. Biederman wrote:

> So for the general horrible idea.
> Nacked-With-Extreme-Prejudice-by: "Eric W. Biederman" <ebiederm@xmission.com>

Goody.  I was surprised Peter didn't make it instantly dead.

-Mike



^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2014-03-26  6:03 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-03 18:07 [RFC] [PATCH] Pre-emption control for userspace Khalid Aziz
2014-03-03 21:51 ` Davidlohr Bueso
2014-03-03 23:29   ` Khalid Aziz
2014-03-04 13:56 ` Oleg Nesterov
2014-03-04 17:44   ` Khalid Aziz
2014-03-04 18:38     ` Al Viro
2014-03-04 19:01       ` Khalid Aziz
2014-03-04 19:03     ` Oleg Nesterov
2014-03-04 20:14       ` Khalid Aziz
2014-03-05 14:38         ` Oleg Nesterov
2014-03-05 16:12           ` Oleg Nesterov
2014-03-05 17:10             ` Khalid Aziz
2014-03-04 21:12 ` H. Peter Anvin
2014-03-04 21:39   ` Khalid Aziz
2014-03-04 22:23     ` One Thousand Gnomes
2014-03-04 22:44       ` Khalid Aziz
2014-03-05  0:39         ` Thomas Gleixner
2014-03-05  0:51           ` Andi Kleen
2014-03-05 11:10             ` Peter Zijlstra
2014-03-05 17:29               ` Khalid Aziz
2014-03-05 19:58               ` Khalid Aziz
2014-03-06  9:57                 ` Peter Zijlstra
2014-03-06 16:08                   ` Khalid Aziz
2014-03-06 11:14                 ` Thomas Gleixner
2014-03-06 16:32                   ` Khalid Aziz
2014-03-05 14:54             ` Oleg Nesterov
2014-03-05 15:56               ` Andi Kleen
2014-03-05 16:36                 ` Oleg Nesterov
2014-03-05 17:22                   ` Khalid Aziz
2014-03-05 23:13                     ` David Lang
2014-03-05 23:48                       ` Khalid Aziz
2014-03-05 23:56                         ` H. Peter Anvin
2014-03-06  0:02                           ` Khalid Aziz
2014-03-06  0:13                             ` H. Peter Anvin
2014-03-05 23:59                         ` David Lang
2014-03-06  0:17                           ` Khalid Aziz
2014-03-06  0:36                             ` David Lang
2014-03-06  1:22                               ` Khalid Aziz
2014-03-06 14:23                                 ` David Lang
2014-03-06 12:13             ` Kevin Easton
2014-03-06 13:59               ` Peter Zijlstra
2014-03-06 22:41                 ` Andi Kleen
2014-03-06 14:25               ` David Lang
2014-03-06 16:12                 ` Khalid Aziz
2014-03-06 13:24   ` Rasmus Villemoes
2014-03-06 13:34     ` Peter Zijlstra
2014-03-06 13:45       ` Rasmus Villemoes
2014-03-06 14:02         ` Peter Zijlstra
2014-03-06 14:33           ` Thomas Gleixner
2014-03-06 14:34             ` H. Peter Anvin
2014-03-06 14:04         ` Thomas Gleixner
2014-03-25 17:17 ` [PATCH v2] " Khalid Aziz
2014-03-25 17:44   ` Andrew Morton
2014-03-25 17:56     ` Khalid Aziz
2014-03-25 18:14       ` Andrew Morton
2014-03-25 17:46   ` Oleg Nesterov
2014-03-25 17:59     ` Khalid Aziz
2014-03-25 18:20   ` Andi Kleen
2014-03-25 18:47     ` Khalid Aziz
2014-03-25 19:47       ` Andi Kleen
2014-03-25 18:59   ` Eric W. Biederman
2014-03-25 19:15     ` Khalid Aziz
2014-03-25 20:31       ` Eric W. Biederman
2014-03-25 21:37         ` Khalid Aziz
2014-03-26  6:03     ` Mike Galbraith
2014-03-25 23:01 ` [RFC] [PATCH] " Davidlohr Bueso
2014-03-25 23:29   ` Khalid Aziz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.