* [RFC] [PATCH] Pre-emption control for userspace
@ 2014-03-03 18:07 Khalid Aziz
2014-03-03 21:51 ` Davidlohr Bueso
` (4 more replies)
0 siblings, 5 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-03 18:07 UTC (permalink / raw)
To: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg, venki
Cc: Khalid Aziz, linux-kernel
I am working on a feature that has been requested by database folks that
helps with performance. Some of the oft executed database code uses
mutexes to lock other threads out of a critical section. They often see
a situation where a thread grabs the mutex, runs out of its timeslice
and gets switched out which then causes another thread to run which
tries to grab the same mutex, spins for a while and finally gives up.
This can happen with multiple threads until original lock owner gets the
CPU again and can complete executing its critical section. This queueing
and subsequent CPU cycle wastage can be avoided if the locking thread
could request to be granted an additional timeslice if its current
timeslice runs out before it gives up the lock. Other operating systems
have implemented this functionality and is used by databases as well as
JVM. This functionality has been shown to improve performance by 3%-5%.
I have implemented similar functionality for Linux. This patch adds a
file /proc/<tgid>/task/<tid>/sched_preempt_delay for each thread.
Writing 1 to this file causes CFS scheduler to grant additional time
slice if the currently running process comes up for pre-emption. Writing
to this file needs to be very quick operation, so I have implemented
code to allow mmap'ing /proc/<tgid>/task/<tid>/sched_preempt_delay. This
allows a userspace task to write this flag very quickly. Usage model is
a thread mmaps this file during initialization. It then writes a 1 to
the mmap'd file after it grabs the lock in its critical section where it
wants immunity from pre-emption. It then writes 0 again to this file
after it releases the lock and calls sched_yield() to give up the
processor. I have also added a new field in scheduler statistics -
nr_preempt_delayed, that counts the number of times a thread has been
granted amnesty. Further details on using this functionality are in
Documentation/scheduler/sched-preempt-delay.txt in the patch. This
patch is based upon the work Venkatesh Pallipadi had done couple of
years ago.
Please provide feedback on this functionality and patch.
-- Khalid
--------------------------------
This patch adds a way for a thread to request additional timeslice from
the scheduler if it is about to be preempted, so it could complete any
critical task it is in the middle of. This functionality helps with
performance on databases and has been used for many years on other OSs
by the databases. This functionality helps in situation where a thread
acquires a lock before performing a critical operation on the database,
happens to get preempted before it completes its task and releases the
lock. This lock causes all other threads that also acquire the same
lock to perform their critical operation on the database to start
queueing up and causing large number of context switches. This queueing
problem can be avoided if the thread that acquires lock first could
request scheduler to grant it an additional timeslice once it enters
its critical section and hence allow it to complete its critical sectiona
without causing queueing problem. If critical section completes before
the thread is due for preemption, the thread can simply desassert its
request. A thread sends the scheduler this request through a proc file
which it can mmap and write to with very little overhead. Documentation
file included in this patch contains further details on how to use this
functionality and conditions associated with its use. This patch also
adds a new field in scheduler statistics which keeps track of how many
times was a thread granted amnesty from preemption.
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
Documentation/scheduler/sched-preempt-delay.txt | 85 ++++++++++++
arch/x86/Kconfig | 10 ++
fs/proc/base.c | 173 ++++++++++++++++++++++++
include/linux/preempt_delay.h | 37 +++++
include/linux/sched.h | 12 ++
kernel/exit.c | 9 ++
kernel/fork.c | 7 +
kernel/sched/Makefile | 1 +
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 59 +++++++-
kernel/sched/preempt_delay.c | 39 ++++++
11 files changed, 430 insertions(+), 3 deletions(-)
create mode 100644 Documentation/scheduler/sched-preempt-delay.txt
create mode 100644 include/linux/preempt_delay.h
create mode 100644 kernel/sched/preempt_delay.c
diff --git a/Documentation/scheduler/sched-preempt-delay.txt b/Documentation/scheduler/sched-preempt-delay.txt
new file mode 100644
index 0000000..aa7e611
--- /dev/null
+++ b/Documentation/scheduler/sched-preempt-delay.txt
@@ -0,0 +1,85 @@
+=================================
+What is preemption delay feature?
+=================================
+
+There are times when a userspace task is executing a critical section
+which gates a number of other tasks that want access to the same
+critical section. If the task holding the lock that guards this critical
+section is preempted by the scheduler in the middle of its critical
+section because its timeslice is up, scheduler ends up scheduling other
+threads which immediately try to grab the lock to enter the critical
+section. This only results in lots of context changes are tasks wake up
+and go to sleep immediately again. If on the other hand, the original
+task were allowed to run for an extra timeslice, it could have completed
+executing its critical section allowing other tasks to make progress
+when they get scheduled. Preemption delay feature allows a task to
+request scheduler to grant it one extra timeslice, if possible.
+
+
+==================================
+Using the preemption delay feature
+==================================
+
+This feature is enabled in the kernel by setting
+CONFIG_SCHED_PREEMPT_DELAY in kernel configuration. Once this feature is
+enabled, kernel creates a file
+/proc/<tgid>/task/<tid>/sched_preempt_delay. This file can be mmapped by
+the task. This file contains a single long value which is a flag to
+indicate if the task is requesting a preemption delay or not. Task
+requests a preemption delay by writing a non-zero value to this file.
+Scheduler checks this value before preempting the task. Scheduler can
+choose to grant one and only an additional time slice to the task for
+each delay request. Scheduler will clear this flag when it makes
+decision to grant or not grant the delay request. Following sample code
+illustrates the use:
+
+int main()
+{
+ int fd, fsz;
+ unsigned char fn[256];
+ unsigned long *map;
+
+ sprintf(fn, “/proc/%lu/task/%lu/sched_preempt_delay”, getpid(), syscall(SYS_gettid));
+ fd = open(fn, O_RDWR);
+ fsz = sysconf(_SC_PAGE_SIZE);
+ map = mmap(NULL, fsz, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+ while (/* some condition is true */) {
+ /* do some work and get ready to enter critical section */
+ map[0] = 1;
+ /*
+ * critical section
+ */
+ map[0] = 0;
+ /* Give the CPU up */
+ sched_yield();
+ /* do some more work */
+ }
+ munmap(map, fsz);
+ close(fd);
+}
+
+
+====================
+Scheduler statistics
+====================
+
+Preemption delay features adds a new field to scheduler statictics -
+nr_preempt_delayed. This is a per thread statistic that tracks the
+number of times a thread was granted amnesty from preemption when it
+requested for one. "cat /proc/<pid>/task/<tid>/sched" will list this
+number along with other scheduler statistics.
+
+
+=====
+Notes
+=====
+
+1. /proc/<tgid>/task/<tid>/sched_preempt_delay can be mmap'd by only
+ one task at a time.
+
+2. Once mmap'd, /proc/<tgid>/task/<tid>/sched_preempt_delay can be written
+ to only by the task that mmap'd it. Reading
+ /proc/<tgid>/task/<tid>/sched_preempt_delay will continue to return the
+ current value.
+
+3. Upon mmap, /proc/<tgid>/task/<tid>/sched_preempt_delay is set to 0.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0af5250..ee10019 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -849,6 +849,16 @@ config SCHED_MC
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.
+config SCHED_PREEMPT_DELAY
+ def_bool n
+ prompt "Scheduler preemption delay support"
+ depends on PROC_FS && PREEMPT_NOTIFIERS
+ ---help---
+ Say Y here if you want to be able to delay scheduler preemption
+ when possible by writing to
+ /proc/<tgid>/task/<tid>/sched_preempt_delay.
+ If in doubt, say "N".
+
source "kernel/Kconfig.preempt"
config X86_UP_APIC
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 5150706..902d07f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1304,6 +1304,176 @@ static const struct file_operations proc_pid_sched_operations = {
#endif
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+static int
+tid_preempt_delay_show(struct seq_file *m, void *v)
+{
+ struct inode *inode = m->private;
+ struct task_struct *task = get_proc_task(inode);
+
+ if (!task)
+ return -ENOENT;
+
+ sched_preempt_delay_show(m, task);
+ put_task_struct(task);
+ return 0;
+}
+
+static ssize_t
+tid_preempt_delay_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *offset)
+{
+ struct inode *inode = file_inode(file);
+ struct task_struct *task = get_proc_task(inode);
+
+ if (!task)
+ return -ENOENT;
+
+ /*
+ * Do not allow write if proc file is currently mmap'd
+ */
+ if (task->sched_preempt_delay.mmap_state)
+ return -EPERM;
+
+ sched_preempt_delay_set(task, buf[0]);
+ put_task_struct(task);
+ return count;
+}
+
+static int
+tid_preempt_delay_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, tid_preempt_delay_show, inode);
+}
+
+static int
+fault_preempt_delay_vmops(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct preemp_delay_mmap_state *state;
+
+ state = (struct preemp_delay_mmap_state *) vma->vm_private_data;
+ if (!state)
+ return VM_FAULT_SIGBUS;
+
+ if (vmf->flags & FAULT_FLAG_MKWRITE) {
+ SetPageUptodate(state->page);
+ return 0;
+ }
+
+ get_page(state->page);
+ vmf->page = state->page;
+ vmf->page->mapping = vma->vm_file->f_mapping;
+ vmf->page->index = vmf->pgoff;
+
+ return 0;
+}
+
+static void
+close_preempt_delay_vmops(struct vm_area_struct *vma)
+{
+ struct preemp_delay_mmap_state *state;
+
+ state = (struct preemp_delay_mmap_state *) vma->vm_private_data;
+ BUG_ON(!state || !state->task);
+
+ state->page->mapping = NULL;
+ /* point delay request flag pointer back to old flag in task_struct */
+ state->task->sched_preempt_delay.delay_req =
+ &state->task->sched_preempt_delay.delay_flag;
+ state->task->sched_preempt_delay.mmap_state = NULL;
+ vfree(state->kaddr);
+ kfree(state);
+ vma->vm_private_data = NULL;
+}
+
+static const struct vm_operations_struct preempt_delay_vmops = {
+ .fault = fault_preempt_delay_vmops,
+ .page_mkwrite = fault_preempt_delay_vmops,
+ .close = close_preempt_delay_vmops,
+};
+
+
+static int
+tid_preempt_delay_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ int retval = 0;
+ void *kaddr = NULL;
+ struct preemp_delay_mmap_state *state = NULL;
+ struct inode *inode = file_inode(file);
+ struct task_struct *task;
+ struct page *page;
+
+ /*
+ * Validate args:
+ * - Only offset 0 support for now
+ * - size should be PAGE_SIZE
+ */
+ if (vma->vm_pgoff != 0 || (vma->vm_end - vma->vm_start) != PAGE_SIZE) {
+ retval = -EINVAL;
+ goto error;
+ }
+
+ /*
+ * Only one mmap allowed at a time
+ */
+ if (current->sched_preempt_delay.mmap_state != NULL) {
+ retval = -EEXIST;
+ goto error;
+ }
+
+ state = kzalloc(sizeof(struct preemp_delay_mmap_state), GFP_KERNEL);
+ kaddr = vmalloc_user(PAGE_SIZE);
+ if (!state || !kaddr) {
+ retval = -ENOMEM;
+ goto error;
+ }
+
+ page = vmalloc_to_page(kaddr);
+ if (!page) {
+ retval = -ENOMEM;
+ goto error;
+ }
+ /*
+ * This mmap belongs to the thread that owns the preemption
+ * delay request flag, not to other threads that may belong to
+ * the same process.
+ */
+ task = get_proc_task(inode);
+ state->page = page;
+ state->kaddr = kaddr;
+ state->uaddr = (void *)vma->vm_start;
+ state->task = task;
+
+ /* Clear the current delay request flag */
+ task->sched_preempt_delay.delay_flag = 0;
+
+ /* Point delay request flag pointer to the newly allocated memory */
+ task->sched_preempt_delay.delay_req = (unsigned char *)kaddr;
+
+ task->sched_preempt_delay.mmap_state = state;
+ vma->vm_private_data = state;
+ vma->vm_ops = &preempt_delay_vmops;
+ vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_SHARED | VM_WRITE;
+
+ return 0;
+
+error:
+ kfree(state);
+ if (kaddr)
+ vfree(kaddr);
+ return retval;
+}
+
+static const struct file_operations proc_tid_preempt_delay_ops = {
+ .open = tid_preempt_delay_open,
+ .read = seq_read,
+ .write = tid_preempt_delay_write,
+ .llseek = seq_lseek,
+ .release = single_release,
+ .mmap = tid_preempt_delay_mmap,
+};
+#endif
+
#ifdef CONFIG_SCHED_AUTOGROUP
/*
* Print out autogroup related information:
@@ -2998,6 +3168,9 @@ static const struct pid_entry tid_base_stuff[] = {
REG("gid_map", S_IRUGO|S_IWUSR, proc_gid_map_operations),
REG("projid_map", S_IRUGO|S_IWUSR, proc_projid_map_operations),
#endif
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+ REG("sched_preempt_delay", S_IRUGO|S_IWUSR, proc_tid_preempt_delay_ops),
+#endif
};
static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/preempt_delay.h b/include/linux/preempt_delay.h
new file mode 100644
index 0000000..70caa81
--- /dev/null
+++ b/include/linux/preempt_delay.h
@@ -0,0 +1,37 @@
+#ifndef __PREEMPT_DELAY_H__
+#define __PREEMPT_DELAY_H__
+
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+/*
+ * struct preempt_delay is part of task struct. It keeps the status
+ * of current request flag, if the request has been granted and a
+ * pointer to strtcure that holds data needed for mmap. This structure
+ * has a variable to hold the delay request flag and a pointer to
+ * the delay request flag. When a new task is initialized, delay_req
+ * pointer points to delay_flag elemnt in this structure. This pointer
+ * is changed when mmap happens on /proc/<pid>/task/<tid>/sched_preempt_delay.
+ * Since procfs works on the granularity of a page size, it makes no sense
+ * to allocate a whole page for each task to hold just a flag. So
+ * initially delay_req pointer points to a single unsigned char sized
+ * variable. When mmap happens, a page is allocated and this pointer
+ * is redirected to the newly allocated page for mmap. Race is not much
+ * of an issue since delay_req always points to a valid memory address.
+ * mmap operation always causes task to start with a request flag value
+ * of 0, so old value of the flag is irrelevant. munmap will point
+ * delay_req back to delay_flag.
+ */
+struct preempt_delay {
+ unsigned char *delay_req; /* delay request flag pointer */
+ unsigned char delay_flag; /* delay request flag */
+ unsigned char delay_granted; /* currently in delay */
+ struct preemp_delay_mmap_state *mmap_state;
+};
+
+struct preemp_delay_mmap_state {
+ void *kaddr; /* kernel vmalloc access addr */
+ void *uaddr; /* user mapped address */
+ struct page *page;
+ struct task_struct *task;
+};
+#endif /* CONFIG_SCHED_PREEMPT_DELAY */
+#endif /* __PREEMPT_DELAY_H__ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a781dec..eeaf7f3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -54,6 +54,7 @@ struct sched_param {
#include <linux/llist.h>
#include <linux/uidgid.h>
#include <linux/gfp.h>
+#include <linux/preempt_delay.h>
#include <asm/processor.h>
@@ -1056,6 +1057,7 @@ struct sched_statistics {
u64 nr_wakeups_affine_attempts;
u64 nr_wakeups_passive;
u64 nr_wakeups_idle;
+ u64 nr_preempt_delayed;
};
#endif
@@ -1250,6 +1252,9 @@ struct task_struct {
/* Revert to default priority/policy when forking */
unsigned sched_reset_on_fork:1;
unsigned sched_contributes_to_load:1;
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+ struct preempt_delay sched_preempt_delay;
+#endif
pid_t pid;
pid_t tgid;
@@ -2061,6 +2066,13 @@ extern u64 scheduler_tick_max_deferment(void);
static inline bool sched_can_stop_tick(void) { return false; }
#endif
+#if defined(CONFIG_SCHED_PREEMPT_DELAY) && defined(CONFIG_PROC_FS)
+extern void sched_preempt_delay_show(struct seq_file *m,
+ struct task_struct *task);
+extern void sched_preempt_delay_set(struct task_struct *task,
+ unsigned char val);
+#endif
+
#ifdef CONFIG_SCHED_AUTOGROUP
extern void sched_autogroup_create_attach(struct task_struct *p);
extern void sched_autogroup_detach(struct task_struct *p);
diff --git a/kernel/exit.c b/kernel/exit.c
index 1e77fc6..b7e4dbf 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -53,6 +53,7 @@
#include <linux/oom.h>
#include <linux/writeback.h>
#include <linux/shm.h>
+#include <linux/preempt_delay.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -721,6 +722,14 @@ void do_exit(long code)
validate_creds_for_do_exit(tsk);
+#if CONFIG_SCHED_PREEMPT_DELAY
+ if (tsk->sched_preempt_delay.mmap_state) {
+ sys_munmap((unsigned long)
+ tsk->sched_preempt_delay.mmap_state->uaddr, PAGE_SIZE);
+ vfree(tsk->sched_preempt_delay.mmap_state->kaddr);
+ kfree(tsk->sched_preempt_delay.mmap_state);
+ }
+#endif
/*
* We're taking recursive faults here in do_exit. Safest is to just
* leave this task alone and wait for reboot.
diff --git a/kernel/fork.c b/kernel/fork.c
index a17621c..94b65c0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -71,6 +71,7 @@
#include <linux/signalfd.h>
#include <linux/uprobes.h>
#include <linux/aio.h>
+#include <linux/preempt_delay.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -1617,6 +1618,12 @@ long do_fork(unsigned long clone_flags,
init_completion(&vfork);
get_task_struct(p);
}
+#if CONFIG_SCHED_PREEMPT_DELAY
+ p->sched_preempt_delay.delay_req =
+ &p->sched_preempt_delay.delay_flag;
+ p->sched_preempt_delay.delay_flag = 0;
+ p->sched_preempt_delay.delay_granted = 0;
+#endif
wake_up_new_task(p);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 9a95c8c..b2582fe 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_SCHED_PREEMPT_DELAY) += preempt_delay.o
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index dd52e7f..2abd02b 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -602,6 +602,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
P(se.statistics.nr_wakeups_affine_attempts);
P(se.statistics.nr_wakeups_passive);
P(se.statistics.nr_wakeups_idle);
+ P(se.statistics.nr_preempt_delayed);
{
u64 avg_atom, avg_per_cpu;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7815709..4250733 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -29,6 +29,7 @@
#include <linux/mempolicy.h>
#include <linux/migrate.h>
#include <linux/task_work.h>
+#include <linux/preempt_delay.h>
#include <trace/events/sched.h>
@@ -444,6 +445,58 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
#endif /* CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+/*
+ * delay_resched_task(): Check if the task about to be preempted has
+ * requested an additional time slice. If it has, grant it additional
+ * timeslice once.
+ */
+static void
+delay_resched_task(struct task_struct *curr)
+{
+ struct sched_entity *se;
+ int cpu = task_cpu(curr);
+ unsigned char *delay_req;
+
+ if (cpu != smp_processor_id())
+ goto resched_now;
+
+ /*
+ * Pre-emption delay will be granted only once. If this task
+ * has already been granted delay, rechedule now
+ */
+ delay_req = curr->sched_preempt_delay.delay_req;
+ if (curr->sched_preempt_delay.delay_granted) {
+ curr->sched_preempt_delay.delay_granted = 0;
+ if (delay_req)
+ *delay_req = 0;
+ goto resched_now;
+ }
+
+ /* If task is not requestin additional timeslice, resched now */
+ if (delay_req && (*delay_req == 0))
+ goto resched_now;
+
+ /*
+ * Current thread has requested preemption delay and has not
+ * been granted an extension yet. Give it an extra timeslice.
+ */
+ se = &curr->se;
+ curr->sched_preempt_delay.delay_granted = 1;
+ schedstat_inc(curr, se.statistics.nr_preempt_delayed);
+ return;
+
+resched_now:
+ resched_task(curr);
+}
+#else
+static void
+delay_resched_task(struct task_struct *curr)
+{
+ resched_task(curr);
+}
+#endif /* CONFIG_SCHED_PREEMPT_DELAY */
+
static __always_inline
void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
@@ -2679,7 +2732,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
ideal_runtime = sched_slice(cfs_rq, curr);
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
if (delta_exec > ideal_runtime) {
- resched_task(rq_of(cfs_rq)->curr);
+ delay_resched_task(rq_of(cfs_rq)->curr);
/*
* The current task ran long enough, ensure it doesn't get
* re-elected due to buddy favours.
@@ -2703,7 +2756,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
return;
if (delta > ideal_runtime)
- resched_task(rq_of(cfs_rq)->curr);
+ delay_resched_task(rq_of(cfs_rq)->curr);
}
static void
@@ -4477,7 +4530,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
return;
preempt:
- resched_task(curr);
+ delay_resched_task(curr);
/*
* Only set the backward buddy when the current task is still
* on the rq. This can happen when a wakeup gets interleaved
diff --git a/kernel/sched/preempt_delay.c b/kernel/sched/preempt_delay.c
new file mode 100644
index 0000000..a8b1c1e
--- /dev/null
+++ b/kernel/sched/preempt_delay.c
@@ -0,0 +1,39 @@
+/*
+ * preempt_delay.c - Facility to allow tasks to request additional
+ * timeslice from the scheduler at scheduler's discretion
+ * Copyright (C) 2013 Khalid Aziz <khalid.aziz@oracle.com>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ *
+ * CONFIG_SCHED_PREEMPT_DELAY infrastructure creates a proc file
+ * /proc/<pid>/task/<tid>/sched_preempt_delay which allows a task to
+ * signal to the scheduler to grant it an extra timeslice once if the
+ * task is about to be pre-empted, by writing a 1 to the file. This
+ * file includes the code to support reading from and writing to this
+ * proc file.
+ */
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+#include "sched.h"
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/preempt_delay.h>
+
+void
+sched_preempt_delay_show(struct seq_file *m, struct task_struct *task)
+{
+ unsigned char *delay_req = task->sched_preempt_delay.delay_req;
+
+ if (delay_req)
+ seq_printf(m, "%d\n", *delay_req);
+}
+
+void
+sched_preempt_delay_set(struct task_struct *task, unsigned char val)
+{
+ unsigned char *delay_req = task->sched_preempt_delay.delay_req;
+
+ if (delay_req)
+ *delay_req = (val != '0'?1:0);
+}
+#endif
--
1.8.3.2
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-03 18:07 [RFC] [PATCH] Pre-emption control for userspace Khalid Aziz
@ 2014-03-03 21:51 ` Davidlohr Bueso
2014-03-03 23:29 ` Khalid Aziz
2014-03-04 13:56 ` Oleg Nesterov
` (3 subsequent siblings)
4 siblings, 1 reply; 67+ messages in thread
From: Davidlohr Bueso @ 2014-03-03 21:51 UTC (permalink / raw)
To: Khalid Aziz
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
venki, linux-kernel
On Mon, 2014-03-03 at 11:07 -0700, Khalid Aziz wrote:
> I am working on a feature that has been requested by database folks that
> helps with performance. Some of the oft executed database code uses
> mutexes to lock other threads out of a critical section. They often see
> a situation where a thread grabs the mutex, runs out of its timeslice
> and gets switched out which then causes another thread to run which
> tries to grab the same mutex, spins for a while and finally gives up.
This strikes me more of a feature for a real-time kernel. It is
definitely an interesting concept but wonder about it being abused.
Also, what about just using a voluntary preemption model instead? I'd
think that systems where this is really a problem would opt for that.
> This can happen with multiple threads until original lock owner gets the
> CPU again and can complete executing its critical section. This queueing
> and subsequent CPU cycle wastage can be avoided if the locking thread
> could request to be granted an additional timeslice if its current
> timeslice runs out before it gives up the lock. Other operating systems
> have implemented this functionality and is used by databases as well as
> JVM. This functionality has been shown to improve performance by 3%-5%.
Could you elaborate more on those performance numbers? What
benchmark/workload?
Thanks,
Davidlohr
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-03 21:51 ` Davidlohr Bueso
@ 2014-03-03 23:29 ` Khalid Aziz
0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-03 23:29 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
venki, linux-kernel
On 03/03/2014 02:51 PM, Davidlohr Bueso wrote:
> On Mon, 2014-03-03 at 11:07 -0700, Khalid Aziz wrote:
>> I am working on a feature that has been requested by database folks that
>> helps with performance. Some of the oft executed database code uses
>> mutexes to lock other threads out of a critical section. They often see
>> a situation where a thread grabs the mutex, runs out of its timeslice
>> and gets switched out which then causes another thread to run which
>> tries to grab the same mutex, spins for a while and finally gives up.
>
> This strikes me more of a feature for a real-time kernel. It is
> definitely an interesting concept but wonder about it being abused.
> Also, what about just using a voluntary preemption model instead? I'd
> think that systems where this is really a problem would opt for that.
That was my first thought as well when I was asked to implement this
feature :) Designing a system as a real-time system indeed gives the
designer good control over pre-emption but the database folks do not
really want or need a full real-time system espcially since they may
have to run on the same server as other database related services. JVM
certainly can not expect to be run as a realtime process. Database folks
are perfectly happy running with CFS scheduler all the time except
during this kind of critical section. This approach gives them some
control to get extra timeslice when they need it. As for the abuse, it
is no different from a realtime process that can lock up a processor
much worse than this approach. As is the case when using realtime
schedulers, one must use the tools wisely. I have thought about allowing
sysadmins to lock this functionality down some but that does add more
complexity. I am open to doing that if most people feel it is necessary.
>
>> This can happen with multiple threads until original lock owner gets the
>> CPU again and can complete executing its critical section. This queueing
>> and subsequent CPU cycle wastage can be avoided if the locking thread
>> could request to be granted an additional timeslice if its current
>> timeslice runs out before it gives up the lock. Other operating systems
>> have implemented this functionality and is used by databases as well as
>> JVM. This functionality has been shown to improve performance by 3%-5%.
>
> Could you elaborate more on those performance numbers? What
> benchmark/workload?
>
> Thanks,
> Davidlohr
>
This was with tpc-c.
Thanks,
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-03 18:07 [RFC] [PATCH] Pre-emption control for userspace Khalid Aziz
2014-03-03 21:51 ` Davidlohr Bueso
@ 2014-03-04 13:56 ` Oleg Nesterov
2014-03-04 17:44 ` Khalid Aziz
2014-03-04 21:12 ` H. Peter Anvin
` (2 subsequent siblings)
4 siblings, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2014-03-04 13:56 UTC (permalink / raw)
To: Khalid Aziz
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, venki,
linux-kernel
On 03/03, Khalid Aziz wrote:
>
> This queueing
> and subsequent CPU cycle wastage can be avoided if the locking thread
> could request to be granted an additional timeslice if its current
> timeslice runs out before it gives up the lock.
Well. I am in no position to discuss the changes in sched/fair.c. I have
to admit that I am skeptical about the whole idea/implementation, but I
leave this to sched/ maintainers.
However, at least the proc/mmap changes do not look right. I didn't read
the whole patch, just picked a "random" ->mmap function, see below.
> kernel/sched/preempt_delay.c | 39 ++++++
Why? This can go into proc/ as well.
> +static void
> +close_preempt_delay_vmops(struct vm_area_struct *vma)
> +{
> + struct preemp_delay_mmap_state *state;
> +
> + state = (struct preemp_delay_mmap_state *) vma->vm_private_data;
> + BUG_ON(!state || !state->task);
> +
> + state->page->mapping = NULL;
> + /* point delay request flag pointer back to old flag in task_struct */
> + state->task->sched_preempt_delay.delay_req =
> + &state->task->sched_preempt_delay.delay_flag;
> + state->task->sched_preempt_delay.mmap_state = NULL;
> + vfree(state->kaddr);
> + kfree(state);
> + vma->vm_private_data = NULL;
> +}
Suppose that state->task != current. Then this can race with do_exit()
which cleanups ->mmap_state too. OTOH do_exit() unmaps this region, it
is not clear why it can't rely in vm_ops->close().
Hmm. In fact I think do_exit() should crash after munmap? ->mmap_state
should be NULL ?? Perhaps I misread this patch completely...
> +static int
> +tid_preempt_delay_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + int retval = 0;
> + void *kaddr = NULL;
> + struct preemp_delay_mmap_state *state = NULL;
> + struct inode *inode = file_inode(file);
> + struct task_struct *task;
> + struct page *page;
> +
> + /*
> + * Validate args:
> + * - Only offset 0 support for now
> + * - size should be PAGE_SIZE
> + */
> + if (vma->vm_pgoff != 0 || (vma->vm_end - vma->vm_start) != PAGE_SIZE) {
> + retval = -EINVAL;
> + goto error;
> + }
> +
> + /*
> + * Only one mmap allowed at a time
> + */
> + if (current->sched_preempt_delay.mmap_state != NULL) {
> + retval = -EEXIST;
> + goto error;
This assumes that we are going to setup current->sched_preempt_delay.mmap_state,
but what if the task opens /proc/random_tid/sched_preempt_delay ?
> + state = kzalloc(sizeof(struct preemp_delay_mmap_state), GFP_KERNEL);
> + kaddr = vmalloc_user(PAGE_SIZE);
Why vmalloc() ? We only need a single page?
> + task = get_proc_task(inode);
And it seems that nobody does put_task_struct(state->task);
> + state->page = page;
> + state->kaddr = kaddr;
> + state->uaddr = (void *)vma->vm_start;
This is used by do_exit(). But ->vm_start can be changed by mremap() ?
Hmm. And mremap() can do vm_ops->close() too. But the new vma will
have the same vm_ops/vm_private_data, so exit_mmap() will try to do
this again... Perhaps I missed something, but I bet this all can't be
right.
> + state->task = task;
> +
> + /* Clear the current delay request flag */
> + task->sched_preempt_delay.delay_flag = 0;
> +
> + /* Point delay request flag pointer to the newly allocated memory */
> + task->sched_preempt_delay.delay_req = (unsigned char *)kaddr;
> +
> + task->sched_preempt_delay.mmap_state = state;
> + vma->vm_private_data = state;
> + vma->vm_ops = &preempt_delay_vmops;
> + vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_SHARED | VM_WRITE;
This probably also needs VM_IO, to protect from madvise(MADV_DOFORK).
VM_SHARED/VM_WRITE doesn't look right.
Oleg.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-04 13:56 ` Oleg Nesterov
@ 2014-03-04 17:44 ` Khalid Aziz
2014-03-04 18:38 ` Al Viro
2014-03-04 19:03 ` Oleg Nesterov
0 siblings, 2 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-04 17:44 UTC (permalink / raw)
To: Oleg Nesterov
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, venki,
linux-kernel
Thanks for the review. Please see my comments inline below.
On 03/04/2014 06:56 AM, Oleg Nesterov wrote:
> On 03/03, Khalid Aziz wrote:
>> kernel/sched/preempt_delay.c | 39 ++++++
>
> Why? This can go into proc/ as well.
>
Sure. No strong reason to keep these functions in separate file. These
functions can go into proc/fs/base.c.
>> +static void
>> +close_preempt_delay_vmops(struct vm_area_struct *vma)
>> +{
>> + struct preemp_delay_mmap_state *state;
>> +
>> + state = (struct preemp_delay_mmap_state *) vma->vm_private_data;
>> + BUG_ON(!state || !state->task);
>> +
>> + state->page->mapping = NULL;
>> + /* point delay request flag pointer back to old flag in task_struct */
>> + state->task->sched_preempt_delay.delay_req =
>> + &state->task->sched_preempt_delay.delay_flag;
>> + state->task->sched_preempt_delay.mmap_state = NULL;
>> + vfree(state->kaddr);
>> + kfree(state);
>> + vma->vm_private_data = NULL;
>> +}
>
> Suppose that state->task != current. Then this can race with do_exit()
> which cleanups ->mmap_state too. OTOH do_exit() unmaps this region, it
> is not clear why it can't rely in vm_ops->close().
>
> Hmm. In fact I think do_exit() should crash after munmap? ->mmap_state
> should be NULL ?? Perhaps I misread this patch completely...
do_exit() unmaps mmap_state->uaddr, and frees up mmap_state->kaddr and
mmap_state. mmap_state should not be NULL after unmap. vfree() and
kfree() are tolerant of pointers that have already been freed. On the
other hand mmap_state can be NULL in do_exit() if do_exit() and
close_preempt_delay_vmops() were to race since
close_preempt_delay_vmops() sets mmap_state to NULL just before it frees
it up. Could they indeed race, because the thread happens to be killed
just as it had called munmap()? I can protect against that with a
refcount in mmap_state. Do you feel this is necessary/helpful to do?
>
>> +static int
>> +tid_preempt_delay_mmap(struct file *file, struct vm_area_struct *vma)
>> +{
>> + int retval = 0;
>> + void *kaddr = NULL;
>> + struct preemp_delay_mmap_state *state = NULL;
>> + struct inode *inode = file_inode(file);
>> + struct task_struct *task;
>> + struct page *page;
>> +
>> + /*
>> + * Validate args:
>> + * - Only offset 0 support for now
>> + * - size should be PAGE_SIZE
>> + */
>> + if (vma->vm_pgoff != 0 || (vma->vm_end - vma->vm_start) != PAGE_SIZE) {
>> + retval = -EINVAL;
>> + goto error;
>> + }
>> +
>> + /*
>> + * Only one mmap allowed at a time
>> + */
>> + if (current->sched_preempt_delay.mmap_state != NULL) {
>> + retval = -EEXIST;
>> + goto error;
>
> This assumes that we are going to setup current->sched_preempt_delay.mmap_state,
> but what if the task opens /proc/random_tid/sched_preempt_delay ?
Good point. A thread should not be allowed to request preemption delay
for another thread. I would recommend leaving this code alone and adding
following code before this:
if (get_proc_task(inode) != current) {
retval = -EPERM;
goto error;
}
Sounds reasonable?
>
>> + state = kzalloc(sizeof(struct preemp_delay_mmap_state), GFP_KERNEL);
>> + kaddr = vmalloc_user(PAGE_SIZE);
>
> Why vmalloc() ? We only need a single page?
>
Makes sense. I will switch to get_zeroed_page().
>> + task = get_proc_task(inode);
>
> And it seems that nobody does put_task_struct(state->task);
Good catch. I had caught the other two instances of get_proc_task() but
missed this one.
>
>> + state->page = page;
>> + state->kaddr = kaddr;
>> + state->uaddr = (void *)vma->vm_start;
>
> This is used by do_exit(). But ->vm_start can be changed by mremap() ?
>
>
> Hmm. And mremap() can do vm_ops->close() too. But the new vma will
> have the same vm_ops/vm_private_data, so exit_mmap() will try to do
> this again... Perhaps I missed something, but I bet this all can't be
> right.
Would you say sys_munmap() of mmap_state->uaddr is not even needed since
exit_mm() will do this any way further down in do_exit()? If I were to
remove this sys_munmap(), that could simplify the race issues as well.
>
>> + state->task = task;
>> +
>> + /* Clear the current delay request flag */
>> + task->sched_preempt_delay.delay_flag = 0;
>> +
>> + /* Point delay request flag pointer to the newly allocated memory */
>> + task->sched_preempt_delay.delay_req = (unsigned char *)kaddr;
>> +
>> + task->sched_preempt_delay.mmap_state = state;
>> + vma->vm_private_data = state;
>> + vma->vm_ops = &preempt_delay_vmops;
>> + vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_SHARED | VM_WRITE;
>
> This probably also needs VM_IO, to protect from madvise(MADV_DOFORK).
Yes, you are right. I will add that.
> VM_SHARED/VM_WRITE doesn't look right.
VM_SHARED is wrong but VM_WRITE is needed I think since the thread will
write to the mmap'd page to signal to request preemption delay.
>
> Oleg.
>
I appreciate your taking the time to review this code. Thank you very much.
--
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-04 17:44 ` Khalid Aziz
@ 2014-03-04 18:38 ` Al Viro
2014-03-04 19:01 ` Khalid Aziz
2014-03-04 19:03 ` Oleg Nesterov
1 sibling, 1 reply; 67+ messages in thread
From: Al Viro @ 2014-03-04 18:38 UTC (permalink / raw)
To: Khalid Aziz
Cc: Oleg Nesterov, tglx, mingo, hpa, peterz, akpm, andi.kleen, rob,
venki, linux-kernel
On Tue, Mar 04, 2014 at 10:44:54AM -0700, Khalid Aziz wrote:
> do_exit() unmaps mmap_state->uaddr, and frees up mmap_state->kaddr
> and mmap_state. mmap_state should not be NULL after unmap. vfree()
> and kfree() are tolerant of pointers that have already been freed.
Huh? Double free() is a bug, plain and simple. Never do that - not
in userland and especially not in the kernel. Think what happens if
some code gets executed between those two and asks to allocate something.
If it gets the area you'd just freed, your second free will leave it
with all kinds of nasty surprises. Starting with "who the hell has
started to modify the object I'd allocated and hadn't freed?"
A: p = alloc();
A: free(p);
B: q = alloc(); /* q == p now */
B: *q = 0; /* *q is zero */
A: free(p); /* same as free(q) */
C: r = alloc(); /* r == q now */
C: *r = 1; /* *q is one */
B: if (*q != 0) panic("somebody's buggering my memory");
It's always a bug, whether the implementation catches it or not.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-04 18:38 ` Al Viro
@ 2014-03-04 19:01 ` Khalid Aziz
0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-04 19:01 UTC (permalink / raw)
To: Al Viro
Cc: Oleg Nesterov, tglx, mingo, hpa, peterz, akpm, andi.kleen, rob,
venki, linux-kernel
On Tue, 2014-03-04 at 18:38 +0000, Al Viro wrote:
> On Tue, Mar 04, 2014 at 10:44:54AM -0700, Khalid Aziz wrote:
>
> > do_exit() unmaps mmap_state->uaddr, and frees up mmap_state->kaddr
> > and mmap_state. mmap_state should not be NULL after unmap. vfree()
> > and kfree() are tolerant of pointers that have already been freed.
>
> Huh? Double free() is a bug, plain and simple. Never do that - not
> in userland and especially not in the kernel. Think what happens if
> some code gets executed between those two and asks to allocate something.
> If it gets the area you'd just freed, your second free will leave it
> with all kinds of nasty surprises. Starting with "who the hell has
> started to modify the object I'd allocated and hadn't freed?"
>
> A: p = alloc();
> A: free(p);
> B: q = alloc(); /* q == p now */
> B: *q = 0; /* *q is zero */
> A: free(p); /* same as free(q) */
> C: r = alloc(); /* r == q now */
> C: *r = 1; /* *q is one */
> B: if (*q != 0) panic("somebody's buggering my memory");
>
> It's always a bug, whether the implementation catches it or not.
Agreed, you are right. I will fix it.
--
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-04 17:44 ` Khalid Aziz
2014-03-04 18:38 ` Al Viro
@ 2014-03-04 19:03 ` Oleg Nesterov
2014-03-04 20:14 ` Khalid Aziz
1 sibling, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2014-03-04 19:03 UTC (permalink / raw)
To: Khalid Aziz
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, venki,
linux-kernel
On 03/04, Khalid Aziz wrote:
>
> On 03/04/2014 06:56 AM, Oleg Nesterov wrote:
>> Hmm. In fact I think do_exit() should crash after munmap? ->mmap_state
>> should be NULL ?? Perhaps I misread this patch completely...
>
> do_exit() unmaps mmap_state->uaddr, and frees up mmap_state->kaddr and
> mmap_state. mmap_state should not be NULL after unmap.
Can't understand... do_exit() does:
+#if CONFIG_SCHED_PREEMPT_DELAY
+ if (tsk->sched_preempt_delay.mmap_state) {
+ sys_munmap((unsigned long)
+ tsk->sched_preempt_delay.mmap_state->uaddr, PAGE_SIZE);
+ vfree(tsk->sched_preempt_delay.mmap_state->kaddr);
+ kfree(tsk->sched_preempt_delay.mmap_state);
sys_munmap() (which btw should not be used) obviously unmaps that
vma and vma_ops()->close() should be called.
close_preempt_delay_vmops() does:
state->task->sched_preempt_delay.mmap_state = NULL;
vfree(tsk->sched_preempt_delay.mmap_state->kaddr) above will try to
dereference .mmap_state == NULL.
IOW, I think that with this patch this trivial program
int main(void)
{
fd = open("/proc/self/task/$TID/sched_preempt_delay", O_RDWR);
mmap(NULL, 4096, PROT_READ,MAP_SHARED, fd, 0);
return 0;
}
should crash the kernel.
>>> + if (current->sched_preempt_delay.mmap_state != NULL) {
>>> + retval = -EEXIST;
>>> + goto error;
>>
>> This assumes that we are going to setup current->sched_preempt_delay.mmap_state,
>> but what if the task opens /proc/random_tid/sched_preempt_delay ?
>
> Good point. A thread should not be allowed to request preemption delay
> for another thread. I would recommend leaving this code alone and adding
> following code before this:
>
> if (get_proc_task(inode) != current) {
> retval = -EPERM;
> goto error;
> }
>
> Sounds reasonable?
Yes, we should check == current, but this interface looks strange anyway, imho.
>>> + state->page = page;
>>> + state->kaddr = kaddr;
>>> + state->uaddr = (void *)vma->vm_start;
>>
>> This is used by do_exit(). But ->vm_start can be changed by mremap() ?
>>
>>
>> Hmm. And mremap() can do vm_ops->close() too. But the new vma will
>> have the same vm_ops/vm_private_data, so exit_mmap() will try to do
>> this again... Perhaps I missed something, but I bet this all can't be
>> right.
>
> Would you say sys_munmap() of mmap_state->uaddr is not even needed since
> exit_mm() will do this any way further down in do_exit()?
No.
I meant:
1. mremap() can move this vma, so do_exit() can't trust ->uaddr
2. Even worse, mremap() itself is not safe. It can do ->close()
too and create the new vma with the same vm_ops. Another
unmap from (say) exit_mm() won't be happy.
>>> + vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_SHARED | VM_WRITE;
>>
>> This probably also needs VM_IO, to protect from madvise(MADV_DOFORK).
>
> Yes, you are right. I will add that.
>
>> VM_SHARED/VM_WRITE doesn't look right.
>
> VM_SHARED is wrong but VM_WRITE is needed I think since the thread will
> write to the mmap'd page to signal to request preemption delay.
But ->mmap() should not set VM_WRITE if application does mmap(PROT_READ) ?
VM_WRITE-or-not should be decided by do_mmap_pgoff/mprotect, ->mmap()
should not play with this bit.
Oleg.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-04 19:03 ` Oleg Nesterov
@ 2014-03-04 20:14 ` Khalid Aziz
2014-03-05 14:38 ` Oleg Nesterov
0 siblings, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-04 20:14 UTC (permalink / raw)
To: Oleg Nesterov
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, venki,
linux-kernel
On 03/04/2014 12:03 PM, Oleg Nesterov wrote:
> On 03/04, Khalid Aziz wrote:
>>
>> On 03/04/2014 06:56 AM, Oleg Nesterov wrote:
>>> Hmm. In fact I think do_exit() should crash after munmap? ->mmap_state
>>> should be NULL ?? Perhaps I misread this patch completely...
>>
>> do_exit() unmaps mmap_state->uaddr, and frees up mmap_state->kaddr and
>> mmap_state. mmap_state should not be NULL after unmap.
>
> Can't understand... do_exit() does:
>
> +#if CONFIG_SCHED_PREEMPT_DELAY
> + if (tsk->sched_preempt_delay.mmap_state) {
> + sys_munmap((unsigned long)
> + tsk->sched_preempt_delay.mmap_state->uaddr, PAGE_SIZE);
> + vfree(tsk->sched_preempt_delay.mmap_state->kaddr);
> + kfree(tsk->sched_preempt_delay.mmap_state);
>
> sys_munmap() (which btw should not be used) obviously unmaps that
> vma and vma_ops()->close() should be called.
>
> close_preempt_delay_vmops() does:
>
> state->task->sched_preempt_delay.mmap_state = NULL;
>
> vfree(tsk->sched_preempt_delay.mmap_state->kaddr) above will try to
> dereference .mmap_state == NULL.
>
> IOW, I think that with this patch this trivial program
>
> int main(void)
> {
> fd = open("/proc/self/task/$TID/sched_preempt_delay", O_RDWR);
> mmap(NULL, 4096, PROT_READ,MAP_SHARED, fd, 0);
> return 0;
> }
>
> should crash the kernel.
>
>>>> + state->page = page;
>>>> + state->kaddr = kaddr;
>>>> + state->uaddr = (void *)vma->vm_start;
>>>
>>> This is used by do_exit(). But ->vm_start can be changed by mremap() ?
>>>
>>>
>>> Hmm. And mremap() can do vm_ops->close() too. But the new vma will
>>> have the same vm_ops/vm_private_data, so exit_mmap() will try to do
>>> this again... Perhaps I missed something, but I bet this all can't be
>>> right.
>>
>> Would you say sys_munmap() of mmap_state->uaddr is not even needed since
>> exit_mm() will do this any way further down in do_exit()?
>
> No.
>
> I meant:
>
> 1. mremap() can move this vma, so do_exit() can't trust ->uaddr
>
> 2. Even worse, mremap() itself is not safe. It can do ->close()
> too and create the new vma with the same vm_ops. Another
> unmap from (say) exit_mm() won't be happy.
I agree this looks like a potential spot for trouble. I was asking if
removing sys_munmap() of uaddr from do_exit() solves both of the above
problems? You have convinced me this sys_munmap() I added is unnecessary.
>
>>>> + vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_SHARED | VM_WRITE;
>>>
>>> This probably also needs VM_IO, to protect from madvise(MADV_DOFORK).
>>
>> Yes, you are right. I will add that.
>>
>>> VM_SHARED/VM_WRITE doesn't look right.
>>
>> VM_SHARED is wrong but VM_WRITE is needed I think since the thread will
>> write to the mmap'd page to signal to request preemption delay.
>
> But ->mmap() should not set VM_WRITE if application does mmap(PROT_READ) ?
> VM_WRITE-or-not should be decided by do_mmap_pgoff/mprotect, ->mmap()
> should not play with this bit.
>
Ah, I see. This makes sense. I will remove it.
Thanks,
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-03 18:07 [RFC] [PATCH] Pre-emption control for userspace Khalid Aziz
2014-03-03 21:51 ` Davidlohr Bueso
2014-03-04 13:56 ` Oleg Nesterov
@ 2014-03-04 21:12 ` H. Peter Anvin
2014-03-04 21:39 ` Khalid Aziz
2014-03-06 13:24 ` Rasmus Villemoes
2014-03-25 17:17 ` [PATCH v2] " Khalid Aziz
2014-03-25 23:01 ` [RFC] [PATCH] " Davidlohr Bueso
4 siblings, 2 replies; 67+ messages in thread
From: H. Peter Anvin @ 2014-03-04 21:12 UTC (permalink / raw)
To: Khalid Aziz, tglx, Ingo Molnar, peterz, akpm, andi.kleen, rob,
viro, oleg, venki
Cc: linux-kernel
On 03/03/2014 10:07 AM, Khalid Aziz wrote:
>
> I am working on a feature that has been requested by database folks that
> helps with performance. Some of the oft executed database code uses
> mutexes to lock other threads out of a critical section. They often see
> a situation where a thread grabs the mutex, runs out of its timeslice
> and gets switched out which then causes another thread to run which
> tries to grab the same mutex, spins for a while and finally gives up.
> This can happen with multiple threads until original lock owner gets the
> CPU again and can complete executing its critical section. This queueing
> and subsequent CPU cycle wastage can be avoided if the locking thread
> could request to be granted an additional timeslice if its current
> timeslice runs out before it gives up the lock. Other operating systems
> have implemented this functionality and is used by databases as well as
> JVM. This functionality has been shown to improve performance by 3%-5%.
>
> I have implemented similar functionality for Linux. This patch adds a
> file /proc/<tgid>/task/<tid>/sched_preempt_delay for each thread.
> Writing 1 to this file causes CFS scheduler to grant additional time
> slice if the currently running process comes up for pre-emption. Writing
> to this file needs to be very quick operation, so I have implemented
> code to allow mmap'ing /proc/<tgid>/task/<tid>/sched_preempt_delay. This
> allows a userspace task to write this flag very quickly. Usage model is
> a thread mmaps this file during initialization. It then writes a 1 to
> the mmap'd file after it grabs the lock in its critical section where it
> wants immunity from pre-emption. It then writes 0 again to this file
> after it releases the lock and calls sched_yield() to give up the
> processor. I have also added a new field in scheduler statistics -
> nr_preempt_delayed, that counts the number of times a thread has been
> granted amnesty. Further details on using this functionality are in
> Documentation/scheduler/sched-preempt-delay.txt in the patch. This
> patch is based upon the work Venkatesh Pallipadi had done couple of
> years ago.
>
Shades of the Android wakelocks, no?
This seems to effectively give userspace an option to turn preemptive
multitasking into cooperative multitasking, which of course is
unacceptable for a privileged process (the same reason why unprivileged
processes aren't allowed to run at above-normal priority, including RT
priority.)
I have several issues with this interface:
1. First, a process needs to know if it *should* have been preempted
before it calls sched_yield(). So there needs to be a second flag set
by the scheduler when granting amnesty.
2. A process which fails to call sched_yield() after being granted
amnesty must be penalized.
3. I'm not keen on occupying a full page for this. I'm wondering if
doing a pointer into user space, futex-style, might make more sense.
The downside, of course, is what happens if the page being pointed to is
swapped out.
Keep in mind this HAS to be per thread.
-hpa
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-04 21:12 ` H. Peter Anvin
@ 2014-03-04 21:39 ` Khalid Aziz
2014-03-04 22:23 ` One Thousand Gnomes
2014-03-06 13:24 ` Rasmus Villemoes
1 sibling, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-04 21:39 UTC (permalink / raw)
To: H. Peter Anvin, tglx, Ingo Molnar, peterz, akpm, andi.kleen, rob,
viro, oleg
Cc: linux-kernel
On 03/04/2014 02:12 PM, H. Peter Anvin wrote:
>
> Shades of the Android wakelocks, no?
>
> This seems to effectively give userspace an option to turn preemptive
> multitasking into cooperative multitasking, which of course is
> unacceptable for a privileged process (the same reason why unprivileged
> processes aren't allowed to run at above-normal priority, including RT
> priority.)
>
> I have several issues with this interface:
>
> 1. First, a process needs to know if it *should* have been preempted
> before it calls sched_yield(). So there needs to be a second flag set
> by the scheduler when granting amnesty.
Good idea. I like it. I will add it.
>
> 2. A process which fails to call sched_yield() after being granted
> amnesty must be penalized.
I agree. Is it fair to say that such a process sees the penalty by being
charged that extra timeslice and being pushed to the right side of RB
tree since its p->se.vruntime would have gone up, which then delays the
time when it can get CPU again? I am open to adding a more explicit
penalty - maybe deny its next preemption delay request if it failed to
call sched_yield() the last time when it should have?
>
> 3. I'm not keen on occupying a full page for this. I'm wondering if
> doing a pointer into user space, futex-style, might make more sense.
> The downside, of course, is what happens if the page being pointed to is
> swapped out.
Using a full page for what is effectively a single bit flag does not sit
well with me either. Doing it through proc forces minimum size of a page
(please correct me there if I am wrong). I will explore your idea some
more to see if that can be made to work.
>
> Keep in mind this HAS to be per thread.
>
Thanks, hpa!
--
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-04 21:39 ` Khalid Aziz
@ 2014-03-04 22:23 ` One Thousand Gnomes
2014-03-04 22:44 ` Khalid Aziz
0 siblings, 1 reply; 67+ messages in thread
From: One Thousand Gnomes @ 2014-03-04 22:23 UTC (permalink / raw)
To: Khalid Aziz
Cc: H. Peter Anvin, tglx, Ingo Molnar, peterz, akpm, andi.kleen, rob,
viro, oleg, linux-kernel
Obvious bug
| Usage model is a thread mmaps this file during initialization. It then
| writes a 1 to the mmap'd file after it grabs the lock in its critical
| section where it wants immunity from pre-emption.
You need to write it first or you can be pre-empted taking the lock
before asking for immunity.
Presumably you could equally use something to group tasks (say a control
group of some form) and implement voluntary pre-emption within the group
only when in user space. Ie they only pre-empt each other by yielding but
they can be pre-empted by other tasks outside the group ?
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-04 22:23 ` One Thousand Gnomes
@ 2014-03-04 22:44 ` Khalid Aziz
2014-03-05 0:39 ` Thomas Gleixner
0 siblings, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-04 22:44 UTC (permalink / raw)
To: One Thousand Gnomes
Cc: H. Peter Anvin, tglx, Ingo Molnar, peterz, akpm, andi.kleen, rob,
viro, oleg, linux-kernel
On 03/04/2014 03:23 PM, One Thousand Gnomes wrote:
> Obvious bug
>
> | Usage model is a thread mmaps this file during initialization. It then
> | writes a 1 to the mmap'd file after it grabs the lock in its critical
> | section where it wants immunity from pre-emption.
>
> You need to write it first or you can be pre-empted taking the lock
> before asking for immunity.
Ah, yes. Thanks, Alan!
>
> Presumably you could equally use something to group tasks (say a control
> group of some form) and implement voluntary pre-emption within the group
> only when in user space. Ie they only pre-empt each other by yielding but
> they can be pre-empted by other tasks outside the group ?
>
I had suggested that to database folks initially, but they use mutexes
that are shared across processes and they do not believe they can
control the environment in customer scenario where they can ensure right
tasks will always be in the right control group. Besides they want to
use a common mechanism across multiple OSs and pre-emption delay is
already in use on other OSs. Good idea though.
Thanks,
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-04 22:44 ` Khalid Aziz
@ 2014-03-05 0:39 ` Thomas Gleixner
2014-03-05 0:51 ` Andi Kleen
0 siblings, 1 reply; 67+ messages in thread
From: Thomas Gleixner @ 2014-03-05 0:39 UTC (permalink / raw)
To: Khalid Aziz
Cc: One Thousand Gnomes, H. Peter Anvin, Ingo Molnar, peterz, akpm,
andi.kleen, rob, viro, oleg, linux-kernel
On Tue, 4 Mar 2014, Khalid Aziz wrote:
> be in the right control group. Besides they want to use a common mechanism
> across multiple OSs and pre-emption delay is already in use on other OSs. Good
> idea though.
Well, just because preemption delay is a mechanism exposed by some
other OS does not make it a good idea.
In fact its a horrible idea.
What you are creating is a crystal ball based form of time bound
priority ceiling with the worst user space interface i've ever seen.
Thanks,
tglx
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 0:39 ` Thomas Gleixner
@ 2014-03-05 0:51 ` Andi Kleen
2014-03-05 11:10 ` Peter Zijlstra
` (2 more replies)
0 siblings, 3 replies; 67+ messages in thread
From: Andi Kleen @ 2014-03-05 0:51 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Khalid Aziz, One Thousand Gnomes, H. Peter Anvin, Ingo Molnar,
peterz, akpm, viro, oleg, linux-kernel
Thomas Gleixner <tglx@linutronix.de> writes:
> On Tue, 4 Mar 2014, Khalid Aziz wrote:
>> be in the right control group. Besides they want to use a common mechanism
>> across multiple OSs and pre-emption delay is already in use on other OSs. Good
>> idea though.
>
> Well, just because preemption delay is a mechanism exposed by some
> other OS does not make it a good idea.
>
> In fact its a horrible idea.
>
> What you are creating is a crystal ball based form of time bound
> priority ceiling with the worst user space interface i've ever seen.
So how would you solve the user space lock holder preemption
problem then?
It's a real problem, affecting lots of workloads.
Just saying everything is crap without suggesting anything
constructive is not really getting us anywhere.
The workarounds I've seen for it are generally far worse
than this. Often people do all kinds of fragile tunings
to address this, when then break on the next kernel
update that does even a minor scheduler change.
futex doesn't solve the problem at all.
The real time scheduler is also really poor fit for these
workloads and needs a lot of hacks to scale.
The thread swap proposal from plumbers had some potential,
but it's likely very intrusive everywhere and seems
to have died too.
Anything else?
-Andi
--
ak@linux.intel.com -- Speaking for myself only
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 0:51 ` Andi Kleen
@ 2014-03-05 11:10 ` Peter Zijlstra
2014-03-05 17:29 ` Khalid Aziz
2014-03-05 19:58 ` Khalid Aziz
2014-03-05 14:54 ` Oleg Nesterov
2014-03-06 12:13 ` Kevin Easton
2 siblings, 2 replies; 67+ messages in thread
From: Peter Zijlstra @ 2014-03-05 11:10 UTC (permalink / raw)
To: Andi Kleen
Cc: Thomas Gleixner, Khalid Aziz, One Thousand Gnomes,
H. Peter Anvin, Ingo Molnar, akpm, viro, oleg, linux-kernel
On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
> Anything else?
Proxy execution; its a form of PI that works for arbitrary scheduling
policies (thus also very much including fair).
With that what you effectively end up with is the lock holder running
'boosted' by the runtime of its blocked chain until the entire chain
runs out of time, at which point preemption doesn't matter anyhow.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-04 20:14 ` Khalid Aziz
@ 2014-03-05 14:38 ` Oleg Nesterov
2014-03-05 16:12 ` Oleg Nesterov
0 siblings, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2014-03-05 14:38 UTC (permalink / raw)
To: Khalid Aziz
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, venki,
linux-kernel
On 03/04, Khalid Aziz wrote:
>
> On 03/04/2014 12:03 PM, Oleg Nesterov wrote:
>>
>> 1. mremap() can move this vma, so do_exit() can't trust ->uaddr
>>
>> 2. Even worse, mremap() itself is not safe. It can do ->close()
>> too and create the new vma with the same vm_ops. Another
>> unmap from (say) exit_mm() won't be happy.
>
> I agree this looks like a potential spot for trouble. I was asking if
> removing sys_munmap() of uaddr from do_exit() solves both of the above
> problems?
How? Of course this won't solve the problems. And there are more problems.
> You have convinced me this sys_munmap() I added is unnecessary.
Cough ;) I didn't try to convince you that it should be removed. It is
necessary (but of course you should not use sys_munmap(), say vm_munmap
is better.
But you know, I think this all doesn't matter. Imho, this proc/mmap
interface is horrible. Perhaps something like tls area visible to kernel
make sense, but it should be more generic at least.
You added /proc/sched_preempt_delay to avoid the syscall. I think it
would be better to simply add vdso_sched_preempt_delay() instead.
But the main problem, of course, is that this feature is questionable.
Oleg.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 0:51 ` Andi Kleen
2014-03-05 11:10 ` Peter Zijlstra
@ 2014-03-05 14:54 ` Oleg Nesterov
2014-03-05 15:56 ` Andi Kleen
2014-03-06 12:13 ` Kevin Easton
2 siblings, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2014-03-05 14:54 UTC (permalink / raw)
To: Andi Kleen
Cc: Thomas Gleixner, Khalid Aziz, One Thousand Gnomes,
H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel
On 03/04, Andi Kleen wrote:
>
> Anything else?
Well, we have yield_to(). Perhaps sys_yield_to(lock_owner) can help.
Or perhaps sys_futex() can do this if it knows the owner. Don't ask
me what exactly I mean though ;)
Oleg.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 14:54 ` Oleg Nesterov
@ 2014-03-05 15:56 ` Andi Kleen
2014-03-05 16:36 ` Oleg Nesterov
0 siblings, 1 reply; 67+ messages in thread
From: Andi Kleen @ 2014-03-05 15:56 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andi Kleen, Thomas Gleixner, Khalid Aziz, One Thousand Gnomes,
H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel
On Wed, Mar 05, 2014 at 03:54:20PM +0100, Oleg Nesterov wrote:
> On 03/04, Andi Kleen wrote:
> >
> > Anything else?
>
> Well, we have yield_to(). Perhaps sys_yield_to(lock_owner) can help.
> Or perhaps sys_futex() can do this if it knows the owner. Don't ask
> me what exactly I mean though ;)
You mean yield_to() would extend the time slice?
That would be the same as the mmap page, just with a syscall right?
-andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 14:38 ` Oleg Nesterov
@ 2014-03-05 16:12 ` Oleg Nesterov
2014-03-05 17:10 ` Khalid Aziz
0 siblings, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2014-03-05 16:12 UTC (permalink / raw)
To: Khalid Aziz
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, venki,
linux-kernel
On 03/05, Oleg Nesterov wrote:
>
> You added /proc/sched_preempt_delay to avoid the syscall. I think it
> would be better to simply add vdso_sched_preempt_delay() instead.
I am stupid. vdso_sched_preempt_delay() obviously can't write to, say,
task_struct.
Oleg.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 15:56 ` Andi Kleen
@ 2014-03-05 16:36 ` Oleg Nesterov
2014-03-05 17:22 ` Khalid Aziz
0 siblings, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2014-03-05 16:36 UTC (permalink / raw)
To: Andi Kleen
Cc: Thomas Gleixner, Khalid Aziz, One Thousand Gnomes,
H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel
On 03/05, Andi Kleen wrote:
>
> On Wed, Mar 05, 2014 at 03:54:20PM +0100, Oleg Nesterov wrote:
> > On 03/04, Andi Kleen wrote:
> > >
> > > Anything else?
> >
> > Well, we have yield_to(). Perhaps sys_yield_to(lock_owner) can help.
> > Or perhaps sys_futex() can do this if it knows the owner. Don't ask
> > me what exactly I mean though ;)
>
> You mean yield_to() would extend the time slice?
>
> That would be the same as the mmap page, just with a syscall right?
Not the same. Very roughly I meant something like
my_lock()
{
if (!TRY_LOCK()) {
yield_to(owner);
LOCK();
}
owner = gettid();
}
But once again, I am not sure if this makes any sense.
Oleg.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 16:12 ` Oleg Nesterov
@ 2014-03-05 17:10 ` Khalid Aziz
0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-05 17:10 UTC (permalink / raw)
To: Oleg Nesterov
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, venki,
linux-kernel
On 03/05/2014 09:12 AM, Oleg Nesterov wrote:
> On 03/05, Oleg Nesterov wrote:
>>
>> You added /proc/sched_preempt_delay to avoid the syscall. I think it
>> would be better to simply add vdso_sched_preempt_delay() instead.
>
> I am stupid. vdso_sched_preempt_delay() obviously can't write to, say,
> task_struct.
>
> Oleg.
>
That is easy to miss :) I had looked into vdso when I started but found
out I can't use vdso for this.
--
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 16:36 ` Oleg Nesterov
@ 2014-03-05 17:22 ` Khalid Aziz
2014-03-05 23:13 ` David Lang
0 siblings, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-05 17:22 UTC (permalink / raw)
To: Oleg Nesterov, Andi Kleen
Cc: Thomas Gleixner, One Thousand Gnomes, H. Peter Anvin,
Ingo Molnar, peterz, akpm, viro, linux-kernel
On 03/05/2014 09:36 AM, Oleg Nesterov wrote:
> On 03/05, Andi Kleen wrote:
>>
>> On Wed, Mar 05, 2014 at 03:54:20PM +0100, Oleg Nesterov wrote:
>>> On 03/04, Andi Kleen wrote:
>>>>
>>>> Anything else?
>>>
>>> Well, we have yield_to(). Perhaps sys_yield_to(lock_owner) can help.
>>> Or perhaps sys_futex() can do this if it knows the owner. Don't ask
>>> me what exactly I mean though ;)
>>
>> You mean yield_to() would extend the time slice?
>>
>> That would be the same as the mmap page, just with a syscall right?
>
> Not the same. Very roughly I meant something like
>
> my_lock()
> {
> if (!TRY_LOCK()) {
> yield_to(owner);
> LOCK();
> }
>
> owner = gettid();
> }
>
> But once again, I am not sure if this makes any sense.
>
> Oleg.
>
Trouble with that approach is by the time a thread finds out it can not
acquire the lock because someone else has it, we have already paid the
price of context switch. What I am trying to do is to avoid that cost. I
looked into a few other approaches to solving this problem without
making kernel changes:
- Use PTHREAD_PRIO_PROTECT protocol to boost the priority of thread that
holds the lock to minimize contention and CPU cycles wasted by other
threads only to find out someone already has the lock. Problem I ran
into is the implementation of PTHREAD_PRIO_PROTECT requires another
system call, sched_setscheduler(), inside the library to boost priority.
Now I have added the overhead of a new system call which easily
outweighs any performance gains from removing lock contention. Besides
databases implement their own spinlocks to maximize performance and thus
can not use PTHREAD_PRIO_PROTECT in posix threads library.
- I looked into adaptive spinning futex work Darren Hart was working on.
It looked very promising but I ran into the same problem again. It
reduces the cost of contention by delaying context switches in cases
where spinning is quicker but it still does not do anything to reduce
the cost of context switch for a thread to get the CPU only to find out
it can not get the lock. This cost again outweighs the 3%-5% benefit we
are seeing from just not giving up CPU in the middle of critical section.
Makes sense?
--
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 11:10 ` Peter Zijlstra
@ 2014-03-05 17:29 ` Khalid Aziz
2014-03-05 19:58 ` Khalid Aziz
1 sibling, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-05 17:29 UTC (permalink / raw)
To: Peter Zijlstra, Andi Kleen
Cc: Thomas Gleixner, One Thousand Gnomes, H. Peter Anvin,
Ingo Molnar, akpm, viro, oleg, linux-kernel
On 03/05/2014 04:10 AM, Peter Zijlstra wrote:
> On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
>> Anything else?
>
> Proxy execution; its a form of PI that works for arbitrary scheduling
> policies (thus also very much including fair).
>
> With that what you effectively end up with is the lock holder running
> 'boosted' by the runtime of its blocked chain until the entire chain
> runs out of time, at which point preemption doesn't matter anyhow.
>
I assume you are referring to
http://users.soe.ucsc.edu/~jayhawk/watkins-rtlws09.pdf and/or
http://www.osadl.org/fileadmin/dam/presentations/RTLWS11/peterz-edf-why-not.pdf
? I will read through these and see if this can solve the problem.
--
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 11:10 ` Peter Zijlstra
2014-03-05 17:29 ` Khalid Aziz
@ 2014-03-05 19:58 ` Khalid Aziz
2014-03-06 9:57 ` Peter Zijlstra
2014-03-06 11:14 ` Thomas Gleixner
1 sibling, 2 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-05 19:58 UTC (permalink / raw)
To: Peter Zijlstra, Andi Kleen
Cc: Thomas Gleixner, One Thousand Gnomes, H. Peter Anvin,
Ingo Molnar, akpm, viro, oleg, linux-kernel
On 03/05/2014 04:10 AM, Peter Zijlstra wrote:
> On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
>> Anything else?
>
> Proxy execution; its a form of PI that works for arbitrary scheduling
> policies (thus also very much including fair).
>
> With that what you effectively end up with is the lock holder running
> 'boosted' by the runtime of its blocked chain until the entire chain
> runs out of time, at which point preemption doesn't matter anyhow.
>
Hello Peter,
I read through the concept of proxy execution and it is a very
interesting concept. I come from many years of realtime and embeddded
systems development and I can easily recall various problems in the past
that can be solved or helped by this. Looking at the current problem I
am trying to solve with databases and JVM, I run into the same issue I
described in my earlier email. Proxy execution is a post-contention
solution. By the time proxy execution can do something for my case, I
have already paid the price of contention and a context switch which is
what I am trying to avoid. For a critical section that is very short
compared to the size of execution thread, which is the case I am looking
at, avoiding preemption in the middle of that short critical section
helps much more than dealing with lock contention later on. The goal
here is to avoid lock contention and associated cost. I do understand
the cost of dealing with lock contention poorly and that can easily be
much bigger cost, but I am looking into avoiding even getting there.
Thanks,
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 17:22 ` Khalid Aziz
@ 2014-03-05 23:13 ` David Lang
2014-03-05 23:48 ` Khalid Aziz
0 siblings, 1 reply; 67+ messages in thread
From: David Lang @ 2014-03-05 23:13 UTC (permalink / raw)
To: Khalid Aziz
Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel
On Wed, 5 Mar 2014, Khalid Aziz wrote:
> On 03/05/2014 09:36 AM, Oleg Nesterov wrote:
>> On 03/05, Andi Kleen wrote:
>>>
>>> On Wed, Mar 05, 2014 at 03:54:20PM +0100, Oleg Nesterov wrote:
>>>> On 03/04, Andi Kleen wrote:
>>>>>
>>>>> Anything else?
>>>>
>>>> Well, we have yield_to(). Perhaps sys_yield_to(lock_owner) can help.
>>>> Or perhaps sys_futex() can do this if it knows the owner. Don't ask
>>>> me what exactly I mean though ;)
>>>
>>> You mean yield_to() would extend the time slice?
>>>
>>> That would be the same as the mmap page, just with a syscall right?
>>
>> Not the same. Very roughly I meant something like
>>
>> my_lock()
>> {
>> if (!TRY_LOCK()) {
>> yield_to(owner);
>> LOCK();
>> }
>>
>> owner = gettid();
>> }
>>
>> But once again, I am not sure if this makes any sense.
>>
>> Oleg.
>>
>
> Trouble with that approach is by the time a thread finds out it can not
> acquire the lock because someone else has it, we have already paid the price
> of context switch. What I am trying to do is to avoid that cost. I looked
> into a few other approaches to solving this problem without making kernel
> changes:
Yes, you have paid the cost of the context switch, but your original problem
description talked about having multiple other threads trying to get the lock,
then spinning trying to get the lock (wasting time if the process holding it is
asleep, but not if it's running on another core) and causing a long delay before
the process holding the lock gets a chance to run again.
Having the threads immediately yield to the process that has the lock reduces
this down to two context switches, which isn't perfect, but it's a LOT better
than what you started from.
> - Use PTHREAD_PRIO_PROTECT protocol to boost the priority of thread that
> holds the lock to minimize contention and CPU cycles wasted by other threads
> only to find out someone already has the lock. Problem I ran into is the
> implementation of PTHREAD_PRIO_PROTECT requires another system call,
> sched_setscheduler(), inside the library to boost priority. Now I have added
> the overhead of a new system call which easily outweighs any performance
> gains from removing lock contention. Besides databases implement their own
> spinlocks to maximize performance and thus can not use PTHREAD_PRIO_PROTECT
> in posix threads library.
well, writing to something in /proc isn't free either. And how is the thread
supposed to know if it needs to do so or if it's going to have enough time to
finish it's work before it's out of time (how can it know how much time it would
have left anyway?)
> - I looked into adaptive spinning futex work Darren Hart was working on. It
> looked very promising but I ran into the same problem again. It reduces the
> cost of contention by delaying context switches in cases where spinning is
> quicker but it still does not do anything to reduce the cost of context
> switch for a thread to get the CPU only to find out it can not get the lock.
> This cost again outweighs the 3%-5% benefit we are seeing from just not
> giving up CPU in the middle of critical section.
is this gain from not giving up the CPU at all? or is it from avoiding all the
delays due to the contending thread trying in turn? the yield_to() approach
avoids all those other threads trying in turn so it should get fairly close to
the same benefits.
David Lang
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 23:13 ` David Lang
@ 2014-03-05 23:48 ` Khalid Aziz
2014-03-05 23:56 ` H. Peter Anvin
2014-03-05 23:59 ` David Lang
0 siblings, 2 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-05 23:48 UTC (permalink / raw)
To: David Lang
Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel
On 03/05/2014 04:13 PM, David Lang wrote:
> Yes, you have paid the cost of the context switch, but your original
> problem description talked about having multiple other threads trying to
> get the lock, then spinning trying to get the lock (wasting time if the
> process holding it is asleep, but not if it's running on another core)
> and causing a long delay before the process holding the lock gets a
> chance to run again.
>
> Having the threads immediately yield to the process that has the lock
> reduces this down to two context switches, which isn't perfect, but it's
> a LOT better than what you started from.
OK, let us consider the multiple core scenario:
- Thread A gets scheduled on core 1, runs for 95% of its timeslice
before it gets to its critical section.
- Thread A grabs the lock and quickly reaches the end of its timeslice
before finishing its critical section.
- Thread A is preempted on core 1 by a completely unrelated thread.
- Thread B in the mean time is scheduled on core 2 and happens to get to
its critical section right away where it tries to grab the lock held by
thread A. It spins for a bit waiting to see if lock becomes available,
gives up and yields to next process in queue.
- Since thread A ran recently, it is now stuck towards the end of run
queue, so thread C gets to run on core 2 which goes through same fate as
thread A.
Now scale this scenario across more cores and more threads that all want
the same lock to execute their small critical section. This cost of
spinning and context switch could have been avoided if thread A could
get additional timeslice to complete its critical section. Yielding to
the process holding the lock happens only after contention has happened
and we have paid the price of two context switches for the lock owner.
When yield_to() happens, the lock owner may not get to run on the core
it was on because an unrelated thread is running on that core and it
needs to wait for that thread's timeslice to run out. If the lock owner
gets scheduled on another core, we pay the price of repopulating the
cache for a new thread on that core. yield_to() is better than having a
convoy building up of processes waiting for the same lock but worse than
avoiding the contention altogether.
>
> well, writing to something in /proc isn't free either. And how is the
> thread supposed to know if it needs to do so or if it's going to have
> enough time to finish it's work before it's out of time (how can it know
> how much time it would have left anyway?)
Cost is writing to a memory location since thread is using mmap, not
insignificant but hardly expensive. Thread does not need to know how
much time it has left in current timeslice. It always sets the flag to
request pre-emption immunity before entering the critical section and
clears the flag when it exits its critical section. If the thread comes
up for pre-emption while the flag is set, it gets immunity. If it does
not, flag will be cleared at the end of critical section any way.
> is this gain from not giving up the CPU at all? or is it from avoiding
> all the delays due to the contending thread trying in turn? the
> yield_to() approach avoids all those other threads trying in turn so it
> should get fairly close to the same benefits.
>
The gain is from avoiding contention by giving locking thread a chance
to complete its critical section which is expected to be very short
(certainly shorter than timeslice). Pre-emption immunity gives it one
and only one additional timeslice.
Hope this helps clear things up.
Thanks,
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 23:48 ` Khalid Aziz
@ 2014-03-05 23:56 ` H. Peter Anvin
2014-03-06 0:02 ` Khalid Aziz
2014-03-05 23:59 ` David Lang
1 sibling, 1 reply; 67+ messages in thread
From: H. Peter Anvin @ 2014-03-05 23:56 UTC (permalink / raw)
To: Khalid Aziz, David Lang
Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
Ingo Molnar, peterz, akpm, viro, linux-kernel
On 03/05/2014 03:48 PM, Khalid Aziz wrote:
>
> Cost is writing to a memory location since thread is using mmap, not
> insignificant but hardly expensive. Thread does not need to know how
> much time it has left in current timeslice. It always sets the flag to
> request pre-emption immunity before entering the critical section and
> clears the flag when it exits its critical section. If the thread comes
> up for pre-emption while the flag is set, it gets immunity. If it does
> not, flag will be cleared at the end of critical section any way.
>
A little more than that. The scheduler needs to set *another* flag
telling the process to yield upon leaving the critical section; if the
process doesn't, the scheduler needs to keep enough accounting to know
to penalize the process, or this method will not be usable for
unprivileged processes.
I have been thinking about how to deal with the fact that if this isn't
locked in memory, what happens if it is paged out. If that can be
detected at reasonable cost, it probably makes most sense to simply say
that if the pointed-to memory is paged out, we're not in a critical section.
-hpa
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 23:48 ` Khalid Aziz
2014-03-05 23:56 ` H. Peter Anvin
@ 2014-03-05 23:59 ` David Lang
2014-03-06 0:17 ` Khalid Aziz
1 sibling, 1 reply; 67+ messages in thread
From: David Lang @ 2014-03-05 23:59 UTC (permalink / raw)
To: Khalid Aziz
Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel
On Wed, 5 Mar 2014, Khalid Aziz wrote:
> On 03/05/2014 04:13 PM, David Lang wrote:
>> Yes, you have paid the cost of the context switch, but your original
>> problem description talked about having multiple other threads trying to
>> get the lock, then spinning trying to get the lock (wasting time if the
>> process holding it is asleep, but not if it's running on another core)
>> and causing a long delay before the process holding the lock gets a
>> chance to run again.
>>
>> Having the threads immediately yield to the process that has the lock
>> reduces this down to two context switches, which isn't perfect, but it's
>> a LOT better than what you started from.
>
> OK, let us consider the multiple core scenario:
>
> - Thread A gets scheduled on core 1, runs for 95% of its timeslice before it
> gets to its critical section.
> - Thread A grabs the lock and quickly reaches the end of its timeslice before
> finishing its critical section.
> - Thread A is preempted on core 1 by a completely unrelated thread.
> - Thread B in the mean time is scheduled on core 2 and happens to get to its
> critical section right away where it tries to grab the lock held by thread A.
> It spins for a bit waiting to see if lock becomes available, gives up and
> yields to next process in queue.
> - Since thread A ran recently, it is now stuck towards the end of run queue,
> so thread C gets to run on core 2 which goes through same fate as thread A.
>
> Now scale this scenario across more cores and more threads that all want the
> same lock to execute their small critical section. This cost of spinning and
> context switch could have been avoided if thread A could get additional
> timeslice to complete its critical section. Yielding to the process holding
> the lock happens only after contention has happened and we have paid the
> price of two context switches for the lock owner. When yield_to() happens,
> the lock owner may not get to run on the core it was on because an unrelated
> thread is running on that core and it needs to wait for that thread's
> timeslice to run out. If the lock owner gets scheduled on another core, we
> pay the price of repopulating the cache for a new thread on that core.
> yield_to() is better than having a convoy building up of processes waiting
> for the same lock but worse than avoiding the contention altogether.
with yield_to(), when thread B hit the contention, thread A would get more time
an be able to complete immediatly, so by the time thread C gets around to
running, there is no longer any contention.
>>
>> well, writing to something in /proc isn't free either. And how is the
>> thread supposed to know if it needs to do so or if it's going to have
>> enough time to finish it's work before it's out of time (how can it know
>> how much time it would have left anyway?)
>
> Cost is writing to a memory location since thread is using mmap, not
> insignificant but hardly expensive. Thread does not need to know how much
> time it has left in current timeslice. It always sets the flag to request
> pre-emption immunity before entering the critical section and clears the flag
> when it exits its critical section. If the thread comes up for pre-emption
> while the flag is set, it gets immunity. If it does not, flag will be cleared
> at the end of critical section any way.
what's the cost to setup mmap of this file in /proc. this is sounding like a lot
of work.
>> is this gain from not giving up the CPU at all? or is it from avoiding
>> all the delays due to the contending thread trying in turn? the
>> yield_to() approach avoids all those other threads trying in turn so it
>> should get fairly close to the same benefits.
>>
>
> The gain is from avoiding contention by giving locking thread a chance to
> complete its critical section which is expected to be very short (certainly
> shorter than timeslice). Pre-emption immunity gives it one and only one
> additional timeslice.
but the yield_to() does almost the same thing, there is a small bump, but you
don't have to wait for thread B to spin, thread C..ZZZ etc to spin before thread
A can finish it's work. As soon as the second thread hits the critical section,
thread A is going to be able to do more work (and hopefully finish)
> Hope this helps clear things up.
It doesn't sound like you and I are understanding how the yield_to() approach
would work. I hope my comments have helped get us on the same page.
David Lang
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 23:56 ` H. Peter Anvin
@ 2014-03-06 0:02 ` Khalid Aziz
2014-03-06 0:13 ` H. Peter Anvin
0 siblings, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-06 0:02 UTC (permalink / raw)
To: H. Peter Anvin, David Lang
Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
Ingo Molnar, peterz, akpm, viro, linux-kernel
On 03/05/2014 04:56 PM, H. Peter Anvin wrote:
> On 03/05/2014 03:48 PM, Khalid Aziz wrote:
>>
>> Cost is writing to a memory location since thread is using mmap, not
>> insignificant but hardly expensive. Thread does not need to know how
>> much time it has left in current timeslice. It always sets the flag to
>> request pre-emption immunity before entering the critical section and
>> clears the flag when it exits its critical section. If the thread comes
>> up for pre-emption while the flag is set, it gets immunity. If it does
>> not, flag will be cleared at the end of critical section any way.
>>
>
> A little more than that. The scheduler needs to set *another* flag
> telling the process to yield upon leaving the critical section; if the
> process doesn't, the scheduler needs to keep enough accounting to know
> to penalize the process, or this method will not be usable for
> unprivileged processes.
Yes, you had made that suggestion earlier and I like it. It will be in
v2 patch. I am thinking of making the penalty be denial of next
preemption immunity request if a process fails to yield when it should
have. Sounds good?
Thanks,
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-06 0:02 ` Khalid Aziz
@ 2014-03-06 0:13 ` H. Peter Anvin
0 siblings, 0 replies; 67+ messages in thread
From: H. Peter Anvin @ 2014-03-06 0:13 UTC (permalink / raw)
To: Khalid Aziz, David Lang
Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
Ingo Molnar, peterz, akpm, viro, linux-kernel
On 03/05/2014 04:02 PM, Khalid Aziz wrote:
>
> Yes, you had made that suggestion earlier and I like it. It will be in
> v2 patch. I am thinking of making the penalty be denial of next
> preemption immunity request if a process fails to yield when it should
> have. Sounds good?
>
That is the minimum penalty possible. Hopefully that is sufficient.
-hpa
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 23:59 ` David Lang
@ 2014-03-06 0:17 ` Khalid Aziz
2014-03-06 0:36 ` David Lang
0 siblings, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-06 0:17 UTC (permalink / raw)
To: David Lang
Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel
On 03/05/2014 04:59 PM, David Lang wrote:
> what's the cost to setup mmap of this file in /proc. this is sounding
> like a lot of work.
That is a one time cost paid when a thread initializes itself.
>
>>> is this gain from not giving up the CPU at all? or is it from avoiding
>>> all the delays due to the contending thread trying in turn? the
>>> yield_to() approach avoids all those other threads trying in turn so it
>>> should get fairly close to the same benefits.
>>>
>>
>> The gain is from avoiding contention by giving locking thread a chance
>> to complete its critical section which is expected to be very short
>> (certainly shorter than timeslice). Pre-emption immunity gives it one
>> and only one additional timeslice.
>
> but the yield_to() does almost the same thing, there is a small bump,
> but you don't have to wait for thread B to spin, thread C..ZZZ etc to
> spin before thread A can finish it's work. As soon as the second thread
> hits the critical section, thread A is going to be able to do more work
> (and hopefully finish)
>
>> Hope this helps clear things up.
>
> It doesn't sound like you and I are understanding how the yield_to()
> approach would work. I hope my comments have helped get us on the same
> page.
>
I apologize if I am being dense. My understanding of yield_to() is what
Oleg had said in his reply earlier, so I will quote the example he gave:
my_lock()
{
if (!TRY_LOCK()) {
yield_to(owner);
LOCK();
}
owner = gettid();
}
If thread A had already lost the processor by the time thread B executes
above code, wouldn't we have paid the price of two context switches for
thread A?
Thanks,
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-06 0:17 ` Khalid Aziz
@ 2014-03-06 0:36 ` David Lang
2014-03-06 1:22 ` Khalid Aziz
0 siblings, 1 reply; 67+ messages in thread
From: David Lang @ 2014-03-06 0:36 UTC (permalink / raw)
To: Khalid Aziz
Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel
On Wed, 5 Mar 2014, Khalid Aziz wrote:
> On 03/05/2014 04:59 PM, David Lang wrote:
>> what's the cost to setup mmap of this file in /proc. this is sounding
>> like a lot of work.
>
> That is a one time cost paid when a thread initializes itself.
>
>>
>>>> is this gain from not giving up the CPU at all? or is it from avoiding
>>>> all the delays due to the contending thread trying in turn? the
>>>> yield_to() approach avoids all those other threads trying in turn so it
>>>> should get fairly close to the same benefits.
>>>>
>>>
>>> The gain is from avoiding contention by giving locking thread a chance
>>> to complete its critical section which is expected to be very short
>>> (certainly shorter than timeslice). Pre-emption immunity gives it one
>>> and only one additional timeslice.
>>
>> but the yield_to() does almost the same thing, there is a small bump,
>> but you don't have to wait for thread B to spin, thread C..ZZZ etc to
>> spin before thread A can finish it's work. As soon as the second thread
>> hits the critical section, thread A is going to be able to do more work
>> (and hopefully finish)
>>
>>> Hope this helps clear things up.
>>
>> It doesn't sound like you and I are understanding how the yield_to()
>> approach would work. I hope my comments have helped get us on the same
>> page.
>>
>
> I apologize if I am being dense. My understanding of yield_to() is what Oleg
> had said in his reply earlier, so I will quote the example he gave:
>
> my_lock()
> {
> if (!TRY_LOCK()) {
> yield_to(owner);
> LOCK();
> }
>
> owner = gettid();
> }
>
> If thread A had already lost the processor by the time thread B executes
> above code, wouldn't we have paid the price of two context switches for
> thread A?
Yes, you pay for two context switches, but you don't pay for threads B..ZZZ all
running (and potentially spinning) trying to aquire the lock before thread A is
able to complete it's work.
As soon as a second thread hits the contention, thread A gets time to finish.
It's not as 'good' [1] as thread A just working longer, but it's FAR better than
thread A sleeping while every other thread runs and potentially tries to get the
lock
[1] it wastes the context switches, but it avoids the overhead of figuring out
if the thread needs to extend it's time, and if it's time was actually extended,
and what penalty it should suffer the next time it runs....
I expect that in practice, it will be very close to the 'theoretical ideal' of
"I'm doing something important, don't interrupt me", without all the potential
abuse of such a flag.
David Lang
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-06 0:36 ` David Lang
@ 2014-03-06 1:22 ` Khalid Aziz
2014-03-06 14:23 ` David Lang
0 siblings, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-06 1:22 UTC (permalink / raw)
To: David Lang
Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel
On 03/05/2014 05:36 PM, David Lang wrote:
> Yes, you pay for two context switches, but you don't pay for threads
> B..ZZZ all running (and potentially spinning) trying to aquire the lock
> before thread A is able to complete it's work.
>
Ah, great. We are converging now.
> As soon as a second thread hits the contention, thread A gets time to
> finish.
Only as long as thread A could be scheduled immediately which may or may
not be the case depending upon what else is running on the core thread A
last ran on and if thread A needs to be migrated to another core.
>
> It's not as 'good' [1] as thread A just working longer,
and that is the exact spot where I am trying to improve performance.
> but it's FAR
> better than thread A sleeping while every other thread runs and
> potentially tries to get the lock
Absolutely. I agree with that.
>
> [1] it wastes the context switches, but it avoids the overhead of
> figuring out if the thread needs to extend it's time, and if it's time
> was actually extended, and what penalty it should suffer the next time
> it runs....
All of it can be done by setting and checking couple of flags in
task_struct. That is not insignificant, but hardly expensive. Logic is
quite simple:
resched()
{
........
if (immmunity) {
if (!penalty) {
immunity = 0;
penalty = 1;
-- skip context switch --
}
else {
immunity = penalty = 0;
-- do the context switch --
}
}
.........
}
sched_yield()
{
......
penalty = 0;
......
}
This simple logic will also work to defeat the obnoxius threads that
keep setting immunity request flag repeatedly within the same critical
section to give themselves multiple extensions.
Thanks,
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 19:58 ` Khalid Aziz
@ 2014-03-06 9:57 ` Peter Zijlstra
2014-03-06 16:08 ` Khalid Aziz
2014-03-06 11:14 ` Thomas Gleixner
1 sibling, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2014-03-06 9:57 UTC (permalink / raw)
To: Khalid Aziz
Cc: Andi Kleen, Thomas Gleixner, One Thousand Gnomes, H. Peter Anvin,
Ingo Molnar, akpm, viro, oleg, linux-kernel
On Wed, Mar 05, 2014 at 12:58:29PM -0700, Khalid Aziz wrote:
> On 03/05/2014 04:10 AM, Peter Zijlstra wrote:
> >On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
> >>Anything else?
> >
> >Proxy execution; its a form of PI that works for arbitrary scheduling
> >policies (thus also very much including fair).
> >
> >With that what you effectively end up with is the lock holder running
> >'boosted' by the runtime of its blocked chain until the entire chain
> >runs out of time, at which point preemption doesn't matter anyhow.
> >
>
> Hello Peter,
>
> I read through the concept of proxy execution and it is a very interesting
> concept. I come from many years of realtime and embeddded systems
> development and I can easily recall various problems in the past that can be
> solved or helped by this.
Yeah; there's a few nasty cases with PEP on SMP though (the reason we've
not already have it); the trivial implementation that works wonderfully
on UP ends up capable of running the same task on multiple CPUs -- which
is an obvious fail.
There's people working on this though; as the scheme also works well for
the recently added deadline scheduler.
> Looking at the current problem I am trying to
> solve with databases and JVM, I run into the same issue I described in my
> earlier email. Proxy execution is a post-contention solution. By the time
> proxy execution can do something for my case, I have already paid the price
> of contention and a context switch which is what I am trying to avoid. For a
> critical section that is very short compared to the size of execution
> thread, which is the case I am looking at, avoiding preemption in the middle
> of that short critical section helps much more than dealing with lock
> contention later on.
Like others have already stated; its likely still cheaper than the
pile-up you get now. It might not be optimally fast, but it sure takes
out the worst case you have now.
> The goal here is to avoid lock contention and
> associated cost. I do understand the cost of dealing with lock contention
> poorly and that can easily be much bigger cost, but I am looking into
> avoiding even getting there.
The thing is; unless userspace is a RT program or practises the same
discipline in such an extend as that it make no practical difference,
there's always going to be the case where you fail to cover the entire
critical section, at which point you're back to your pile-up fail.
So while the limited preemption guard helps the best cast, it doesn't
help the worst case at all.
So supposing we went with this now; you (or someone else) will come back
in a year's time and tell us that if we only just stretch this window a
little, their favourite workload will also benefit.
Where's the end of that?
And what about CONFIG_HZ; suppose you compile your kernel with HZ=100
and your 1 extra tick is sufficient. Then someone compiles their kernel
with HZ=1000 and it all comes apart.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 19:58 ` Khalid Aziz
2014-03-06 9:57 ` Peter Zijlstra
@ 2014-03-06 11:14 ` Thomas Gleixner
2014-03-06 16:32 ` Khalid Aziz
1 sibling, 1 reply; 67+ messages in thread
From: Thomas Gleixner @ 2014-03-06 11:14 UTC (permalink / raw)
To: Khalid Aziz
Cc: Peter Zijlstra, Andi Kleen, One Thousand Gnomes, H. Peter Anvin,
Ingo Molnar, Andrew Morton, Al Viro, Oleg Nesterov, LKML
On Wed, 5 Mar 2014, Khalid Aziz wrote:
> On 03/05/2014 04:10 AM, Peter Zijlstra wrote:
> > On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
> > > Anything else?
> >
> > Proxy execution; its a form of PI that works for arbitrary scheduling
> > policies (thus also very much including fair).
> >
> > With that what you effectively end up with is the lock holder running
> > 'boosted' by the runtime of its blocked chain until the entire chain
> > runs out of time, at which point preemption doesn't matter anyhow.
> >
>
> Hello Peter,
>
> I read through the concept of proxy execution and it is a very interesting
> concept. I come from many years of realtime and embeddded systems development
> and I can easily recall various problems in the past that can be solved or
> helped by this. Looking at the current problem I am trying to solve with
> databases and JVM, I run into the same issue I described in my earlier email.
> Proxy execution is a post-contention solution. By the time proxy execution can
> do something for my case, I have already paid the price of contention and a
> context switch which is what I am trying to avoid. For a critical section that
> is very short compared to the size of execution thread, which is the case I am
> looking at, avoiding preemption in the middle of that short critical section
> helps much more than dealing with lock contention later on. The goal here is
> to avoid lock contention and associated cost. I do understand the cost of
> dealing with lock contention poorly and that can easily be much bigger cost,
> but I am looking into avoiding even getting there.
We understand that you want to avoid preemption in the first place and
not getting into the contention handling case.
But, what you're trying to do is essentially creating an ABI which we
have to support and maintain forever. And that definitely is worth a
few serious questions.
Lets ignore the mm related issues for now as those can be solved. That's
the least of my worries.
Right now you are using this for a single use case with a well defined
environment, where all related threads reside in the same scheduling
class (FAIR). But that's one of a gazillion of use cases of Linux.
If we allow you to special case your database workload then we have no
argument why we should not do the same thing for realtime workloads
where the SCHED_FAIR housekeeping thread can hold a lock shortly to
access some important data in the SCHED_FIFO realtime computation
thread. Of course the RT people want to avoid the lock contention as
much as you do, just for different reasons.
Add SCHED_EDF, cgroups and hierarchical scheduling to the picture and
hell breaks lose.
Why? Simply because you applied the everything is a nail therefor I
use a hammer approach. Understandable because in your case (data base
workload) almost everything is the same nail.
Thanks,
tglx
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-05 0:51 ` Andi Kleen
2014-03-05 11:10 ` Peter Zijlstra
2014-03-05 14:54 ` Oleg Nesterov
@ 2014-03-06 12:13 ` Kevin Easton
2014-03-06 13:59 ` Peter Zijlstra
2014-03-06 14:25 ` David Lang
2 siblings, 2 replies; 67+ messages in thread
From: Kevin Easton @ 2014-03-06 12:13 UTC (permalink / raw)
To: Andi Kleen
Cc: Thomas Gleixner, Khalid Aziz, One Thousand Gnomes,
H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, oleg,
linux-kernel
On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
> Anything else?
If it was possible to make the time remaining in the current timeslice
available to userspace through the vdso, the thread could do something like:
if (sys_timeleft() < CRITICAL_SECTION_SIZE)
yield();
lock();
to avoid running out of timeslice in the middle of the critical section.
- Kevin
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-04 21:12 ` H. Peter Anvin
2014-03-04 21:39 ` Khalid Aziz
@ 2014-03-06 13:24 ` Rasmus Villemoes
2014-03-06 13:34 ` Peter Zijlstra
1 sibling, 1 reply; 67+ messages in thread
From: Rasmus Villemoes @ 2014-03-06 13:24 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Khalid Aziz, tglx, Ingo Molnar, peterz, akpm, andi.kleen, rob,
viro, oleg, venki, linux-kernel
"H. Peter Anvin" <hpa@zytor.com> writes:
> I have several issues with this interface:
>
> 1. First, a process needs to know if it *should* have been preempted
> before it calls sched_yield(). So there needs to be a second flag set
> by the scheduler when granting amnesty.
>
> 2. A process which fails to call sched_yield() after being granted
> amnesty must be penalized.
>
> 3. I'm not keen on occupying a full page for this. I'm wondering if
> doing a pointer into user space, futex-style, might make more sense.
> The downside, of course, is what happens if the page being pointed to is
> swapped out.
Is it possible to implement non-sleeping versions of {get,put}_user()?
That is, use the same basic approach (let the MMU do the hard work) but use
different, and simpler, "fixup" code (return -ESOMETHING on a major
fault).
If so, I think an extra pointer (->amnesty) and an extra bit
(->amnesty_granted) in task_struct suffices:
If the scheduler runs and tsk->amnesty_granted is true, penalize the
current task (and possibly clear tsk->amnesty_granted).
Otherwise, the task is granted amnesty provided these conditions are
met:
tsk->amnesty is non-NULL
get_user_nosleep(j, tsk->amnesty) succeeds
j is now 1
put_user_nosleep(YOU_WERE_LUCKY, tsk->amnesty) succeeds
If so, set amnesty_granted and let the thread continue; otherwise reschedule.
The userspace side would be something like
void *thread_func(void*)
{
int Im_busy = 0;
sched_need_amnesty(&Im_busy); /* better name needed */
/* -EPERM if not allowed (new capability?) */
/* go critical */
Im_busy = 1;
LOCK();
do_stuff();
UNLOCK();
if (Im_busy != 1) {
/* better play nice with the others */
sched_yield();
}
Im_busy = 0;
/* If the thread doesn't need the amnesty feature anymore, it can just do
sched_need_amnesty(NULL); */
}
Rasmus
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-06 13:24 ` Rasmus Villemoes
@ 2014-03-06 13:34 ` Peter Zijlstra
2014-03-06 13:45 ` Rasmus Villemoes
0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2014-03-06 13:34 UTC (permalink / raw)
To: Rasmus Villemoes
Cc: H. Peter Anvin, Khalid Aziz, tglx, Ingo Molnar, akpm, andi.kleen,
rob, viro, oleg, venki, linux-kernel
On Thu, Mar 06, 2014 at 02:24:43PM +0100, Rasmus Villemoes wrote:
> Is it possible to implement non-sleeping versions of {get,put}_user()?
__{get,put}_user()
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-06 13:34 ` Peter Zijlstra
@ 2014-03-06 13:45 ` Rasmus Villemoes
2014-03-06 14:02 ` Peter Zijlstra
2014-03-06 14:04 ` Thomas Gleixner
0 siblings, 2 replies; 67+ messages in thread
From: Rasmus Villemoes @ 2014-03-06 13:45 UTC (permalink / raw)
To: Peter Zijlstra
Cc: H. Peter Anvin, Khalid Aziz, tglx, Ingo Molnar, akpm, andi.kleen,
rob, viro, oleg, venki, linux-kernel
Peter Zijlstra <peterz@infradead.org> writes:
> On Thu, Mar 06, 2014 at 02:24:43PM +0100, Rasmus Villemoes wrote:
>> Is it possible to implement non-sleeping versions of {get,put}_user()?
>
> __{get,put}_user()
Huh?
arch/x86/include/asm/uaccess.h:
/**
* __get_user: - Get a simple variable from user space, with less checking.
* @x: Variable to store result.
* @ptr: Source address, in user space.
*
* Context: User context only. This function may sleep.
What am I missing?
Rasmus
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-06 12:13 ` Kevin Easton
@ 2014-03-06 13:59 ` Peter Zijlstra
2014-03-06 22:41 ` Andi Kleen
2014-03-06 14:25 ` David Lang
1 sibling, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2014-03-06 13:59 UTC (permalink / raw)
To: Kevin Easton
Cc: Andi Kleen, Thomas Gleixner, Khalid Aziz, One Thousand Gnomes,
H. Peter Anvin, Ingo Molnar, akpm, viro, oleg, linux-kernel
On Thu, Mar 06, 2014 at 11:13:33PM +1100, Kevin Easton wrote:
> On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
> > Anything else?
>
> If it was possible to make the time remaining in the current timeslice
> available to userspace through the vdso, the thread could do something like:
Assuming we can do per-cpu values in the VDSO; this would mean hitting
that cacheline on every context switch and wakeup. That's a complete
non-starter performance wise.
>
> if (sys_timeleft() < CRITICAL_SECTION_SIZE)
> yield();
> lock();
>
> to avoid running out of timeslice in the middle of the critical section.
Can still happen, the effective slice of a single runnable task is
infinite, the moment another task gets woken this gets reduced to a finite
amount, we then keep reducing the slice until there are about 8 runnable
tasks (assuming you've not poked at any sysctls).
Also; depending on the state of the just woken task; it might land left
of you in the tree, making it immediately eligible to run, completely
obviating whatever number you just read.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-06 13:45 ` Rasmus Villemoes
@ 2014-03-06 14:02 ` Peter Zijlstra
2014-03-06 14:33 ` Thomas Gleixner
2014-03-06 14:04 ` Thomas Gleixner
1 sibling, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2014-03-06 14:02 UTC (permalink / raw)
To: Rasmus Villemoes
Cc: H. Peter Anvin, Khalid Aziz, tglx, Ingo Molnar, akpm, andi.kleen,
rob, viro, oleg, venki, linux-kernel
On Thu, Mar 06, 2014 at 02:45:00PM +0100, Rasmus Villemoes wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
>
> > On Thu, Mar 06, 2014 at 02:24:43PM +0100, Rasmus Villemoes wrote:
> >> Is it possible to implement non-sleeping versions of {get,put}_user()?
> >
> > __{get,put}_user()
>
> Huh?
>
> arch/x86/include/asm/uaccess.h:
>
> /**
> * __get_user: - Get a simple variable from user space, with less checking.
> * @x: Variable to store result.
> * @ptr: Source address, in user space.
> *
> * Context: User context only. This function may sleep.
>
> What am I missing?
__get_user() -> __get_user_nocheck() -> __get_user_size() -> __get_user_asm()
And __get_user_asm() seems to generate the required .fixup section for
this to work in pagefault_disable() context.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-06 13:45 ` Rasmus Villemoes
2014-03-06 14:02 ` Peter Zijlstra
@ 2014-03-06 14:04 ` Thomas Gleixner
1 sibling, 0 replies; 67+ messages in thread
From: Thomas Gleixner @ 2014-03-06 14:04 UTC (permalink / raw)
To: Rasmus Villemoes
Cc: Peter Zijlstra, H. Peter Anvin, Khalid Aziz, Ingo Molnar, akpm,
andi.kleen, rob, viro, oleg, venki, linux-kernel
On Thu, 6 Mar 2014, Rasmus Villemoes wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
>
> > On Thu, Mar 06, 2014 at 02:24:43PM +0100, Rasmus Villemoes wrote:
> >> Is it possible to implement non-sleeping versions of {get,put}_user()?
> >
> > __{get,put}_user()
>
> Huh?
>
> arch/x86/include/asm/uaccess.h:
>
> /**
> * __get_user: - Get a simple variable from user space, with less checking.
> * @x: Variable to store result.
> * @ptr: Source address, in user space.
> *
> * Context: User context only. This function may sleep.
>
> What am I missing?
All these variants can sleep, as they might trip a page fault.....
If you want to access user space from a context which cant sleep you
need to use the __copy_to/from_user_inatomic variants. If the access
fails then they return -EFAULT and you need to deal with that
yourself.
Thanks,
tglx
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-06 1:22 ` Khalid Aziz
@ 2014-03-06 14:23 ` David Lang
0 siblings, 0 replies; 67+ messages in thread
From: David Lang @ 2014-03-06 14:23 UTC (permalink / raw)
To: Khalid Aziz
Cc: Oleg Nesterov, Andi Kleen, Thomas Gleixner, One Thousand Gnomes,
H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, linux-kernel
On Wed, 5 Mar 2014, Khalid Aziz wrote:
> On 03/05/2014 05:36 PM, David Lang wrote:
>> Yes, you pay for two context switches, but you don't pay for threads
>> B..ZZZ all running (and potentially spinning) trying to aquire the lock
>> before thread A is able to complete it's work.
>>
>
> Ah, great. We are converging now.
>
>> As soon as a second thread hits the contention, thread A gets time to
>> finish.
>
> Only as long as thread A could be scheduled immediately which may or may not
> be the case depending upon what else is running on the core thread A last ran
> on and if thread A needs to be migrated to another core.
>
>>
>> It's not as 'good' [1] as thread A just working longer,
>
> and that is the exact spot where I am trying to improve performance.
well, you have said that the "give me more time" flag results in 3-5% better
performance for databases under some workloads, how does this compare with the
results of yield_to()?
I think that the two approaches are going to be very close. I'd lay good odds
that the difference between the two is very hard to extract from the noise of
the variation of different runs (I won't say statistically insignificant, but I
will say that I expect it to take a lot of statistical analysis to pull it out
of the clutter)
David Lang
>> but it's FAR
>> better than thread A sleeping while every other thread runs and
>> potentially tries to get the lock
>
> Absolutely. I agree with that.
>
>>
>> [1] it wastes the context switches, but it avoids the overhead of
>> figuring out if the thread needs to extend it's time, and if it's time
>> was actually extended, and what penalty it should suffer the next time
>> it runs....
>
> All of it can be done by setting and checking couple of flags in task_struct.
> That is not insignificant, but hardly expensive. Logic is quite simple:
>
> resched()
> {
> ........
> if (immmunity) {
> if (!penalty) {
> immunity = 0;
> penalty = 1;
> -- skip context switch --
> }
> else {
> immunity = penalty = 0;
> -- do the context switch --
> }
> }
> .........
> }
>
> sched_yield()
> {
> ......
> penalty = 0;
> ......
> }
>
> This simple logic will also work to defeat the obnoxius threads that keep
> setting immunity request flag repeatedly within the same critical section to
> give themselves multiple extensions.
>
> Thanks,
> Khalid
>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-06 12:13 ` Kevin Easton
2014-03-06 13:59 ` Peter Zijlstra
@ 2014-03-06 14:25 ` David Lang
2014-03-06 16:12 ` Khalid Aziz
1 sibling, 1 reply; 67+ messages in thread
From: David Lang @ 2014-03-06 14:25 UTC (permalink / raw)
To: Kevin Easton
Cc: Andi Kleen, Thomas Gleixner, Khalid Aziz, One Thousand Gnomes,
H. Peter Anvin, Ingo Molnar, peterz, akpm, viro, oleg,
linux-kernel
On Thu, 6 Mar 2014, Kevin Easton wrote:
> On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
>> Anything else?
>
> If it was possible to make the time remaining in the current timeslice
> available to userspace through the vdso, the thread could do something like:
>
> if (sys_timeleft() < CRITICAL_SECTION_SIZE)
> yield();
> lock();
>
> to avoid running out of timeslice in the middle of the critical section.
but won't the system call result in context switches? According to Kevin, even a
context switch to another thread and back immediatly is bad enough to need to be
avoided, so replacing that with the context switch to the kernel and back isn't
a subtantial win.
David Lang
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-06 14:02 ` Peter Zijlstra
@ 2014-03-06 14:33 ` Thomas Gleixner
2014-03-06 14:34 ` H. Peter Anvin
0 siblings, 1 reply; 67+ messages in thread
From: Thomas Gleixner @ 2014-03-06 14:33 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rasmus Villemoes, H. Peter Anvin, Khalid Aziz, Ingo Molnar, akpm,
andi.kleen, rob, viro, oleg, venki, linux-kernel
On Thu, 6 Mar 2014, Peter Zijlstra wrote:
> On Thu, Mar 06, 2014 at 02:45:00PM +0100, Rasmus Villemoes wrote:
> > Peter Zijlstra <peterz@infradead.org> writes:
> >
> > > On Thu, Mar 06, 2014 at 02:24:43PM +0100, Rasmus Villemoes wrote:
> > >> Is it possible to implement non-sleeping versions of {get,put}_user()?
> > >
> > > __{get,put}_user()
> >
> > Huh?
> >
> > arch/x86/include/asm/uaccess.h:
> >
> > /**
> > * __get_user: - Get a simple variable from user space, with less checking.
> > * @x: Variable to store result.
> > * @ptr: Source address, in user space.
> > *
> > * Context: User context only. This function may sleep.
> >
> > What am I missing?
>
> __get_user() -> __get_user_nocheck() -> __get_user_size() -> __get_user_asm()
>
> And __get_user_asm() seems to generate the required .fixup section for
> this to work in pagefault_disable() context.
Well, it still might sleep if you're using it in page fault enabled
context. So the documentation of that function sucks.
Thanks,
tglx
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-06 14:33 ` Thomas Gleixner
@ 2014-03-06 14:34 ` H. Peter Anvin
0 siblings, 0 replies; 67+ messages in thread
From: H. Peter Anvin @ 2014-03-06 14:34 UTC (permalink / raw)
To: Thomas Gleixner, Peter Zijlstra
Cc: Rasmus Villemoes, Khalid Aziz, Ingo Molnar, akpm, andi.kleen,
rob, viro, oleg, venki, linux-kernel
The no checking is omitting access_ok(), no? Either way, disabling page faults have to be done explicitly.
On March 6, 2014 6:33:04 AM PST, Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
>On Thu, 6 Mar 2014, Peter Zijlstra wrote:
>
>> On Thu, Mar 06, 2014 at 02:45:00PM +0100, Rasmus Villemoes wrote:
>> > Peter Zijlstra <peterz@infradead.org> writes:
>> >
>> > > On Thu, Mar 06, 2014 at 02:24:43PM +0100, Rasmus Villemoes wrote:
>> > >> Is it possible to implement non-sleeping versions of
>{get,put}_user()?
>> > >
>> > > __{get,put}_user()
>> >
>> > Huh?
>> >
>> > arch/x86/include/asm/uaccess.h:
>> >
>> > /**
>> > * __get_user: - Get a simple variable from user space, with less
>checking.
>> > * @x: Variable to store result.
>> > * @ptr: Source address, in user space.
>> > *
>> > * Context: User context only. This function may sleep.
>> >
>> > What am I missing?
>>
>> __get_user() -> __get_user_nocheck() -> __get_user_size() ->
>__get_user_asm()
>>
>> And __get_user_asm() seems to generate the required .fixup section
>for
>> this to work in pagefault_disable() context.
>
>Well, it still might sleep if you're using it in page fault enabled
>context. So the documentation of that function sucks.
>
>Thanks,
>
> tglx
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-06 9:57 ` Peter Zijlstra
@ 2014-03-06 16:08 ` Khalid Aziz
0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-06 16:08 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Andi Kleen, Thomas Gleixner, One Thousand Gnomes, H. Peter Anvin,
Ingo Molnar, akpm, viro, oleg, linux-kernel
On 03/06/2014 02:57 AM, Peter Zijlstra wrote:
> On Wed, Mar 05, 2014 at 12:58:29PM -0700, Khalid Aziz wrote:
>> Looking at the current problem I am trying to
>> solve with databases and JVM, I run into the same issue I described in my
>> earlier email. Proxy execution is a post-contention solution. By the time
>> proxy execution can do something for my case, I have already paid the price
>> of contention and a context switch which is what I am trying to avoid. For a
>> critical section that is very short compared to the size of execution
>> thread, which is the case I am looking at, avoiding preemption in the middle
>> of that short critical section helps much more than dealing with lock
>> contention later on.
>
> Like others have already stated; its likely still cheaper than the
> pile-up you get now. It might not be optimally fast, but it sure takes
> out the worst case you have now.
>
>> The goal here is to avoid lock contention and
>> associated cost. I do understand the cost of dealing with lock contention
>> poorly and that can easily be much bigger cost, but I am looking into
>> avoiding even getting there.
>
> The thing is; unless userspace is a RT program or practises the same
> discipline in such an extend as that it make no practical difference,
> there's always going to be the case where you fail to cover the entire
> critical section, at which point you're back to your pile-up fail.
>
> So while the limited preemption guard helps the best cast, it doesn't
> help the worst case at all.
That is true. I am breaking this problem into two parts - (1) avoid pile
up, (2) if pile up happens, deal with it efficiently. Worst case
scenario you point out is the second part of the problem. Solutions for
that can be PTHREAD_PRIO_PROTECT protocol for the threads that use POSIX
threads or proxy execution. Once pile up has happened, cost of a system
call to boost thread priority becomes much smaller part of overall cost
of handling the pile up.
Part (1) of this problem is what my patch attempts to solve. Here the
cost of system call to boost priority or do anything else is too high.
The mechanism to avoid pile up has to be very light weight to be of any use.
>
> So supposing we went with this now; you (or someone else) will come back
> in a year's time and tell us that if we only just stretch this window a
> little, their favourite workload will also benefit.
>
> Where's the end of that?
>
> And what about CONFIG_HZ; suppose you compile your kernel with HZ=100
> and your 1 extra tick is sufficient. Then someone compiles their kernel
> with HZ=1000 and it all comes apart.
>
>
My goal here is to help the cases where critical section is short and
executes quickly as it should be for well designed critical sections in
threads that want to run using CFS. I see this as an incremental
improvement over current situation. With CFS, timeslice is adaptive and
depends upon the workload, so it is not directly tied to CONFIG_HZ. But
you are right, CONFIG_HZ does have a bearing on this. I see a critical
section that can easily go over a single timeslice and cause a pile up,
as a workload designed to create these problems. Such a workload needs
to use SCHED_FIFO or the deadline scheduler with properly designed yield
points and priorities, or live with the pile ups caused by using CFS.
Trying to help such cases with CFS is not beneficial and will cause CFS
to become more and more complex. What I am trying to do is help the
cases where a short critical section ends up being pre-empted simply
because the execution reached critical section only towards the end of
current timeslice and resulted in an unintended pile up. So give these
cases a tool to avoid pile ups but use of the tool comes with
restrictions (yield the processor as soon as you can if you got amnesty,
and pay a penalty if you don't). At this point, the two workloads I know
of that fit this group are databases and JVM both of which are in
significant use.
Makes sense?
Thanks,
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-06 14:25 ` David Lang
@ 2014-03-06 16:12 ` Khalid Aziz
0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-06 16:12 UTC (permalink / raw)
To: David Lang, Kevin Easton
Cc: Andi Kleen, Thomas Gleixner, One Thousand Gnomes, H. Peter Anvin,
Ingo Molnar, peterz, akpm, viro, oleg, linux-kernel
On 03/06/2014 07:25 AM, David Lang wrote:
> On Thu, 6 Mar 2014, Kevin Easton wrote:
>
>> On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
>>> Anything else?
>>
>> If it was possible to make the time remaining in the current timeslice
>> available to userspace through the vdso, the thread could do something
>> like:
>>
>> if (sys_timeleft() < CRITICAL_SECTION_SIZE)
>> yield();
>> lock();
>>
>> to avoid running out of timeslice in the middle of the critical section.
>
> but won't the system call result in context switches? According to
> Kevin, even a context switch to another thread and back immediatly is
> bad enough to need to be avoided, so replacing that with the context
> switch to the kernel and back isn't a subtantial win.
>
> David Lang
Using vdso reduces the cost of system call significantly, but as Peter
pointed out a thread can not really rely upon the number it will get back.
--
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-06 11:14 ` Thomas Gleixner
@ 2014-03-06 16:32 ` Khalid Aziz
0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-06 16:32 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Peter Zijlstra, Andi Kleen, One Thousand Gnomes, H. Peter Anvin,
Ingo Molnar, Andrew Morton, Al Viro, Oleg Nesterov, LKML
On 03/06/2014 04:14 AM, Thomas Gleixner wrote:
> We understand that you want to avoid preemption in the first place and
> not getting into the contention handling case.
>
> But, what you're trying to do is essentially creating an ABI which we
> have to support and maintain forever. And that definitely is worth a
> few serious questions.
Fair enough. I agree a new ABI should not be created lightly.
>
> Lets ignore the mm related issues for now as those can be solved. That's
> the least of my worries.
>
> Right now you are using this for a single use case with a well defined
> environment, where all related threads reside in the same scheduling
> class (FAIR). But that's one of a gazillion of use cases of Linux.
>
Creating a new ABI for a single use case or a special case is something
I would argue against as well. I am with you on that. I am stating that
databases and JVM happen to be two real world examples of the scenario
where CFS can cause convoying problem inadvertently for a well designed
critical section that represents a small portion of overall execution
thread, simply because of where in the current timeslice the critical
section is hit. If there are other examples others have come across, I
would love to hear it. If we can indeed say this is a very special case
for an uncommon workload, I would completely agree with refusing to
create a new ABI.
> If we allow you to special case your database workload then we have no
> argument why we should not do the same thing for realtime workloads
> where the SCHED_FAIR housekeeping thread can hold a lock shortly to
> access some important data in the SCHED_FIFO realtime computation
> thread. Of course the RT people want to avoid the lock contention as
> much as you do, just for different reasons.
>
> Add SCHED_EDF, cgroups and hierarchical scheduling to the picture and
> hell breaks lose.
Realtime and deadline scheduler policies are supposed to be higher
priority than CFS. A thread running in CFS that can impact threads
running with realtime policies is a bad thing, agreed? What I am
proposing actually allows a thread running with CFS to get out of the
way of threads running with realtime policies quicker. In your specific
example, the SCHED_FAIR housekeeping thread gets a chance to get out of
SCHED_FIFO threads' way by giving its critical section better chance to
complete execution before causing a convoy problem and while its cache
is hot by using the exact same mechanism I am proposing. The logic is
not onerous. Thread asks for amnesty from one context switch if and only
if rescheduling point happens in the middle of its timeslice. If
rescheduling point does not occur during its critical section, the
thread takes that request back and life goes on as if nothing changed.
If rescheduling point happens in the middle of thread's critical
section, it gets the amnesty but it yields the processor as soon as it
is done with its critical section. Any thread that does not play nice
gets penalized next time it wants immunity (as hpa suggested).
Thanks,
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-06 13:59 ` Peter Zijlstra
@ 2014-03-06 22:41 ` Andi Kleen
0 siblings, 0 replies; 67+ messages in thread
From: Andi Kleen @ 2014-03-06 22:41 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Kevin Easton, Andi Kleen, Thomas Gleixner, Khalid Aziz,
One Thousand Gnomes, H. Peter Anvin, Ingo Molnar, akpm, viro,
oleg, linux-kernel
On Thu, Mar 06, 2014 at 02:59:46PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 06, 2014 at 11:13:33PM +1100, Kevin Easton wrote:
> > On Tue, Mar 04, 2014 at 04:51:15PM -0800, Andi Kleen wrote:
> > > Anything else?
> >
> > If it was possible to make the time remaining in the current timeslice
> > available to userspace through the vdso, the thread could do something like:
>
> Assuming we can do per-cpu values in the VDSO; this would mean hitting
> that cacheline on every context switch and wakeup. That's a complete
> non-starter performance wise.
If you worry about fetching it you can always prefetch it early.
> > if (sys_timeleft() < CRITICAL_SECTION_SIZE)
> > yield();
> > lock();
> >
> > to avoid running out of timeslice in the middle of the critical section.
>
> Can still happen, the effective slice of a single runnable task is
> infinite, the moment another task gets woken this gets reduced to a finite
> amount, we then keep reducing the slice until there are about 8 runnable
> tasks (assuming you've not poked at any sysctls).
I guess it could be some predicted value, similar to how the menu
governour works.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v2] Pre-emption control for userspace
2014-03-03 18:07 [RFC] [PATCH] Pre-emption control for userspace Khalid Aziz
` (2 preceding siblings ...)
2014-03-04 21:12 ` H. Peter Anvin
@ 2014-03-25 17:17 ` Khalid Aziz
2014-03-25 17:44 ` Andrew Morton
` (3 more replies)
2014-03-25 23:01 ` [RFC] [PATCH] " Davidlohr Bueso
4 siblings, 4 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-25 17:17 UTC (permalink / raw)
To: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
gnomes, riel, snorcht, dhowells, luto, daeseok.youn, ebiederm
Cc: Khalid Aziz, linux-kernel, linux-doc
This patch adds a way for a thread to request additional timeslice from
the scheduler if it is about to be preempted, so it could complete any
critical task it is in the middle of. This functionality helps with
performance on databases and has been used for many years on other OSs
by the databases. This functionality helps in situation where a thread
acquires a lock before performing a critical operation on the database,
happens to get preempted before it completes its task and releases the
lock. This lock causes all other threads that also acquire the same
lock to perform their critical operation on the database to start
queueing up and causing large number of context switches. This queueing
problem can be avoided if the thread that acquires lock first could
request scheduler to grant it an additional timeslice once it enters its
critical section and hence allow it to complete its critical sectiona
without causing queueing problem. If critical section completes before
the thread is due for preemption, the thread can simply desassert its
request. A thread sends the scheduler this request by setting a flag in
a memory location it has shared with the kernel. Kernel uses bytes in
the same memory location to let the thread know when its request for
amnesty from preemption has been granted. Thread should yield the
processor at the end of its critical section if it was granted amnesty
to play nice with other threads. If thread fails to yield processor, it
gets penalized by having its next amnesty request turned down by
scheduler. Documentation file included in this patch contains further
details on how to use this functionality and conditions associated with
its use. This patch also adds a new field in scheduler statistics which
keeps track of how many times was a thread granted amnesty from
preemption. This feature and its usage are documented in
Documentation/scheduler/sched-preempt-delay.txt and this patch includes
a test for this feature under tools/testing/selftests/preempt-delay
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
v2:
- Replaced mmap operation with a more memory efficient futex
like communication between userspace and kernel
- Added a flag to let userspace know if it was granted amnesty
- Added a penalty for tasks failing to yield CPU when they
are granted amnesty from pre-emption
v1:
- Initial RFC patch with mmap for communication between userspace
and kernel
Documentation/scheduler/sched-preempt-delay.txt | 121 ++++++++++
arch/x86/Kconfig | 12 +
fs/proc/base.c | 89 ++++++++
include/linux/sched.h | 15 ++
kernel/fork.c | 5 +
kernel/sched/core.c | 8 +
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 114 ++++++++-
tools/testing/selftests/preempt-delay/Makefile | 8 +
.../selftests/preempt-delay/preempt-delay.c | 254 +++++++++++++++++++++
10 files changed, 624 insertions(+), 3 deletions(-)
create mode 100644 Documentation/scheduler/sched-preempt-delay.txt
create mode 100644 tools/testing/selftests/preempt-delay/Makefile
create mode 100644 tools/testing/selftests/preempt-delay/preempt-delay.c
diff --git a/Documentation/scheduler/sched-preempt-delay.txt b/Documentation/scheduler/sched-preempt-delay.txt
new file mode 100644
index 0000000..38b4edc
--- /dev/null
+++ b/Documentation/scheduler/sched-preempt-delay.txt
@@ -0,0 +1,121 @@
+=================================
+What is preemption delay feature?
+=================================
+
+There are times when a userspace task is executing a critical section
+which gates a number of other tasks that want access to the same
+critical section. If the task holding the lock that guards this critical
+section is preempted by the scheduler in the middle of its critical
+section because its timeslice is up, scheduler ends up scheduling other
+threads which immediately try to grab the lock to enter the critical
+section. This only results in lots of context changes are tasks wake up
+and go to sleep immediately again. If on the other hand, the original
+task were allowed to run for an extra timeslice, it could have completed
+executing its critical section allowing other tasks to make progress
+when they get scheduled. Preemption delay feature allows a task to
+request scheduler to grant it one extra timeslice, if possible.
+
+
+==================================
+Using the preemption delay feature
+==================================
+
+This feature is enabled in the kernel by setting
+CONFIG_SCHED_PREEMPT_DELAY in kernel configuration. Once this feature is
+enabled, the userspace process communicates with the kernel using a
+4-byte memory location in its address space. It first gives the kernel
+address for this memory location by writing its address to
+/proc/<tgid>/task/<tid>/sched_preempt_delay. This memory location is
+interpreted as a sequence of 4 bytes:
+
+ byte[0] = flag to request preemption delay
+ byte[1] = flag from kernel indicating preemption delay was granted
+ byte[2] = reserved for future use
+ byte[3] = reserved for future use
+
+Task requests a preemption delay by writing a non-zero value to the
+first byte. Scheduler checks this value before preempting the task.
+Scheduler can choose to grant one and only an additional time slice to
+the task for each delay request but this delay is not guaranteed.
+If scheduler does grant an additional timeslice, it will set the flag
+in second byte. Upon completion of the section of code where the task
+wants preemption delay, task should check the second byte. If the flag
+in second byte is set, it should clear this flag and call sched_yield()
+so as to not hog the processor. If a thread was granted additional
+timeslice and it fails to call sched_yield(), scheduler will penalize
+it by denying its next request for additional timeslice. Following sample
+code illustrates the use:
+
+int main()
+{
+ int fd, fsz;
+ unsigned char buf[256];
+ unsigned char preempt_delay[4];
+
+ sprintf(buf, “/proc/%lu/task/%lu/sched_preempt_delay”, getpid(),
+ syscall(SYS_gettid));
+ fd = open(buf, O_RDWR);
+
+ preempt_delay[0] = preempt_delay[1] = 0;
+
+ /* Tell kernel where the flag lives */
+ *(unsigned int **)buf = &preempt_delay;
+ write(fd, buf, sizeof(unsigned int *));
+
+ while (/* some condition is true */) {
+ /* do some work and get ready to enter critical section */
+ preempt_delay[0] = 1;
+ /*
+ * Obtain lock for critical section
+ */
+ /*
+ * critical section
+ */
+ /*
+ * Release lock for critical section
+ */
+ preempt_delay[0] = 0;
+ /* Give the CPU up if required */
+ if (preempt_delay[1]) {
+ preempt_delay[1] = 0;
+ sched_yield();
+ }
+ /* do some more work */
+ }
+ /*
+ * Tell kernel we are done asking for preemption delay
+ */
+ *(unsigned int **)buf = 0;
+ write(fd, buf, sizeof(unsigned int *));
+ close(fd);
+}
+
+
+====================
+Scheduler statistics
+====================
+
+Preemption delay features adds a new field to scheduler statictics -
+nr_preempt_delayed. This is a per thread statistic that tracks the
+number of times a thread was granted amnesty from preemption when it
+requested for one. "cat /proc/<pid>/task/<tid>/sched" will list this
+number along with other scheduler statistics.
+
+
+=====
+Notes
+=====
+
+1. /proc/<tgid>/task/<tid>/sched_preempt_delay can be written to only
+ by the thread that corresponds to this file.
+
+2. /proc/<tgid>/task/<tid>/sched_preempt_delay can be written with valid
+ memory address once. To write a new memory address, the previous
+ memory address must be cleared first by writing NULL. Each new
+ memory address requires validation in the kernel and update of
+ pointers. Changing this address too many times creates too much
+ overhead.
+
+2. Reading /proc/<tgid>/task/<tid>/sched_preempt_delay returns the
+ current memory location address thread is using to communicate with
+ the kernel.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0af5250..2d54816 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -849,6 +849,18 @@ config SCHED_MC
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.
+config SCHED_PREEMPT_DELAY
+ def_bool n
+ prompt "Scheduler preemption delay support"
+ depends on PROC_FS && PREEMPT_NOTIFIERS
+ ---help---
+ Say Y here if you want to be able to delay scheduler preemption
+ when possible by setting a flag in a memory location after
+ sharing the address of this location by writing to
+ /proc/<tgid>/task/<tid>/sched_preempt_delay. See
+ Documentation/scheduler/sched-preempt-delay.txt for details.
+ If in doubt, say "N".
+
source "kernel/Kconfig.preempt"
config X86_UP_APIC
diff --git a/fs/proc/base.c b/fs/proc/base.c
index b976062..f6ab240 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1304,6 +1304,92 @@ static const struct file_operations proc_pid_sched_operations = {
#endif
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+static int
+tid_preempt_delay_show(struct seq_file *m, void *v)
+{
+ struct inode *inode = m->private;
+ struct task_struct *task = get_proc_task(inode);
+ unsigned char *delay_req;
+
+ if (!task)
+ return -ENOENT;
+
+ delay_req = (unsigned char *)task->sched_preempt_delay.delay_req;
+ seq_printf(m, "0x%-p\n", delay_req);
+
+ put_task_struct(task);
+ return 0;
+}
+
+static ssize_t
+tid_preempt_delay_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *offset)
+{
+ struct inode *inode = file_inode(file);
+ struct task_struct *task = get_proc_task(inode);
+ u32 __user *delay_req;
+ int retval;
+
+ if (!task) {
+ retval = -ENOENT;
+ goto out;
+ }
+
+ /*
+ * A thread can write only to its corresponding preempt_delay
+ * proc file
+ */
+ if (current != task) {
+ retval = -EPERM;
+ goto out;
+ }
+
+ delay_req = *(u32 __user **)buf;
+
+ /*
+ * Do not allow write if pointer is currently set
+ */
+ if (task->sched_preempt_delay.delay_req && (delay_req != NULL)) {
+ retval = -EINVAL;
+ goto out;
+ }
+
+ /*
+ * Validate the pointer.
+ */
+ if (unlikely(!access_ok(rw, delay_req, sizeof(u32)))) {
+ retval = -EFAULT;
+ goto out;
+ }
+
+ task->sched_preempt_delay.delay_req = delay_req;
+
+ /* zero out flags */
+ put_user(0, delay_req);
+
+ retval = count;
+
+out:
+ put_task_struct(task);
+ return retval;
+}
+
+static int
+tid_preempt_delay_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, tid_preempt_delay_show, inode);
+}
+
+static const struct file_operations proc_tid_preempt_delay_ops = {
+ .open = tid_preempt_delay_open,
+ .read = seq_read,
+ .write = tid_preempt_delay_write,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+#endif
+
#ifdef CONFIG_SCHED_AUTOGROUP
/*
* Print out autogroup related information:
@@ -2999,6 +3085,9 @@ static const struct pid_entry tid_base_stuff[] = {
REG("gid_map", S_IRUGO|S_IWUSR, proc_gid_map_operations),
REG("projid_map", S_IRUGO|S_IWUSR, proc_projid_map_operations),
#endif
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+ REG("sched_preempt_delay", S_IRUGO|S_IWUSR, proc_tid_preempt_delay_ops),
+#endif
};
static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a781dec..77aba5c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1056,6 +1056,7 @@ struct sched_statistics {
u64 nr_wakeups_affine_attempts;
u64 nr_wakeups_passive;
u64 nr_wakeups_idle;
+ u64 nr_preempt_delayed;
};
#endif
@@ -1250,6 +1251,13 @@ struct task_struct {
/* Revert to default priority/policy when forking */
unsigned sched_reset_on_fork:1;
unsigned sched_contributes_to_load:1;
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+ struct preempt_delay {
+ u32 __user *delay_req; /* delay request flag pointer */
+ unsigned char delay_granted:1; /* currently in delay */
+ unsigned char yield_penalty:1; /* failure to yield penalty */
+ } sched_preempt_delay;
+#endif
pid_t pid;
pid_t tgid;
@@ -2061,6 +2069,13 @@ extern u64 scheduler_tick_max_deferment(void);
static inline bool sched_can_stop_tick(void) { return false; }
#endif
+#if defined(CONFIG_SCHED_PREEMPT_DELAY) && defined(CONFIG_PROC_FS)
+extern void sched_preempt_delay_show(struct seq_file *m,
+ struct task_struct *task);
+extern void sched_preempt_delay_set(struct task_struct *task,
+ unsigned char *val);
+#endif
+
#ifdef CONFIG_SCHED_AUTOGROUP
extern void sched_autogroup_create_attach(struct task_struct *p);
extern void sched_autogroup_detach(struct task_struct *p);
diff --git a/kernel/fork.c b/kernel/fork.c
index a17621c..8847176 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1617,6 +1617,11 @@ long do_fork(unsigned long clone_flags,
init_completion(&vfork);
get_task_struct(p);
}
+#if CONFIG_SCHED_PREEMPT_DELAY
+ p->sched_preempt_delay.delay_req = NULL;
+ p->sched_preempt_delay.delay_granted = 0;
+ p->sched_preempt_delay.yield_penalty = 0;
+#endif
wake_up_new_task(p);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5c6635..ec16b4e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4055,6 +4055,14 @@ SYSCALL_DEFINE0(sched_yield)
{
struct rq *rq = this_rq_lock();
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+ /*
+ * Clear the penalty flag for current task to reward it for
+ * palying by the rules
+ */
+ current->sched_preempt_delay.yield_penalty = 0;
+#endif
+
schedstat_inc(rq, yld_count);
current->sched_class->yield_task(rq);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index dd52e7f..2abd02b 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -602,6 +602,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
P(se.statistics.nr_wakeups_affine_attempts);
P(se.statistics.nr_wakeups_passive);
P(se.statistics.nr_wakeups_idle);
+ P(se.statistics.nr_preempt_delayed);
{
u64 avg_atom, avg_per_cpu;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9b4c4f3..142bed5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -444,6 +444,114 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
#endif /* CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+/*
+ * delay_resched_task(): Check if the task about to be preempted has
+ * requested an additional time slice. If it has, grant it additional
+ * timeslice once.
+ */
+static void
+delay_resched_task(struct task_struct *curr)
+{
+ struct sched_entity *se;
+ int cpu = task_cpu(curr);
+ u32 __user *delay_req;
+ unsigned int delay_req_flag;
+ unsigned char *delay_flag;
+
+ /*
+ * Check if task is using pre-emption delay feature. If address
+ * for preemption delay request flag is not set, this task is
+ * not using preemption delay feature, we can reschedule without
+ * any delay
+ */
+ delay_req = curr->sched_preempt_delay.delay_req;
+
+ if ((delay_req == NULL) || (cpu != smp_processor_id()))
+ goto resched_now;
+
+ /*
+ * Pre-emption delay will be granted only once. If this task
+ * has already been granted delay, rechedule now
+ */
+ if (curr->sched_preempt_delay.delay_granted) {
+ curr->sched_preempt_delay.delay_granted = 0;
+ goto resched_now;
+ }
+
+ /*
+ * Get the value of preemption delay request flag from userspace.
+ * Task had already passed us the address where the flag is stored
+ * in userspace earlier. This flag is just like the PROCESS_PRIVATE
+ * futex, leverage the futex code here to read the flag. If there
+ * is a page fault accessing this flag in userspace, that means
+ * userspace has not touched this flag recently and we can
+ * assume no preemption delay is needed.
+ *
+ * If task is not requesting additional timeslice, resched now
+ */
+ if (delay_req) {
+ int ret;
+
+ pagefault_disable();
+ ret = __copy_from_user_inatomic(&delay_req_flag, delay_req,
+ sizeof(u32));
+ pagefault_enable();
+ delay_flag = &delay_req_flag;
+ if (ret || !delay_flag[0])
+ goto resched_now;
+ } else {
+ goto resched_now;
+ }
+
+ /*
+ * Current thread has requested preemption delay and has not
+ * been granted an extension yet. If this thread failed to yield
+ * processor after being granted amnesty last time, penalize it
+ * by not granting this delay request, otherwise give it an extra
+ * timeslice.
+ */
+ if (curr->sched_preempt_delay.yield_penalty) {
+ curr->sched_preempt_delay.yield_penalty = 0;
+ goto resched_now;
+ }
+
+ se = &curr->se;
+ curr->sched_preempt_delay.delay_granted = 1;
+
+ /*
+ * Set the penalty flag for failing to yield the processor after
+ * being granted immunity. This flag will be cleared in
+ * sched_yield() if the thread indeed calls sched_yield
+ */
+ curr->sched_preempt_delay.yield_penalty = 1;
+
+ /*
+ * Let the thread know it got amnesty and it should call
+ * sched_yield() when it is done to avoid penalty next time
+ * it wants amnesty. We need to write to userspace location.
+ * Since we just read from this location, chances are extremley
+ * low we might page fault. If we do page fault, we will ignore
+ * it and accept the cost of failed write in form of unnecessary
+ * penalty for userspace task for not yielding processor.
+ * This is a highly unlikely scenario.
+ */
+ delay_flag[0] = 0;
+ delay_flag[1] = 1;
+ pagefault_disable();
+ __copy_to_user_inatomic(delay_req, &delay_req_flag, sizeof(u32));
+ pagefault_enable();
+
+ schedstat_inc(curr, se.statistics.nr_preempt_delayed);
+ return;
+
+resched_now:
+ resched_task(curr);
+}
+#else
+#define delay_resched_task(curr) resched_task(curr)
+#endif /* CONFIG_SCHED_PREEMPT_DELAY */
+
static __always_inline
void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
@@ -2679,7 +2787,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
ideal_runtime = sched_slice(cfs_rq, curr);
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
if (delta_exec > ideal_runtime) {
- resched_task(rq_of(cfs_rq)->curr);
+ delay_resched_task(rq_of(cfs_rq)->curr);
/*
* The current task ran long enough, ensure it doesn't get
* re-elected due to buddy favours.
@@ -2703,7 +2811,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
return;
if (delta > ideal_runtime)
- resched_task(rq_of(cfs_rq)->curr);
+ delay_resched_task(rq_of(cfs_rq)->curr);
}
static void
@@ -4477,7 +4585,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
return;
preempt:
- resched_task(curr);
+ delay_resched_task(curr);
/*
* Only set the backward buddy when the current task is still
* on the rq. This can happen when a wakeup gets interleaved
diff --git a/tools/testing/selftests/preempt-delay/Makefile b/tools/testing/selftests/preempt-delay/Makefile
new file mode 100644
index 0000000..b2da185
--- /dev/null
+++ b/tools/testing/selftests/preempt-delay/Makefile
@@ -0,0 +1,8 @@
+all:
+ gcc -pthread preempt-delay.c -o preempt-delay -lrt
+
+run_tests: all
+ ./preempt-delay 300 400
+
+clean:
+ rm -f ./preempt-delay
diff --git a/tools/testing/selftests/preempt-delay/preempt-delay.c b/tools/testing/selftests/preempt-delay/preempt-delay.c
new file mode 100644
index 0000000..59daf8f
--- /dev/null
+++ b/tools/testing/selftests/preempt-delay/preempt-delay.c
@@ -0,0 +1,254 @@
+/*
+ * This test program checks for the presence of preemption delay feature
+ * in the kernel. If the feature is present, it exercises it by running
+ * a number of threads that ask for preemption delay and checks if they
+ * are granted these preemption delays. It then runs the threads again
+ * without requesting preemption delays and verifies preemption delays
+ * are not granted when not requested (negative test).
+ */
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <time.h>
+#include <string.h>
+#include <pthread.h>
+#include <sys/types.h>
+#include <sys/syscall.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/mman.h>
+
+#define NUMTHREADS 1000
+
+pthread_mutex_t mylock = PTHREAD_MUTEX_INITIALIZER;
+unsigned long iterations;
+unsigned long delays_granted = 0;
+unsigned long request_delay = 1;
+
+#define BUFSIZE 1024
+
+int
+feature_check()
+{
+ unsigned char buf[BUFSIZE];
+
+ sprintf(buf, "/proc/%d/task/%ld/sched_preempt_delay",
+ getpid(), syscall(SYS_gettid));
+ if (access(buf, F_OK))
+ return 1;
+ return 0;
+}
+
+void
+do_some_work(void *param)
+{
+ struct timespec timeout;
+ int i, j, tid, fd, fsz;
+ unsigned long sum;
+ unsigned char buf[BUFSIZE];
+ unsigned char delay[4];
+ int cnt = 0;
+
+ /*
+ * mmap the sched_preempt_delay file
+ */
+ sprintf(buf, "/proc/%d/task/%ld/sched_preempt_delay",
+ getpid(), syscall(SYS_gettid));
+ fd = open(buf, O_RDWR);
+ if (fd == -1) {
+ perror("Error opening sched_preemp_delay file");
+ return;
+ }
+
+ for (i = 0; i < 4; i++)
+ delay[i] = 0;
+
+ if (request_delay) {
+ *(unsigned int **)buf = (unsigned int *) &delay;
+ if (write(fd, buf, sizeof(unsigned int *)) < 0) {
+ perror("Error writing flag address");
+ close(fd);
+ return;
+ }
+ }
+
+ tid = *(int *) param;
+
+ for (i = 0; i < iterations; i++) {
+ /* start by locking the resource */
+ if (request_delay)
+ delay[0] = 1;
+ if (pthread_mutex_lock(&mylock)) {
+ perror("mutex_lock():");
+ delay[0] = 0;
+ return;
+ }
+
+ /* Do some busy work */
+ sum = 0;
+ for (j = 0; j < (iterations*(tid+1)); j++)
+ sum += sum;
+ for (j = 0; j < iterations/(tid+1); j++)
+ sum += i^2;
+
+ /* Now unlock the resource */
+ if (pthread_mutex_unlock(&mylock)) {
+ perror("mutex_unlock():");
+ delay[0] = 0;
+ return;
+ }
+ delay[0] = 0;
+
+ if (delay[1]) {
+ delay[1] = 0;
+ cnt++;
+ sched_yield();
+ }
+ }
+
+ if (request_delay) {
+ *(unsigned int **)buf = 0;
+ if (write(fd, buf, sizeof(unsigned int *)) < 0) {
+ perror("Error clearing flag address");
+ close(fd);
+ return;
+ }
+ }
+ close(fd);
+
+ /*
+ * Update global count of delays granted. Need to grab a lock
+ * since this is a global.
+ */
+ if (pthread_mutex_lock(&mylock)) {
+ perror("mutex_lock():");
+ delay[0] = 0;
+ return;
+ }
+ delays_granted += cnt;
+ if (pthread_mutex_unlock(&mylock)) {
+ perror("mutex_unlock():");
+ delay[0] = 0;
+ return;
+ }
+}
+
+void
+help(char *progname)
+{
+ fprintf(stderr, "Usage: %s <number of threads> ", progname);
+ fprintf(stderr, "<number of iterations>\n");
+ fprintf(stderr, " Notes: (1) Maximum number of threads is %d\n",
+ NUMTHREADS);
+ fprintf(stderr, " (2) Suggested number of iterations is ");
+ fprintf(stderr, "300-10000\n");
+ fprintf(stderr, " (3) Exit codes are: 1 = Failed with no ");
+ fprintf(stderr, "preemption delays granted\n");
+ fprintf(stderr, " 2 = Failed with ");
+ fprintf(stderr, "preemption delays granted when\n");
+ fprintf(stderr, " not requested\n");
+ fprintf(stderr, " 3 = Error in test ");
+ fprintf(stderr, "arguments\n");
+ fprintf(stderr, " 4 = Other errors\n");
+}
+
+int main(int argc, char **argv)
+{
+ pthread_t thread[NUMTHREADS];
+ int ret, i, tid[NUMTHREADS];
+ unsigned long nthreads;
+
+ /* check arguments */
+ if (argc < 3) {
+ help(argv[0]);
+ exit(3);
+ }
+
+ nthreads = atoi(argv[1]);
+ iterations = atoi(argv[2]);
+ if (nthreads > NUMTHREADS) {
+ fprintf(stderr, "ERROR: exceeded maximum number of threads\n");
+ exit(3);
+ }
+
+ /*
+ * Check for the presence of feature
+ */
+ if (feature_check()) {
+ printf("INFO: Pre-emption delay feature is not present in ");
+ printf("this kernel\n");
+ exit(0);
+ }
+
+ /*
+ * Create a bunch of threads that will compete for the
+ * same mutex. Run these threads first while requesting
+ * preemption delay.
+ */
+ for (i = 0; i < nthreads; i++) {
+ tid[i] = i;
+ ret = pthread_create(&thread[i], NULL, (void *)&do_some_work,
+ &tid[i]);
+ if (ret) {
+ perror("pthread_create(): ");
+ exit(4);
+ }
+ }
+
+ printf("Threads started. Waiting......\n");
+ /* Now wait for threads to get done */
+ for (i = 0; i < nthreads; i++) {
+ ret = pthread_join(thread[i], NULL);
+ if (ret) {
+ perror("pthread_join(): ");
+ exit(4);
+ }
+ }
+
+ /*
+ * We started out with requesting pre-emption delays, check if
+ * we got at least a few.
+ */
+ if (delays_granted == 0) {
+ fprintf(stderr, "FAIL: No delays granted at all.\n");
+ exit(1);
+ }
+
+ /*
+ * Run the threads again, this time not requesting preemption delays
+ */
+ request_delay = 0;
+ delays_granted = 0;
+ for (i = 0; i < nthreads; i++) {
+ tid[i] = i;
+ ret = pthread_create(&thread[i], NULL, (void *)&do_some_work,
+ &tid[i]);
+ if (ret) {
+ perror("pthread_create(): ");
+ exit(4);
+ }
+ }
+
+ printf("Threads started. Waiting......\n");
+ /* Now wait for threads to get done */
+ for (i = 0; i < nthreads; i++) {
+ ret = pthread_join(thread[i], NULL);
+ if (ret) {
+ perror("pthread_join(): ");
+ exit(4);
+ }
+ }
+
+ /*
+ * Check if preemption delays were granted even though we
+ * did not ask for them
+ */
+ if (delays_granted > 0) {
+ fprintf(stderr, "FAIL: delays granted when not requested.\n");
+ exit(2);
+ }
+}
--
1.8.3.2
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH v2] Pre-emption control for userspace
2014-03-25 17:17 ` [PATCH v2] " Khalid Aziz
@ 2014-03-25 17:44 ` Andrew Morton
2014-03-25 17:56 ` Khalid Aziz
2014-03-25 17:46 ` Oleg Nesterov
` (2 subsequent siblings)
3 siblings, 1 reply; 67+ messages in thread
From: Andrew Morton @ 2014-03-25 17:44 UTC (permalink / raw)
To: Khalid Aziz
Cc: tglx, mingo, hpa, peterz, andi.kleen, rob, viro, oleg, gnomes,
riel, snorcht, dhowells, luto, daeseok.youn, ebiederm,
linux-kernel, linux-doc
On Tue, 25 Mar 2014 11:17:50 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> This patch adds a way for a thread to request additional timeslice from
> the scheduler if it is about to be preempted, so it could complete any
> critical task it is in the middle of. This functionality helps with
> performance on databases and has been used for many years on other OSs
> by the databases. This functionality helps in situation where a thread
> acquires a lock before performing a critical operation on the database,
> happens to get preempted before it completes its task and releases the
> lock. This lock causes all other threads that also acquire the same
> lock to perform their critical operation on the database to start
> queueing up and causing large number of context switches. This queueing
> problem can be avoided if the thread that acquires lock first could
> request scheduler to grant it an additional timeslice once it enters its
> critical section and hence allow it to complete its critical sectiona
> without causing queueing problem. If critical section completes before
> the thread is due for preemption, the thread can simply desassert its
> request. A thread sends the scheduler this request by setting a flag in
> a memory location it has shared with the kernel. Kernel uses bytes in
> the same memory location to let the thread know when its request for
> amnesty from preemption has been granted. Thread should yield the
> processor at the end of its critical section if it was granted amnesty
> to play nice with other threads. If thread fails to yield processor, it
> gets penalized by having its next amnesty request turned down by
> scheduler. Documentation file included in this patch contains further
> details on how to use this functionality and conditions associated with
> its use. This patch also adds a new field in scheduler statistics which
> keeps track of how many times was a thread granted amnesty from
> preemption. This feature and its usage are documented in
> Documentation/scheduler/sched-preempt-delay.txt and this patch includes
> a test for this feature under tools/testing/selftests/preempt-delay
What a long paragraph ;)
The feature makes sense and sounds useful, but I'd like to see some
real world performance testing results so we can understand what the
benefit will be to our users?
>
> ...
>
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +static int
> +tid_preempt_delay_show(struct seq_file *m, void *v)
> +{
> + struct inode *inode = m->private;
> + struct task_struct *task = get_proc_task(inode);
> + unsigned char *delay_req;
> +
> + if (!task)
> + return -ENOENT;
> +
> + delay_req = (unsigned char *)task->sched_preempt_delay.delay_req;
> + seq_printf(m, "0x%-p\n", delay_req);
> +
> + put_task_struct(task);
> + return 0;
> +}
> +
> +static ssize_t
> +tid_preempt_delay_write(struct file *file, const char __user *buf,
> + size_t count, loff_t *offset)
> +{
> + struct inode *inode = file_inode(file);
> + struct task_struct *task = get_proc_task(inode);
> + u32 __user *delay_req;
> + int retval;
> +
> + if (!task) {
> + retval = -ENOENT;
> + goto out;
> + }
> +
> + /*
> + * A thread can write only to its corresponding preempt_delay
> + * proc file
> + */
> + if (current != task) {
> + retval = -EPERM;
> + goto out;
> + }
> +
> + delay_req = *(u32 __user **)buf;
> +
> + /*
> + * Do not allow write if pointer is currently set
> + */
> + if (task->sched_preempt_delay.delay_req && (delay_req != NULL)) {
> + retval = -EINVAL;
> + goto out;
> + }
> +
> + /*
> + * Validate the pointer.
> + */
> + if (unlikely(!access_ok(rw, delay_req, sizeof(u32)))) {
> + retval = -EFAULT;
> + goto out;
> + }
> +
> + task->sched_preempt_delay.delay_req = delay_req;
> +
> + /* zero out flags */
> + put_user(0, delay_req);
> +
> + retval = count;
> +
> +out:
> + put_task_struct(task);
> + return retval;
> +}
So the procfs file is written in binary format and is read back in
ascii format. Seems odd.
Perhaps this should all be done as a new syscall rather than some
procfs thing.
>
> ...
>
> @@ -1250,6 +1251,13 @@ struct task_struct {
> /* Revert to default priority/policy when forking */
> unsigned sched_reset_on_fork:1;
> unsigned sched_contributes_to_load:1;
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> + struct preempt_delay {
> + u32 __user *delay_req; /* delay request flag pointer */
> + unsigned char delay_granted:1; /* currently in delay */
> + unsigned char yield_penalty:1; /* failure to yield penalty */
> + } sched_preempt_delay;
The problem with bitfields is that a write to one bitfield can corrupt
a concurrent write to the other one. So it's your responsibility to
provide locking and/or to describe how this race is avoided. A comment
here in the definition would be a suitable way of addressing this.
> +#endif
>
> pid_t pid;
> pid_t tgid;
>
> ...
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4055,6 +4055,14 @@ SYSCALL_DEFINE0(sched_yield)
> {
> struct rq *rq = this_rq_lock();
>
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> + /*
> + * Clear the penalty flag for current task to reward it for
> + * palying by the rules
"playing"
> + */
> + current->sched_preempt_delay.yield_penalty = 0;
> +#endif
> +
> schedstat_inc(rq, yld_count);
> current->sched_class->yield_task(rq);
>
> ...
>
> +static void
> +delay_resched_task(struct task_struct *curr)
> +{
> + struct sched_entity *se;
> + int cpu = task_cpu(curr);
> + u32 __user *delay_req;
> + unsigned int delay_req_flag;
> + unsigned char *delay_flag;
> +
> + /*
> + * Check if task is using pre-emption delay feature. If address
> + * for preemption delay request flag is not set, this task is
> + * not using preemption delay feature, we can reschedule without
> + * any delay
> + */
> + delay_req = curr->sched_preempt_delay.delay_req;
> +
> + if ((delay_req == NULL) || (cpu != smp_processor_id()))
Presumably we don't get "smp_processor_id() used in preemptible code"
warnings here, so called-with-preempt-disabled is a secret prerequisite
for delay_resched_task().
> + goto resched_now;
> +
> + /*
> + * Pre-emption delay will be granted only once. If this task
> + * has already been granted delay, rechedule now
> + */
> + if (curr->sched_preempt_delay.delay_granted) {
> + curr->sched_preempt_delay.delay_granted = 0;
> + goto resched_now;
> + }
> +
> + /*
> + * Get the value of preemption delay request flag from userspace.
> + * Task had already passed us the address where the flag is stored
> + * in userspace earlier. This flag is just like the PROCESS_PRIVATE
> + * futex, leverage the futex code here to read the flag. If there
> + * is a page fault accessing this flag in userspace, that means
> + * userspace has not touched this flag recently and we can
> + * assume no preemption delay is needed.
> + *
> + * If task is not requesting additional timeslice, resched now
> + */
> + if (delay_req) {
> + int ret;
> +
> + pagefault_disable();
> + ret = __copy_from_user_inatomic(&delay_req_flag, delay_req,
> + sizeof(u32));
> + pagefault_enable();
This all looks rather hacky and unneccesary. Can't we somehow use
plain old get_user() and avoid such fuss?
> + delay_flag = &delay_req_flag;
> + if (ret || !delay_flag[0])
> + goto resched_now;
> + } else {
> + goto resched_now;
> + }
> +
> + /*
> + * Current thread has requested preemption delay and has not
> + * been granted an extension yet. If this thread failed to yield
> + * processor after being granted amnesty last time, penalize it
> + * by not granting this delay request, otherwise give it an extra
> + * timeslice.
> + */
> + if (curr->sched_preempt_delay.yield_penalty) {
> + curr->sched_preempt_delay.yield_penalty = 0;
> + goto resched_now;
> + }
> +
> + se = &curr->se;
> + curr->sched_preempt_delay.delay_granted = 1;
> +
> + /*
> + * Set the penalty flag for failing to yield the processor after
> + * being granted immunity. This flag will be cleared in
> + * sched_yield() if the thread indeed calls sched_yield
> + */
> + curr->sched_preempt_delay.yield_penalty = 1;
> +
> + /*
> + * Let the thread know it got amnesty and it should call
> + * sched_yield() when it is done to avoid penalty next time
> + * it wants amnesty. We need to write to userspace location.
> + * Since we just read from this location, chances are extremley
> + * low we might page fault. If we do page fault, we will ignore
> + * it and accept the cost of failed write in form of unnecessary
> + * penalty for userspace task for not yielding processor.
> + * This is a highly unlikely scenario.
> + */
> + delay_flag[0] = 0;
> + delay_flag[1] = 1;
> + pagefault_disable();
> + __copy_to_user_inatomic(delay_req, &delay_req_flag, sizeof(u32));
> + pagefault_enable();
and put_user() here.
> + schedstat_inc(curr, se.statistics.nr_preempt_delayed);
> + return;
> +
> +resched_now:
> + resched_task(curr);
> +}
> +#else
> +#define delay_resched_task(curr) resched_task(curr)
This could have been implemented in C...
> +#endif /* CONFIG_SCHED_PREEMPT_DELAY */
> +
>
> ...
>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v2] Pre-emption control for userspace
2014-03-25 17:17 ` [PATCH v2] " Khalid Aziz
2014-03-25 17:44 ` Andrew Morton
@ 2014-03-25 17:46 ` Oleg Nesterov
2014-03-25 17:59 ` Khalid Aziz
2014-03-25 18:20 ` Andi Kleen
2014-03-25 18:59 ` Eric W. Biederman
3 siblings, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2014-03-25 17:46 UTC (permalink / raw)
To: Khalid Aziz
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, gnomes,
riel, snorcht, dhowells, luto, daeseok.youn, ebiederm,
linux-kernel, linux-doc
On 03/25, Khalid Aziz wrote:
>
> fs/proc/base.c | 89 ++++++++
This code can be simplified, but the real question is why do we need it.
Just add PR_SET_PREEMPT_DELAY ?
Oleg.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v2] Pre-emption control for userspace
2014-03-25 17:44 ` Andrew Morton
@ 2014-03-25 17:56 ` Khalid Aziz
2014-03-25 18:14 ` Andrew Morton
0 siblings, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-25 17:56 UTC (permalink / raw)
To: Andrew Morton
Cc: tglx, mingo, hpa, peterz, andi.kleen, rob, viro, oleg, gnomes,
riel, snorcht, dhowells, luto, daeseok.youn, ebiederm,
linux-kernel, linux-doc
On 03/25/2014 11:44 AM, Andrew Morton wrote:
> So the procfs file is written in binary format and is read back in
> ascii format. Seems odd.
>
> Perhaps this should all be done as a new syscall rather than some
> procfs thing.
>
I didn't want to add yet another syscall which will then need to be
added to glibc, but I am open to doing it through a syscall if that is
the consensus.
>> + struct preempt_delay {
>> + u32 __user *delay_req; /* delay request flag pointer */
>> + unsigned char delay_granted:1; /* currently in delay */
>> + unsigned char yield_penalty:1; /* failure to yield penalty */
>> + } sched_preempt_delay;
>
> The problem with bitfields is that a write to one bitfield can corrupt
> a concurrent write to the other one. So it's your responsibility to
> provide locking and/or to describe how this race is avoided. A comment
> here in the definition would be a suitable way of addressing this.
>
I do not have a strong reason to use a bitfield, just trying to not use
any more bytes than I need to. If using a char is safer, I would rather
use safer code.
>> + if (delay_req) {
>> + int ret;
>> +
>> + pagefault_disable();
>> + ret = __copy_from_user_inatomic(&delay_req_flag, delay_req,
>> + sizeof(u32));
>> + pagefault_enable();
>
> This all looks rather hacky and unneccesary. Can't we somehow use
> plain old get_user() and avoid such fuss?
get_user() takes longer and can sleep if page fault occurs. I need this
code to be very fast for it to be beneficial and am willing to ignore
page faults since page fault would imply the task has not touched
pre-emption delay request field and hence we can resched safely.
>> +#else
>> +#define delay_resched_task(curr) resched_task(curr)
>
> This could have been implemented in C...
>
Sure, I can do that.
Thanks, Andrew!
--
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v2] Pre-emption control for userspace
2014-03-25 17:46 ` Oleg Nesterov
@ 2014-03-25 17:59 ` Khalid Aziz
0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-25 17:59 UTC (permalink / raw)
To: Oleg Nesterov
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, gnomes,
riel, snorcht, dhowells, luto, daeseok.youn, ebiederm,
linux-kernel, linux-doc
On 03/25/2014 11:46 AM, Oleg Nesterov wrote:
> On 03/25, Khalid Aziz wrote:
>>
>> fs/proc/base.c | 89 ++++++++
>
> This code can be simplified, but the real question is why do we need it.
>
> Just add PR_SET_PREEMPT_DELAY ?
>
> Oleg.
>
I like this idea! Thanks.
--
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v2] Pre-emption control for userspace
2014-03-25 17:56 ` Khalid Aziz
@ 2014-03-25 18:14 ` Andrew Morton
0 siblings, 0 replies; 67+ messages in thread
From: Andrew Morton @ 2014-03-25 18:14 UTC (permalink / raw)
To: Khalid Aziz
Cc: tglx, mingo, hpa, peterz, andi.kleen, rob, viro, oleg, gnomes,
riel, snorcht, dhowells, luto, daeseok.youn, ebiederm,
linux-kernel, linux-doc
On Tue, 25 Mar 2014 11:56:31 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
> On 03/25/2014 11:44 AM, Andrew Morton wrote:
> > So the procfs file is written in binary format and is read back in
> > ascii format. Seems odd.
> >
> > Perhaps this should all be done as a new syscall rather than some
> > procfs thing.
> >
>
> I didn't want to add yet another syscall which will then need to be
> added to glibc, but I am open to doing it through a syscall if that is
> the consensus.
>
> >> + struct preempt_delay {
> >> + u32 __user *delay_req; /* delay request flag pointer */
> >> + unsigned char delay_granted:1; /* currently in delay */
> >> + unsigned char yield_penalty:1; /* failure to yield penalty */
> >> + } sched_preempt_delay;
> >
> > The problem with bitfields is that a write to one bitfield can corrupt
> > a concurrent write to the other one. So it's your responsibility to
> > provide locking and/or to describe how this race is avoided. A comment
> > here in the definition would be a suitable way of addressing this.
> >
>
> I do not have a strong reason to use a bitfield, just trying to not use
> any more bytes than I need to. If using a char is safer, I would rather
> use safer code.
My point is that the locking rules should be documented, via a code comment.
Presumably that rule is "only ever modified by this task".
> >> + if (delay_req) {
> >> + int ret;
> >> +
> >> + pagefault_disable();
> >> + ret = __copy_from_user_inatomic(&delay_req_flag, delay_req,
> >> + sizeof(u32));
> >> + pagefault_enable();
> >
> > This all looks rather hacky and unneccesary. Can't we somehow use
> > plain old get_user() and avoid such fuss?
>
> get_user() takes longer and can sleep if page fault occurs. I need this
> code to be very fast for it to be beneficial and am willing to ignore
> page faults since page fault would imply the task has not touched
> pre-emption delay request field and hence we can resched safely.
That's what I meant by "hacky" :)
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v2] Pre-emption control for userspace
2014-03-25 17:17 ` [PATCH v2] " Khalid Aziz
2014-03-25 17:44 ` Andrew Morton
2014-03-25 17:46 ` Oleg Nesterov
@ 2014-03-25 18:20 ` Andi Kleen
2014-03-25 18:47 ` Khalid Aziz
2014-03-25 18:59 ` Eric W. Biederman
3 siblings, 1 reply; 67+ messages in thread
From: Andi Kleen @ 2014-03-25 18:20 UTC (permalink / raw)
To: Khalid Aziz
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
gnomes, riel, snorcht, dhowells, luto, daeseok.youn, ebiederm,
linux-kernel, linux-doc
Khalid Aziz <khalid.aziz@oracle.com> writes:
First it would be nice to have some standard reference lock library that
uses this. What would it take to support this in glibc?
> +==================================
> +Using the preemption delay feature
> +==================================
> +
> +This feature is enabled in the kernel by setting
> +CONFIG_SCHED_PREEMPT_DELAY in kernel configuration. Once this feature is
> +enabled, the userspace process communicates with the kernel using a
> +4-byte memory location in its address space. It first gives the kernel
> +address for this memory location by writing its address to
> +/proc/<tgid>/task/<tid>/sched_preempt_delay. This memory location is
> +interpreted as a sequence of 4 bytes:
> +
> + byte[0] = flag to request preemption delay
> + byte[1] = flag from kernel indicating preemption delay was granted
> + byte[2] = reserved for future use
> + byte[3] = reserved for future use
Should reserve more bytes (64, 128?) and rename the proc flag
to a more generic name. I could well assume other things
using such a mechanism in the future. Also please add a flag
word with feature bits (similar to the perf mmap page)
How about alignment? x86 will not care, but other architectures
may.
> #endif
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> + REG("sched_preempt_delay", S_IRUGO|S_IWUSR,
> proc_tid_preempt_delay_ops),
This shouldn't be readable by group/other, as it exposes the address space,
so could help exploits.
> @@ -2061,6 +2069,13 @@ extern u64 scheduler_tick_max_deferment(void);
> static inline bool sched_can_stop_tick(void) { return false; }
> #endif
>
> +#if defined(CONFIG_SCHED_PREEMPT_DELAY) && defined(CONFIG_PROC_FS)
> +extern void sched_preempt_delay_show(struct seq_file *m,
> + struct task_struct *task);
> +extern void sched_preempt_delay_set(struct task_struct *task,
> + unsigned char *val);
> +#endif
Prototypes don't need to be ifdefed.
-Andi
--
ak@linux.intel.com -- Speaking for myself only
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v2] Pre-emption control for userspace
2014-03-25 18:20 ` Andi Kleen
@ 2014-03-25 18:47 ` Khalid Aziz
2014-03-25 19:47 ` Andi Kleen
0 siblings, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-25 18:47 UTC (permalink / raw)
To: Andi Kleen
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
gnomes, riel, snorcht, dhowells, luto, daeseok.youn, ebiederm,
linux-kernel, linux-doc
On 03/25/2014 12:20 PM, Andi Kleen wrote:
> Khalid Aziz <khalid.aziz@oracle.com> writes:
>
> First it would be nice to have some standard reference lock library that
> uses this. What would it take to support this in glibc?
I am not sure if it would be practical and useful to integrate this into
any of the standard locking interfaces, but I have not looked into it
much either. My initial intent is to let individual apps decide if they
could benefit from this interface and code it in if so since the
interface is meant to be very simple. Do you see any of the standard
locking interfaces where it would make sense to integrate this feature
in, or are you thinking of creating a new interface?
>
>> +==================================
>> +Using the preemption delay feature
>> +==================================
>> +
>> +This feature is enabled in the kernel by setting
>> +CONFIG_SCHED_PREEMPT_DELAY in kernel configuration. Once this feature is
>> +enabled, the userspace process communicates with the kernel using a
>> +4-byte memory location in its address space. It first gives the kernel
>> +address for this memory location by writing its address to
>> +/proc/<tgid>/task/<tid>/sched_preempt_delay. This memory location is
>> +interpreted as a sequence of 4 bytes:
>> +
>> + byte[0] = flag to request preemption delay
>> + byte[1] = flag from kernel indicating preemption delay was granted
>> + byte[2] = reserved for future use
>> + byte[3] = reserved for future use
>
> Should reserve more bytes (64, 128?) and rename the proc flag
> to a more generic name. I could well assume other things
> using such a mechanism in the future. Also please add a flag
> word with feature bits (similar to the perf mmap page)
I am reluctant to make it too big since reading larger quantities from
userspace will take longer and start to impact performance. Keeping
shared data limited to 32-bits allows us to move it between userspace
and kernel with one instruction.
>
> How about alignment? x86 will not care, but other architectures
> may.
>
You are right. These 4-bytes need to be aligned to a 4-byte boundary. I
will look into adding an alignment check at the time this address is
passed to kernel.
>> #endif
>> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
>> + REG("sched_preempt_delay", S_IRUGO|S_IWUSR,
>> proc_tid_preempt_delay_ops),
>
> This shouldn't be readable by group/other, as it exposes the address space,
> so could help exploits.
I like Oleg's suggestion of using prctl() instead of procfs to pass the
userspace address to kernel. Above problem will disappear when I switch
to prctl() in v3 of this patch.
>
>> @@ -2061,6 +2069,13 @@ extern u64 scheduler_tick_max_deferment(void);
>> static inline bool sched_can_stop_tick(void) { return false; }
>> #endif
>>
>> +#if defined(CONFIG_SCHED_PREEMPT_DELAY) && defined(CONFIG_PROC_FS)
>> +extern void sched_preempt_delay_show(struct seq_file *m,
>> + struct task_struct *task);
>> +extern void sched_preempt_delay_set(struct task_struct *task,
>> + unsigned char *val);
>> +#endif
>
> Prototypes don't need to be ifdefed.
>
> -Andi
>
Ah, you are right. Thanks, Andi!
--
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v2] Pre-emption control for userspace
2014-03-25 17:17 ` [PATCH v2] " Khalid Aziz
` (2 preceding siblings ...)
2014-03-25 18:20 ` Andi Kleen
@ 2014-03-25 18:59 ` Eric W. Biederman
2014-03-25 19:15 ` Khalid Aziz
2014-03-26 6:03 ` Mike Galbraith
3 siblings, 2 replies; 67+ messages in thread
From: Eric W. Biederman @ 2014-03-25 18:59 UTC (permalink / raw)
To: Khalid Aziz
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
gnomes, riel, snorcht, dhowells, luto, daeseok.youn,
linux-kernel, linux-doc
Khalid Aziz <khalid.aziz@oracle.com> writes:
> This patch adds a way for a thread to request additional timeslice from
> the scheduler if it is about to be preempted, so it could complete any
> critical task it is in the middle of. This functionality helps with
> performance on databases and has been used for many years on other OSs
> by the databases. This functionality helps in situation where a thread
> acquires a lock before performing a critical operation on the database,
> happens to get preempted before it completes its task and releases the
> lock. This lock causes all other threads that also acquire the same
> lock to perform their critical operation on the database to start
> queueing up and causing large number of context switches. This queueing
> problem can be avoided if the thread that acquires lock first could
> request scheduler to grant it an additional timeslice once it enters its
> critical section and hence allow it to complete its critical sectiona
> without causing queueing problem. If critical section completes before
> the thread is due for preemption, the thread can simply desassert its
> request. A thread sends the scheduler this request by setting a flag in
> a memory location it has shared with the kernel. Kernel uses bytes in
> the same memory location to let the thread know when its request for
> amnesty from preemption has been granted. Thread should yield the
> processor at the end of its critical section if it was granted amnesty
> to play nice with other threads. If thread fails to yield processor, it
> gets penalized by having its next amnesty request turned down by
> scheduler. Documentation file included in this patch contains further
> details on how to use this functionality and conditions associated with
> its use. This patch also adds a new field in scheduler statistics which
> keeps track of how many times was a thread granted amnesty from
> preemption. This feature and its usage are documented in
> Documentation/scheduler/sched-preempt-delay.txt and this patch includes
> a test for this feature under tools/testing/selftests/preempt-delay
Let me see if I understand the problem. Your simulated application has
a ridiculous number of threads (1000) all contending for a single lock
with fairly long lock hold times between 600 and 20000 clocks assuming
no cache line misses. So 1000 threads contending for about 10usec or
1/100 of a tick when HZ=1000. Giving you something like 1 chance in
100 of being preempted while holding the lock. With 1000 threads
those sound like pretty bad odds.
Either your test program is a serious exageration of what your userspace
is doing or this looks like an application design problem.
I am sorry no number of kernel patches can fix a stupid userspace
application, and what is worse it looks like this approach will make
the situation worse for applications that aren't stupid. Because they
will now suffer from much less predictability in how long they have to
wait for the cpu.
Maybe if this was limited to a cooperating set of userspace
tasks/threads this might not be too bad. As this exists I have users
who would hunt me down with malicious intent if this code ever showed up
on our servers, because it would make life for every other application
on the server worse.
The only two sane versions of this I can see are (a) having the
scheduler write the predicted next preemption time into the vdso page so
your thread can yield preemptively before taking the lock if it doesn't
look like it has enough time, or (b) limiting this to just a small
cooperating set of threads in a single cgroup.
As you have not limited the effects of this patch and as this will make
latencies worse for every other program on a system I think this is a
horrible approach. This really is not something you can do unless all
of the threads that could be affected are in the same code base, which
is definitely not the case here.
So for the general horrible idea.
Nacked-With-Extreme-Prejudice-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cooperative multitasking sucked in Windows 3.1 and it would be much
worse now. Please stop the crazy. Linux is challenging enough to
comprehend as it is, and I can't possibly see this patch makes anything
more predictable.
Eric
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v2] Pre-emption control for userspace
2014-03-25 18:59 ` Eric W. Biederman
@ 2014-03-25 19:15 ` Khalid Aziz
2014-03-25 20:31 ` Eric W. Biederman
2014-03-26 6:03 ` Mike Galbraith
1 sibling, 1 reply; 67+ messages in thread
From: Khalid Aziz @ 2014-03-25 19:15 UTC (permalink / raw)
To: Eric W. Biederman
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
gnomes, riel, snorcht, dhowells, luto, daeseok.youn,
linux-kernel, linux-doc
On 03/25/2014 12:59 PM, ebiederm@xmission.com wrote:
> Khalid Aziz <khalid.aziz@oracle.com> writes:
>
>> This patch adds a way for a thread to request additional timeslice from
>> the scheduler if it is about to be preempted, so it could complete any
>> critical task it is in the middle of. .......
>
>
> Let me see if I understand the problem. Your simulated application has
> a ridiculous number of threads (1000) all contending for a single lock
> with fairly long lock hold times between 600 and 20000 clocks assuming
> no cache line misses. So 1000 threads contending for about 10usec or
> 1/100 of a tick when HZ=1000. Giving you something like 1 chance in
> 100 of being preempted while holding the lock. With 1000 threads
> those sound like pretty bad odds.
This problem does not happen because threads are holding the lock for
too long, rather it happens when a thread does a large number of things
in its loop and one small part of it requires it to hold a lock. So it
holds the lock for a very short time but what can happen is thread is
executing in non-critical section of its loop, it finally gets to the
critical section just as its timeslice is about to end, it grabs the
lock and is pre-empted right away. Now we start building a convoy of
threads that want the same lock. This problem can be avoided if the
locking thread could be given additional time to complete its critical
section, release the lock and yield the processor if it indeed was
granted amnesty by the scheduler.
>
> Either your test program is a serious exageration of what your userspace
> is doing or this looks like an application design problem.
If you are referring to the test I included in the patch, that test is
designed to exaggerate this problem just so it could test the
functionality. It is nothing more than a test for this functionality so
we can test for any regressions in future.
>
> I am sorry no number of kernel patches can fix a stupid userspace
> application, and what is worse it looks like this approach will make
> the situation worse for applications that aren't stupid. Because they
> will now suffer from much less predictability in how long they have to
> wait for the cpu.
New code in scheduler kicks in only for the threads that have explicitly
asked kernel to use this functionality by sending it the address of a
shared memory location. There is a very quick bail out at the top of new
resched routine to check for this and not affect every thread.
>
> Maybe if this was limited to a cooperating set of userspace
> tasks/threads this might not be too bad. As this exists I have users
> who would hunt me down with malicious intent if this code ever showed up
> on our servers, because it would make life for every other application
> on the server worse.
>
Yes, it is indeed limited to a cooperating set of userspace
tasks/threads. Tasks/threads will explicitly choose to use this feature.
It is a no-op for every one else.
> The only two sane versions of this I can see are (a) having the
> scheduler write the predicted next preemption time into the vdso page so
> your thread can yield preemptively before taking the lock if it doesn't
> look like it has enough time,
This was discussed as an option when I posted first version of this
patch and dismissed due to lack of reliability of such a data.
> or (b) limiting this to just a small
> cooperating set of threads in a single cgroup.
and that is almost what this patch does. It is not limited to a cgroup,
rather to the tasks/threads that ask to use this feature.
>
> As you have not limited the effects of this patch and as this will make
> latencies worse for every other program on a system I think this is a
> horrible approach. This really is not something you can do unless all
> of the threads that could be affected are in the same code base, which
> is definitely not the case here.
>
> So for the general horrible idea.
> Nacked-With-Extreme-Prejudice-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> Cooperative multitasking sucked in Windows 3.1 and it would be much
> worse now. Please stop the crazy. Linux is challenging enough to
> comprehend as it is, and I can't possibly see this patch makes anything
> more predictable.
>
> Eric
>
I completely agree with you on cooperative multitasking. I am definitely
not trying to do anything like that.
Thanks,
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v2] Pre-emption control for userspace
2014-03-25 18:47 ` Khalid Aziz
@ 2014-03-25 19:47 ` Andi Kleen
0 siblings, 0 replies; 67+ messages in thread
From: Andi Kleen @ 2014-03-25 19:47 UTC (permalink / raw)
To: Khalid Aziz
Cc: Andi Kleen, tglx, mingo, hpa, peterz, akpm, andi.kleen, rob,
viro, oleg, gnomes, riel, snorcht, dhowells, luto, daeseok.youn,
ebiederm, linux-kernel, linux-doc
On Tue, Mar 25, 2014 at 12:47:52PM -0600, Khalid Aziz wrote:
> I am not sure if it would be practical and useful to integrate this
> into any of the standard locking interfaces, but I have not looked
> into it much either. My initial intent is to let individual apps
> decide if they could benefit from this interface and code it in if
> so since the interface is meant to be very simple. Do you see any of
> the standard locking interfaces where it would make sense to
> integrate this feature in, or are you thinking of creating a new
> interface?
It would probably make sense to use by default with glibc adaptive mutexes.
> I am reluctant to make it too big since reading larger quantities
> from userspace will take longer and start to impact performance.
> Keeping shared data limited to 32-bits allows us to move it between
> userspace and kernel with one instruction.
You don't need to read/write more. Just reserve more so that it
can be sensibly extended later.
The feature bits would only need to be written once when it is set up.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v2] Pre-emption control for userspace
2014-03-25 19:15 ` Khalid Aziz
@ 2014-03-25 20:31 ` Eric W. Biederman
2014-03-25 21:37 ` Khalid Aziz
0 siblings, 1 reply; 67+ messages in thread
From: Eric W. Biederman @ 2014-03-25 20:31 UTC (permalink / raw)
To: Khalid Aziz
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
gnomes, riel, snorcht, dhowells, luto, daeseok.youn,
linux-kernel, linux-doc
Khalid Aziz <khalid.aziz@oracle.com> writes:
> On 03/25/2014 12:59 PM, ebiederm@xmission.com wrote:
>> Khalid Aziz <khalid.aziz@oracle.com> writes:
>>
>>> This patch adds a way for a thread to request additional timeslice from
>>> the scheduler if it is about to be preempted, so it could complete any
>>> critical task it is in the middle of. .......
>>
>>
>> Let me see if I understand the problem. Your simulated application has
>> a ridiculous number of threads (1000) all contending for a single lock
>> with fairly long lock hold times between 600 and 20000 clocks assuming
>> no cache line misses. So 1000 threads contending for about 10usec or
>> 1/100 of a tick when HZ=1000. Giving you something like 1 chance in
>> 100 of being preempted while holding the lock. With 1000 threads
>> those sound like pretty bad odds.
>
> This problem does not happen because threads are holding the lock for too long,
> rather it happens when a thread does a large number of things in its loop and
> one small part of it requires it to hold a lock. So it holds the lock for a very
> short time but what can happen is thread is executing in non-critical section of
> its loop, it finally gets to the critical section just as its timeslice is about
> to end, it grabs the lock and is pre-empted right away. Now we start building a
> convoy of threads that want the same lock. This problem can be avoided if the
> locking thread could be given additional time to complete its critical section,
> release the lock and yield the processor if it indeed was granted amnesty by the
> scheduler.
I would dearly like to see the math that shows such a change will make
actually significantly change the propbabilities of this hitting. It
seems more likely that this will just be a way for threads to cheat and
get larger time slices.
>> Maybe if this was limited to a cooperating set of userspace
>> tasks/threads this might not be too bad. As this exists I have users
>> who would hunt me down with malicious intent if this code ever showed up
>> on our servers, because it would make life for every other application
>> on the server worse.
>>
>
> Yes, it is indeed limited to a cooperating set of userspace
> tasks/threads. Tasks/threads will explicitly choose to use this feature. It is a
> no-op for every one else.
It is absolutely not a no-op for me if my task can't be scheduled soon
enough because your task executed sched-prempt.
It means my latency goes up and the *random bad thing* will happen
because I missed my deadline because I was not scheduled fast enough.
>> or (b) limiting this to just a small
>> cooperating set of threads in a single cgroup.
>
> and that is almost what this patch does. It is not limited to a cgroup, rather
> to the tasks/threads that ask to use this feature.
Except you do not appear to be considering what could be scheduled in
your tasks place.
You allow any task to extend it's timeslice.
Which means I will get the question why does why does
really_important_job only miss it's latency guarantees when running on
the same box as sched_preempt_using_job?
Your change appears to have extremely difficult to debug non-local
effects.
Eric
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v2] Pre-emption control for userspace
2014-03-25 20:31 ` Eric W. Biederman
@ 2014-03-25 21:37 ` Khalid Aziz
0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-25 21:37 UTC (permalink / raw)
To: Eric W. Biederman
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
gnomes, riel, snorcht, dhowells, luto, daeseok.youn,
linux-kernel, linux-doc
On 03/25/2014 02:31 PM, ebiederm@xmission.com wrote:
>> Yes, it is indeed limited to a cooperating set of userspace
>> tasks/threads. Tasks/threads will explicitly choose to use this feature. It is a
>> no-op for every one else.
>
> It is absolutely not a no-op for me if my task can't be scheduled soon
> enough because your task executed sched-prempt.
>
> It means my latency goes up and the *random bad thing* will happen
> because I missed my deadline because I was not scheduled fast enough.
>
This patch changes only CFS, so SCHED_RR and SCHED_FIFO are not affected
by this change. Does CFS ever guarantee that any task will be scheduled
fast enough to meet its critical deadline? Shouldn't a system that
requires guarantees to meet deadline use SCHED_FIFO or SCHED_RR? Even if
a task that has an important deadline to meet and is ready to run, it
can be denied processor time simply based upon which other tasks are
ready to run, how long they have run previously and what their
priorities are. I don't see how CFS can guarantee temporal determinism
other than every task will get a chance to run in a finite time period.
Please do correct me if my understanding of CFS is flawed.
Any task that asks for extension of its current timeslice will get
charged this additional time in p->se.vruntime and cause it to move to
the right on rbtree. This amounts to a task borrowing time from its
future runtime as opposed to being able to get more processor time than
others.
>>> or (b) limiting this to just a small
>>> cooperating set of threads in a single cgroup.
>>
>> and that is almost what this patch does. It is not limited to a cgroup, rather
>> to the tasks/threads that ask to use this feature.
>
> Except you do not appear to be considering what could be scheduled in
> your tasks place.
>
> You allow any task to extend it's timeslice.
Would you recommend a policy that determines which tasks are allowed to
extend their timeslice?
>
> Which means I will get the question why does why does
> really_important_job only miss it's latency guarantees when running on
> the same box as sched_preempt_using_job?
I would expect a system that requires latency guarantees to be designed
using SCHED_FIFO or SCHED_RR, not SCHED_OTHER.
Hope this makes sense.
Thanks,
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-03 18:07 [RFC] [PATCH] Pre-emption control for userspace Khalid Aziz
` (3 preceding siblings ...)
2014-03-25 17:17 ` [PATCH v2] " Khalid Aziz
@ 2014-03-25 23:01 ` Davidlohr Bueso
2014-03-25 23:29 ` Khalid Aziz
4 siblings, 1 reply; 67+ messages in thread
From: Davidlohr Bueso @ 2014-03-25 23:01 UTC (permalink / raw)
To: Khalid Aziz
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
venki, linux-kernel
On Mon, 2014-03-03 at 11:07 -0700, Khalid Aziz wrote:
> I am working on a feature that has been requested by database folks that
> helps with performance. Some of the oft executed database code uses
> mutexes to lock other threads out of a critical section. They often see
> a situation where a thread grabs the mutex, runs out of its timeslice
> and gets switched out which then causes another thread to run which
> tries to grab the same mutex, spins for a while and finally gives up.
> This can happen with multiple threads until original lock owner gets the
> CPU again and can complete executing its critical section. This queueing
> and subsequent CPU cycle wastage can be avoided if the locking thread
> could request to be granted an additional timeslice if its current
> timeslice runs out before it gives up the lock. Other operating systems
> have implemented this functionality and is used by databases as well as
> JVM. This functionality has been shown to improve performance by 3%-5%.
>
> I have implemented similar functionality for Linux. This patch adds a
> file /proc/<tgid>/task/<tid>/sched_preempt_delay for each thread.
> Writing 1 to this file causes CFS scheduler to grant additional time
> slice if the currently running process comes up for pre-emption. Writing
> to this file needs to be very quick operation, so I have implemented
> code to allow mmap'ing /proc/<tgid>/task/<tid>/sched_preempt_delay. This
> allows a userspace task to write this flag very quickly. Usage model is
> a thread mmaps this file during initialization. It then writes a 1 to
> the mmap'd file after it grabs the lock in its critical section where it
> wants immunity from pre-emption. It then writes 0 again to this file
> after it releases the lock and calls sched_yield() to give up the
> processor. I have also added a new field in scheduler statistics -
> nr_preempt_delayed, that counts the number of times a thread has been
> granted amnesty. Further details on using this functionality are in
> Documentation/scheduler/sched-preempt-delay.txt in the patch. This
> patch is based upon the work Venkatesh Pallipadi had done couple of
> years ago.
>
> Please provide feedback on this functionality and patch.
Good timing! The topic came up just yesterday in LSF/MM. This
functionality is on the wish list for both facebook and postgres.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [RFC] [PATCH] Pre-emption control for userspace
2014-03-25 23:01 ` [RFC] [PATCH] " Davidlohr Bueso
@ 2014-03-25 23:29 ` Khalid Aziz
0 siblings, 0 replies; 67+ messages in thread
From: Khalid Aziz @ 2014-03-25 23:29 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: tglx, mingo, hpa, peterz, akpm, andi.kleen, rob, viro, oleg,
venki, linux-kernel
On 03/25/2014 05:01 PM, Davidlohr Bueso wrote:
> Good timing! The topic came up just yesterday in LSF/MM. This
> functionality is on the wish list for both facebook and postgres.
>
Thanks for letting me know. I am glad to hear of others who need this
functionality. Did you happen to catch the names of folks at facebook
and postgres that are interested in this? I would like to work with them
to make sure what I do works for them as well.
--
Khalid
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v2] Pre-emption control for userspace
2014-03-25 18:59 ` Eric W. Biederman
2014-03-25 19:15 ` Khalid Aziz
@ 2014-03-26 6:03 ` Mike Galbraith
1 sibling, 0 replies; 67+ messages in thread
From: Mike Galbraith @ 2014-03-26 6:03 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Khalid Aziz, tglx, mingo, hpa, peterz, akpm, andi.kleen, rob,
viro, oleg, gnomes, riel, snorcht, dhowells, luto, daeseok.youn,
linux-kernel, linux-doc
On Tue, 2014-03-25 at 11:59 -0700, Eric W. Biederman wrote:
> So for the general horrible idea.
> Nacked-With-Extreme-Prejudice-by: "Eric W. Biederman" <ebiederm@xmission.com>
Goody. I was surprised Peter didn't make it instantly dead.
-Mike
^ permalink raw reply [flat|nested] 67+ messages in thread
end of thread, other threads:[~2014-03-26 6:03 UTC | newest]
Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-03 18:07 [RFC] [PATCH] Pre-emption control for userspace Khalid Aziz
2014-03-03 21:51 ` Davidlohr Bueso
2014-03-03 23:29 ` Khalid Aziz
2014-03-04 13:56 ` Oleg Nesterov
2014-03-04 17:44 ` Khalid Aziz
2014-03-04 18:38 ` Al Viro
2014-03-04 19:01 ` Khalid Aziz
2014-03-04 19:03 ` Oleg Nesterov
2014-03-04 20:14 ` Khalid Aziz
2014-03-05 14:38 ` Oleg Nesterov
2014-03-05 16:12 ` Oleg Nesterov
2014-03-05 17:10 ` Khalid Aziz
2014-03-04 21:12 ` H. Peter Anvin
2014-03-04 21:39 ` Khalid Aziz
2014-03-04 22:23 ` One Thousand Gnomes
2014-03-04 22:44 ` Khalid Aziz
2014-03-05 0:39 ` Thomas Gleixner
2014-03-05 0:51 ` Andi Kleen
2014-03-05 11:10 ` Peter Zijlstra
2014-03-05 17:29 ` Khalid Aziz
2014-03-05 19:58 ` Khalid Aziz
2014-03-06 9:57 ` Peter Zijlstra
2014-03-06 16:08 ` Khalid Aziz
2014-03-06 11:14 ` Thomas Gleixner
2014-03-06 16:32 ` Khalid Aziz
2014-03-05 14:54 ` Oleg Nesterov
2014-03-05 15:56 ` Andi Kleen
2014-03-05 16:36 ` Oleg Nesterov
2014-03-05 17:22 ` Khalid Aziz
2014-03-05 23:13 ` David Lang
2014-03-05 23:48 ` Khalid Aziz
2014-03-05 23:56 ` H. Peter Anvin
2014-03-06 0:02 ` Khalid Aziz
2014-03-06 0:13 ` H. Peter Anvin
2014-03-05 23:59 ` David Lang
2014-03-06 0:17 ` Khalid Aziz
2014-03-06 0:36 ` David Lang
2014-03-06 1:22 ` Khalid Aziz
2014-03-06 14:23 ` David Lang
2014-03-06 12:13 ` Kevin Easton
2014-03-06 13:59 ` Peter Zijlstra
2014-03-06 22:41 ` Andi Kleen
2014-03-06 14:25 ` David Lang
2014-03-06 16:12 ` Khalid Aziz
2014-03-06 13:24 ` Rasmus Villemoes
2014-03-06 13:34 ` Peter Zijlstra
2014-03-06 13:45 ` Rasmus Villemoes
2014-03-06 14:02 ` Peter Zijlstra
2014-03-06 14:33 ` Thomas Gleixner
2014-03-06 14:34 ` H. Peter Anvin
2014-03-06 14:04 ` Thomas Gleixner
2014-03-25 17:17 ` [PATCH v2] " Khalid Aziz
2014-03-25 17:44 ` Andrew Morton
2014-03-25 17:56 ` Khalid Aziz
2014-03-25 18:14 ` Andrew Morton
2014-03-25 17:46 ` Oleg Nesterov
2014-03-25 17:59 ` Khalid Aziz
2014-03-25 18:20 ` Andi Kleen
2014-03-25 18:47 ` Khalid Aziz
2014-03-25 19:47 ` Andi Kleen
2014-03-25 18:59 ` Eric W. Biederman
2014-03-25 19:15 ` Khalid Aziz
2014-03-25 20:31 ` Eric W. Biederman
2014-03-25 21:37 ` Khalid Aziz
2014-03-26 6:03 ` Mike Galbraith
2014-03-25 23:01 ` [RFC] [PATCH] " Davidlohr Bueso
2014-03-25 23:29 ` Khalid Aziz
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.