From: Andrew Morton <akpm@linux-foundation.org>
To: Khalid Aziz <khalid.aziz@oracle.com>
Cc: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com,
peterz@infradead.org, andi.kleen@intel.com, rob@landley.net,
viro@zeniv.linux.org.uk, oleg@redhat.com,
gnomes@lxorguk.ukuu.org.uk, riel@redhat.com, snorcht@gmail.com,
dhowells@redhat.com, luto@amacapital.net, daeseok.youn@gmail.com,
ebiederm@xmission.com, linux-kernel@vger.kernel.org,
linux-doc@vger.kernel.org
Subject: Re: [PATCH v2] Pre-emption control for userspace
Date: Tue, 25 Mar 2014 10:44:10 -0700 [thread overview]
Message-ID: <20140325104410.25e06c4d.akpm@linux-foundation.org> (raw)
In-Reply-To: <1395767870-28053-1-git-send-email-khalid.aziz@oracle.com>
On Tue, 25 Mar 2014 11:17:50 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> This patch adds a way for a thread to request additional timeslice from
> the scheduler if it is about to be preempted, so it could complete any
> critical task it is in the middle of. This functionality helps with
> performance on databases and has been used for many years on other OSs
> by the databases. This functionality helps in situation where a thread
> acquires a lock before performing a critical operation on the database,
> happens to get preempted before it completes its task and releases the
> lock. This lock causes all other threads that also acquire the same
> lock to perform their critical operation on the database to start
> queueing up and causing large number of context switches. This queueing
> problem can be avoided if the thread that acquires lock first could
> request scheduler to grant it an additional timeslice once it enters its
> critical section and hence allow it to complete its critical sectiona
> without causing queueing problem. If critical section completes before
> the thread is due for preemption, the thread can simply desassert its
> request. A thread sends the scheduler this request by setting a flag in
> a memory location it has shared with the kernel. Kernel uses bytes in
> the same memory location to let the thread know when its request for
> amnesty from preemption has been granted. Thread should yield the
> processor at the end of its critical section if it was granted amnesty
> to play nice with other threads. If thread fails to yield processor, it
> gets penalized by having its next amnesty request turned down by
> scheduler. Documentation file included in this patch contains further
> details on how to use this functionality and conditions associated with
> its use. This patch also adds a new field in scheduler statistics which
> keeps track of how many times was a thread granted amnesty from
> preemption. This feature and its usage are documented in
> Documentation/scheduler/sched-preempt-delay.txt and this patch includes
> a test for this feature under tools/testing/selftests/preempt-delay
What a long paragraph ;)
The feature makes sense and sounds useful, but I'd like to see some
real world performance testing results so we can understand what the
benefit will be to our users?
>
> ...
>
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +static int
> +tid_preempt_delay_show(struct seq_file *m, void *v)
> +{
> + struct inode *inode = m->private;
> + struct task_struct *task = get_proc_task(inode);
> + unsigned char *delay_req;
> +
> + if (!task)
> + return -ENOENT;
> +
> + delay_req = (unsigned char *)task->sched_preempt_delay.delay_req;
> + seq_printf(m, "0x%-p\n", delay_req);
> +
> + put_task_struct(task);
> + return 0;
> +}
> +
> +static ssize_t
> +tid_preempt_delay_write(struct file *file, const char __user *buf,
> + size_t count, loff_t *offset)
> +{
> + struct inode *inode = file_inode(file);
> + struct task_struct *task = get_proc_task(inode);
> + u32 __user *delay_req;
> + int retval;
> +
> + if (!task) {
> + retval = -ENOENT;
> + goto out;
> + }
> +
> + /*
> + * A thread can write only to its corresponding preempt_delay
> + * proc file
> + */
> + if (current != task) {
> + retval = -EPERM;
> + goto out;
> + }
> +
> + delay_req = *(u32 __user **)buf;
> +
> + /*
> + * Do not allow write if pointer is currently set
> + */
> + if (task->sched_preempt_delay.delay_req && (delay_req != NULL)) {
> + retval = -EINVAL;
> + goto out;
> + }
> +
> + /*
> + * Validate the pointer.
> + */
> + if (unlikely(!access_ok(rw, delay_req, sizeof(u32)))) {
> + retval = -EFAULT;
> + goto out;
> + }
> +
> + task->sched_preempt_delay.delay_req = delay_req;
> +
> + /* zero out flags */
> + put_user(0, delay_req);
> +
> + retval = count;
> +
> +out:
> + put_task_struct(task);
> + return retval;
> +}
So the procfs file is written in binary format and is read back in
ascii format. Seems odd.
Perhaps this should all be done as a new syscall rather than some
procfs thing.
>
> ...
>
> @@ -1250,6 +1251,13 @@ struct task_struct {
> /* Revert to default priority/policy when forking */
> unsigned sched_reset_on_fork:1;
> unsigned sched_contributes_to_load:1;
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> + struct preempt_delay {
> + u32 __user *delay_req; /* delay request flag pointer */
> + unsigned char delay_granted:1; /* currently in delay */
> + unsigned char yield_penalty:1; /* failure to yield penalty */
> + } sched_preempt_delay;
The problem with bitfields is that a write to one bitfield can corrupt
a concurrent write to the other one. So it's your responsibility to
provide locking and/or to describe how this race is avoided. A comment
here in the definition would be a suitable way of addressing this.
> +#endif
>
> pid_t pid;
> pid_t tgid;
>
> ...
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4055,6 +4055,14 @@ SYSCALL_DEFINE0(sched_yield)
> {
> struct rq *rq = this_rq_lock();
>
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> + /*
> + * Clear the penalty flag for current task to reward it for
> + * palying by the rules
"playing"
> + */
> + current->sched_preempt_delay.yield_penalty = 0;
> +#endif
> +
> schedstat_inc(rq, yld_count);
> current->sched_class->yield_task(rq);
>
> ...
>
> +static void
> +delay_resched_task(struct task_struct *curr)
> +{
> + struct sched_entity *se;
> + int cpu = task_cpu(curr);
> + u32 __user *delay_req;
> + unsigned int delay_req_flag;
> + unsigned char *delay_flag;
> +
> + /*
> + * Check if task is using pre-emption delay feature. If address
> + * for preemption delay request flag is not set, this task is
> + * not using preemption delay feature, we can reschedule without
> + * any delay
> + */
> + delay_req = curr->sched_preempt_delay.delay_req;
> +
> + if ((delay_req == NULL) || (cpu != smp_processor_id()))
Presumably we don't get "smp_processor_id() used in preemptible code"
warnings here, so called-with-preempt-disabled is a secret prerequisite
for delay_resched_task().
> + goto resched_now;
> +
> + /*
> + * Pre-emption delay will be granted only once. If this task
> + * has already been granted delay, rechedule now
> + */
> + if (curr->sched_preempt_delay.delay_granted) {
> + curr->sched_preempt_delay.delay_granted = 0;
> + goto resched_now;
> + }
> +
> + /*
> + * Get the value of preemption delay request flag from userspace.
> + * Task had already passed us the address where the flag is stored
> + * in userspace earlier. This flag is just like the PROCESS_PRIVATE
> + * futex, leverage the futex code here to read the flag. If there
> + * is a page fault accessing this flag in userspace, that means
> + * userspace has not touched this flag recently and we can
> + * assume no preemption delay is needed.
> + *
> + * If task is not requesting additional timeslice, resched now
> + */
> + if (delay_req) {
> + int ret;
> +
> + pagefault_disable();
> + ret = __copy_from_user_inatomic(&delay_req_flag, delay_req,
> + sizeof(u32));
> + pagefault_enable();
This all looks rather hacky and unneccesary. Can't we somehow use
plain old get_user() and avoid such fuss?
> + delay_flag = &delay_req_flag;
> + if (ret || !delay_flag[0])
> + goto resched_now;
> + } else {
> + goto resched_now;
> + }
> +
> + /*
> + * Current thread has requested preemption delay and has not
> + * been granted an extension yet. If this thread failed to yield
> + * processor after being granted amnesty last time, penalize it
> + * by not granting this delay request, otherwise give it an extra
> + * timeslice.
> + */
> + if (curr->sched_preempt_delay.yield_penalty) {
> + curr->sched_preempt_delay.yield_penalty = 0;
> + goto resched_now;
> + }
> +
> + se = &curr->se;
> + curr->sched_preempt_delay.delay_granted = 1;
> +
> + /*
> + * Set the penalty flag for failing to yield the processor after
> + * being granted immunity. This flag will be cleared in
> + * sched_yield() if the thread indeed calls sched_yield
> + */
> + curr->sched_preempt_delay.yield_penalty = 1;
> +
> + /*
> + * Let the thread know it got amnesty and it should call
> + * sched_yield() when it is done to avoid penalty next time
> + * it wants amnesty. We need to write to userspace location.
> + * Since we just read from this location, chances are extremley
> + * low we might page fault. If we do page fault, we will ignore
> + * it and accept the cost of failed write in form of unnecessary
> + * penalty for userspace task for not yielding processor.
> + * This is a highly unlikely scenario.
> + */
> + delay_flag[0] = 0;
> + delay_flag[1] = 1;
> + pagefault_disable();
> + __copy_to_user_inatomic(delay_req, &delay_req_flag, sizeof(u32));
> + pagefault_enable();
and put_user() here.
> + schedstat_inc(curr, se.statistics.nr_preempt_delayed);
> + return;
> +
> +resched_now:
> + resched_task(curr);
> +}
> +#else
> +#define delay_resched_task(curr) resched_task(curr)
This could have been implemented in C...
> +#endif /* CONFIG_SCHED_PREEMPT_DELAY */
> +
>
> ...
>
next prev parent reply other threads:[~2014-03-25 17:44 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-03-03 18:07 [RFC] [PATCH] Pre-emption control for userspace Khalid Aziz
2014-03-03 21:51 ` Davidlohr Bueso
2014-03-03 23:29 ` Khalid Aziz
2014-03-04 13:56 ` Oleg Nesterov
2014-03-04 17:44 ` Khalid Aziz
2014-03-04 18:38 ` Al Viro
2014-03-04 19:01 ` Khalid Aziz
2014-03-04 19:03 ` Oleg Nesterov
2014-03-04 20:14 ` Khalid Aziz
2014-03-05 14:38 ` Oleg Nesterov
2014-03-05 16:12 ` Oleg Nesterov
2014-03-05 17:10 ` Khalid Aziz
2014-03-04 21:12 ` H. Peter Anvin
2014-03-04 21:39 ` Khalid Aziz
2014-03-04 22:23 ` One Thousand Gnomes
2014-03-04 22:44 ` Khalid Aziz
2014-03-05 0:39 ` Thomas Gleixner
2014-03-05 0:51 ` Andi Kleen
2014-03-05 11:10 ` Peter Zijlstra
2014-03-05 17:29 ` Khalid Aziz
2014-03-05 19:58 ` Khalid Aziz
2014-03-06 9:57 ` Peter Zijlstra
2014-03-06 16:08 ` Khalid Aziz
2014-03-06 11:14 ` Thomas Gleixner
2014-03-06 16:32 ` Khalid Aziz
2014-03-05 14:54 ` Oleg Nesterov
2014-03-05 15:56 ` Andi Kleen
2014-03-05 16:36 ` Oleg Nesterov
2014-03-05 17:22 ` Khalid Aziz
2014-03-05 23:13 ` David Lang
2014-03-05 23:48 ` Khalid Aziz
2014-03-05 23:56 ` H. Peter Anvin
2014-03-06 0:02 ` Khalid Aziz
2014-03-06 0:13 ` H. Peter Anvin
2014-03-05 23:59 ` David Lang
2014-03-06 0:17 ` Khalid Aziz
2014-03-06 0:36 ` David Lang
2014-03-06 1:22 ` Khalid Aziz
2014-03-06 14:23 ` David Lang
2014-03-06 12:13 ` Kevin Easton
2014-03-06 13:59 ` Peter Zijlstra
2014-03-06 22:41 ` Andi Kleen
2014-03-06 14:25 ` David Lang
2014-03-06 16:12 ` Khalid Aziz
2014-03-06 13:24 ` Rasmus Villemoes
2014-03-06 13:34 ` Peter Zijlstra
2014-03-06 13:45 ` Rasmus Villemoes
2014-03-06 14:02 ` Peter Zijlstra
2014-03-06 14:33 ` Thomas Gleixner
2014-03-06 14:34 ` H. Peter Anvin
2014-03-06 14:04 ` Thomas Gleixner
2014-03-25 17:17 ` [PATCH v2] " Khalid Aziz
2014-03-25 17:44 ` Andrew Morton [this message]
2014-03-25 17:56 ` Khalid Aziz
2014-03-25 18:14 ` Andrew Morton
2014-03-25 17:46 ` Oleg Nesterov
2014-03-25 17:59 ` Khalid Aziz
2014-03-25 18:20 ` Andi Kleen
2014-03-25 18:47 ` Khalid Aziz
2014-03-25 19:47 ` Andi Kleen
2014-03-25 18:59 ` Eric W. Biederman
2014-03-25 19:15 ` Khalid Aziz
2014-03-25 20:31 ` Eric W. Biederman
2014-03-25 21:37 ` Khalid Aziz
2014-03-26 6:03 ` Mike Galbraith
2014-03-25 23:01 ` [RFC] [PATCH] " Davidlohr Bueso
2014-03-25 23:29 ` Khalid Aziz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140325104410.25e06c4d.akpm@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=andi.kleen@intel.com \
--cc=daeseok.youn@gmail.com \
--cc=dhowells@redhat.com \
--cc=ebiederm@xmission.com \
--cc=gnomes@lxorguk.ukuu.org.uk \
--cc=hpa@zytor.com \
--cc=khalid.aziz@oracle.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=mingo@redhat.com \
--cc=oleg@redhat.com \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=rob@landley.net \
--cc=snorcht@gmail.com \
--cc=tglx@linutronix.de \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).