linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Khalid Aziz <khalid.aziz@oracle.com>
Cc: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com,
	peterz@infradead.org, andi.kleen@intel.com, rob@landley.net,
	viro@zeniv.linux.org.uk, oleg@redhat.com,
	gnomes@lxorguk.ukuu.org.uk, riel@redhat.com, snorcht@gmail.com,
	dhowells@redhat.com, luto@amacapital.net, daeseok.youn@gmail.com,
	ebiederm@xmission.com, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org
Subject: Re: [PATCH v2] Pre-emption control for userspace
Date: Tue, 25 Mar 2014 10:44:10 -0700	[thread overview]
Message-ID: <20140325104410.25e06c4d.akpm@linux-foundation.org> (raw)
In-Reply-To: <1395767870-28053-1-git-send-email-khalid.aziz@oracle.com>

On Tue, 25 Mar 2014 11:17:50 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:

> 
> This patch adds a way for a thread to request additional timeslice from
> the scheduler if it is about to be preempted, so it could complete any
> critical task it is in the middle of. This functionality helps with
> performance on databases and has been used for many years on other OSs
> by the databases. This functionality helps in situation where a thread
> acquires a lock before performing a critical operation on the database,
> happens to get preempted before it completes its task and releases the
> lock.  This lock causes all other threads that also acquire the same
> lock to perform their critical operation on the database to start
> queueing up and causing large number of context switches. This queueing
> problem can be avoided if the thread that acquires lock first could
> request scheduler to grant it an additional timeslice once it enters its
> critical section and hence allow it to complete its critical sectiona
> without causing queueing problem. If critical section completes before
> the thread is due for preemption, the thread can simply desassert its
> request. A thread sends the scheduler this request by setting a flag in
> a memory location it has shared with the kernel.  Kernel uses bytes in
> the same memory location to let the thread know when its request for
> amnesty from preemption has been granted. Thread should yield the
> processor at the end of its critical section if it was granted amnesty
> to play nice with other threads. If thread fails to yield processor, it
> gets penalized by having its next amnesty request turned down by
> scheduler.  Documentation file included in this patch contains further
> details on how to use this functionality and conditions associated with
> its use. This patch also adds a new field in scheduler statistics which
> keeps track of how many times was a thread granted amnesty from
> preemption. This feature and its usage are documented in
> Documentation/scheduler/sched-preempt-delay.txt and this patch includes
> a test for this feature under tools/testing/selftests/preempt-delay

What a long paragraph ;)

The feature makes sense and sounds useful, but I'd like to see some
real world performance testing results so we can understand what the
benefit will be to our users?

>
> ...
>
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +static int
> +tid_preempt_delay_show(struct seq_file *m, void *v)
> +{
> +	struct inode *inode = m->private;
> +	struct task_struct *task = get_proc_task(inode);
> +	unsigned char *delay_req;
> +
> +	if (!task)
> +		return -ENOENT;
> +
> +	delay_req = (unsigned char *)task->sched_preempt_delay.delay_req;
> +	seq_printf(m, "0x%-p\n", delay_req);
> +
> +	put_task_struct(task);
> +	return 0;
> +}
> +
> +static ssize_t
> +tid_preempt_delay_write(struct file *file, const char __user *buf,
> +			  size_t count, loff_t *offset)
> +{
> +	struct inode *inode = file_inode(file);
> +	struct task_struct *task = get_proc_task(inode);
> +	u32 __user *delay_req;
> +	int retval;
> +
> +	if (!task) {
> +		retval = -ENOENT;
> +		goto out;
> +	}
> +
> +	/*
> +	 * A thread can write only to its corresponding preempt_delay
> +	 * proc file
> +	 */
> +	if (current != task) {
> +		retval =  -EPERM;
> +		goto out;
> +	}
> +
> +	delay_req = *(u32 __user **)buf;
> +
> +	/*
> +	 * Do not allow write if pointer is currently set
> +	 */
> +	if (task->sched_preempt_delay.delay_req && (delay_req != NULL)) {
> +		retval = -EINVAL;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Validate the pointer.
> +	 */
> +	if (unlikely(!access_ok(rw, delay_req, sizeof(u32)))) {
> +		retval = -EFAULT;
> +		goto out;
> +	}
> +
> +	task->sched_preempt_delay.delay_req = delay_req;
> +
> +	/* zero out flags */
> +	put_user(0, delay_req);
> +
> +	retval = count;
> +
> +out:
> +	put_task_struct(task);
> +	return retval;
> +}

So the procfs file is written in binary format and is read back in
ascii format.  Seems odd.

Perhaps this should all be done as a new syscall rather than some
procfs thing.

>
> ...
>
> @@ -1250,6 +1251,13 @@ struct task_struct {
>  	/* Revert to default priority/policy when forking */
>  	unsigned sched_reset_on_fork:1;
>  	unsigned sched_contributes_to_load:1;
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +	struct preempt_delay {
> +		u32 __user *delay_req;		/* delay request flag pointer */
> +		unsigned char delay_granted:1;	/* currently in delay */
> +		unsigned char yield_penalty:1;	/* failure to yield penalty */
> +	} sched_preempt_delay;

The problem with bitfields is that a write to one bitfield can corrupt
a concurrent write to the other one.  So it's your responsibility to
provide locking and/or to describe how this race is avoided.  A comment
here in the definition would be a suitable way of addressing this.

> +#endif
>  
>  	pid_t pid;
>  	pid_t tgid;
>
> ...
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4055,6 +4055,14 @@ SYSCALL_DEFINE0(sched_yield)
>  {
>  	struct rq *rq = this_rq_lock();
>  
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +	/*
> +	 * Clear the penalty flag for current task to reward it for
> +	 * palying by the rules

"playing"

> +	 */
> +	current->sched_preempt_delay.yield_penalty = 0;
> +#endif
> +
>  	schedstat_inc(rq, yld_count);
>  	current->sched_class->yield_task(rq);
>  
> ...
>
> +static void
> +delay_resched_task(struct task_struct *curr)
> +{
> +	struct sched_entity *se;
> +	int cpu = task_cpu(curr);
> +	u32 __user *delay_req;
> +	unsigned int delay_req_flag;
> +	unsigned char *delay_flag;
> +
> +	/*
> +	 * Check if task is using pre-emption delay feature. If address
> +	 * for preemption delay request flag is not set, this task is
> +	 * not using preemption delay feature, we can reschedule without
> +	 * any delay
> +	 */
> +	delay_req = curr->sched_preempt_delay.delay_req;
> +
> +	if ((delay_req == NULL) || (cpu != smp_processor_id()))

Presumably we don't get "smp_processor_id() used in preemptible code"
warnings here, so called-with-preempt-disabled is a secret prerequisite
for delay_resched_task().

> +		goto resched_now;
> +
> +	/*
> +	 * Pre-emption delay will  be granted only once. If this task
> +	 * has already been granted delay, rechedule now
> +	 */
> +	if (curr->sched_preempt_delay.delay_granted) {
> +		curr->sched_preempt_delay.delay_granted = 0;
> +		goto resched_now;
> +	}
> +
> +	/*
> +	 * Get the value of preemption delay request flag from userspace.
> +	 * Task had already passed us the address where the flag is stored
> +	 * in userspace earlier. This flag is just like the PROCESS_PRIVATE
> +	 * futex, leverage the futex code here to read the flag. If there
> +	 * is a page fault accessing this flag in userspace, that means
> +	 * userspace has not touched this flag recently and we can
> +	 * assume no preemption delay is needed.
> +	 *
> +	 * If task is not requesting additional timeslice, resched now
> +	 */
> +	if (delay_req) {
> +		int ret;
> +
> +		pagefault_disable();
> +		ret = __copy_from_user_inatomic(&delay_req_flag, delay_req,
> +				sizeof(u32));
> +		pagefault_enable();

This all looks rather hacky and unneccesary.  Can't we somehow use
plain old get_user() and avoid such fuss?

> +		delay_flag = &delay_req_flag;
> +		if (ret || !delay_flag[0])
> +			goto resched_now;
> +	} else {
> +		goto resched_now;
> +	}
> +
> +	/*
> +	 * Current thread has requested preemption delay and has not
> +	 * been granted an extension yet. If this thread failed to yield
> +	 * processor after being granted amnesty last time, penalize it
> +	 * by not granting this delay request, otherwise give it an extra
> +	 * timeslice.
> +	 */
> +	if (curr->sched_preempt_delay.yield_penalty) {
> +		curr->sched_preempt_delay.yield_penalty = 0;
> +		goto resched_now;
> +	}
> +
> +	se = &curr->se;
> +	curr->sched_preempt_delay.delay_granted = 1;
> +
> +	/*
> +	 * Set the penalty flag for failing to yield the processor after
> +	 * being granted immunity. This flag will be cleared in
> +	 * sched_yield() if the thread indeed calls sched_yield
> +	 */
> +	curr->sched_preempt_delay.yield_penalty = 1;
> +
> +	/*
> +	 * Let the thread know it got amnesty and it should call
> +	 * sched_yield() when it is done to avoid penalty next time
> +	 * it wants amnesty. We need to write to userspace location.
> +	 * Since we just read from this location, chances are extremley
> +	 * low we might page fault. If we do page fault, we will ignore
> +	 * it and accept the cost of failed write in form of unnecessary
> +	 * penalty for userspace task for not yielding processor.
> +	 * This is a highly unlikely scenario.
> +	 */
> +	delay_flag[0] = 0;
> +	delay_flag[1] = 1;
> +	pagefault_disable();
> +	__copy_to_user_inatomic(delay_req, &delay_req_flag, sizeof(u32));
> +	pagefault_enable();

and put_user() here.

> +	schedstat_inc(curr, se.statistics.nr_preempt_delayed);
> +	return;
> +
> +resched_now:
> +	resched_task(curr);
> +}
> +#else
> +#define delay_resched_task(curr) resched_task(curr)

This could have been implemented in C...

> +#endif /* CONFIG_SCHED_PREEMPT_DELAY */
> +
>
> ...
>

  reply	other threads:[~2014-03-25 17:44 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-03 18:07 [RFC] [PATCH] Pre-emption control for userspace Khalid Aziz
2014-03-03 21:51 ` Davidlohr Bueso
2014-03-03 23:29   ` Khalid Aziz
2014-03-04 13:56 ` Oleg Nesterov
2014-03-04 17:44   ` Khalid Aziz
2014-03-04 18:38     ` Al Viro
2014-03-04 19:01       ` Khalid Aziz
2014-03-04 19:03     ` Oleg Nesterov
2014-03-04 20:14       ` Khalid Aziz
2014-03-05 14:38         ` Oleg Nesterov
2014-03-05 16:12           ` Oleg Nesterov
2014-03-05 17:10             ` Khalid Aziz
2014-03-04 21:12 ` H. Peter Anvin
2014-03-04 21:39   ` Khalid Aziz
2014-03-04 22:23     ` One Thousand Gnomes
2014-03-04 22:44       ` Khalid Aziz
2014-03-05  0:39         ` Thomas Gleixner
2014-03-05  0:51           ` Andi Kleen
2014-03-05 11:10             ` Peter Zijlstra
2014-03-05 17:29               ` Khalid Aziz
2014-03-05 19:58               ` Khalid Aziz
2014-03-06  9:57                 ` Peter Zijlstra
2014-03-06 16:08                   ` Khalid Aziz
2014-03-06 11:14                 ` Thomas Gleixner
2014-03-06 16:32                   ` Khalid Aziz
2014-03-05 14:54             ` Oleg Nesterov
2014-03-05 15:56               ` Andi Kleen
2014-03-05 16:36                 ` Oleg Nesterov
2014-03-05 17:22                   ` Khalid Aziz
2014-03-05 23:13                     ` David Lang
2014-03-05 23:48                       ` Khalid Aziz
2014-03-05 23:56                         ` H. Peter Anvin
2014-03-06  0:02                           ` Khalid Aziz
2014-03-06  0:13                             ` H. Peter Anvin
2014-03-05 23:59                         ` David Lang
2014-03-06  0:17                           ` Khalid Aziz
2014-03-06  0:36                             ` David Lang
2014-03-06  1:22                               ` Khalid Aziz
2014-03-06 14:23                                 ` David Lang
2014-03-06 12:13             ` Kevin Easton
2014-03-06 13:59               ` Peter Zijlstra
2014-03-06 22:41                 ` Andi Kleen
2014-03-06 14:25               ` David Lang
2014-03-06 16:12                 ` Khalid Aziz
2014-03-06 13:24   ` Rasmus Villemoes
2014-03-06 13:34     ` Peter Zijlstra
2014-03-06 13:45       ` Rasmus Villemoes
2014-03-06 14:02         ` Peter Zijlstra
2014-03-06 14:33           ` Thomas Gleixner
2014-03-06 14:34             ` H. Peter Anvin
2014-03-06 14:04         ` Thomas Gleixner
2014-03-25 17:17 ` [PATCH v2] " Khalid Aziz
2014-03-25 17:44   ` Andrew Morton [this message]
2014-03-25 17:56     ` Khalid Aziz
2014-03-25 18:14       ` Andrew Morton
2014-03-25 17:46   ` Oleg Nesterov
2014-03-25 17:59     ` Khalid Aziz
2014-03-25 18:20   ` Andi Kleen
2014-03-25 18:47     ` Khalid Aziz
2014-03-25 19:47       ` Andi Kleen
2014-03-25 18:59   ` Eric W. Biederman
2014-03-25 19:15     ` Khalid Aziz
2014-03-25 20:31       ` Eric W. Biederman
2014-03-25 21:37         ` Khalid Aziz
2014-03-26  6:03     ` Mike Galbraith
2014-03-25 23:01 ` [RFC] [PATCH] " Davidlohr Bueso
2014-03-25 23:29   ` Khalid Aziz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140325104410.25e06c4d.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=andi.kleen@intel.com \
    --cc=daeseok.youn@gmail.com \
    --cc=dhowells@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=gnomes@lxorguk.ukuu.org.uk \
    --cc=hpa@zytor.com \
    --cc=khalid.aziz@oracle.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mingo@redhat.com \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=rob@landley.net \
    --cc=snorcht@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).