linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace)
@ 2014-11-24 20:56 Khalid Aziz
  2014-11-24 22:43 ` Andi Kleen
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Khalid Aziz @ 2014-11-24 20:56 UTC (permalink / raw)
  To: tglx, corbet, mingo, hpa, peterz, riel, akpm, rientjes, ak,
	mgorman, liwanp, raistlin, kirill.shutemov, atomlin, avagin,
	gorcunov, serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn,
	keescook, yangds.fnst, sbauer, vishnu.ps, axboe, paulmck
  Cc: Khalid Aziz, linux-kernel, linux-doc, linux-api

sched/fair: Add advisory flag for borrowing a timeslice

This patch adds a way for a task to request to borrow one timeslice
from future if it is about to be preempted, so it could delay
preemption and complete any critical task it is in the middle of.

This feature helps with performance on databases and has been
used for many years on other OSs by the databases. This feature
helps in situation where a task acquires a lock before performing a
critical operation on the database and happens to get preempted before
it completes its task.  This lock being held causes all other tasks
that also acquire the same lock to perform their critical operation
on the database, to start queueing up and causing large number of
context switches.  This queueing problem can be avoided if the task
that acquires lock first could request scheduler to let it borrow one
timeslice once it enters its critical section and hence allow it to
complete its critical section without causing queueing problem. If
critical section completes before the task is due for preemption,
the task can simply desassert its request. A task sends the
scheduler this request by setting a flag in a memory location it has
shared with the kernel.  Kernel uses bytes in the same memory location
to let the task know when its request for amnesty from preemption
has been granted.

These rules apply to the use of this feature:

- Request to borrow timeslice is not guranteed to be honored.
- If the task is allowed to borrow, kernel will inform the task
  of this. When this happens, task must yield the processor as soon
  as it completes its critical section.
- If the task fails to yield processor after being allowed to
  borrow, it is penalized by forcing it to skip its next time slot
  by the scheduler.
- Task is charged additional time for the borrowed timeslice as
  accumulated run time. This pushes it further down in consideration
  for the next task to run.

This feature was tested with a TPC-C workload. TPC-C workload shows
a 3% improvement in tpcc throughput when using this feature, which
is a significant improvement.

A new sysctl tunable kernel.preempt_delay_available enables this
feature at run time. The kernel boots up with this feature disabled
by default.

Documentation file included in this patch contains details on how to
use this feature, and conditions associated with its use. This patch
also adds a new field in scheduler statistics which keeps track of
how many times a task was granted amnesty from preemption.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---

With this new version of this patch, the kernel will not enable
preemption delay by default. This feature must be turned on by using
sysctl tunable kernel.preempt_delay_available. With this change, there
are now two ways to eliminate the impact of this feature on systems
that do not intend to use it and are sensitive to scheduling delays
that may be caused by the use of this feature. This feature can be
configured out for custom built kernels. For pre-compiled kernels where
this feature may have been configured in, it will stay off until enabled
through sysctl tunable.

Changelog:
v3:
	- Use prctl() syscall to give kernel the location for shared flag 
	  instead of using a proc file.
	- Disabled this feature by default on a newly booted kernel and
	  added a sysctl tunable to enable/disable it at runtime.
v2:
	- Replaced mmap operation with a more memory efficient futex
	  like communication between userspace and kernel
	- Added a flag to let userspace know if it was granted amnesty
	- Added a penalty for tasks failing to yield CPU when they
	  are granted amnesty from pre-emption

v1:
	- Initial RFC patch with mmap for communication between userspace
	  and kernel

 Documentation/scheduler/sched-preempt-delay.txt | 117 ++++++++++++++++++++++++
 arch/x86/Kconfig                                |  12 +++
 include/linux/sched.h                           |  15 +++
 include/linux/sched/sysctl.h                    |   4 +
 include/uapi/linux/prctl.h                      |   3 +
 kernel/fork.c                                   |   5 +
 kernel/sched/core.c                             |   8 ++
 kernel/sched/debug.c                            |   1 +
 kernel/sched/fair.c                             | 115 ++++++++++++++++++++++-
 kernel/sys.c                                    |  43 +++++++++
 kernel/sysctl.c                                 |   9 ++
 11 files changed, 329 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/scheduler/sched-preempt-delay.txt

diff --git a/Documentation/scheduler/sched-preempt-delay.txt b/Documentation/scheduler/sched-preempt-delay.txt
new file mode 100644
index 0000000..07dd699
--- /dev/null
+++ b/Documentation/scheduler/sched-preempt-delay.txt
@@ -0,0 +1,117 @@
+=================================
+What is preemption delay feature?
+=================================
+
+There are times when a userspace task is executing a critical section
+which gates a number of other tasks that want access to the same
+critical section. If the task holding the lock that guards this critical
+section is preempted by the scheduler in the middle of its critical
+section because its timeslice is up, scheduler ends up scheduling other
+threads which immediately try to grab the lock to enter the critical
+section. This only results in lots of context changes are tasks wake up
+and go to sleep immediately again. If on the other hand, the original
+task were allowed to run for an extra timeslice, it could have completed
+executing its critical section allowing other tasks to make progress
+when they get scheduled. Preemption delay feature allows a task to
+request scheduler to grant it one extra timeslice, if possible.
+
+
+==================================
+Using the preemption delay feature
+==================================
+
+This feature is compiled in the kernel by setting
+CONFIG_SCHED_PREEMPT_DELAY in kernel configuration. By default, the
+kernel boots up with this feature disabled. Enable it using sysctl
+tunable kernel.preempt_delay_available. Once this feature is
+enabled, the userspace process communicates with the kernel using a
+4-byte memory location in its address space. This location must be
+aligned to 4-byte boundary. It first gives the kernel address for this
+memory location by making a prctl() system call with PR_SET_PREEMPT_DELAY
+option. This memory location is interpreted as a sequence of 4 bytes:
+
+	byte[0] = flag to request preemption delay
+	byte[1] = flag from kernel indicating preemption delay was granted
+	byte[2] = reserved for future use
+	byte[3] = reserved for future use
+
+Task requests a preemption delay by writing a non-zero value to the
+first byte. Scheduler checks this value before preempting the task.
+Scheduler can choose to grant one and only an additional time slice to
+the task for each delay request but this delay is not guaranteed.
+If scheduler does grant an additional timeslice, it will set the flag
+in second byte. Upon completion of the section of code where the task
+wants preemption delay, task should check the second byte. If the flag
+in second byte is set, it should clear this flag and call sched_yield()
+so as to not hog the processor. If a thread was granted additional
+timeslice and it fails to call sched_yield(), scheduler will penalize
+it by denying its next request for additional timeslice. Following sample
+code illustrates how to use this feature:
+
+int main()
+{
+	unsigned char buf[256];
+	union {
+		struct {
+			unsigned char nopreempt;
+			unsigned char yield;
+			unsigned char rsvd[2];
+		} flags;
+		uint32_t preempt_delay;
+	} delay;
+
+	delay.preempt_delay = 0;
+
+	/* Tell kernel where the flag lives */
+	prctl(PR_SET_PREEMPT_DELAY, &delay);
+
+	while (/* some condition is true */) {
+		/* do some work and get ready to enter critical section */
+		delay.flags.nopreempt = 1;
+		/*
+		 * Obtain lock for critical section
+		 */
+		/*
+		 * critical section
+		 */
+		/*
+		 * Release lock for critical section
+		 */
+		delay.flags.nopreempt = 0;
+		/* Give the CPU up if required */
+		if (delay.flags.yield) {
+			delay.flags.yield = 0;
+			sched_yield();
+		}
+		/* do some more work */
+	}
+	/*
+	 * Tell kernel we are done asking for preemption delay
+	 */
+	prctl(PR_SET_PREEMPT_DELAY, NULL);
+}
+
+
+====================
+Scheduler statistics
+====================
+
+Preemption delay features adds a new field to scheduler statictics -
+nr_preempt_delayed. This is a per thread statistic that tracks the
+number of times a thread was granted amnesty from preemption when it
+requested for one. "cat /proc/<pid>/task/<tid>/sched" will list this
+number along with other scheduler statistics.
+
+
+=====
+Notes
+=====
+
+1. Location of shared flag can be set using prctl() only once. To
+   write a new memory address, the previous memory address must be
+   cleared first by writing NULL. Each new memory address requires
+   validation in the kernel and update of pointers. Changing this
+   address too many times creates too much overhead.
+
+2. If the location of shared flag is not aligned to 4-byte boundary,
+   prctl() will terminate with EFAULT.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ded8a67..3dac536 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -852,6 +852,18 @@ config SCHED_MC
 	  making when dealing with multi-core CPU chips at a cost of slightly
 	  increased overhead in some places. If unsure say N here.
 
+config SCHED_PREEMPT_DELAY
+	def_bool n
+	prompt "Scheduler preemption delay support"
+	depends on PROC_FS
+	---help---
+	  Say Y here if you want to be able to delay scheduler preemption
+	  when possible by setting a flag in a memory location after
+	  sharing the address of this location with kernel using
+	  PR_SET_PREEMPT_DELAY prctl() call. See
+	  Documentation/scheduler/sched-preempt-delay.txt for details.
+	  If in doubt, say "N".
+
 source "kernel/Kconfig.preempt"
 
 config X86_UP_APIC
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bb..f5150e4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1116,6 +1116,7 @@ struct sched_statistics {
 	u64			nr_wakeups_affine_attempts;
 	u64			nr_wakeups_passive;
 	u64			nr_wakeups_idle;
+	u64			nr_preempt_delayed;
 };
 #endif
 
@@ -1324,6 +1325,13 @@ struct task_struct {
 	/* Revert to default priority/policy when forking */
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+	struct preempt_delay {
+		u32 __user *delay_req;		/* delay request flag pointer */
+		unsigned char delay_granted;	/* currently in delay */
+		unsigned char yield_penalty;	/* failure to yield penalty */
+	} sched_preempt_delay;
+#endif
 
 	unsigned long atomic_flags; /* Flags needing atomic access. */
 
@@ -2179,6 +2187,13 @@ extern u64 scheduler_tick_max_deferment(void);
 static inline bool sched_can_stop_tick(void) { return false; }
 #endif
 
+#if defined(CONFIG_SCHED_PREEMPT_DELAY) && defined(CONFIG_PROC_FS)
+extern void sched_preempt_delay_show(struct seq_file *m,
+					struct task_struct *task);
+extern void sched_preempt_delay_set(struct task_struct *task,
+					unsigned char *val);
+#endif
+
 #ifdef CONFIG_SCHED_AUTOGROUP
 extern void sched_autogroup_create_attach(struct task_struct *p);
 extern void sched_autogroup_detach(struct task_struct *p);
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 596a0e0..516f74e 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -107,4 +107,8 @@ extern int sysctl_numa_balancing(struct ctl_table *table, int write,
 				 void __user *buffer, size_t *lenp,
 				 loff_t *ppos);
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+extern int sysctl_preempt_delay_available;
+#endif
+
 #endif /* _SCHED_SYSCTL_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 513df75..ecfd2cd 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -179,4 +179,7 @@ struct prctl_mm_map {
 #define PR_SET_THP_DISABLE	41
 #define PR_GET_THP_DISABLE	42
 
+#define PR_SET_PREEMPT_DELAY	43
+#define PR_GET_PREEMPT_DELAY	44
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..7f0d843 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1671,6 +1671,11 @@ long do_fork(unsigned long clone_flags,
 			init_completion(&vfork);
 			get_task_struct(p);
 		}
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+		p->sched_preempt_delay.delay_req = NULL;
+		p->sched_preempt_delay.delay_granted = 0;
+		p->sched_preempt_delay.yield_penalty = 0;
+#endif
 
 		wake_up_new_task(p);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 240157c..38cb515 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4230,6 +4230,14 @@ SYSCALL_DEFINE0(sched_yield)
 {
 	struct rq *rq = this_rq_lock();
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+	/*
+	 * Clear the penalty flag for current task to reward it for
+	 * palying by the rules
+	 */
+	current->sched_preempt_delay.yield_penalty = 0;
+#endif
+
 	schedstat_inc(rq, yld_count);
 	current->sched_class->yield_task(rq);
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index ce33780..618d2ac 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -597,6 +597,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.statistics.nr_wakeups_affine_attempts);
 	P(se.statistics.nr_wakeups_passive);
 	P(se.statistics.nr_wakeups_idle);
+	P(se.statistics.nr_preempt_delayed);
 
 	{
 		u64 avg_atom, avg_per_cpu;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 34baa60..58f6e1b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -428,6 +428,115 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+/*
+ * delay_resched_rq(): Check if the task about to be preempted has
+ *	requested an additional time slice. If it has, grant it additional
+ *	timeslice once.
+ */
+static void
+delay_resched_rq(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	struct sched_entity *se;
+	int cpu = task_cpu(curr);
+	u32 __user *delay_req;
+	unsigned int delay_req_flag;
+	unsigned char *delay_flag;
+
+	/*
+	 * Check if task is using pre-emption delay feature. If address
+	 * for preemption delay request flag is not set, this task is
+	 * not using preemption delay feature, we can reschedule without
+	 * any delay
+	 */
+	delay_req = curr->sched_preempt_delay.delay_req;
+
+	if ((delay_req == NULL) || (cpu != smp_processor_id()))
+		goto resched_now;
+
+	/*
+	 * Pre-emption delay will  be granted only once. If this task
+	 * has already been granted delay, rechedule now
+	 */
+	if (curr->sched_preempt_delay.delay_granted) {
+		curr->sched_preempt_delay.delay_granted = 0;
+		goto resched_now;
+	}
+
+	/*
+	 * Get the value of preemption delay request flag from userspace.
+	 * Task had already passed us the address where the flag is stored
+	 * in userspace earlier. This flag is just like the PROCESS_PRIVATE
+	 * futex, leverage the futex code here to read the flag. If there
+	 * is a page fault accessing this flag in userspace, that means
+	 * userspace has not touched this flag recently and we can
+	 * assume no preemption delay is needed.
+	 *
+	 * If task is not requesting additional timeslice, resched now
+	 */
+	if (delay_req) {
+		int ret;
+
+		pagefault_disable();
+		ret = __copy_from_user_inatomic(&delay_req_flag, delay_req,
+				sizeof(u32));
+		pagefault_enable();
+		delay_flag = &delay_req_flag;
+		if (ret || !delay_flag[0])
+			goto resched_now;
+	} else {
+		goto resched_now;
+	}
+
+	/*
+	 * Current thread has requested preemption delay and has not
+	 * been granted an extension yet. If this thread failed to yield
+	 * processor after being granted amnesty last time, penalize it
+	 * by not granting this delay request, otherwise give it an extra
+	 * timeslice.
+	 */
+	if (curr->sched_preempt_delay.yield_penalty) {
+		curr->sched_preempt_delay.yield_penalty = 0;
+		goto resched_now;
+	}
+
+	se = &curr->se;
+	curr->sched_preempt_delay.delay_granted = 1;
+
+	/*
+	 * Set the penalty flag for failing to yield the processor after
+	 * being granted immunity. This flag will be cleared in
+	 * sched_yield() if the thread indeed calls sched_yield
+	 */
+	curr->sched_preempt_delay.yield_penalty = 1;
+
+	/*
+	 * Let the thread know it got amnesty and it should call
+	 * sched_yield() when it is done to avoid penalty next time
+	 * it wants amnesty. We need to write to userspace location.
+	 * Since we just read from this location, chances are extremley
+	 * low we might page fault. If we do page fault, we will ignore
+	 * it and accept the cost of failed write in form of unnecessary
+	 * penalty for userspace task for not yielding processor.
+	 * This is a highly unlikely scenario.
+	 */
+	delay_flag[0] = 0;
+	delay_flag[1] = 1;
+	pagefault_disable();
+	__copy_to_user_inatomic(delay_req, &delay_req_flag, sizeof(u32));
+	pagefault_enable();
+
+	schedstat_inc(curr, se.statistics.nr_preempt_delayed);
+	return;
+
+resched_now:
+	resched_curr(rq);
+}
+#else
+#define delay_resched_rq(rq) resched_curr(rq)
+#endif /* CONFIG_SCHED_PREEMPT_DELAY */
+
 static __always_inline
 void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
 
@@ -2939,7 +3048,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	ideal_runtime = sched_slice(cfs_rq, curr);
 	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
 	if (delta_exec > ideal_runtime) {
-		resched_curr(rq_of(cfs_rq));
+		delay_resched_rq(rq_of(cfs_rq));
 		/*
 		 * The current task ran long enough, ensure it doesn't get
 		 * re-elected due to buddy favours.
@@ -2963,7 +3072,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 		return;
 
 	if (delta > ideal_runtime)
-		resched_curr(rq_of(cfs_rq));
+		delay_resched_rq(rq_of(cfs_rq));
 }
 
 static void
@@ -4780,7 +4889,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	return;
 
 preempt:
-	resched_curr(rq);
+	delay_resched_rq(rq);
 	/*
 	 * Only set the backward buddy when the current task is still
 	 * on the rq. This can happen when a wakeup gets interleaved
diff --git a/kernel/sys.c b/kernel/sys.c
index 1eaa2f0..9edf572 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2025,6 +2025,36 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
 }
 #endif
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+int sysctl_preempt_delay_available;
+
+static int
+preempt_delay_write(struct task_struct *task, unsigned long preempt_delay_addr)
+{
+	/*
+	 * Do not allow write if pointer is currently set
+	 */
+	if (task->sched_preempt_delay.delay_req &&
+			((void *)preempt_delay_addr != NULL))
+		return -EINVAL;
+
+	/*
+	 * Validate the pointer. It should be aligned to 4-byte boundary.
+	 */
+	if (unlikely(!IS_ALIGNED(preempt_delay_addr, 4)))
+		return -EFAULT;
+	if (unlikely(!access_ok(rw, preempt_delay_addr, sizeof(u32))))
+		return -EFAULT;
+
+	task->sched_preempt_delay.delay_req = (u32 __user *) preempt_delay_addr;
+
+	/* zero out flags */
+	put_user(0, (uint32_t *)preempt_delay_addr);
+
+	return 0;
+}
+#endif
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -2203,6 +2233,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			me->mm->def_flags &= ~VM_NOHUGEPAGE;
 		up_write(&me->mm->mmap_sem);
 		break;
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+	case PR_SET_PREEMPT_DELAY:
+		if (sysctl_preempt_delay_available)
+			error = preempt_delay_write(current, arg2);
+		else
+			error = -EPERM;
+		break;
+	case PR_GET_PREEMPT_DELAY:
+		error = put_user(
+			(unsigned long)current->sched_preempt_delay.delay_req,
+					(unsigned long __user *)arg2);
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 15f2511..ad450cf 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1104,6 +1104,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+	{
+		.procname	= "preempt_delay_available",
+		.data		= &sysctl_preempt_delay_available,
+		.maxlen		= sizeof(int),
+		.mode		= 0600,
+		.proc_handler	= proc_dointvec,
+	},
+#endif
 	{ }
 };
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace)
  2014-11-24 20:56 [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace) Khalid Aziz
@ 2014-11-24 22:43 ` Andi Kleen
  2014-11-24 23:20   ` Khalid Aziz
  2014-11-24 23:35 ` Thomas Gleixner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2014-11-24 22:43 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: tglx, corbet, mingo, hpa, peterz, riel, akpm, rientjes, mgorman,
	liwanp, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api

> +1. Location of shared flag can be set using prctl() only once. To
> +   write a new memory address, the previous memory address must be
> +   cleared first by writing NULL. Each new memory address requires
> +   validation in the kernel and update of pointers. Changing this
> +   address too many times creates too much overhead.

Can you explain this more? Doesn't make any sense to me.
The validation is just access_ok() which is only a few instructions?

Also I would drop the config symbol. Linux normally doesn't
do CONFIG for things like that.

> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9b7d746..7f0d843 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1671,6 +1671,11 @@ long do_fork(unsigned long clone_flags,
>  			init_completion(&vfork);
>  			get_task_struct(p);
>  		}
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +		p->sched_preempt_delay.delay_req = NULL;
> +		p->sched_preempt_delay.delay_granted = 0;
> +		p->sched_preempt_delay.yield_penalty = 0;
> +#endif

FWIW this would lead to every new thread having to reexecute
this. No good way around it, but it may eventually make
thread spawns more expensive if it was widely used.

>  
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +	/*
> +	 * Clear the penalty flag for current task to reward it for
> +	 * palying by the rules
> +	 */
> +	current->sched_preempt_delay.yield_penalty = 0;
> +#endif

Doesn't that need to be quantified? After all they may yield
only near the end of their time slice.

> +	}
> +
> +	/*
> +	 * Get the value of preemption delay request flag from userspace.
> +	 * Task had already passed us the address where the flag is stored
> +	 * in userspace earlier. This flag is just like the PROCESS_PRIVATE
> +	 * futex, leverage the futex code here to read the flag. If there

I don't think any of the calls below are futex code.

> +	case PR_GET_PREEMPT_DELAY:
> +		error = put_user(
> +			(unsigned long)current->sched_preempt_delay.delay_req,
> +					(unsigned long __user *)arg2);
> +		break;
> +#endif

Unnecessary cast.

> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1104,6 +1104,15 @@ static struct ctl_table kern_table[] = {
>  		.proc_handler	= proc_dointvec,
>  	},
>  #endif
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +	{
> +		.procname	= "preempt_delay_available",
> +		.data		= &sysctl_preempt_delay_available,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0600,

Better 0644, so users can know if they can use it.

Rest looks reasonable to me.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace)
  2014-11-24 22:43 ` Andi Kleen
@ 2014-11-24 23:20   ` Khalid Aziz
  0 siblings, 0 replies; 22+ messages in thread
From: Khalid Aziz @ 2014-11-24 23:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: tglx, corbet, mingo, hpa, peterz, riel, akpm, rientjes, mgorman,
	liwanp, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api

On Mon, 2014-11-24 at 14:43 -0800, Andi Kleen wrote:
> > +1. Location of shared flag can be set using prctl() only once. To
> > +   write a new memory address, the previous memory address must be
> > +   cleared first by writing NULL. Each new memory address requires
> > +   validation in the kernel and update of pointers. Changing this
> > +   address too many times creates too much overhead.
> 
> Can you explain this more? Doesn't make any sense to me.
> The validation is just access_ok() which is only a few instructions?

>From userspace app point of view, each call to prctl() incurs the
overhead of a system call.

> 
> Also I would drop the config symbol. Linux normally doesn't
> do CONFIG for things like that.

CONFIG_SCHED_PREEMPT_DELAY allows one to keep this code out of compiled
kernel for custom kernels if this feature is definitely not needed.
Concerns were raised last time about this feature impacting other tasks.
Nevertheless, I am ok with removing the config option if that is the
consensus.

> 
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 9b7d746..7f0d843 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1671,6 +1671,11 @@ long do_fork(unsigned long clone_flags,
> >  			init_completion(&vfork);
> >  			get_task_struct(p);
> >  		}
> > +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> > +		p->sched_preempt_delay.delay_req = NULL;
> > +		p->sched_preempt_delay.delay_granted = 0;
> > +		p->sched_preempt_delay.yield_penalty = 0;
> > +#endif
> 
> FWIW this would lead to every new thread having to reexecute
> this. No good way around it, but it may eventually make
> thread spawns more expensive if it was widely used.

Yes, that is true. Newly allocated task_struct is not zero'd out, so
this becomes necessary.

> 
> >  
> > +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> > +	/*
> > +	 * Clear the penalty flag for current task to reward it for
> > +	 * palying by the rules
> > +	 */
> > +	current->sched_preempt_delay.yield_penalty = 0;
> > +#endif
> 
> Doesn't that need to be quantified? After all they may yield
> only near the end of their time slice.

and that should be ok because the task was allowed to borrow a full
timeslice and it can use it up almost completely. Is there a reason to
differentiate between tasks yielding well before their borrowed
timeslice is up and tasks not yielding until almost the end of borrowed
timeslice?

> 
> > +	}
> > +
> > +	/*
> > +	 * Get the value of preemption delay request flag from userspace.
> > +	 * Task had already passed us the address where the flag is stored
> > +	 * in userspace earlier. This flag is just like the PROCESS_PRIVATE
> > +	 * futex, leverage the futex code here to read the flag. If there
> 
> I don't think any of the calls below are futex code.

Following code was borrowed from get_futex_value_locked() but this
comment is not really necessary here and can be removed if it causes
confusion.

> 
> > +	case PR_GET_PREEMPT_DELAY:
> > +		error = put_user(
> > +			(unsigned long)current->sched_preempt_delay.delay_req,
> > +					(unsigned long __user *)arg2);
> > +		break;
> > +#endif
> 
> Unnecessary cast.

I get bunch of warnings and errors if I remove either of the casts.

> 
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -1104,6 +1104,15 @@ static struct ctl_table kern_table[] = {
> >  		.proc_handler	= proc_dointvec,
> >  	},
> >  #endif
> > +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> > +	{
> > +		.procname	= "preempt_delay_available",
> > +		.data		= &sysctl_preempt_delay_available,
> > +		.maxlen		= sizeof(int),
> > +		.mode		= 0600,
> 
> Better 0644, so users can know if they can use it.

OK, I can change that.

> 
> Rest looks reasonable to me.
> 
> -Andi

Thanks, Andi! I appreciate your feedback.

--
Khalid



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace)
  2014-11-24 20:56 [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace) Khalid Aziz
  2014-11-24 22:43 ` Andi Kleen
@ 2014-11-24 23:35 ` Thomas Gleixner
  2014-11-25  2:12   ` Davidlohr Bueso
                     ` (2 more replies)
  2014-11-25  2:03 ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace) Rik van Riel
  2014-11-25 10:12 ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace) Srikar Dronamraju
  3 siblings, 3 replies; 22+ messages in thread
From: Thomas Gleixner @ 2014-11-24 23:35 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: corbet, mingo, hpa, peterz, riel, akpm, rientjes, ak, mgorman,
	liwanp, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api

On Mon, 24 Nov 2014, Khalid Aziz wrote:
> sched/fair: Add advisory flag for borrowing a timeslice
> 
> This patch adds a way for a task to request to borrow one timeslice
> from future if it is about to be preempted, so it could delay
> preemption and complete any critical task it is in the middle of.
> 
> This feature helps with performance on databases and has been
> used for many years on other OSs by the databases. This feature
> helps in situation where a task acquires a lock before performing a
> critical operation on the database and happens to get preempted before
> it completes its task.  This lock being held causes all other tasks
> that also acquire the same lock to perform their critical operation
> on the database, to start queueing up and causing large number of
> context switches.  This queueing problem can be avoided if the task
> that acquires lock first could request scheduler to let it borrow one
> timeslice once it enters its critical section and hence allow it to
> complete its critical section without causing queueing problem. If

While you are niftily avoiding to talk about the nature of the lock, I
can take it for granted that you are talking about user space
spinlocks, right?

Simply if you would talk about futexes and pthread_mutexes then it
would have occured to you while implementing that feature, that the
kernel already has a mechanism to record a reference to a user space
data structure (robust_list_head) which is updated when a futex is
acquired in user space, i.e. when a critical section is entered. It's
not the same as you need, but it would be relatively simple to convey
that information there.

So what are the actual lock types and use cases and why can't you
combine that with the existing robust list mechanism?

> critical section completes before the task is due for preemption,
> the task can simply desassert its request. A task sends the

And that deassertion has which consequences before the next preempt
check happens?

> +config SCHED_PREEMPT_DELAY
> +	def_bool n
> +	prompt "Scheduler preemption delay support"
> +	depends on PROC_FS

Why so?

> @@ -1324,6 +1325,13 @@ struct task_struct {
>  	/* Revert to default priority/policy when forking */
>  	unsigned sched_reset_on_fork:1;
>  	unsigned sched_contributes_to_load:1;
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +	struct preempt_delay {
> +		u32 __user *delay_req;		/* delay request flag pointer */
> +		unsigned char delay_granted;	/* currently in delay */
> +		unsigned char yield_penalty;	/* failure to yield penalty */
> +	} sched_preempt_delay;

No. First of all this wants to be a proper struct declaration outside
of task_struct.

Aside of that your user space side is actually a structure and not a
opaque u32 pointer, so this should be an explicit data type and not
something randomly defined in the guts of task_struct.

> +#if defined(CONFIG_SCHED_PREEMPT_DELAY) && defined(CONFIG_PROC_FS)
> +extern void sched_preempt_delay_show(struct seq_file *m,
> +					struct task_struct *task);
> +extern void sched_preempt_delay_set(struct task_struct *task,
> +					unsigned char *val);
> +#endif

Can you please get rid of the leftovers of your previous patches
yourself and before posting? It's annoying as hell to review patches
which contain stale code.

> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9b7d746..7f0d843 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1671,6 +1671,11 @@ long do_fork(unsigned long clone_flags,
>  			init_completion(&vfork);
>  			get_task_struct(p);
>  		}
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +		p->sched_preempt_delay.delay_req = NULL;
> +		p->sched_preempt_delay.delay_granted = 0;
> +		p->sched_preempt_delay.yield_penalty = 0;
> +#endif

Sigh. We do not sprinkle that kind of #ifdef crap all over the
place. That's what inline functions in header files are for.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 240157c..38cb515 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4230,6 +4230,14 @@ SYSCALL_DEFINE0(sched_yield)
>  {
>  	struct rq *rq = this_rq_lock();
>  
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +	/*
> +	 * Clear the penalty flag for current task to reward it for
> +	 * palying by the rules

Looking at that mess makes me palying^Wpale.

> +	 */
> +	current->sched_preempt_delay.yield_penalty = 0;
> +#endif
> +
>  	schedstat_inc(rq, yld_count);
>  	current->sched_class->yield_task(rq);
  
> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
> +/*
> + * delay_resched_rq(): Check if the task about to be preempted has
> + *	requested an additional time slice. If it has, grant it additional
> + *	timeslice once.
> + */
> +static void
> +delay_resched_rq(struct rq *rq)
> +{
> +	struct task_struct *curr = rq->curr;
> +	struct sched_entity *se;
> +	int cpu = task_cpu(curr);
> +	u32 __user *delay_req;
> +	unsigned int delay_req_flag;
> +	unsigned char *delay_flag;
> +
> +	/*
> +	 * Check if task is using pre-emption delay feature. If address
> +	 * for preemption delay request flag is not set, this task is
> +	 * not using preemption delay feature, we can reschedule without
> +	 * any delay

So what happens if:

   kernel.preempt_delay_available = 1;
   
   prctl(PR_SET_PREEMPT_DELAY, ...);

   kernel.preempt_delay_available = 0;

Nothing happens at all because you fail to give the sysop control over
the feature once you unleashed it.

The proper solution for this is to use a static key to control the
feature itself. That also reduces the overhead for those who are not
interested in that.

> +	 */
> +	delay_req = curr->sched_preempt_delay.delay_req;
> +
> +	if ((delay_req == NULL) || (cpu != smp_processor_id()))

check_preempt_tick() clearly does not care about that, but you inflict
a smp_processor_id() on every caller. I can see that you really care
about performance.

> +		goto resched_now;
> +
> +	/*
> +	 * Pre-emption delay will  be granted only once. If this task
> +	 * has already been granted delay, rechedule now
> +	 */
> +	if (curr->sched_preempt_delay.delay_granted) {
> +		curr->sched_preempt_delay.delay_granted = 0;
> +		goto resched_now;
> +	}
> +	/*
> +	 * Get the value of preemption delay request flag from userspace.
> +	 * Task had already passed us the address where the flag is stored
> +	 * in userspace earlier. This flag is just like the PROCESS_PRIVATE
> +	 * futex, leverage the futex code here to read the flag. If there
> +	 * is a page fault accessing this flag in userspace, that means
> +	 * userspace has not touched this flag recently and we can
> +	 * assume no preemption delay is needed.
> +	 *
> +	 * If task is not requesting additional timeslice, resched now
> +	 */
> +	if (delay_req) {

Surely we need to recheck delay_req here.

> +		int ret;
> +
> +		pagefault_disable();
> +		ret = __copy_from_user_inatomic(&delay_req_flag, delay_req,
> +				sizeof(u32));
> +		pagefault_enable();
> +		delay_flag = &delay_req_flag;
> +		if (ret || !delay_flag[0])

This is really a well designed kernel/user space interface. NOT.

> +			goto resched_now;
> +	} else {
> +		goto resched_now;
> +	}
> +
> +	/*
> +	 * Current thread has requested preemption delay and has not
> +	 * been granted an extension yet. If this thread failed to yield
> +	 * processor after being granted amnesty last time, penalize it
> +	 * by not granting this delay request, otherwise give it an extra
> +	 * timeslice.
> +	 */
> +	if (curr->sched_preempt_delay.yield_penalty) {
> +		curr->sched_preempt_delay.yield_penalty = 0;
> +		goto resched_now;
> +	}
> +
> +	se = &curr->se;
> +	curr->sched_preempt_delay.delay_granted = 1;
> +	/*
> +	 * Set the penalty flag for failing to yield the processor after
> +	 * being granted immunity. This flag will be cleared in
> +	 * sched_yield() if the thread indeed calls sched_yield
> +	 */
> +	curr->sched_preempt_delay.yield_penalty = 1;

Why on earth do we need two flags here? Just because we can create
more code in the guts of the scheduler hot pathes that way?

And surely we want to put them into two adjacent u8 to make life
easier for all architectures.

> +	/*
> +	 * Let the thread know it got amnesty and it should call
> +	 * sched_yield() when it is done to avoid penalty next time
> +	 * it wants amnesty. We need to write to userspace location.
> +	 * Since we just read from this location, chances are extremley
> +	 * low we might page fault. If we do page fault, we will ignore
> +	 * it and accept the cost of failed write in form of unnecessary
> +	 * penalty for userspace task for not yielding processor.

This is the completely wrong argument. We know that the task was
asking for an extra time slice because the copy from user above
succeeded. So we are better of to let the task actually handle its
pagefault than scheduling it out.

> +	 * This is a highly unlikely scenario.
> +	 */
> +	delay_flag[0] = 0;
> +	delay_flag[1] = 1;

Sigh.

> +#ifdef CONFIG_SCHED_PREEMPT_DELAY

And all of this needs to be in kernel/sys.c just because...

> +int sysctl_preempt_delay_available;
> +
> +static int
> +preempt_delay_write(struct task_struct *task, unsigned long preempt_delay_addr)
> +{
> +	/*
> +	 * Do not allow write if pointer is currently set
> +	 */
> +	if (task->sched_preempt_delay.delay_req &&
> +			((void *)preempt_delay_addr != NULL))
> +		return -EINVAL;
> +	/*
> +	 * Validate the pointer. It should be aligned to 4-byte boundary.

So 4 bytes is a perfect boundary for everyone, right? Pulled that
number out of thin air or what?

> +	 */
> +	if (unlikely(!IS_ALIGNED(preempt_delay_addr, 4)))
> +		return -EFAULT;
> +	if (unlikely(!access_ok(rw, preempt_delay_addr, sizeof(u32))))
> +		return -EFAULT;
> +
> +	task->sched_preempt_delay.delay_req = (u32 __user *) preempt_delay_addr;
> +
> +	/* zero out flags */

Brilliant comment. I can see what the code is doing. What's way more
interesting and of course undocumented is why you are ignoring the
return value of put_user() ..

> +	put_user(0, (uint32_t *)preempt_delay_addr);

Aside of the general issues I have with this (see the inline replies
to your changelog) the overall impression of this patch is that it is
a half baken and carelessly cobbled together extract of some data base
specific kernel hackery, which I prefer not to see at all.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace)
  2014-11-24 20:56 [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace) Khalid Aziz
  2014-11-24 22:43 ` Andi Kleen
  2014-11-24 23:35 ` Thomas Gleixner
@ 2014-11-25  2:03 ` Rik van Riel
  2014-11-25  6:30   ` Davidlohr Bueso
  2014-11-25 14:52   ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice Khalid Aziz
  2014-11-25 10:12 ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace) Srikar Dronamraju
  3 siblings, 2 replies; 22+ messages in thread
From: Rik van Riel @ 2014-11-25  2:03 UTC (permalink / raw)
  To: Khalid Aziz, tglx, corbet, mingo, hpa, peterz, akpm, rientjes,
	ak, mgorman, liwanp, raistlin, kirill.shutemov, atomlin, avagin,
	gorcunov, serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn,
	keescook, yangds.fnst, sbauer, vishnu.ps, axboe, paulmck
  Cc: linux-kernel, linux-doc, linux-api

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/24/2014 03:56 PM, Khalid Aziz wrote:
> sched/fair: Add advisory flag for borrowing a timeslice
> 
> This patch adds a way for a task to request to borrow one
> timeslice from future if it is about to be preempted, so it could
> delay preemption and complete any critical task it is in the middle
> of.
> 
> This feature helps with performance on databases and has been used
> for many years on other OSs by the databases. This feature helps in
> situation where a task acquires a lock before performing a critical
> operation on the database and happens to get preempted

Why don't the other tasks that want the lock sleep on the
lock?

I can see this "solution" help mostly with userspace spinlocks,
which are relics of a past era that need to die. There is no
way userspace spinlocks will not fail miserably on virtual
machines, and it is time to get rid of them.

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUc+OIAAoJEM553pKExN6DF/oIAJ+ldPQZBMIxJLK4dmAwCuwu
OLK8sEyOMbg7/0u7EcfJaeWMhxIN+etnK9535TsIpm7ojBeBcuMvXv/K7u5gD6V4
+yU4mV/rUCccolXruaHJqaFZqOg06qmJ0FdzJNyBrsMclgGnfTL8m8p5dlCOMMLZ
11N3imtrrJekigAmn/r9DCr75cGgfpIjPqE1yHc5NhiZ2uPmAS2qvefIZsg+88PH
8M0dOjgIWQKi9SkB6K2OSy7A/fKwyf9DJ3/OKRovA6AHfszvqCU1WVOoYRoO0CPG
v/zOYxIi8FIwi9LT50pM62zcpXVMYddN5etGa9qh4nI7oxXYKniH4JzwZdWDYwo=
=EymR
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace)
  2014-11-24 23:35 ` Thomas Gleixner
@ 2014-11-25  2:12   ` Davidlohr Bueso
  2014-11-25  4:20   ` Mike Galbraith
  2014-11-25 14:45   ` Khalid Aziz
  2 siblings, 0 replies; 22+ messages in thread
From: Davidlohr Bueso @ 2014-11-25  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Khalid Aziz, corbet, mingo, hpa, peterz, riel, akpm, rientjes,
	ak, mgorman, liwanp, raistlin, kirill.shutemov, atomlin, avagin,
	gorcunov, serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn,
	keescook, yangds.fnst, sbauer, vishnu.ps, axboe, paulmck,
	linux-kernel, linux-doc, linux-api

On Tue, 2014-11-25 at 00:35 +0100, Thomas Gleixner wrote:
> On Mon, 24 Nov 2014, Khalid Aziz wrote:
> > sched/fair: Add advisory flag for borrowing a timeslice
> > 
> > This patch adds a way for a task to request to borrow one timeslice
> > from future if it is about to be preempted, so it could delay
> > preemption and complete any critical task it is in the middle of.
> > 
> > This feature helps with performance on databases and has been
> > used for many years on other OSs by the databases. This feature
> > helps in situation where a task acquires a lock before performing a
> > critical operation on the database and happens to get preempted before
> > it completes its task.  This lock being held causes all other tasks
> > that also acquire the same lock to perform their critical operation
> > on the database, to start queueing up and causing large number of
> > context switches.  This queueing problem can be avoided if the task
> > that acquires lock first could request scheduler to let it borrow one
> > timeslice once it enters its critical section and hence allow it to
> > complete its critical section without causing queueing problem. If
> 
> While you are niftily avoiding to talk about the nature of the lock, I
> can take it for granted that you are talking about user space
> spinlocks, right?

Probably, or perhaps userspace fair locks. If the task that is next in
line to acquire the lock is preempted and the lock is released, fairness
prevents anyone else from taking it instead.

Thanks,
Davidlohr



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace)
  2014-11-24 23:35 ` Thomas Gleixner
  2014-11-25  2:12   ` Davidlohr Bueso
@ 2014-11-25  4:20   ` Mike Galbraith
  2014-11-25 14:50     ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice Khalid Aziz
  2014-11-25 14:45   ` Khalid Aziz
  2 siblings, 1 reply; 22+ messages in thread
From: Mike Galbraith @ 2014-11-25  4:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Khalid Aziz, corbet, mingo, hpa, peterz, riel, akpm, rientjes,
	ak, mgorman, liwanp, raistlin, kirill.shutemov, atomlin, avagin,
	gorcunov, serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn,
	keescook, yangds.fnst, sbauer, vishnu.ps, axboe, paulmck,
	linux-kernel, linux-doc, linux-api

On Tue, 2014-11-25 at 00:35 +0100, Thomas Gleixner wrote:

> Aside of the general issues I have with this (see the inline replies
> to your changelog) the overall impression of this patch is that it is
> a half baken and carelessly cobbled together extract of some data base
> specific kernel hackery, which I prefer not to see at all.

It culminates in a lumbering pseudo RT class of task disguised as a fair
class task.  I'd expect more gain by twiddling knobs to let last buddy
do its job than the 3% mentioned.

You could perhaps create a SUPER_BATCH class that is not wakeup
preempted by any fair class task of <= priority, not only BATCH and
IDLE, but that's as nasty as this patch, though loads prettier.  The
tick time thing doesn't feel right at all... if you're hurt badly by the
tick, you're likely holding the lock too long methinks.

	-Mike



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace)
  2014-11-25  2:03 ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace) Rik van Riel
@ 2014-11-25  6:30   ` Davidlohr Bueso
  2014-11-25 13:38     ` Rik van Riel
  2014-11-25 14:52   ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice Khalid Aziz
  1 sibling, 1 reply; 22+ messages in thread
From: Davidlohr Bueso @ 2014-11-25  6:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Khalid Aziz, tglx, corbet, mingo, hpa, peterz, akpm, rientjes,
	ak, mgorman, liwanp, raistlin, kirill.shutemov, atomlin, avagin,
	gorcunov, serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn,
	keescook, yangds.fnst, sbauer, vishnu.ps, axboe, paulmck,
	linux-kernel, linux-doc, linux-api

On Mon, 2014-11-24 at 21:03 -0500, Rik van Riel wrote:
> I can see this "solution" help mostly with userspace spinlocks,
> which are relics of a past era that need to die. There is no
> way userspace spinlocks will not fail miserably on virtual
> machines, and it is time to get rid of them.

No, not really. Spinlocks are still very useful on bare metal.
Virtualization is not the only thing out there.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace)
  2014-11-24 20:56 [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace) Khalid Aziz
                   ` (2 preceding siblings ...)
  2014-11-25  2:03 ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace) Rik van Riel
@ 2014-11-25 10:12 ` Srikar Dronamraju
  2014-11-25 14:56   ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice Khalid Aziz
  3 siblings, 1 reply; 22+ messages in thread
From: Srikar Dronamraju @ 2014-11-25 10:12 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: tglx, corbet, mingo, hpa, peterz, riel, akpm, rientjes, ak,
	mgorman, liwanp, raistlin, kirill.shutemov, atomlin, avagin,
	gorcunov, serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn,
	keescook, yangds.fnst, sbauer, vishnu.ps, axboe, paulmck,
	linux-kernel, linux-doc, linux-api

> 
> - Request to borrow timeslice is not guranteed to be honored.
> - If the task is allowed to borrow, kernel will inform the task
>   of this. When this happens, task must yield the processor as soon
>   as it completes its critical section.
> - If the task fails to yield processor after being allowed to
>   borrow, it is penalized by forcing it to skip its next time slot
>   by the scheduler.
> - Task is charged additional time for the borrowed timeslice as
>   accumulated run time. This pushes it further down in consideration
>   for the next task to run.
> 

Is there a way for us to identify if the lock is contended?
Because it may not be prudent to allow a task to borrow timeslice for a
lock which isnt contended.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace)
  2014-11-25  6:30   ` Davidlohr Bueso
@ 2014-11-25 13:38     ` Rik van Riel
  0 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2014-11-25 13:38 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Khalid Aziz, tglx, corbet, mingo, hpa, peterz, akpm, rientjes,
	ak, mgorman, liwanp, raistlin, kirill.shutemov, atomlin, avagin,
	gorcunov, serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn,
	keescook, yangds.fnst, sbauer, vishnu.ps, axboe, paulmck,
	linux-kernel, linux-doc, linux-api

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/25/2014 01:30 AM, Davidlohr Bueso wrote:
> On Mon, 2014-11-24 at 21:03 -0500, Rik van Riel wrote:
>> I can see this "solution" help mostly with userspace spinlocks, 
>> which are relics of a past era that need to die. There is no way
>> userspace spinlocks will not fail miserably on virtual machines,
>> and it is time to get rid of them.
> 
> No, not really. Spinlocks are still very useful on bare metal. 
> Virtualization is not the only thing out there.

How many people are going to build two different binaries,
one for bare metal, and one for virtualized environments?

I suspect the vast majority of applications will only be
built once, and it would be nice if it wasn't using a
locking scheme that broke horribly on a significant part
of the deployments...

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUdIY6AAoJEM553pKExN6DaisH+wWzTc+onHTsPUXs6EU/s+sa
lp3KFWmRQACPjiWSyIfg7aWFxakiS8BQ4ypbXdC/55lHuX/KMm/1k3zZF/lHiyYA
vIwfPUX7TnZxgYGVGk++nrCTffQImAc5RXlCBU6Hp6dHxV5Pead6S9afO8dfOeVu
80cpsqCyUqX+jhMDKq6NkIE0mCMb/U4L0cqo7m67h7PTlWmj8V64PKJjvkDu48O1
tPd+6jj4xDoEl8dde00EMaYETA6Utngt8+LslV1hMB1nxn82aNIGJnEqQco4WGXH
gE8Pkn+iToBe1hPF63MVZFJHRzXPUOAoaBCTgu7+l9LLDkfBgv2A/ckh5NLUrd8=
=41Jl
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice
  2014-11-24 23:35 ` Thomas Gleixner
  2014-11-25  2:12   ` Davidlohr Bueso
  2014-11-25  4:20   ` Mike Galbraith
@ 2014-11-25 14:45   ` Khalid Aziz
  2014-11-25 18:27     ` Davidlohr Bueso
  2 siblings, 1 reply; 22+ messages in thread
From: Khalid Aziz @ 2014-11-25 14:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: corbet, mingo, hpa, peterz, riel, akpm, rientjes, ak, mgorman,
	liwanp, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api

Thanks for the review. I appreciate your comments. Please see my 
response inline.

On 11/24/2014 04:35 PM, Thomas Gleixner wrote:
> On Mon, 24 Nov 2014, Khalid Aziz wrote:
>> sched/fair: Add advisory flag for borrowing a timeslice
>>
>> This patch adds a way for a task to request to borrow one timeslice
>> from future if it is about to be preempted, so it could delay
>> preemption and complete any critical task it is in the middle of.
>>
>> This feature helps with performance on databases and has been
>> used for many years on other OSs by the databases. This feature
>> helps in situation where a task acquires a lock before performing a
>> critical operation on the database and happens to get preempted before
>> it completes its task.  This lock being held causes all other tasks
>> that also acquire the same lock to perform their critical operation
>> on the database, to start queueing up and causing large number of
>> context switches.  This queueing problem can be avoided if the task
>> that acquires lock first could request scheduler to let it borrow one
>> timeslice once it enters its critical section and hence allow it to
>> complete its critical section without causing queueing problem. If
>
> While you are niftily avoiding to talk about the nature of the lock, I
> can take it for granted that you are talking about user space
> spinlocks, right?

Sorry, it was certainly not my intention to avoid talking about the 
nature of the locks. I could have done a better job of explaining the 
locks in use. Yes, these are userspace spinlocks implemented in database 
library used by the database.

>
> Simply if you would talk about futexes and pthread_mutexes then it
> would have occured to you while implementing that feature, that the
> kernel already has a mechanism to record a reference to a user space
> data structure (robust_list_head) which is updated when a futex is
> acquired in user space, i.e. when a critical section is entered. It's
> not the same as you need, but it would be relatively simple to convey
> that information there.
>
> So what are the actual lock types and use cases and why can't you
> combine that with the existing robust list mechanism?

When I was asked to solve this queueing problem, I started working on a 
solution built around futexes and then I found out futex based solution 
is a no-go. Two primary users of this feature are database and Java. 
Neither uses POSIX locks or futex I am told by the folks that maintain 
these two. Hence a solution that allows them to ask for amnesty from 
preemption for just one more timeslice by letting them borrow a 
timeslice from future IF they are going to be preempted BEFORE their 
critical section is done. Userspace app is expected to yield the 
processor as soon as it is done with critical section which means the 
app may never use the extra timeslice it had asked for.

This solution has been used by both database and java on other OSs and 
has shown performance improvement. Andrew had asked for performance 
numbers on Linux with this patch last time I sent this out and it took 
me a while to get performance folks to run a full TPC-C workload. They 
did see a 3% improvement in tpcc as I noted in commit log and that is a 
significant improvement.

>
>> critical section completes before the task is due for preemption,
>> the task can simply desassert its request. A task sends the
>
> And that deassertion has which consequences before the next preempt
> check happens?
>
>> +config SCHED_PREEMPT_DELAY
>> +	def_bool n
>> +	prompt "Scheduler preemption delay support"
>> +	depends on PROC_FS
>
> Why so?

I assume you are asking about "depends on PROC_FS"? This patch uses proc 
to publish statistics but it really is publishing through existing proc 
file, so this is a weak dependency at best. I will remove this.

>
>> @@ -1324,6 +1325,13 @@ struct task_struct {
>>   	/* Revert to default priority/policy when forking */
>>   	unsigned sched_reset_on_fork:1;
>>   	unsigned sched_contributes_to_load:1;
>> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
>> +	struct preempt_delay {
>> +		u32 __user *delay_req;		/* delay request flag pointer */
>> +		unsigned char delay_granted;	/* currently in delay */
>> +		unsigned char yield_penalty;	/* failure to yield penalty */
>> +	} sched_preempt_delay;
>
> No. First of all this wants to be a proper struct declaration outside
> of task_struct.
>
> Aside of that your user space side is actually a structure and not a
> opaque u32 pointer, so this should be an explicit data type and not
> something randomly defined in the guts of task_struct.

That is a good point. I will fix it.

>
>> +#if defined(CONFIG_SCHED_PREEMPT_DELAY) && defined(CONFIG_PROC_FS)
>> +extern void sched_preempt_delay_show(struct seq_file *m,
>> +					struct task_struct *task);
>> +extern void sched_preempt_delay_set(struct task_struct *task,
>> +					unsigned char *val);
>> +#endif
>
> Can you please get rid of the leftovers of your previous patches
> yourself and before posting? It's annoying as hell to review patches
> which contain stale code.

Sorry, my screw up. I missed that one when reviewing my own patch.

>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 9b7d746..7f0d843 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1671,6 +1671,11 @@ long do_fork(unsigned long clone_flags,
>>   			init_completion(&vfork);
>>   			get_task_struct(p);
>>   		}
>> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
>> +		p->sched_preempt_delay.delay_req = NULL;
>> +		p->sched_preempt_delay.delay_granted = 0;
>> +		p->sched_preempt_delay.yield_penalty = 0;
>> +#endif
>
> Sigh. We do not sprinkle that kind of #ifdef crap all over the
> place. That's what inline functions in header files are for.

Sure. I will change that.

>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 240157c..38cb515 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4230,6 +4230,14 @@ SYSCALL_DEFINE0(sched_yield)
>>   {
>>   	struct rq *rq = this_rq_lock();
>>
>> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
>> +	/*
>> +	 * Clear the penalty flag for current task to reward it for
>> +	 * palying by the rules
>
> Looking at that mess makes me palying^Wpale.

:) will clean it up.

>
>> +	 */
>> +	current->sched_preempt_delay.yield_penalty = 0;
>> +#endif
>> +
>>   	schedstat_inc(rq, yld_count);
>>   	current->sched_class->yield_task(rq);
>
>> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
>> +/*
>> + * delay_resched_rq(): Check if the task about to be preempted has
>> + *	requested an additional time slice. If it has, grant it additional
>> + *	timeslice once.
>> + */
>> +static void
>> +delay_resched_rq(struct rq *rq)
>> +{
>> +	struct task_struct *curr = rq->curr;
>> +	struct sched_entity *se;
>> +	int cpu = task_cpu(curr);
>> +	u32 __user *delay_req;
>> +	unsigned int delay_req_flag;
>> +	unsigned char *delay_flag;
>> +
>> +	/*
>> +	 * Check if task is using pre-emption delay feature. If address
>> +	 * for preemption delay request flag is not set, this task is
>> +	 * not using preemption delay feature, we can reschedule without
>> +	 * any delay
>
> So what happens if:
>
>     kernel.preempt_delay_available = 1;
>
>     prctl(PR_SET_PREEMPT_DELAY, ...);
>
>     kernel.preempt_delay_available = 0;
>
> Nothing happens at all because you fail to give the sysop control over
> the feature once you unleashed it.
>
> The proper solution for this is to use a static key to control the
> feature itself. That also reduces the overhead for those who are not
> interested in that.

I was trying to reduce adding one more check to scheduler path and that 
opened up a hole as you pointed out. Good catch! I will add the 
conditional to scheduler path to plug this hole.

>
>> +	 */
>> +	delay_req = curr->sched_preempt_delay.delay_req;
>> +
>> +	if ((delay_req == NULL) || (cpu != smp_processor_id()))
>
> check_preempt_tick() clearly does not care about that, but you inflict
> a smp_processor_id() on every caller. I can see that you really care
> about performance.

:) I do care about performance. I will remove that check.

>
>> +		goto resched_now;
>> +
>> +	/*
>> +	 * Pre-emption delay will  be granted only once. If this task
>> +	 * has already been granted delay, rechedule now
>> +	 */
>> +	if (curr->sched_preempt_delay.delay_granted) {
>> +		curr->sched_preempt_delay.delay_granted = 0;
>> +		goto resched_now;
>> +	}
>> +	/*
>> +	 * Get the value of preemption delay request flag from userspace.
>> +	 * Task had already passed us the address where the flag is stored
>> +	 * in userspace earlier. This flag is just like the PROCESS_PRIVATE
>> +	 * futex, leverage the futex code here to read the flag. If there
>> +	 * is a page fault accessing this flag in userspace, that means
>> +	 * userspace has not touched this flag recently and we can
>> +	 * assume no preemption delay is needed.
>> +	 *
>> +	 * If task is not requesting additional timeslice, resched now
>> +	 */
>> +	if (delay_req) {
>
> Surely we need to recheck delay_req here.
>

Agreed, this is superfluous. I will clean it up.

>> +		int ret;
>> +
>> +		pagefault_disable();
>> +		ret = __copy_from_user_inatomic(&delay_req_flag, delay_req,
>> +				sizeof(u32));
>> +		pagefault_enable();
>> +		delay_flag = &delay_req_flag;
>> +		if (ret || !delay_flag[0])
>
> This is really a well designed kernel/user space interface. NOT.

Please do suggest a better interface. My constraint is it has to be very 
low overhead. A system call is just too much overhead and will negate 
any benefit from this.

>
>> +			goto resched_now;
>> +	} else {
>> +		goto resched_now;
>> +	}
>> +
>> +	/*
>> +	 * Current thread has requested preemption delay and has not
>> +	 * been granted an extension yet. If this thread failed to yield
>> +	 * processor after being granted amnesty last time, penalize it
>> +	 * by not granting this delay request, otherwise give it an extra
>> +	 * timeslice.
>> +	 */
>> +	if (curr->sched_preempt_delay.yield_penalty) {
>> +		curr->sched_preempt_delay.yield_penalty = 0;
>> +		goto resched_now;
>> +	}
>> +
>> +	se = &curr->se;
>> +	curr->sched_preempt_delay.delay_granted = 1;
>> +	/*
>> +	 * Set the penalty flag for failing to yield the processor after
>> +	 * being granted immunity. This flag will be cleared in
>> +	 * sched_yield() if the thread indeed calls sched_yield
>> +	 */
>> +	curr->sched_preempt_delay.yield_penalty = 1;
>
> Why on earth do we need two flags here? Just because we can create
> more code in the guts of the scheduler hot pathes that way?
>
> And surely we want to put them into two adjacent u8 to make life
> easier for all architectures.
>

They keep track of two different things. delay_granted keeps track of 
whether a request for preemption delay was granted and is used to stop 
the task from being granted a second timeslice. yield_penalty keeps 
track of whether the task should be penalized or not. These two flags 
are cleared at two different times. delay_granted is cleared when the 
task is preempted forcibly after being granted a delay whereas 
yield_penalty is cleared when the task gives up the processor voluntarily.

>> +	/*
>> +	 * Let the thread know it got amnesty and it should call
>> +	 * sched_yield() when it is done to avoid penalty next time
>> +	 * it wants amnesty. We need to write to userspace location.
>> +	 * Since we just read from this location, chances are extremley
>> +	 * low we might page fault. If we do page fault, we will ignore
>> +	 * it and accept the cost of failed write in form of unnecessary
>> +	 * penalty for userspace task for not yielding processor.
>
> This is the completely wrong argument. We know that the task was
> asking for an extra time slice because the copy from user above
> succeeded. So we are better of to let the task actually handle its
> pagefault than scheduling it out.
>
>> +	 * This is a highly unlikely scenario.
>> +	 */
>> +	delay_flag[0] = 0;
>> +	delay_flag[1] = 1;
>
> Sigh.
>
>> +#ifdef CONFIG_SCHED_PREEMPT_DELAY
>
> And all of this needs to be in kernel/sys.c just because...

I will look for a better way to do it.

>
>> +int sysctl_preempt_delay_available;
>> +
>> +static int
>> +preempt_delay_write(struct task_struct *task, unsigned long preempt_delay_addr)
>> +{
>> +	/*
>> +	 * Do not allow write if pointer is currently set
>> +	 */
>> +	if (task->sched_preempt_delay.delay_req &&
>> +			((void *)preempt_delay_addr != NULL))
>> +		return -EINVAL;
>> +	/*
>> +	 * Validate the pointer. It should be aligned to 4-byte boundary.
>
> So 4 bytes is a perfect boundary for everyone, right? Pulled that
> number out of thin air or what?

OK. What would you suggest? I can be taught to do this better.

>
>> +	 */
>> +	if (unlikely(!IS_ALIGNED(preempt_delay_addr, 4)))
>> +		return -EFAULT;
>> +	if (unlikely(!access_ok(rw, preempt_delay_addr, sizeof(u32))))
>> +		return -EFAULT;
>> +
>> +	task->sched_preempt_delay.delay_req = (u32 __user *) preempt_delay_addr;
>> +
>> +	/* zero out flags */
>
> Brilliant comment. I can see what the code is doing. What's way more
> interesting and of course undocumented is why you are ignoring the
> return value of put_user() ..
>
>> +	put_user(0, (uint32_t *)preempt_delay_addr);
>
> Aside of the general issues I have with this (see the inline replies
> to your changelog) the overall impression of this patch is that it is
> a half baken and carelessly cobbled together extract of some data base
> specific kernel hackery, which I prefer not to see at all.
>
> Thanks,
>
> 	tglx
>

Even if it helps one of the major workloads for Linux to perform better?

Thanks! You have lot of good feedback for me and I really appreciate it.

--
Khalid


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice
  2014-11-25  4:20   ` Mike Galbraith
@ 2014-11-25 14:50     ` Khalid Aziz
  2014-11-25 17:46       ` Mike Galbraith
  0 siblings, 1 reply; 22+ messages in thread
From: Khalid Aziz @ 2014-11-25 14:50 UTC (permalink / raw)
  To: Mike Galbraith, Thomas Gleixner
  Cc: corbet, mingo, hpa, peterz, riel, akpm, rientjes, ak, mgorman,
	liwanp, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api

On 11/24/2014 09:20 PM, Mike Galbraith wrote:
> On Tue, 2014-11-25 at 00:35 +0100, Thomas Gleixner wrote:
>
>> Aside of the general issues I have with this (see the inline replies
>> to your changelog) the overall impression of this patch is that it is
>> a half baken and carelessly cobbled together extract of some data base
>> specific kernel hackery, which I prefer not to see at all.
>
> It culminates in a lumbering pseudo RT class of task disguised as a fair
> class task.  I'd expect more gain by twiddling knobs to let last buddy
> do its job than the 3% mentioned.
>
> You could perhaps create a SUPER_BATCH class that is not wakeup
> preempted by any fair class task of <= priority, not only BATCH and
> IDLE, but that's as nasty as this patch, though loads prettier.  The
> tick time thing doesn't feel right at all... if you're hurt badly by the
> tick, you're likely holding the lock too long methinks.
>
> 	-Mike
>
>

It is definitely not an attempt to solve any kind of RT problem. It 
would be a poor attempt if it indeed attempted to solve an RT problem. 
RT is all about guarantees. This patch does not help there at all and 
hence I have no intention of ever applying anything like this to 
SCHED_FIFO or SCHED_RR.

This problem is not caused by task holding the lock too long. It is 
caused by the task happening to acquire the lock just before its current 
timeslice is up. In that case, it does not matter how long the task 
holds the lock for.

--
Khalid

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice
  2014-11-25  2:03 ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace) Rik van Riel
  2014-11-25  6:30   ` Davidlohr Bueso
@ 2014-11-25 14:52   ` Khalid Aziz
  2014-11-25 15:25     ` Rik van Riel
  1 sibling, 1 reply; 22+ messages in thread
From: Khalid Aziz @ 2014-11-25 14:52 UTC (permalink / raw)
  To: Rik van Riel, tglx, corbet, mingo, hpa, peterz, akpm, rientjes,
	ak, mgorman, liwanp, raistlin, kirill.shutemov, atomlin, avagin,
	gorcunov, serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn,
	keescook, yangds.fnst, sbauer, vishnu.ps, axboe, paulmck
  Cc: linux-kernel, linux-doc, linux-api

On 11/24/2014 07:03 PM, Rik van Riel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 11/24/2014 03:56 PM, Khalid Aziz wrote:
>> sched/fair: Add advisory flag for borrowing a timeslice
>>
>> This patch adds a way for a task to request to borrow one
>> timeslice from future if it is about to be preempted, so it could
>> delay preemption and complete any critical task it is in the middle
>> of.
>>
>> This feature helps with performance on databases and has been used
>> for many years on other OSs by the databases. This feature helps in
>> situation where a task acquires a lock before performing a critical
>> operation on the database and happens to get preempted
>
> Why don't the other tasks that want the lock sleep on the
> lock?
>
> I can see this "solution" help mostly with userspace spinlocks,
> which are relics of a past era that need to die. There is no
> way userspace spinlocks will not fail miserably on virtual
> machines, and it is time to get rid of them.

This solution indeed is for userspace spinlocks. Database code has been 
written with all critical locking implemented in userspace (as I have 
been told by database folks. I am not a database guy).

--
Khalid


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice
  2014-11-25 10:12 ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace) Srikar Dronamraju
@ 2014-11-25 14:56   ` Khalid Aziz
  0 siblings, 0 replies; 22+ messages in thread
From: Khalid Aziz @ 2014-11-25 14:56 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: tglx, corbet, mingo, hpa, peterz, riel, akpm, rientjes, ak,
	mgorman, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api

On 11/25/2014 03:12 AM, Srikar Dronamraju wrote:
>>
>> - Request to borrow timeslice is not guranteed to be honored.
>> - If the task is allowed to borrow, kernel will inform the task
>>    of this. When this happens, task must yield the processor as soon
>>    as it completes its critical section.
>> - If the task fails to yield processor after being allowed to
>>    borrow, it is penalized by forcing it to skip its next time slot
>>    by the scheduler.
>> - Task is charged additional time for the borrowed timeslice as
>>    accumulated run time. This pushes it further down in consideration
>>    for the next task to run.
>>
>
> Is there a way for us to identify if the lock is contended?
> Because it may not be prudent to allow a task to borrow timeslice for a
> lock which isnt contended.
>

Userspace knows that. It is hard to determine this from kernel. Darren 
Hart had worked on a solution to solving similar issue and I spent fair 
amount of time looking through that code. Darren's solution comes into 
play after contention has already happened and does reduce the cost of 
contention. Database folks think the cost is already too high once 
contention has happened because of the resulting context switches and 
post-contention solutions do not help. This solution helps reduce 
contention on locks and userspace code designer is in best position to 
determine which locks are subject to such contention.

--
Khalid

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice
  2014-11-25 14:52   ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice Khalid Aziz
@ 2014-11-25 15:25     ` Rik van Riel
  2014-11-25 17:22       ` Khalid Aziz
  0 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2014-11-25 15:25 UTC (permalink / raw)
  To: Khalid Aziz, tglx, corbet, mingo, hpa, peterz, akpm, rientjes,
	ak, mgorman, liwanp, raistlin, kirill.shutemov, atomlin, avagin,
	gorcunov, serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn,
	keescook, yangds.fnst, sbauer, vishnu.ps, axboe, paulmck
  Cc: linux-kernel, linux-doc, linux-api

On 11/25/2014 09:52 AM, Khalid Aziz wrote:
> On 11/24/2014 07:03 PM, Rik van Riel wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 11/24/2014 03:56 PM, Khalid Aziz wrote:
>>> sched/fair: Add advisory flag for borrowing a timeslice
>>>
>>> This patch adds a way for a task to request to borrow one
>>> timeslice from future if it is about to be preempted, so it could
>>> delay preemption and complete any critical task it is in the middle
>>> of.
>>>
>>> This feature helps with performance on databases and has been used
>>> for many years on other OSs by the databases. This feature helps in
>>> situation where a task acquires a lock before performing a critical
>>> operation on the database and happens to get preempted
>>
>> Why don't the other tasks that want the lock sleep on the
>> lock?
>>
>> I can see this "solution" help mostly with userspace spinlocks,
>> which are relics of a past era that need to die. There is no
>> way userspace spinlocks will not fail miserably on virtual
>> machines, and it is time to get rid of them.
> 
> This solution indeed is for userspace spinlocks. Database code has been
> written with all critical locking implemented in userspace (as I have
> been told by database folks. I am not a database guy).

They should fix that.

The scheme you propose can really only work on bare metal,
and not on virtual machines. That improves the problem for,
what, 60% of new installs (and dropping)?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice
  2014-11-25 15:25     ` Rik van Riel
@ 2014-11-25 17:22       ` Khalid Aziz
  2014-11-25 17:45         ` Rik van Riel
  0 siblings, 1 reply; 22+ messages in thread
From: Khalid Aziz @ 2014-11-25 17:22 UTC (permalink / raw)
  To: Rik van Riel, tglx, corbet, mingo, hpa, peterz, akpm, rientjes,
	ak, mgorman, raistlin, kirill.shutemov, atomlin, avagin,
	gorcunov, serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn,
	keescook, yangds.fnst, sbauer, vishnu.ps, axboe, paulmck
  Cc: linux-kernel, linux-doc, linux-api

On 11/25/2014 08:25 AM, Rik van Riel wrote:
> On 11/25/2014 09:52 AM, Khalid Aziz wrote:
>> On 11/24/2014 07:03 PM, Rik van Riel wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> On 11/24/2014 03:56 PM, Khalid Aziz wrote:
>>>> sched/fair: Add advisory flag for borrowing a timeslice
>>>>
>>>> This patch adds a way for a task to request to borrow one
>>>> timeslice from future if it is about to be preempted, so it could
>>>> delay preemption and complete any critical task it is in the middle
>>>> of.
>>>>
>>>> This feature helps with performance on databases and has been used
>>>> for many years on other OSs by the databases. This feature helps in
>>>> situation where a task acquires a lock before performing a critical
>>>> operation on the database and happens to get preempted
>>>
>>> Why don't the other tasks that want the lock sleep on the
>>> lock?
>>>
>>> I can see this "solution" help mostly with userspace spinlocks,
>>> which are relics of a past era that need to die. There is no
>>> way userspace spinlocks will not fail miserably on virtual
>>> machines, and it is time to get rid of them.
>>
>> This solution indeed is for userspace spinlocks. Database code has been
>> written with all critical locking implemented in userspace (as I have
>> been told by database folks. I am not a database guy).
>
> They should fix that.

Agreed, but that is a very large task for databases for sure and 
potentially for java as well (think exhaustive testing as well besides 
code re-write). I do not see this happening except as part of major 
re-architecture. In the mean time, if we can give them help without 
impacting kernel significantly for other users, it only makes Linux a 
more usable platform for them.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice
  2014-11-25 17:22       ` Khalid Aziz
@ 2014-11-25 17:45         ` Rik van Riel
  0 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2014-11-25 17:45 UTC (permalink / raw)
  To: Khalid Aziz, tglx, corbet, mingo, hpa, peterz, akpm, rientjes,
	ak, mgorman, raistlin, kirill.shutemov, atomlin, avagin,
	gorcunov, serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn,
	keescook, yangds.fnst, sbauer, vishnu.ps, axboe, paulmck
  Cc: linux-kernel, linux-doc, linux-api

On 11/25/2014 12:22 PM, Khalid Aziz wrote:
> On 11/25/2014 08:25 AM, Rik van Riel wrote:
>> On 11/25/2014 09:52 AM, Khalid Aziz wrote:
>>> On 11/24/2014 07:03 PM, Rik van Riel wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>>
>>>> On 11/24/2014 03:56 PM, Khalid Aziz wrote:
>>>>> sched/fair: Add advisory flag for borrowing a timeslice
>>>>>
>>>>> This patch adds a way for a task to request to borrow one
>>>>> timeslice from future if it is about to be preempted, so it could
>>>>> delay preemption and complete any critical task it is in the middle
>>>>> of.
>>>>>
>>>>> This feature helps with performance on databases and has been used
>>>>> for many years on other OSs by the databases. This feature helps in
>>>>> situation where a task acquires a lock before performing a critical
>>>>> operation on the database and happens to get preempted
>>>>
>>>> Why don't the other tasks that want the lock sleep on the
>>>> lock?
>>>>
>>>> I can see this "solution" help mostly with userspace spinlocks,
>>>> which are relics of a past era that need to die. There is no
>>>> way userspace spinlocks will not fail miserably on virtual
>>>> machines, and it is time to get rid of them.
>>>
>>> This solution indeed is for userspace spinlocks. Database code has been
>>> written with all critical locking implemented in userspace (as I have
>>> been told by database folks. I am not a database guy).
>>
>> They should fix that.
> 
> Agreed, but that is a very large task for databases for sure and
> potentially for java as well (think exhaustive testing as well besides
> code re-write). I do not see this happening except as part of major
> re-architecture. In the mean time, if we can give them help without
> impacting kernel significantly for other users, it only makes Linux a
> more usable platform for them.

I am not convinced that permanently adding overhead to the
Linux scheduler in order to deal with a temporary problem in
database and Java software (probably not every JVM, either!)
is a worthwhile trade-off..


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice
  2014-11-25 14:50     ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice Khalid Aziz
@ 2014-11-25 17:46       ` Mike Galbraith
  2014-11-25 19:38         ` Khalid Aziz
  0 siblings, 1 reply; 22+ messages in thread
From: Mike Galbraith @ 2014-11-25 17:46 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Thomas Gleixner, corbet, mingo, hpa, peterz, riel, akpm,
	rientjes, ak, mgorman, liwanp, raistlin, kirill.shutemov,
	atomlin, avagin, gorcunov, serge.hallyn, athorlton, oleg,
	vdavydov, daeseok.youn, keescook, yangds.fnst, sbauer, vishnu.ps,
	axboe, paulmck, linux-kernel, linux-doc, linux-api

On Tue, 2014-11-25 at 07:50 -0700, Khalid Aziz wrote:

> It is definitely not an attempt to solve any kind of RT problem.

No no, I'm saying that giving certain tasks special dispensations
effectively elevates them.  Temporarily or otherwise, they play by
different rules, will block more deserving tasks, and it's not cut and
dried that that blocking will not do more harm than good.

Is it a clear win to make say some kworker or other global asset wait
when it could have preempted and been gone in usecs?  Nope.

	-Mike


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice
  2014-11-25 14:45   ` Khalid Aziz
@ 2014-11-25 18:27     ` Davidlohr Bueso
  2014-11-25 19:23       ` Khalid Aziz
  0 siblings, 1 reply; 22+ messages in thread
From: Davidlohr Bueso @ 2014-11-25 18:27 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Thomas Gleixner, corbet, mingo, hpa, peterz, riel, akpm,
	rientjes, ak, mgorman, liwanp, raistlin, kirill.shutemov,
	atomlin, avagin, gorcunov, serge.hallyn, athorlton, oleg,
	vdavydov, daeseok.youn, keescook, yangds.fnst, sbauer, vishnu.ps,
	axboe, paulmck, linux-kernel, linux-doc, linux-api

On Tue, 2014-11-25 at 07:45 -0700, Khalid Aziz wrote:
> This solution has been used by both database and java on other OSs and 
> has shown performance improvement. Andrew had asked for performance 
> numbers on Linux with this patch last time I sent this out and it took 
> me a while to get performance folks to run a full TPC-C workload. They 
> did see a 3% improvement in tpcc as I noted in commit log and that is a 
> significant improvement.

3% for such a change seems pretty worthless... I would have expected
this having to impact performance much more.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice
  2014-11-25 18:27     ` Davidlohr Bueso
@ 2014-11-25 19:23       ` Khalid Aziz
  2014-11-25 19:40         ` Davidlohr Bueso
  0 siblings, 1 reply; 22+ messages in thread
From: Khalid Aziz @ 2014-11-25 19:23 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Thomas Gleixner, corbet, mingo, hpa, peterz, riel, akpm,
	rientjes, ak, mgorman, raistlin, kirill.shutemov, atomlin,
	avagin, gorcunov, serge.hallyn, athorlton, oleg, vdavydov,
	daeseok.youn, keescook, yangds.fnst, sbauer, vishnu.ps, axboe,
	paulmck, linux-kernel, linux-doc, linux-api

On 11/25/2014 11:27 AM, Davidlohr Bueso wrote:
> On Tue, 2014-11-25 at 07:45 -0700, Khalid Aziz wrote:
>> This solution has been used by both database and java on other OSs and
>> has shown performance improvement. Andrew had asked for performance
>> numbers on Linux with this patch last time I sent this out and it took
>> me a while to get performance folks to run a full TPC-C workload. They
>> did see a 3% improvement in tpcc as I noted in commit log and that is a
>> significant improvement.
>
> 3% for such a change seems pretty worthless... I would have expected
> this having to impact performance much more.
>

Performance impact will depend upon how big a bottleneck the spinlock 
was creating and how severe the resulting convoy problem was. Database 
guys try to squeeze every bit out of the system and 3% is considered to 
be very good gain. 10% gain would have been nicer :)

--
Khalid

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice
  2014-11-25 17:46       ` Mike Galbraith
@ 2014-11-25 19:38         ` Khalid Aziz
  0 siblings, 0 replies; 22+ messages in thread
From: Khalid Aziz @ 2014-11-25 19:38 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Thomas Gleixner, corbet, mingo, hpa, peterz, riel, akpm,
	rientjes, ak, mgorman, liwanp, raistlin, kirill.shutemov,
	atomlin, avagin, gorcunov, serge.hallyn, athorlton, oleg,
	vdavydov, daeseok.youn, keescook, yangds.fnst, sbauer, vishnu.ps,
	axboe, paulmck, linux-kernel, linux-doc, linux-api

On 11/25/2014 10:46 AM, Mike Galbraith wrote:
> On Tue, 2014-11-25 at 07:50 -0700, Khalid Aziz wrote:
>
>> It is definitely not an attempt to solve any kind of RT problem.
>
> No no, I'm saying that giving certain tasks special dispensations
> effectively elevates them.  Temporarily or otherwise, they play by
> different rules, will block more deserving tasks, and it's not cut and
> dried that that blocking will not do more harm than good.
>
> Is it a clear win to make say some kworker or other global asset wait
> when it could have preempted and been gone in usecs?  Nope.

I understand. You are right, this allows some apps to gain special 
dispensation. On a general purpose system, I agree this can be 
problematic and it is important that it be easy to disable this. This is 
why I added sysctl tunable and made "disabled" the default state for 
this feature. Allowing temporary elevation of a task as part of the 
overall system design is ok when it is intentional and done after 
considering impact on other tasks running on the system. A large 
database server typically is not a general purpose server that runs any 
arbitrary tasks. These systems are tweaked in many ways to ensure 
optimal performance for database and not necessarily other apps. This 
patch gives admins one more knob to turn when maximizing performance. 
Any general purpose system that sees no use for this feature can leave 
this feature in its default state of disabled. I can see usefulness of 
this patch for other servers used in telecommunication infrastructure 
for instance, where the server is dedicated to specific task(s) and 
needs to update critical database with minimal contention, for example 
switch map on a telco switch controller or channel allocation map on a 
base station controller. I am sure people more familiar with other 
industries can see usefulness in other dedicated applications.

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice
  2014-11-25 19:23       ` Khalid Aziz
@ 2014-11-25 19:40         ` Davidlohr Bueso
  0 siblings, 0 replies; 22+ messages in thread
From: Davidlohr Bueso @ 2014-11-25 19:40 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Thomas Gleixner, corbet, mingo, hpa, peterz, riel, akpm,
	rientjes, ak, mgorman, raistlin, kirill.shutemov, atomlin,
	avagin, gorcunov, serge.hallyn, athorlton, oleg, vdavydov,
	daeseok.youn, keescook, yangds.fnst, sbauer, vishnu.ps, axboe,
	paulmck, linux-kernel, linux-doc, linux-api

On Tue, 2014-11-25 at 12:23 -0700, Khalid Aziz wrote:
> On 11/25/2014 11:27 AM, Davidlohr Bueso wrote:
> > On Tue, 2014-11-25 at 07:45 -0700, Khalid Aziz wrote:
> >> This solution has been used by both database and java on other OSs and
> >> has shown performance improvement. Andrew had asked for performance
> >> numbers on Linux with this patch last time I sent this out and it took
> >> me a while to get performance folks to run a full TPC-C workload. They
> >> did see a 3% improvement in tpcc as I noted in commit log and that is a
> >> significant improvement.
> >
> > 3% for such a change seems pretty worthless... I would have expected
> > this having to impact performance much more.
> >
> 
> Performance impact will depend upon how big a bottleneck the spinlock 
> was creating and how severe the resulting convoy problem was. Database 
> guys try to squeeze every bit out of the system and 3% is considered to 
> be very good gain. 10% gain would have been nicer :)

Right, but my point is that 3% indicates that this isn't really a
problem in the first place. It would be good to know that hw
characteristics as well. We've tackled Oracle related performance issues
in the past with on OLTP benchmarks, with much more noticeable
improvements -- which is why I'm not impressed with your numbers and
particularly for a patch of this nature. That said, I do realize that
Oracle is not the only workload that can potentially gain with userspace
spinlocks.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-11-25 19:40 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-24 20:56 [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace) Khalid Aziz
2014-11-24 22:43 ` Andi Kleen
2014-11-24 23:20   ` Khalid Aziz
2014-11-24 23:35 ` Thomas Gleixner
2014-11-25  2:12   ` Davidlohr Bueso
2014-11-25  4:20   ` Mike Galbraith
2014-11-25 14:50     ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice Khalid Aziz
2014-11-25 17:46       ` Mike Galbraith
2014-11-25 19:38         ` Khalid Aziz
2014-11-25 14:45   ` Khalid Aziz
2014-11-25 18:27     ` Davidlohr Bueso
2014-11-25 19:23       ` Khalid Aziz
2014-11-25 19:40         ` Davidlohr Bueso
2014-11-25  2:03 ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace) Rik van Riel
2014-11-25  6:30   ` Davidlohr Bueso
2014-11-25 13:38     ` Rik van Riel
2014-11-25 14:52   ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice Khalid Aziz
2014-11-25 15:25     ` Rik van Riel
2014-11-25 17:22       ` Khalid Aziz
2014-11-25 17:45         ` Rik van Riel
2014-11-25 10:12 ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice (was: Pre-emption control for userspace) Srikar Dronamraju
2014-11-25 14:56   ` [PATCH v3] sched/fair: Add advisory flag for borrowing a timeslice Khalid Aziz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).