All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
@ 2014-12-18 18:44 Khalid Aziz
  2014-12-18 22:28 ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Khalid Aziz @ 2014-12-18 18:44 UTC (permalink / raw)
  To: tglx, corbet, mingo, hpa, peterz, riel, akpm, rientjes, ak,
	mgorman, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck
  Cc: Khalid Aziz, linux-kernel, linux-doc, linux-api

sched/fair: Add advisory flag for borrowing a timeslice

This patch adds a way for a task to request to borrow one timeslice
from future if it is about to be preempted, so it could delay
preemption and complete any critical task it is in the middle of.

This feature improves performance for apps that use userspace locking
across large number of threads, for example large databases and Java,
and similar solutions have been used for many years on other OSs.
This feature helps in situation where a task acquires a lock before
performing a critical operation on shared data and happens to have
acquired the lock just before its timeslice is up which means it gets
preempted before it completes its task. This lock being held causes
all other tasks that also acquire the same lock to perform their
critical operation on shared data, to start queueing up and causing
large number of context switches. This queueing problem can be avoided
if the task that acquires lock first could request scheduler to let it
borrow one timeslice once it enters its critical section and hence
allow it to complete its critical section without causing queueing
problem. If critical section completes before the task is due for
preemption, the task can desassert its request which causes scheduler
to proceed with normal preemption. A task sends the scheduler this
request by setting a flag in a memory location it has shared with the
kernel. Kernel uses bytes in the same memory location to let the task
know when its request for amnesty from preemption has been granted.

These rules apply to the use of this feature:

- Request to borrow timeslice is not guranteed to be honored.
- If the task is allowed to borrow, kernel will inform the task
  of this. When this happens, task must yield the processor as soon
  as it completes its critical section.
- If the task fails to yield processor after being allowed to
  borrow, it is penalized by not honoring its next request for
  extra timeslice.
- Task is charged additional time for the borrowed timeslice as
  accumulated run time. This pushes it further down in consideration
  for the next task to run.

This feature was tested with a TPC-C workload. TPC-C workload shows
a 3% improvement in tpcc throughput when using this feature, which
is a significant improvement.

A new sysctl tunable kernel.preempt_delay_available enables this
feature at run time. The kernel boots up with this feature disabled
by default.

Documentation file included in this patch contains details on how to
use this feature, and conditions associated with its use. This patch
also adds a new field in scheduler statistics which keeps track of
how many times a task was granted amnesty from preemption.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---

With this version of this patch, the kernel will not enable
preemption delay by default. This feature must be turned on by using
sysctl tunable kernel.preempt_delay_available. With this change, there
are now two ways to eliminate the impact of this feature on systems
that do not intend to use it and are sensitive to scheduling delays
that may be caused by the use of this feature. This feature can be
configured out for custom built kernels. For pre-compiled kernels where
this feature may have been configured in, it will stay off until enabled
through sysctl tunable.

Changelog:
v4:
	- Added a shared data structure to define the memory location
	  used for requesting preemption delay.
	- Fixed a hole in the code that allowed preemption delay to
	  continue to happen if preemption delay feature was disabled
	  with sysctl after a task had already started using this.
	- Removed the restriction on setting location of shared data
	  structure only when one is not set currently.
	- Moved almost all conditionally compiled code into header files
	- Cleaned up config dependency for CONFIG_SCHED_PREEMPT_DELAY
	- Changed permission on preempt_delay_available sysctl to allow
	  users to read it.
	- Updated documentation file to match the code changes
v3:
	- Use prctl() syscall to give kernel the location for shared flag 
	  instead of using a proc file.
	- Disabled this feature by default on a newly booted kernel and
	  added a sysctl tunable to enable/disable it at runtime.
v2:
	- Replaced mmap operation with a more memory efficient futex
	  like communication between userspace and kernel
	- Added a flag to let userspace know if it was granted amnesty
	- Added a penalty for tasks failing to yield CPU when they
	  are granted amnesty from pre-emption

v1:
	- Initial RFC patch with mmap for communication between userspace
	  and kernel

 Documentation/scheduler/sched-preempt-delay.txt | 112 ++++++++++++++++++++
 arch/x86/Kconfig                                |  11 ++
 include/linux/sched.h                           |  38 +++++++
 include/linux/sched/sysctl.h                    |   4 +
 include/uapi/linux/prctl.h                      |   3 +
 include/uapi/linux/sched.h                      |   9 ++
 kernel/fork.c                                   |   2 +
 kernel/sched/core.c                             |   1 +
 kernel/sched/debug.c                            |   1 +
 kernel/sched/fair.c                             | 129 +++++++++++++++++++++++-
 kernel/sys.c                                    |   6 ++
 kernel/sysctl.c                                 |   9 ++
 12 files changed, 322 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/scheduler/sched-preempt-delay.txt

diff --git a/Documentation/scheduler/sched-preempt-delay.txt b/Documentation/scheduler/sched-preempt-delay.txt
new file mode 100644
index 0000000..4c9e111
--- /dev/null
+++ b/Documentation/scheduler/sched-preempt-delay.txt
@@ -0,0 +1,112 @@
+=================================
+What is preemption delay feature?
+=================================
+
+There are times when a userspace task is executing a critical section
+which gates a number of other tasks that want access to the same
+critical section. If the task holding the lock that guards this critical
+section happens to grab the lock just before its timeslice is up and is
+preempted by the scheduler, scheduler ends up scheduling other
+tasks which immediately try to grab the lock to enter the critical
+section. This only results in lots of context switches as tasks wake up
+and go to sleep immediately. If on the other hand, the original task
+were allowed to run for an extra timeslice, it could have completed
+executing its critical section allowing other tasks to make progress
+when they get scheduled. Preemption delay feature allows a task to
+request scheduler to let it borrow one extra timeslice, if possible.
+
+
+==================================
+Using the preemption delay feature
+==================================
+
+This feature is compiled in the kernel by setting
+CONFIG_SCHED_PREEMPT_DELAY in kernel configuration. By default, the
+kernel boots up with this feature disabled. Enable it using sysctl
+tunable kernel.preempt_delay_available. Once this feature is
+enabled, the userspace process communicates with the kernel using a
+4-byte memory location in its address space. This location must be
+aligned to 4-byte boundary. It first gives the kernel address for this
+memory location by making a prctl() system call with PR_SET_PREEMPT_DELAY
+option. This memory location is interpreted as the following data
+structure (defined in linux/sched.h):
+
+struct sched_delay_req {
+	unsigned char nopreempt;	/* flag to request preemption delay */
+	unsigned char yield;		/* flag from kernel indicating  */
+					/* preemption delay was granted */
+	unsigned char rsvd[2];		/* reserved */
+};
+
+Task requests a preemption delay by writing a non-zero value to the
+first byte - nopreempt. Scheduler checks this value before preempting
+the task. Scheduler can choose to grant one and only an additional
+time slice to the task for each delay request but this delay is not
+guaranteed. If scheduler does grant an additional timeslice, it will
+set the flag in second byte. Upon completion of the section of code
+where the task wants preemption delay, task should check the second byte.
+If the flag in second byte is set, it should clear this flag and call
+sched_yield() so as to not hog the processor. If a thread was granted
+additional timeslice and it fails to call sched_yield(), scheduler
+will penalize it by denying its next request for additional timeslice.
+Following sample code illustrates how to use this feature:
+
+#include <linux/sched.h>
+
+int main()
+{
+	unsigned char buf[256];
+	struct sched_delay_req delay;
+
+	bzero(&delay, sizeof(delay));
+
+	/* Tell kernel where the flag lives */
+	prctl(PR_SET_PREEMPT_DELAY, &delay);
+
+	while (/* some condition is true */) {
+		/* do some work and get ready to enter critical section */
+		delay.nopreempt = 1;
+		/*
+		 * Obtain lock for critical section
+		 */
+		/*
+		 * critical section
+		 */
+		/*
+		 * Release lock for critical section
+		 */
+		delay.nopreempt = 0;
+		/* Give the CPU up if required */
+		if (delay.yield) {
+			delay.yield = 0;
+			sched_yield();
+		}
+		/* do some more work */
+	}
+	/*
+	 * Tell kernel we are done asking for preemption delay
+	 */
+	prctl(PR_SET_PREEMPT_DELAY, NULL);
+}
+
+
+====================
+Scheduler statistics
+====================
+
+Preemption delay features adds a new field to scheduler statictics -
+nr_preempt_delayed. This is a per thread statistic that tracks the
+number of times a thread was granted amnesty from preemption when it
+requested for one. "cat /proc/<pid>/task/<tid>/sched" will list this
+number along with other scheduler statistics.
+
+
+=====
+Notes
+=====
+
+1. If the location of shared flag is not aligned to 4-byte boundary,
+   prctl() will terminate with EFAULT.
+2. Userspace app should zero out the sched_delay_req structure before
+   giving kernel the address of this structure. Stale data in this
+   structure could cause unintended requests for preemption delay.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 41a503c..6c24167 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -852,6 +852,17 @@ config SCHED_MC
 	  making when dealing with multi-core CPU chips at a cost of slightly
 	  increased overhead in some places. If unsure say N here.
 
+config SCHED_PREEMPT_DELAY
+	def_bool n
+	prompt "Scheduler preemption delay support"
+	---help---
+	  Say Y here if you want to be able to delay scheduler preemption
+	  when possible by setting a flag in a memory location after
+	  sharing the address of this location with kernel using
+	  PR_SET_PREEMPT_DELAY prctl() call. See
+	  Documentation/scheduler/sched-preempt-delay.txt for details.
+	  If in doubt, say "N".
+
 source "kernel/Kconfig.preempt"
 
 config X86_UP_APIC
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bb..0b2f911 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1116,6 +1116,7 @@ struct sched_statistics {
 	u64			nr_wakeups_affine_attempts;
 	u64			nr_wakeups_passive;
 	u64			nr_wakeups_idle;
+	u64			nr_preempt_delayed;
 };
 #endif
 
@@ -1232,6 +1233,14 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+struct preempt_delay {
+	struct sched_delay_req *delay_req;	/* delay request flag pointer */
+	unsigned char delay_granted;		/* currently in delay */
+	unsigned char yield_penalty;		/* failure to yield penalty */
+};
+#endif
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1324,6 +1333,9 @@ struct task_struct {
 	/* Revert to default priority/policy when forking */
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+	struct preempt_delay sched_preempt_delay;
+#endif
 
 	unsigned long atomic_flags; /* Flags needing atomic access. */
 
@@ -3031,4 +3043,30 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+static inline void task_init_preempt_delay(struct task_struct *p)
+{
+	memset(&p->sched_preempt_delay, 0, sizeof(struct preempt_delay));
+}
+static inline void task_clear_preempt_yield(struct task_struct *p)
+{
+	p->sched_preempt_delay.yield_penalty = 0;
+}
+extern int preempt_delay_write(struct task_struct *task,
+					unsigned long preempt_delay_addr);
+#define SCHED_SET_PREEMPT_DELAY(a)	preempt_delay_write(current, a)
+#define SCHED_GET_PREEMPT_DELAY(a)	\
+		put_user((unsigned long)current->sched_preempt_delay.delay_req,\
+				(unsigned long __user *)a)
+#else
+static inline void task_init_preempt_delay(struct task_struct *p)
+{
+}
+static inline void task_clear_preempt_yield(struct task_struct *p)
+{
+}
+#define SCHED_SET_PREEMPT_DELAY(a)	(-EINVAL)
+#define SCHED_GET_PREEMPT_DELAY(a)	(-EINVAL)
+#endif /* CONFIG_SCHED_PREEMPT_DELAY */
+
 #endif
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 596a0e0..516f74e 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -107,4 +107,8 @@ extern int sysctl_numa_balancing(struct ctl_table *table, int write,
 				 void __user *buffer, size_t *lenp,
 				 loff_t *ppos);
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+extern int sysctl_preempt_delay_available;
+#endif
+
 #endif /* _SCHED_SYSCTL_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 513df75..ecfd2cd 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -179,4 +179,7 @@ struct prctl_mm_map {
 #define PR_SET_THP_DISABLE	41
 #define PR_GET_THP_DISABLE	42
 
+#define PR_SET_PREEMPT_DELAY	43
+#define PR_GET_PREEMPT_DELAY	44
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b932be9..66a2f67 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -49,4 +49,13 @@
  */
 #define SCHED_FLAG_RESET_ON_FORK	0x01
 
+/*
+ * struct for requesting preemption delay from scheduler
+ */
+struct sched_delay_req {
+	unsigned char nopreempt;
+	unsigned char yield;
+	unsigned char rsvd[2];
+};
+
 #endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..aea655b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1672,6 +1672,8 @@ long do_fork(unsigned long clone_flags,
 			get_task_struct(p);
 		}
 
+		task_init_preempt_delay(p);
+
 		wake_up_new_task(p);
 
 		/* forking complete and child started to run, tell ptracer */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 24beb9b..a926eea 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4201,6 +4201,7 @@ SYSCALL_DEFINE0(sched_yield)
 {
 	struct rq *rq = this_rq_lock();
 
+	task_clear_preempt_yield(current);
 	schedstat_inc(rq, yld_count);
 	current->sched_class->yield_task(rq);
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index ce33780..618d2ac 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -597,6 +597,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.statistics.nr_wakeups_affine_attempts);
 	P(se.statistics.nr_wakeups_passive);
 	P(se.statistics.nr_wakeups_idle);
+	P(se.statistics.nr_preempt_delayed);
 
 	{
 		u64 avg_atom, avg_per_cpu;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ef2b104..a880c6f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -428,6 +428,129 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+int sysctl_preempt_delay_available;
+
+int
+preempt_delay_write(struct task_struct *task, unsigned long preempt_delay_addr)
+{
+	/*
+	 * Do not allow write if preemption delay feature is disabled
+	 */
+	if (!sysctl_preempt_delay_available)
+		return -EPERM;
+
+	if ((void *)preempt_delay_addr == NULL) {
+		task->sched_preempt_delay.delay_req = NULL;
+		return 0;
+	}
+
+	/*
+	 * Validate the pointer. It should be naturally aligned
+	 */
+	if (unlikely((preempt_delay_addr % sizeof(u32)) != 0))
+		return -EFAULT;
+	if (unlikely(!access_ok(rw, preempt_delay_addr,
+					sizeof(struct sched_delay_req))))
+		return -EFAULT;
+
+	task->sched_preempt_delay.delay_req =
+				(struct sched_delay_req *) preempt_delay_addr;
+	return 0;
+}
+
+/*
+ * delay_resched_rq(): Check if the task about to be preempted has
+ *	requested an additional time slice. If it has, grant it additional
+ *	timeslice once.
+ */
+static void
+delay_resched_rq(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	struct sched_entity *se;
+	struct sched_delay_req *delay_req, delay_flag;
+	int ret;
+
+	if (!sysctl_preempt_delay_available)
+		goto resched_now;
+
+	/*
+	 * Check if task is using pre-emption delay feature. If address
+	 * for preemption delay request flag is not set, this task is
+	 * not using preemption delay feature, we can reschedule without
+	 * any delay
+	 */
+	delay_req = curr->sched_preempt_delay.delay_req;
+	if (delay_req == NULL)
+		goto resched_now;
+
+	/*
+	 * Pre-emption delay will  be granted only once. If this task
+	 * has already been granted delay, rechedule now
+	 */
+	if (curr->sched_preempt_delay.delay_granted) {
+		curr->sched_preempt_delay.delay_granted = 0;
+		goto resched_now;
+	}
+
+	/*
+	 * Get the value of preemption delay request flag from userspace.
+	 * Task had already passed us the address where the flag is stored
+	 * in userspace earlier. If there is a page fault accessing this
+	 * flag in userspace, that means userspace has not touched this
+	 * flag recently and we can assume no preemption delay is needed.
+	 *
+	 * If task is not requesting additional timeslice, resched now
+	 */
+	pagefault_disable();
+	ret = __copy_from_user_inatomic(&delay_flag, delay_req,
+			sizeof(u32));
+	pagefault_enable();
+	if (ret || !delay_flag.nopreempt)
+		goto resched_now;
+
+	/*
+	 * Current thread has requested preemption delay and has not
+	 * been granted an extension yet. If this thread failed to yield
+	 * processor after being granted amnesty last time, penalize it
+	 * by not granting this delay request, otherwise give it an extra
+	 * timeslice.
+	 */
+	if (curr->sched_preempt_delay.yield_penalty) {
+		curr->sched_preempt_delay.yield_penalty = 0;
+		goto resched_now;
+	}
+
+	se = &curr->se;
+	curr->sched_preempt_delay.delay_granted = 1;
+
+	/*
+	 * Set the penalty flag for failing to yield the processor after
+	 * being granted immunity. This flag will be cleared in
+	 * sched_yield() if the thread indeed calls sched_yield
+	 */
+	curr->sched_preempt_delay.yield_penalty = 1;
+
+	/*
+	 * Let the thread know it got amnesty and it should call
+	 * sched_yield() when it is done to avoid penalty next time
+	 * it wants amnesty.
+	 */
+	delay_flag.nopreempt = 0;
+	delay_flag.yield = 1;
+	schedstat_inc(curr, se.statistics.nr_preempt_delayed);
+	__copy_to_user_inatomic(delay_req, &delay_flag, sizeof(u32));
+
+	return;
+
+resched_now:
+	resched_curr(rq);
+}
+#else
+#define delay_resched_rq(rq) resched_curr(rq)
+#endif /* CONFIG_SCHED_PREEMPT_DELAY */
+
 static __always_inline
 void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
 
@@ -2951,7 +3074,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	ideal_runtime = sched_slice(cfs_rq, curr);
 	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
 	if (delta_exec > ideal_runtime) {
-		resched_curr(rq_of(cfs_rq));
+		delay_resched_rq(rq_of(cfs_rq));
 		/*
 		 * The current task ran long enough, ensure it doesn't get
 		 * re-elected due to buddy favours.
@@ -2975,7 +3098,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 		return;
 
 	if (delta > ideal_runtime)
-		resched_curr(rq_of(cfs_rq));
+		delay_resched_rq(rq_of(cfs_rq));
 }
 
 static void
@@ -4792,7 +4915,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	return;
 
 preempt:
-	resched_curr(rq);
+	delay_resched_rq(rq);
 	/*
 	 * Only set the backward buddy when the current task is still
 	 * on the rq. This can happen when a wakeup gets interleaved
diff --git a/kernel/sys.c b/kernel/sys.c
index 1eaa2f0..a8b1eff 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2203,6 +2203,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			me->mm->def_flags &= ~VM_NOHUGEPAGE;
 		up_write(&me->mm->mmap_sem);
 		break;
+	case PR_SET_PREEMPT_DELAY:
+		error = SCHED_SET_PREEMPT_DELAY(arg2);
+		break;
+	case PR_GET_PREEMPT_DELAY:
+		error = SCHED_GET_PREEMPT_DELAY(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 15f2511..c1cd344 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1104,6 +1104,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
+	{
+		.procname	= "preempt_delay_available",
+		.data		= &sysctl_preempt_delay_available,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif
 	{ }
 };
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
  2014-12-18 18:44 [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice Khalid Aziz
@ 2014-12-18 22:28 ` Peter Zijlstra
  2014-12-18 22:42   ` Khalid Aziz
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-12-18 22:28 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: tglx, corbet, mingo, hpa, riel, akpm, rientjes, ak, mgorman,
	raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api

On Thu, Dec 18, 2014 at 11:44:19AM -0700, Khalid Aziz wrote:
> sched/fair: Add advisory flag for borrowing a timeslice

yuck hatred and much of that.

Also, you fail to explain why a kernel side spin futex-lock is not an
option.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
  2014-12-18 22:28 ` Peter Zijlstra
@ 2014-12-18 22:42   ` Khalid Aziz
  2014-12-18 23:02     ` Thomas Gleixner
  0 siblings, 1 reply; 17+ messages in thread
From: Khalid Aziz @ 2014-12-18 22:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, corbet, mingo, hpa, riel, akpm, rientjes, ak, mgorman,
	raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api

On 12/18/2014 03:28 PM, Peter Zijlstra wrote:
> On Thu, Dec 18, 2014 at 11:44:19AM -0700, Khalid Aziz wrote:
>> sched/fair: Add advisory flag for borrowing a timeslice
>
> yuck hatred and much of that.
>
> Also, you fail to explain why a kernel side spin futex-lock is not an
> option.
>

I had explained that when this question came up in the last round of 
discussions. The queuing problem I am trying to solve stems from 
userspace locks. Java and databases implement their own userspace locks 
that do not use futex. Solving this with futex will not help the primary 
users of this solution.

Hope this helps.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
  2014-12-18 22:42   ` Khalid Aziz
@ 2014-12-18 23:02     ` Thomas Gleixner
  2014-12-18 23:38       ` Khalid Aziz
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2014-12-18 23:02 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Peter Zijlstra, corbet, mingo, hpa, riel, akpm, rientjes, ak,
	mgorman, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api

On Thu, 18 Dec 2014, Khalid Aziz wrote:
> On 12/18/2014 03:28 PM, Peter Zijlstra wrote:
> > On Thu, Dec 18, 2014 at 11:44:19AM -0700, Khalid Aziz wrote:
> > > sched/fair: Add advisory flag for borrowing a timeslice
> > 
> > yuck hatred and much of that.
> > 
> > Also, you fail to explain why a kernel side spin futex-lock is not an
> > option.
> > 
> 
> I had explained that when this question came up in the last round of
> discussions. The queuing problem I am trying to solve stems from userspace
> locks. Java and databases implement their own userspace locks that do not use
> futex. Solving this with futex will not help the primary users of this
> solution.

If we can solve it with a proper designed and well thought out
functionality in the kernel based on a futex like mechanism, why cant
java and databases not switch over to that and simply use it?

You need to modify user space anyway, so it does not matter whether
you modify it in a sane or in a hacky way.

But its simpler to hack crap into the scheduler than coming up with a
proper solution to the problem, right?

> Hope this helps.

Not at all.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
  2014-12-18 23:02     ` Thomas Gleixner
@ 2014-12-18 23:38       ` Khalid Aziz
  2014-12-19  0:27         ` Thomas Gleixner
  0 siblings, 1 reply; 17+ messages in thread
From: Khalid Aziz @ 2014-12-18 23:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, corbet, mingo, hpa, riel, akpm, rientjes, ak,
	mgorman, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api

On 12/18/2014 04:02 PM, Thomas Gleixner wrote:
> On Thu, 18 Dec 2014, Khalid Aziz wrote:
>> On 12/18/2014 03:28 PM, Peter Zijlstra wrote:
>>> On Thu, Dec 18, 2014 at 11:44:19AM -0700, Khalid Aziz wrote:
>>>> sched/fair: Add advisory flag for borrowing a timeslice
>>>
>>> yuck hatred and much of that.
>>>
>>> Also, you fail to explain why a kernel side spin futex-lock is not an
>>> option.
>>>
>>
>> I had explained that when this question came up in the last round of
>> discussions. The queuing problem I am trying to solve stems from userspace
>> locks. Java and databases implement their own userspace locks that do not use
>> futex. Solving this with futex will not help the primary users of this
>> solution.
>
> If we can solve it with a proper designed and well thought out
> functionality in the kernel based on a futex like mechanism, why cant
> java and databases not switch over to that and simply use it?
>
> You need to modify user space anyway, so it does not matter whether
> you modify it in a sane or in a hacky way.

Actually userspace does not need to be modified. The code to use this 
functionality is already present in database code since this same 
functionality exists on other OSs (the API is a little different but 
those details can be handled with a simple header file in userspace). 
Userspace code has already been tested and debugged thoroughly on the 
OSs that support this functionality and that has significant impact on 
testing effort. So for userspace it is simply a matter of turning that 
code on on Linux as well and recompiling. This would be a multi-platform 
solution for database/java as opposed to a Linux specific solution.

--
Khalid

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
  2014-12-18 23:38       ` Khalid Aziz
@ 2014-12-19  0:27         ` Thomas Gleixner
  2014-12-19 21:43           ` Khalid Aziz
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2014-12-19  0:27 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Peter Zijlstra, corbet, mingo, hpa, riel, akpm, rientjes, ak,
	mgorman, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api

On Thu, 18 Dec 2014, Khalid Aziz wrote:
> On 12/18/2014 04:02 PM, Thomas Gleixner wrote:
> > If we can solve it with a proper designed and well thought out
> > functionality in the kernel based on a futex like mechanism, why cant
> > java and databases not switch over to that and simply use it?
> > 
> > You need to modify user space anyway, so it does not matter whether
> > you modify it in a sane or in a hacky way.
> 
> Actually userspace does not need to be modified. The code to use this
> functionality is already present in database code since this same
> functionality exists on other OSs (the API is a little different but those
> details can be handled with a simple header file in userspace). Userspace code
> has already been tested and debugged thoroughly on the OSs that support this
> functionality and that has significant impact on testing effort. So for
> userspace it is simply a matter of turning that code on on Linux as well and
> recompiling. This would be a multi-platform solution for database/java as
> opposed to a Linux specific solution.

Bullshit. If you turn that option on, it's a modification from the QA
point of view and you need to run a full validation no matter
what. Anything else is just QA by crystal ball.

Of course you carefully avoided (again) to answer the real question:

> But its simpler to hack crap into the scheduler than coming up with a
> proper solution to the problem, right?

I can answer it for you: Yes, it is simpler.

But as you might have figured out it's not really popular and therefor
not simpler to be accepted by the people who actually care about sane
designs. I can whip you up special purpose hacks for that which will
give you way more guarantees with way less lines of horrible code, but
that does not mean that such hacks are an acceptable solution. You can
carry those hacks in your private tree and ship it to your customers,
but do not expect that any sane maintainer will care about it.

Now the very same maintainers asked you several times to answer the
question why this can't be done with proper futex like spin
mechanisms, which would solve a bunch of related problems as well.

 You never even tried to answer that question simply because you never
 tried to think about it for real. Your only answer is that you want A
 because A is already used on other OSs and therefor solution B is not
 an option.

 But if solution B would gain 4% performance, then according to your
 previous argumentation it would become suddenly very interesting,
 right?

So unless you even show any sign of thinking about different
approaches and technically arguing why they cannot deliver the same
value you wont get anywhere with this and I can tell you why.

You create a new user space ABI

 That forces the kernel to support it forever, which in consequence
 imposes restrictions on the kernel scheduler forever.

 We have enough restrictions by misdesigned ABIs (e.g. sched_yield())
 already, so we really do not need more of that.

You ignore any request to prove why a proper designed spin futex
interface would not be a sensible solution for the problem.

 Of course you are free to ignore that (as you are free to ignore
 important review comments), but you don't have to be suprised when
 the responsible maintainers ignore any further attempt from you to
 get this merged.

Aside of that, you still fail to provide a proper test case which is
publically usable for the people involved in this to reproduce your 3%
gain and analyze the problem at hand properly. The provided:

      enable_hack();
      while (/*some condition */) {
      	    /* bla */
	    /* blub */
	    /* blurb */
	    /* yay! */
      }
      disable_hack();

is beyond useless.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
  2014-12-19  0:27         ` Thomas Gleixner
@ 2014-12-19 21:43           ` Khalid Aziz
  2014-12-19 23:57             ` Thomas Gleixner
  0 siblings, 1 reply; 17+ messages in thread
From: Khalid Aziz @ 2014-12-19 21:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, corbet, mingo, hpa, riel, akpm, rientjes, ak,
	mgorman, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api

On 12/18/2014 05:27 PM, Thomas Gleixner wrote:
> On Thu, 18 Dec 2014, Khalid Aziz wrote:
>> On 12/18/2014 04:02 PM, Thomas Gleixner wrote:
>>> If we can solve it with a proper designed and well thought out
>>> functionality in the kernel based on a futex like mechanism, why cant
>>> java and databases not switch over to that and simply use it?
>>>
>>> You need to modify user space anyway, so it does not matter whether
>>> you modify it in a sane or in a hacky way.
>>
>> Actually userspace does not need to be modified. The code to use this
>> functionality is already present in database code since this same
>> functionality exists on other OSs (the API is a little different but those
>> details can be handled with a simple header file in userspace). Userspace code
>> has already been tested and debugged thoroughly on the OSs that support this
>> functionality and that has significant impact on testing effort. So for
>> userspace it is simply a matter of turning that code on on Linux as well and
>> recompiling. This would be a multi-platform solution for database/java as
>> opposed to a Linux specific solution.
>
> Bullshit. If you turn that option on, it's a modification from the QA
> point of view and you need to run a full validation no matter
> what. Anything else is just QA by crystal ball.
>
> Of course you carefully avoided (again) to answer the real question:
>
>> But its simpler to hack crap into the scheduler than coming up with a
>> proper solution to the problem, right?
>
> I can answer it for you: Yes, it is simpler.
>
> But as you might have figured out it's not really popular and therefor
> not simpler to be accepted by the people who actually care about sane
> designs. I can whip you up special purpose hacks for that which will
> give you way more guarantees with way less lines of horrible code, but
> that does not mean that such hacks are an acceptable solution. You can
> carry those hacks in your private tree and ship it to your customers,
> but do not expect that any sane maintainer will care about it.
>
> Now the very same maintainers asked you several times to answer the
> question why this can't be done with proper futex like spin
> mechanisms, which would solve a bunch of related problems as well.
>
>   You never even tried to answer that question simply because you never
>   tried to think about it for real. Your only answer is that you want A
>   because A is already used on other OSs and therefor solution B is not
>   an option.
>
>   But if solution B would gain 4% performance, then according to your
>   previous argumentation it would become suddenly very interesting,
>   right?
>
> So unless you even show any sign of thinking about different
> approaches and technically arguing why they cannot deliver the same
> value you wont get anywhere with this and I can tell you why.
>
> You create a new user space ABI
>
>   That forces the kernel to support it forever, which in consequence
>   imposes restrictions on the kernel scheduler forever.
>
>   We have enough restrictions by misdesigned ABIs (e.g. sched_yield())
>   already, so we really do not need more of that.
>
> You ignore any request to prove why a proper designed spin futex
> interface would not be a sensible solution for the problem.
>
>   Of course you are free to ignore that (as you are free to ignore
>   important review comments), but you don't have to be suprised when
>   the responsible maintainers ignore any further attempt from you to
>   get this merged.
>
> Aside of that, you still fail to provide a proper test case which is
> publically usable for the people involved in this to reproduce your 3%
> gain and analyze the problem at hand properly. The provided:
>
>        enable_hack();
>        while (/*some condition */) {
>        	    /* bla */
> 	    /* blub */
> 	    /* blurb */
> 	    /* yay! */
>        }
>        disable_hack();
>
> is beyond useless.
>
> Thanks,
>
> 	tglx
>

Fair enough. Implications of a new userspace ABI can be significant and 
I can accept not introducing a new one in the kernel.

The queuing problem caused by a task taking a contended lock just before 
its current timeslice is up which userspace app wouldn't know about, is 
a real problem nevertheless. My patch attempts to avoid the contention 
in the first place. futex with adaptive spinning is a post-contention 
solution that tries to minimize the cost of contention but does nothing 
to avoid the contention. Solving this problem using futex can help only 
if the userspace lock uses futex.

I have looked at solving this problem in userspace using priority 
inheritance semaphore but ran into many problems. I will go back and 
take another look at it.

I appreciate your feedback.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
  2014-12-19 21:43           ` Khalid Aziz
@ 2014-12-19 23:57             ` Thomas Gleixner
  2014-12-22 16:40               ` Khalid Aziz
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2014-12-19 23:57 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Peter Zijlstra, corbet, mingo, hpa, riel, akpm, rientjes, ak,
	mgorman, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api

On Fri, 19 Dec 2014, Khalid Aziz wrote:
> The queuing problem caused by a task taking a contended lock just before its
> current timeslice is up which userspace app wouldn't know about, is a real
> problem nevertheless.

We know that already.

> My patch attempts to avoid the contention in the first
> place. futex with adaptive spinning is a post-contention solution that tries
> to minimize the cost of contention but does nothing to avoid the contention.

I never said that adaptive spinning can solve that problem. 

If you would have carefuly read what I wrote, you might have noticed,
that I said: 

     a proper futex like spin mechanism

Can you spot the subtle difference between that phrase and 'futex with
adaptive spinning'?

> Solving this problem using futex can help only if the userspace lock uses
> futex.

A really fundamentally new and earth shattering insight.

If you would spend your time to actually digest what maintainers are
telling you, we might make progress on that matter.

But you prefer to spend your time by repeating yourself and providing
completely useless information.

What you are missing completely here is that neither me nor other
maintainers involved care about how you spend your time. But we very
much care about the time WE waste with your behaviour.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
  2014-12-19 23:57             ` Thomas Gleixner
@ 2014-12-22 16:40               ` Khalid Aziz
  2014-12-23 10:52                   ` Ingo Molnar
  0 siblings, 1 reply; 17+ messages in thread
From: Khalid Aziz @ 2014-12-22 16:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, corbet, mingo, hpa, riel, akpm, rientjes, ak,
	mgorman, raistlin, kirill.shutemov, atomlin, avagin, gorcunov,
	serge.hallyn, athorlton, oleg, vdavydov, daeseok.youn, keescook,
	yangds.fnst, sbauer, vishnu.ps, axboe, paulmck, linux-kernel,
	linux-doc, linux-api

On 12/19/2014 04:57 PM, Thomas Gleixner wrote:
> On Fri, 19 Dec 2014, Khalid Aziz wrote:
>> The queuing problem caused by a task taking a contended lock just before its
>> current timeslice is up which userspace app wouldn't know about, is a real
>> problem nevertheless.
>
> We know that already.
>
>> My patch attempts to avoid the contention in the first
>> place. futex with adaptive spinning is a post-contention solution that tries
>> to minimize the cost of contention but does nothing to avoid the contention.
>
> I never said that adaptive spinning can solve that problem.
>
> If you would have carefuly read what I wrote, you might have noticed,
> that I said:
>
>       a proper futex like spin mechanism
>
> Can you spot the subtle difference between that phrase and 'futex with
> adaptive spinning'?
>
>> Solving this problem using futex can help only if the userspace lock uses
>> futex.
>
> A really fundamentally new and earth shattering insight.
>
> If you would spend your time to actually digest what maintainers are
> telling you, we might make progress on that matter.
>
> But you prefer to spend your time by repeating yourself and providing
> completely useless information.
>
> What you are missing completely here is that neither me nor other
> maintainers involved care about how you spend your time. But we very
> much care about the time WE waste with your behaviour.

I am sorry that you feel the need to continue to resort to personal 
attacks even after I made it clear in my last response that I was not 
going to pursue this patch. There is no possibility of a productive 
discussion of a solution at this point. I hope someone else can find a 
solution you find acceptable.

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
@ 2014-12-23 10:52                   ` Ingo Molnar
  0 siblings, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2014-12-23 10:52 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Thomas Gleixner, Peter Zijlstra, corbet, mingo, hpa, riel, akpm,
	rientjes, ak, mgorman, raistlin, kirill.shutemov, atomlin,
	avagin, gorcunov, serge.hallyn, athorlton, oleg, vdavydov,
	daeseok.youn, keescook, yangds.fnst, sbauer, vishnu.ps, axboe,
	paulmck, linux-kernel, linux-doc, linux-api


* Khalid Aziz <khalid.aziz@oracle.com> wrote:

> On 12/19/2014 04:57 PM, Thomas Gleixner wrote:
> >On Fri, 19 Dec 2014, Khalid Aziz wrote:
> >>The queuing problem caused by a task taking a contended lock just before its
> >>current timeslice is up which userspace app wouldn't know about, is a real
> >>problem nevertheless.
> >
> >We know that already.
> >
> >>My patch attempts to avoid the contention in the first
> >>place. futex with adaptive spinning is a post-contention solution that tries
> >>to minimize the cost of contention but does nothing to avoid the contention.
> >
> >I never said that adaptive spinning can solve that problem.
> >
> >If you would have carefuly read what I wrote, you might have noticed,
> >that I said:
> >
> >      a proper futex like spin mechanism
> >
> >Can you spot the subtle difference between that phrase and 'futex with
> >adaptive spinning'?
> >
> >>Solving this problem using futex can help only if the userspace lock uses
> >>futex.
> >
> >A really fundamentally new and earth shattering insight.
> >
> >If you would spend your time to actually digest what maintainers are
> >telling you, we might make progress on that matter.
> >
> >But you prefer to spend your time by repeating yourself and providing
> >completely useless information.
> >
> >What you are missing completely here is that neither me nor other
> >maintainers involved care about how you spend your time. But we very
> >much care about the time WE waste with your behaviour.
> 
> I am sorry that you feel the need to continue to resort to 
> personal attacks [...]

Thomas did not attack your person AFAICS - he criticised your 
arguments with increasing volume, because he did not see you 
respond to his arguments in substance.

> even after I made it clear in my last response that I was not 
> going to pursue this patch. There is no possibility of a 
> productive discussion of a solution at this point. [...]

I think there is very much a possibility of a productive 
discussion:

> [...] I hope someone else can find a solution you find 
> acceptable.

to implement what Thomas suggested in the discussion: a proper 
futex like spin mechanism? That looks like a totally acceptable 
solution to me, without the disadvantages of your proposed 
solution.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
@ 2014-12-23 10:52                   ` Ingo Molnar
  0 siblings, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2014-12-23 10:52 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Thomas Gleixner, Peter Zijlstra, corbet-T1hC0tSOHrs,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	riel-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	rientjes-hpIqsD4AKlfQT0dZR+AlfA, ak-VuQAYsv1563Yd54FQh9/CA,
	mgorman-l3A5Bk7waGM, raistlin-k2GhghHVRtY,
	kirill.shutemov-VuQAYsv1563Yd54FQh9/CA,
	atomlin-H+wXaHxf7aLQT0dZR+AlfA, avagin-GEFAQzZX7r8dnm+yROfE0A,
	gorcunov-GEFAQzZX7r8dnm+yROfE0A,
	serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw, athorlton-sJ/iWh9BUns,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
	daeseok.youn-Re5JQEeQqe8AvxtiuMwx3w,
	keescook-F7+t8E8rja9g9hUCZPvPmw,
	yangds.fnst-BthXqXjhjHXQFUHtdCDX3A,
	sbauer-F61uvSdQLzf2fBVCVOL8/A, vishnu.ps-Sze3O3UU22JBDgjK7y7TUQ,
	axboe-b10kYP2dOMg, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA


* Khalid Aziz <khalid.aziz-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:

> On 12/19/2014 04:57 PM, Thomas Gleixner wrote:
> >On Fri, 19 Dec 2014, Khalid Aziz wrote:
> >>The queuing problem caused by a task taking a contended lock just before its
> >>current timeslice is up which userspace app wouldn't know about, is a real
> >>problem nevertheless.
> >
> >We know that already.
> >
> >>My patch attempts to avoid the contention in the first
> >>place. futex with adaptive spinning is a post-contention solution that tries
> >>to minimize the cost of contention but does nothing to avoid the contention.
> >
> >I never said that adaptive spinning can solve that problem.
> >
> >If you would have carefuly read what I wrote, you might have noticed,
> >that I said:
> >
> >      a proper futex like spin mechanism
> >
> >Can you spot the subtle difference between that phrase and 'futex with
> >adaptive spinning'?
> >
> >>Solving this problem using futex can help only if the userspace lock uses
> >>futex.
> >
> >A really fundamentally new and earth shattering insight.
> >
> >If you would spend your time to actually digest what maintainers are
> >telling you, we might make progress on that matter.
> >
> >But you prefer to spend your time by repeating yourself and providing
> >completely useless information.
> >
> >What you are missing completely here is that neither me nor other
> >maintainers involved care about how you spend your time. But we very
> >much care about the time WE waste with your behaviour.
> 
> I am sorry that you feel the need to continue to resort to 
> personal attacks [...]

Thomas did not attack your person AFAICS - he criticised your 
arguments with increasing volume, because he did not see you 
respond to his arguments in substance.

> even after I made it clear in my last response that I was not 
> going to pursue this patch. There is no possibility of a 
> productive discussion of a solution at this point. [...]

I think there is very much a possibility of a productive 
discussion:

> [...] I hope someone else can find a solution you find 
> acceptable.

to implement what Thomas suggested in the discussion: a proper 
futex like spin mechanism? That looks like a totally acceptable 
solution to me, without the disadvantages of your proposed 
solution.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
  2014-12-23 10:52                   ` Ingo Molnar
  (?)
@ 2014-12-23 15:13                   ` Khalid Aziz
  2014-12-23 18:46                     ` Rik van Riel
  -1 siblings, 1 reply; 17+ messages in thread
From: Khalid Aziz @ 2014-12-23 15:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Peter Zijlstra, corbet, mingo, hpa, riel, akpm,
	rientjes, ak, mgorman, raistlin, kirill.shutemov, atomlin,
	avagin, gorcunov, serge.hallyn, athorlton, oleg, vdavydov,
	daeseok.youn, keescook, yangds.fnst, sbauer, vishnu.ps, axboe,
	paulmck, linux-kernel, linux-doc, linux-api

On 12/23/2014 03:52 AM, Ingo Molnar wrote:
>
>
> to implement what Thomas suggested in the discussion: a proper
> futex like spin mechanism? That looks like a totally acceptable
> solution to me, without the disadvantages of your proposed
> solution.

Hi Ingo,

Thank you for taking the time to respond. It is indeed possible to 
implement a futex like spin mechanism. Futex like mechanism will be 
clean and elegant. That is where I had started when I was given this 
problem to solve. Trouble I run into is the primary application I am 
looking at to help with this solution is Database which implements its 
own locking mechanism without using POSIX semaphore or futex. Since the 
locking is entirely in userspace, kernel has no clue when the userspace 
has acquired one of these locks. So I can see only two ways to solve 
this - find a solution in userspace entirely, or have userspace tell the 
kernel when it acquires one of these locks. I will spend more time on 
finding a way to solve it in userspace and see if I can find a way to 
leverage futex mechanism without causing significant change to database 
code. There may be a way to use priority inheritance to avoid 
contention. Database performance people tell me that their testing has 
shown the cost of making any system calls in this code easily offsets 
any gains from optimizing for contention avoidance, so that is one big 
challenge. Database rewriting their locking code is extremely unlikely 
scenario. Am I missing a third option here?

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
  2014-12-23 15:13                   ` Khalid Aziz
@ 2014-12-23 18:46                     ` Rik van Riel
  2014-12-23 20:47                       ` Khalid Aziz
  2015-01-13 11:25                       ` Peter Zijlstra
  0 siblings, 2 replies; 17+ messages in thread
From: Rik van Riel @ 2014-12-23 18:46 UTC (permalink / raw)
  To: Khalid Aziz, Ingo Molnar
  Cc: Thomas Gleixner, Peter Zijlstra, corbet, mingo, hpa, akpm,
	rientjes, ak, mgorman, raistlin, kirill.shutemov, atomlin,
	avagin, gorcunov, serge.hallyn, athorlton, oleg, vdavydov,
	daeseok.youn, keescook, yangds.fnst, sbauer, vishnu.ps, axboe,
	paulmck, linux-kernel, linux-doc, linux-api

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/23/2014 10:13 AM, Khalid Aziz wrote:
> On 12/23/2014 03:52 AM, Ingo Molnar wrote:
>> 
>> 
>> to implement what Thomas suggested in the discussion: a proper 
>> futex like spin mechanism? That looks like a totally acceptable 
>> solution to me, without the disadvantages of your proposed 
>> solution.
> 
> Hi Ingo,
> 
> Thank you for taking the time to respond. It is indeed possible to 
> implement a futex like spin mechanism. Futex like mechanism will
> be clean and elegant. That is where I had started when I was given
> this problem to solve. Trouble I run into is the primary
> application I am looking at to help with this solution is Database
> which implements its own locking mechanism without using POSIX
> semaphore or futex. Since the locking is entirely in userspace,
> kernel has no clue when the userspace has acquired one of these
> locks. So I can see only two ways to solve this - find a solution
> in userspace entirely, or have userspace tell the kernel when it
> acquires one of these locks. I will spend more time on finding a
> way to solve it in userspace and see if I can find a way to 
> leverage futex mechanism without causing significant change to
> database code. There may be a way to use priority inheritance to
> avoid contention. Database performance people tell me that their
> testing has shown the cost of making any system calls in this code
> easily offsets any gains from optimizing for contention avoidance,
> so that is one big challenge. Database rewriting their locking code
> is extremely unlikely scenario. Am I missing a third option here?

An uncontended futex is taken without ever going into kernel
space. Adaptive spinning allows short duration futexes to be
taken without going into kernel space.

Only long held locks cause a thread to go into kernel space,
where it goes to sleep, freeing up the cpu, and increasing
the chance that the lock holder will run.

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUmbihAAoJEM553pKExN6DDlQH/1vvy9YYuP2dCAZSU3fz855e
pj4796Qja929I2dStsbLl6Qhcg2ELtwtPkLoAePQ/4j2l7DCYgSNLXlC+RzQ32ay
rbMIfwiriEVGp2hsvYTOCpnur19IHf7v726ivaDXVOM/nrRaHsB8wwspLQQyfSIE
b7M7jxvT4S2pEELOGB6JQfEZZhbf5wBv9HBk+fkCBMaO4WZrnYczyD0/omiADm65
xSm/8pCMK22u8Tzn9EpKpIVdIFrl9AlZ1uiRBV2Br1oqwaBTvJVknW4bvIk0DWZU
ErwR/073UYKpl+xce3nbnixH8FeRP7/mq73Xd8e+iCgn6Dtzr1tANsu27EigMZ0=
=WHb3
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
  2014-12-23 18:46                     ` Rik van Riel
@ 2014-12-23 20:47                       ` Khalid Aziz
  2014-12-23 22:33                           ` Rik van Riel
  2015-01-13 11:25                       ` Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: Khalid Aziz @ 2014-12-23 20:47 UTC (permalink / raw)
  To: Rik van Riel, Ingo Molnar
  Cc: Thomas Gleixner, Peter Zijlstra, corbet, mingo, hpa, akpm,
	rientjes, ak, mgorman, raistlin, kirill.shutemov, atomlin,
	avagin, gorcunov, serge.hallyn, athorlton, oleg, vdavydov,
	daeseok.youn, keescook, yangds.fnst, sbauer, vishnu.ps, axboe,
	paulmck, linux-kernel, linux-doc, linux-api

On 12/23/2014 11:46 AM, Rik van Riel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 12/23/2014 10:13 AM, Khalid Aziz wrote:
>> On 12/23/2014 03:52 AM, Ingo Molnar wrote:
>>>
>>>
>>> to implement what Thomas suggested in the discussion: a proper
>>> futex like spin mechanism? That looks like a totally acceptable
>>> solution to me, without the disadvantages of your proposed
>>> solution.
>>
>> Hi Ingo,
>>
>> Thank you for taking the time to respond. It is indeed possible to
>> implement a futex like spin mechanism. Futex like mechanism will
>> be clean and elegant. That is where I had started when I was given
>> this problem to solve. Trouble I run into is the primary
>> application I am looking at to help with this solution is Database
>> which implements its own locking mechanism without using POSIX
>> semaphore or futex. Since the locking is entirely in userspace,
>> kernel has no clue when the userspace has acquired one of these
>> locks. So I can see only two ways to solve this - find a solution
>> in userspace entirely, or have userspace tell the kernel when it
>> acquires one of these locks. I will spend more time on finding a
>> way to solve it in userspace and see if I can find a way to
>> leverage futex mechanism without causing significant change to
>> database code. There may be a way to use priority inheritance to
>> avoid contention. Database performance people tell me that their
>> testing has shown the cost of making any system calls in this code
>> easily offsets any gains from optimizing for contention avoidance,
>> so that is one big challenge. Database rewriting their locking code
>> is extremely unlikely scenario. Am I missing a third option here?
>
> An uncontended futex is taken without ever going into kernel
> space. Adaptive spinning allows short duration futexes to be
> taken without going into kernel space.

You are right. Uncontended futex is very fast since it never goes into 
kernel. Queuing problem happens when the lock holder has been 
pre-empted. Adaptive spinning does the smart thing os spin-waiting only 
if the lock holder is still running on another core. If lock holder is 
not scheduled on any core, even adaptive spinning has to go into the 
kernel to be put on wait queue. What would avoid queuing problem and 
reduce the cost of contention is a combination of adaptive spinning, and 
a way to keep the lock holder running on one of the cores just a little 
longer so it can release the lock. Without creating special case and a 
new API in kernel, one way I can think of accomplishing the second part 
is to boost the priority of lock holder when contention happens and 
priority ceiling is meant to do exactly that. Priority ceiling 
implementation in glibc boosts the priority by calling into scheduler 
which does incur the cost of a system call. Priority boost is a reliable 
solution that does not change scheduling semantics. The solution 
allowing lock holder to use one extra timeslice is not a definitive 
solution but tpcc workload shows it does work and it works without 
requiring changes to database locking code.

Theoretically a new locking library that uses both these techniques will 
help solve the problem but being a new locking library, there is a big 
unknown of what new problems, performance and otherwise, it will bring 
and database has to recode to this new library. Nevertheless this is the 
path I am exploring now. The challenge being how to do this without 
requiring changes to database code or the kernel. The hooks available to 
me into current database code are schedctl_init(), schedctl_start() and 
schedctl_stop() which are no-op on Linux at this time. Database folks 
can replace these no-ops with real code in their library to solve the 
queuing problem. schedctl_start() and schedctl_stop() are called only 
when one of the highly contended locks is acquired or released. 
schedctl_start() is called after the lock has been acquired which means 
I can not rely upon it to solve contention issue. schedctl_stop() is 
called after the lock has been released.

Thanks,
Khalid

>
> Only long held locks cause a thread to go into kernel space,
> where it goes to sleep, freeing up the cpu, and increasing
> the chance that the lock holder will run.
>
> - --
> All rights reversed
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
>
> iQEcBAEBAgAGBQJUmbihAAoJEM553pKExN6DDlQH/1vvy9YYuP2dCAZSU3fz855e
> pj4796Qja929I2dStsbLl6Qhcg2ELtwtPkLoAePQ/4j2l7DCYgSNLXlC+RzQ32ay
> rbMIfwiriEVGp2hsvYTOCpnur19IHf7v726ivaDXVOM/nrRaHsB8wwspLQQyfSIE
> b7M7jxvT4S2pEELOGB6JQfEZZhbf5wBv9HBk+fkCBMaO4WZrnYczyD0/omiADm65
> xSm/8pCMK22u8Tzn9EpKpIVdIFrl9AlZ1uiRBV2Br1oqwaBTvJVknW4bvIk0DWZU
> ErwR/073UYKpl+xce3nbnixH8FeRP7/mq73Xd8e+iCgn6Dtzr1tANsu27EigMZ0=
> =WHb3
> -----END PGP SIGNATURE-----
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
@ 2014-12-23 22:33                           ` Rik van Riel
  0 siblings, 0 replies; 17+ messages in thread
From: Rik van Riel @ 2014-12-23 22:33 UTC (permalink / raw)
  To: Khalid Aziz, Ingo Molnar
  Cc: Thomas Gleixner, Peter Zijlstra, corbet, mingo, hpa, akpm,
	rientjes, ak, mgorman, raistlin, kirill.shutemov, atomlin,
	avagin, gorcunov, serge.hallyn, athorlton, oleg, vdavydov,
	daeseok.youn, keescook, yangds.fnst, sbauer, vishnu.ps, axboe,
	paulmck, linux-kernel, linux-doc, linux-api

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/23/2014 03:47 PM, Khalid Aziz wrote:

>> You are right. Uncontended futex is very fast since it never goes
>> into kernel. Queuing problem happens when the lock holder has
>> been pre-empted. Adaptive spinning does the smart thing os
>> spin-waiting only if the lock holder is still running on another
>> core. If lock holder is not scheduled on any core, even adaptive
>> spinning has to go into the kernel to be put on wait queue. What
>> would avoid queuing problem and reduce the cost of contention is
>> a combination of adaptive spinning, and a way to keep the lock
>> holder running on one of the cores just a little longer so it can
>> release the lock. Without creating special case and a new API in
>> kernel, one way I can think of accomplishing the second part is
>> to boost the priority of lock holder when contention happens and 
>> priority ceiling is meant to do exactly that. Priority ceiling 
>> implementation in glibc boosts the priority by calling into
>> scheduler which does incur the cost of a system call. Priority
>> boost is a reliable solution that does not change scheduling
>> semantics. The solution allowing lock holder to use one extra
>> timeslice is not a definitive solution but tpcc workload shows it
>> does work and it works without requiring changes to database
>> locking code.
> 
>> Theoretically a new locking library that uses both these
>> techniques will help solve the problem but being a new locking
>> library, there is a big unknown of what new problems, performance
>> and otherwise, it will bring and database has to recode to this
>> new library. Nevertheless this is the path I am exploring now.
>> The challenge being how to do this without requiring changes to
>> database code or the kernel. The hooks available to me into
>> current database code are schedctl_init(), schedctl_start() and 
>> schedctl_stop() which are no-op on Linux at this time.

That sounds like a feature.  Keep the uncontended operations fast
by not doing anything, and only slow down when there is contention.

Presumably the database people will optimize their code to avoid
contention, so any complexity can happen in the slow path, instead
of by adding things to the fast path...

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUme3NAAoJEM553pKExN6DD8gH/3am5Izrobk/AiN8sijg3YXA
a9orVuoWNE+BLt49PwWrYpjsR2AgN4G3BbUrb4GVhaFBL5/v/frUhk0As3w3uM21
QjxMtaFvqZviLWCFgtIna7zSxHom+v/eRiAjLtCoX+GtHs+t25Jyf1GowmZnkoNd
UtDPHPXmyA2CqZC0E9d53Uzb9XaP/T4G3J8U2aPSvwoj4Nw85H2S/QMptNQEJDjY
0Qpx/fv2Ze/gJ7GujU3gloX6cH5DDU+p9/pFZ7iDEB6jbbb384Zuacq6R6CeJMVB
EAxKW1tpFtPvaRC51x8TFNJY5FxSISbXKbehxKjXQ8rlkcM/k1euzo2KCKOp68w=
=cTlU
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
@ 2014-12-23 22:33                           ` Rik van Riel
  0 siblings, 0 replies; 17+ messages in thread
From: Rik van Riel @ 2014-12-23 22:33 UTC (permalink / raw)
  To: Khalid Aziz, Ingo Molnar
  Cc: Thomas Gleixner, Peter Zijlstra, corbet-T1hC0tSOHrs,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	rientjes-hpIqsD4AKlfQT0dZR+AlfA, ak-VuQAYsv1563Yd54FQh9/CA,
	mgorman-l3A5Bk7waGM, raistlin-k2GhghHVRtY,
	kirill.shutemov-VuQAYsv1563Yd54FQh9/CA,
	atomlin-H+wXaHxf7aLQT0dZR+AlfA, avagin-GEFAQzZX7r8dnm+yROfE0A,
	gorcunov-GEFAQzZX7r8dnm+yROfE0A,
	serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw, athorlton-sJ/iWh9BUns,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
	daeseok.youn-Re5JQEeQqe8AvxtiuMwx3w,
	keescook-F7+t8E8rja9g9hUCZPvPmw,
	yangds.fnst-BthXqXjhjHXQFUHtdCDX3A,
	sbauer-F61uvSdQLzf2fBVCVOL8/A, vishnu.ps-Sze3O3UU22JBDgjK7y7TUQ,
	axboe-b10kYP2dOMg, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/23/2014 03:47 PM, Khalid Aziz wrote:

>> You are right. Uncontended futex is very fast since it never goes
>> into kernel. Queuing problem happens when the lock holder has
>> been pre-empted. Adaptive spinning does the smart thing os
>> spin-waiting only if the lock holder is still running on another
>> core. If lock holder is not scheduled on any core, even adaptive
>> spinning has to go into the kernel to be put on wait queue. What
>> would avoid queuing problem and reduce the cost of contention is
>> a combination of adaptive spinning, and a way to keep the lock
>> holder running on one of the cores just a little longer so it can
>> release the lock. Without creating special case and a new API in
>> kernel, one way I can think of accomplishing the second part is
>> to boost the priority of lock holder when contention happens and 
>> priority ceiling is meant to do exactly that. Priority ceiling 
>> implementation in glibc boosts the priority by calling into
>> scheduler which does incur the cost of a system call. Priority
>> boost is a reliable solution that does not change scheduling
>> semantics. The solution allowing lock holder to use one extra
>> timeslice is not a definitive solution but tpcc workload shows it
>> does work and it works without requiring changes to database
>> locking code.
> 
>> Theoretically a new locking library that uses both these
>> techniques will help solve the problem but being a new locking
>> library, there is a big unknown of what new problems, performance
>> and otherwise, it will bring and database has to recode to this
>> new library. Nevertheless this is the path I am exploring now.
>> The challenge being how to do this without requiring changes to
>> database code or the kernel. The hooks available to me into
>> current database code are schedctl_init(), schedctl_start() and 
>> schedctl_stop() which are no-op on Linux at this time.

That sounds like a feature.  Keep the uncontended operations fast
by not doing anything, and only slow down when there is contention.

Presumably the database people will optimize their code to avoid
contention, so any complexity can happen in the slow path, instead
of by adding things to the fast path...

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUme3NAAoJEM553pKExN6DD8gH/3am5Izrobk/AiN8sijg3YXA
a9orVuoWNE+BLt49PwWrYpjsR2AgN4G3BbUrb4GVhaFBL5/v/frUhk0As3w3uM21
QjxMtaFvqZviLWCFgtIna7zSxHom+v/eRiAjLtCoX+GtHs+t25Jyf1GowmZnkoNd
UtDPHPXmyA2CqZC0E9d53Uzb9XaP/T4G3J8U2aPSvwoj4Nw85H2S/QMptNQEJDjY
0Qpx/fv2Ze/gJ7GujU3gloX6cH5DDU+p9/pFZ7iDEB6jbbb384Zuacq6R6CeJMVB
EAxKW1tpFtPvaRC51x8TFNJY5FxSISbXKbehxKjXQ8rlkcM/k1euzo2KCKOp68w=
=cTlU
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice
  2014-12-23 18:46                     ` Rik van Riel
  2014-12-23 20:47                       ` Khalid Aziz
@ 2015-01-13 11:25                       ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2015-01-13 11:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Khalid Aziz, Ingo Molnar, Thomas Gleixner, corbet, mingo, hpa,
	akpm, rientjes, ak, mgorman, raistlin, kirill.shutemov, atomlin,
	avagin, gorcunov, serge.hallyn, athorlton, oleg, vdavydov,
	daeseok.youn, keescook, yangds.fnst, sbauer, vishnu.ps, axboe,
	paulmck, linux-kernel, linux-doc, linux-api

On Tue, Dec 23, 2014 at 01:46:58PM -0500, Rik van Riel wrote:
> An uncontended futex is taken without ever going into kernel
> space. Adaptive spinning allows short duration futexes to be
> taken without going into kernel space.

The going into kernel is a red herring afaict, a no-op syscall costs ~180
cycles or something like that. So sure you can do a wee spin (~90 cycles
to try and amortize that), but I really doubt that this is the problem.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-01-13 11:25 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-18 18:44 [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice Khalid Aziz
2014-12-18 22:28 ` Peter Zijlstra
2014-12-18 22:42   ` Khalid Aziz
2014-12-18 23:02     ` Thomas Gleixner
2014-12-18 23:38       ` Khalid Aziz
2014-12-19  0:27         ` Thomas Gleixner
2014-12-19 21:43           ` Khalid Aziz
2014-12-19 23:57             ` Thomas Gleixner
2014-12-22 16:40               ` Khalid Aziz
2014-12-23 10:52                 ` Ingo Molnar
2014-12-23 10:52                   ` Ingo Molnar
2014-12-23 15:13                   ` Khalid Aziz
2014-12-23 18:46                     ` Rik van Riel
2014-12-23 20:47                       ` Khalid Aziz
2014-12-23 22:33                         ` Rik van Riel
2014-12-23 22:33                           ` Rik van Riel
2015-01-13 11:25                       ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.