linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Scheduler Soft Affinity
@ 2019-06-26 22:47 subhra mazumdar
  2019-06-26 22:47 ` [RFC PATCH 1/3] sched: Introduce new interface for scheduler soft affinity subhra mazumdar
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: subhra mazumdar @ 2019-06-26 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, prakash.sangappa, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

When multiple instances of workloads are consolidated in same host it is
good practice to partition them for best performance. For e.g give a NUMA
node parition to each instance. Currently Linux kernel provides two
interfaces to hard parition: sched_setaffinity system call or cpuset.cpus
cgroup. But this doesn't allow one instance to burst out of its partition
and use available CPUs from other partitions when they are idle. Running
all instances free range without any affinity, on the other hand, suffers
from cache coherence overhead across sockets (NUMA nodes) when all
instances are busy.

To achieve the best of both worlds, one potential way is to use AutoNUMA
balancer which migrates memory and threads to align them on same NUMA node
when all instances are busy. But it doesn't work if memory is spread across
NUMA nodes, has high reaction time due to periodic scanning mechanism and
can't handle sub-NUMA levels. Some motivational experiments were done with
2 DB instances running on a 2 socket x86 Intel system with 22 cores per
socket. 'numactl -m' was used to bind the memory of each instance to one
NUMA node, the idea was to have AutoNUMA migrate only threads as memory is
pinned. But AutoNUMA ON vs OFF didn't make any difference in performance.
Also it was found that AutoNUMA still migrated pages across NUMA nodes, so
numactl only controls the initial allocation of memory and AutoNUMA is free
to migrate them later. Following are the vmstats for different number of
users running TPC-C in each DB instance with numactl and AutoNUMA ON. With
AutnoNUMA OFF the page migrations were of course zero.

users (2x16)            
numa_hint_faults        1672485
numa_hint_faults_local  1158283
numa_pages_migrated     373670
    
users (2x24)            
numa_hint_faults        2267425
numa_hint_faults_local  1548501
numa_pages_migrated     586473
    
    
users (2x32)            
numa_hint_faults        1916625
numa_hint_faults_local  1499772
numa_pages_migrated     229581

Given the above drawbacks, the most logical way to achieve the desired
behavior is via the task scheduler. A new interface is added, a new system
call sched_setaffinity2 in this case, to specify the set of soft affinity
CPUs. It takes an extra paremeter to specify hard or soft affinity, where
hard behaves same as existing sched_setaffinity. I am open to using other
interfaces like cgroup or anything else I might not have considered. Also
this patchset only allows it for CFS class threads as for RT class the
preferential search latency may not be tolerated. But nothing in theory
stops us from implementing soft affinity for RT class too. Finally it
also adds new scheduler tunables to tune the "softness" of soft affinity
as different workload may have different optimal points. This is done
using two tunables sched_allowed and sched_preferred, where if the ratio
of CPU utilization of the prefrred set and allowed set crosses the ratio
sched_allowed:sched_preferred, scheduler will use the entire allowed set
instead of the preferred set in the first level of search in
select_task_rq_fair. The default value is 100:1.

Following are the performance results of running 2 and 1 instance(s) of
Hackbench and Oracle DB on 2 socket Intel x86 system with 22 cores per
socket with the default tunable settings. For 2 instance case DB shows
substantial improvement over no affinity but Hackbench shows negligible
improvement. For 1 instance case DB performance is close to no affinity,
but Hackbench has significant regression. Hard affinity numbers are also
added for comparison. The load in each Hackbench and DB instance is varied
by varying the number of groups and users respectively. %gain is w.r.t
no affinity.

(100:1)
Hackbench   %gain with soft affinity        %gain with hard affinity
2*4	    1.12			    1.3
2*8	    1.67			    1.35
2*16	    1.3				    1.12
2*32        0.31			    0.61
1*4	    -18				    -58
1*8	    -24				    -59
1*16	    -33				    -72
1*32	    -38				    -83

DB	    %gain with soft affinity	    %gain with hard affinity
2*16	    4.26			    4.1
2*24	    5.43			    7.17
2*32	    6.31			    6.22
1*16	    -0.02			    0.62
1*24	    0.12			    -26.1
1*32	    -1.48			    -8.65

The experiments were repeated with sched_allowed:sched_preferred set to
5:4 to have "softer" soft affinity. Following numbers show it preserves the
(negligible) improvement for 2 instance Hackbench case but reduces the
regression for 1 instance significantly. For DB this setting doesn't work
well as the improvements for 2 instance case goes away. This also shows
different workloads have different optimal setting.

(5:4)
Hackbench   %gain with soft affinity
2*4	    1.43
2*8	    1.36
2*16	    1.01
2*32	    1.45
1*4	    -2.55
1*8	    -5.06
1*16	    -8
1*32	    -7.32
                                                                           
DB          %gain with soft affinity
2*16	    0.46
2*24	    3.68
2*32	    -3.34
1*16	    0.08
1*24	    1.6
1*32	    -1.29

Finally I measured the overhead of soft affinity when it is NOT used by
comparing it with baseline kernel in case of no affinity and hard affinity
with Hackbench. The following is the improvement of soft affinity kernel
w.r.t baseline, but really numbers are in noise margin. This shows soft
affinity has no overhead when not used.

Hackbench   %diff of no affinity	%diff of hard affinity
2*4	    0.11			0.31
2*8	    0.13			0.55
2*16	    0.61			0.90
2*32	    0.86			1.01
1*4	    0.48			0.43
1*8	    0.45			0.33
1*16	    0.61			0.64
1*32	    0.11			0.63

A final set of experiments were done (numbers not shown) having the memory
of each DB instance spread evenly across both NUMA nodes. This showed
similar improvements with soft affinity for 2 instance case, thus proving
the improvement is due to saving LLC coherence overhead.

subhra mazumdar (3):
  sched: Introduce new interface for scheduler soft affinity
  sched: change scheduler to give preference to soft affinity CPUs
  sched: introduce tunables to control soft affinity

 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/sched.h                  |   5 +-
 include/linux/sched/sysctl.h           |   2 +
 include/linux/syscalls.h               |   3 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/sched.h             |   3 +
 init/init_task.c                       |   2 +
 kernel/compat.c                        |   2 +-
 kernel/rcu/tree_plugin.h               |   3 +-
 kernel/sched/core.c                    | 167 ++++++++++++++++++++++++++++-----
 kernel/sched/fair.c                    | 154 ++++++++++++++++++++++--------
 kernel/sched/sched.h                   |   2 +
 kernel/sysctl.c                        |  14 +++
 13 files changed, 297 insertions(+), 65 deletions(-)

-- 
2.9.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 1/3] sched: Introduce new interface for scheduler soft affinity
  2019-06-26 22:47 [RFC PATCH 0/3] Scheduler Soft Affinity subhra mazumdar
@ 2019-06-26 22:47 ` subhra mazumdar
  2019-07-02 16:23   ` Peter Zijlstra
  2019-07-02 16:29   ` Peter Zijlstra
  2019-06-26 22:47 ` [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs subhra mazumdar
  2019-06-26 22:47 ` [RFC PATCH 3/3] sched: introduce tunables to control soft affinity subhra mazumdar
  2 siblings, 2 replies; 12+ messages in thread
From: subhra mazumdar @ 2019-06-26 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, prakash.sangappa, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

New system call sched_setaffinity2 is introduced for scheduler soft
affinity. It takes an extra parameter to specify hard or soft affinity,
where hard implies same as existing sched_setaffinity. New cpumask
cpus_preferred is introduced for this purpose which is always a subset of
cpus_allowed. A boolean affinity_unequal is used to store if they are
unequal for fast lookup. Setting hard affinity resets soft affinity set to
be equal to it. Soft affinity is only allowed for CFS class threads.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/sched.h                  |   5 +-
 include/linux/syscalls.h               |   3 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/sched.h             |   3 +
 init/init_task.c                       |   2 +
 kernel/compat.c                        |   2 +-
 kernel/rcu/tree_plugin.h               |   3 +-
 kernel/sched/core.c                    | 167 ++++++++++++++++++++++++++++-----
 9 files changed, 162 insertions(+), 28 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e..1dccdd2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 431	common	fsconfig		__x64_sys_fsconfig
 432	common	fsmount			__x64_sys_fsmount
 433	common	fspick			__x64_sys_fspick
+434	common	sched_setaffinity2	__x64_sys_sched_setaffinity2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1183741..b863fa8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -652,6 +652,8 @@ struct task_struct {
 	unsigned int			policy;
 	int				nr_cpus_allowed;
 	cpumask_t			cpus_allowed;
+	cpumask_t			cpus_preferred;
+	bool				affinity_unequal;
 
 #ifdef CONFIG_PREEMPT_RCU
 	int				rcu_read_lock_nesting;
@@ -1784,7 +1786,8 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)
 # define vcpu_is_preempted(cpu)	false
 #endif
 
-extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);
+extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask,
+			      int flags);
 extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 
 #ifndef TASK_SIZE_OF
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe..147a4e5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -669,6 +669,9 @@ asmlinkage long sys_sched_rr_get_interval(pid_t pid,
 				struct __kernel_timespec __user *interval);
 asmlinkage long sys_sched_rr_get_interval_time32(pid_t pid,
 						 struct old_timespec32 __user *interval);
+asmlinkage long sys_sched_setaffinity2(pid_t pid, unsigned int len,
+				       unsigned long __user *user_mask_ptr,
+				       int flags);
 
 /* kernel/signal.c */
 asmlinkage long sys_restart_syscall(void);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904d..d77b366 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_sched_setaffinity2 434
+__SYSCALL(__NR_sched_setaffinity2, sys_sched_setaffinity2)
 
 #undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 435
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index ed4ee17..f910cd5 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -52,6 +52,9 @@
 #define SCHED_FLAG_RECLAIM		0x02
 #define SCHED_FLAG_DL_OVERRUN		0x04
 
+#define SCHED_HARD_AFFINITY	0
+#define SCHED_SOFT_AFFINITY	1
+
 #define SCHED_FLAG_ALL	(SCHED_FLAG_RESET_ON_FORK	| \
 			 SCHED_FLAG_RECLAIM		| \
 			 SCHED_FLAG_DL_OVERRUN)
diff --git a/init/init_task.c b/init/init_task.c
index c70ef65..aa226a3 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -73,6 +73,8 @@ struct task_struct init_task
 	.normal_prio	= MAX_PRIO - 20,
 	.policy		= SCHED_NORMAL,
 	.cpus_allowed	= CPU_MASK_ALL,
+	.cpus_preferred = CPU_MASK_ALL,
+	.affinity_unequal = false,
 	.nr_cpus_allowed= NR_CPUS,
 	.mm		= NULL,
 	.active_mm	= &init_mm,
diff --git a/kernel/compat.c b/kernel/compat.c
index b5f7063..96621d7 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -226,7 +226,7 @@ COMPAT_SYSCALL_DEFINE3(sched_setaffinity, compat_pid_t, pid,
 	if (retval)
 		goto out;
 
-	retval = sched_setaffinity(pid, new_mask);
+	retval = sched_setaffinity(pid, new_mask, SCHED_HARD_AFFINITY);
 out:
 	free_cpumask_var(new_mask);
 	return retval;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 1102765..bdff600 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2287,7 +2287,8 @@ static bool init_nocb_callback_list(struct rcu_data *rdp)
 void rcu_bind_current_to_nocb(void)
 {
 	if (cpumask_available(rcu_nocb_mask) && cpumask_weight(rcu_nocb_mask))
-		WARN_ON(sched_setaffinity(current->pid, rcu_nocb_mask));
+		WARN_ON(sched_setaffinity(current->pid, rcu_nocb_mask,
+					  SCHED_HARD_AFFINITY));
 }
 EXPORT_SYMBOL_GPL(rcu_bind_current_to_nocb);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427..eca3e98b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1060,6 +1060,12 @@ void set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_ma
 	p->nr_cpus_allowed = cpumask_weight(new_mask);
 }
 
+void set_cpus_preferred_common(struct task_struct *p,
+			       const struct cpumask *new_mask)
+{
+	cpumask_copy(&p->cpus_preferred, new_mask);
+}
+
 void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 {
 	struct rq *rq = task_rq(p);
@@ -1082,6 +1088,37 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 		put_prev_task(rq, p);
 
 	p->sched_class->set_cpus_allowed(p, new_mask);
+	set_cpus_preferred_common(p, new_mask);
+
+	if (queued)
+		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
+	if (running)
+		set_curr_task(rq, p);
+}
+
+void do_set_cpus_preferred(struct task_struct *p,
+			   const struct cpumask *new_mask)
+{
+	struct rq *rq = task_rq(p);
+	bool queued, running;
+
+	lockdep_assert_held(&p->pi_lock);
+
+	queued = task_on_rq_queued(p);
+	running = task_current(rq, p);
+
+	if (queued) {
+		/*
+		 * Because __kthread_bind() calls this on blocked tasks without
+		 * holding rq->lock.
+		 */
+		lockdep_assert_held(&rq->lock);
+		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
+	}
+	if (running)
+		put_prev_task(rq, p);
+
+	set_cpus_preferred_common(p, new_mask);
 
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
@@ -1170,6 +1207,41 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
 	return ret;
 }
 
+static int
+__set_cpus_preferred_ptr(struct task_struct *p, const struct cpumask *new_mask)
+{
+	const struct cpumask *cpu_valid_mask = cpu_active_mask;
+	unsigned int dest_cpu;
+	struct rq_flags rf;
+	struct rq *rq;
+	int ret = 0;
+
+	rq = task_rq_lock(p, &rf);
+	update_rq_clock(rq);
+
+	if (p->flags & PF_KTHREAD) {
+		/*
+		 * Kernel threads are allowed on online && !active CPUs
+		 */
+		cpu_valid_mask = cpu_online_mask;
+	}
+
+	if (cpumask_equal(&p->cpus_preferred, new_mask))
+		goto out;
+
+	if (!cpumask_intersects(new_mask, cpu_valid_mask)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	do_set_cpus_preferred(p, new_mask);
+
+out:
+	task_rq_unlock(rq, p, &rf);
+
+	return ret;
+}
+
 int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
 {
 	return __set_cpus_allowed_ptr(p, new_mask, false);
@@ -4724,7 +4796,7 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 	return retval;
 }
 
-long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
+long sched_setaffinity(pid_t pid, const struct cpumask *in_mask, int flags)
 {
 	cpumask_var_t cpus_allowed, new_mask;
 	struct task_struct *p;
@@ -4742,6 +4814,11 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 	get_task_struct(p);
 	rcu_read_unlock();
 
+	if (flags == SCHED_SOFT_AFFINITY &&
+	    p->sched_class != &fair_sched_class) {
+		retval = -EINVAL;
+		goto out_put_task;
+	}
 	if (p->flags & PF_NO_SETAFFINITY) {
 		retval = -EINVAL;
 		goto out_put_task;
@@ -4790,18 +4867,37 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 	}
 #endif
 again:
-	retval = __set_cpus_allowed_ptr(p, new_mask, true);
-
-	if (!retval) {
-		cpuset_cpus_allowed(p, cpus_allowed);
-		if (!cpumask_subset(new_mask, cpus_allowed)) {
-			/*
-			 * We must have raced with a concurrent cpuset
-			 * update. Just reset the cpus_allowed to the
-			 * cpuset's cpus_allowed
-			 */
-			cpumask_copy(new_mask, cpus_allowed);
-			goto again;
+	if (flags == SCHED_HARD_AFFINITY) {
+		retval = __set_cpus_allowed_ptr(p, new_mask, true);
+
+		if (!retval) {
+			cpuset_cpus_allowed(p, cpus_allowed);
+			if (!cpumask_subset(new_mask, cpus_allowed)) {
+				/*
+				 * We must have raced with a concurrent cpuset
+				 * update. Just reset the cpus_allowed to the
+				 * cpuset's cpus_allowed
+				 */
+				cpumask_copy(new_mask, cpus_allowed);
+				goto again;
+			}
+			p->affinity_unequal = false;
+		}
+	} else if (flags == SCHED_SOFT_AFFINITY) {
+		retval = __set_cpus_preferred_ptr(p, new_mask);
+		if (!retval) {
+			cpuset_cpus_allowed(p, cpus_allowed);
+			if (!cpumask_subset(new_mask, cpus_allowed)) {
+				/*
+				 * We must have raced with a concurrent cpuset
+				 * update.
+				 */
+				cpumask_and(new_mask, new_mask, cpus_allowed);
+				goto again;
+			}
+			if (!cpumask_equal(&p->cpus_allowed,
+					   &p->cpus_preferred))
+				p->affinity_unequal = true;
 		}
 	}
 out_free_new_mask:
@@ -4824,30 +4920,53 @@ static int get_user_cpu_mask(unsigned long __user *user_mask_ptr, unsigned len,
 	return copy_from_user(new_mask, user_mask_ptr, len) ? -EFAULT : 0;
 }
 
-/**
- * sys_sched_setaffinity - set the CPU affinity of a process
- * @pid: pid of the process
- * @len: length in bytes of the bitmask pointed to by user_mask_ptr
- * @user_mask_ptr: user-space pointer to the new CPU mask
- *
- * Return: 0 on success. An error code otherwise.
- */
-SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
-		unsigned long __user *, user_mask_ptr)
+static bool
+valid_affinity_flags(int flags)
+{
+	return flags == SCHED_HARD_AFFINITY || flags == SCHED_SOFT_AFFINITY;
+}
+
+static int
+sched_setaffinity_common(pid_t pid, unsigned int len,
+			 unsigned long __user *user_mask_ptr, int flags)
 {
 	cpumask_var_t new_mask;
 	int retval;
 
+	if (!valid_affinity_flags(flags))
+		return -EINVAL;
+
 	if (!alloc_cpumask_var(&new_mask, GFP_KERNEL))
 		return -ENOMEM;
 
 	retval = get_user_cpu_mask(user_mask_ptr, len, new_mask);
 	if (retval == 0)
-		retval = sched_setaffinity(pid, new_mask);
+		retval = sched_setaffinity(pid, new_mask, flags);
 	free_cpumask_var(new_mask);
 	return retval;
 }
 
+SYSCALL_DEFINE4(sched_setaffinity2, pid_t, pid, unsigned int, len,
+		unsigned long __user *, user_mask_ptr, int, flags)
+{
+	return sched_setaffinity_common(pid, len, user_mask_ptr, flags);
+}
+
+/**
+ * sys_sched_setaffinity - set the CPU affinity of a process
+ * @pid: pid of the process
+ * @len: length in bytes of the bitmask pointed to by user_mask_ptr
+ * @user_mask_ptr: user-space pointer to the new CPU mask
+ *
+ * Return: 0 on success. An error code otherwise.
+ */
+SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
+		unsigned long __user *, user_mask_ptr)
+{
+	return sched_setaffinity_common(pid, len, user_mask_ptr,
+					SCHED_HARD_AFFINITY);
+}
+
 long sched_getaffinity(pid_t pid, struct cpumask *mask)
 {
 	struct task_struct *p;
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs
  2019-06-26 22:47 [RFC PATCH 0/3] Scheduler Soft Affinity subhra mazumdar
  2019-06-26 22:47 ` [RFC PATCH 1/3] sched: Introduce new interface for scheduler soft affinity subhra mazumdar
@ 2019-06-26 22:47 ` subhra mazumdar
  2019-07-02 17:28   ` Peter Zijlstra
  2019-06-26 22:47 ` [RFC PATCH 3/3] sched: introduce tunables to control soft affinity subhra mazumdar
  2 siblings, 1 reply; 12+ messages in thread
From: subhra mazumdar @ 2019-06-26 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, prakash.sangappa, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

The soft affinity CPUs present in the cpumask cpus_preferred is used by the
scheduler in two levels of search. First is in determining wake affine
which choses the LLC domain and secondly while searching for idle CPUs in
LLC domain. In the first level it uses cpus_preferred to prune out the
search space. In the second level it first searches the cpus_preferred and
then cpus_allowed. Using affinity_unequal flag it breaks early to avoid
any overhead in the scheduler fast path when soft affinity is not used.
This only changes the wake up path of the scheduler, the idle balancing
is unchanged; together they achieve the "softness" of scheduling.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/fair.c | 137 ++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 100 insertions(+), 37 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..53aa7f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5807,7 +5807,7 @@ static unsigned long capacity_spare_without(int cpu, struct task_struct *p)
  */
 static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p,
-		  int this_cpu, int sd_flag)
+		  int this_cpu, int sd_flag, struct cpumask *cpus)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
 	struct sched_group *most_spare_sg = NULL;
@@ -5831,7 +5831,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 		/* Skip over this group if it has no CPUs allowed */
 		if (!cpumask_intersects(sched_group_span(group),
-					&p->cpus_allowed))
+					cpus))
 			continue;
 
 		local_group = cpumask_test_cpu(this_cpu,
@@ -5949,7 +5949,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
  * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
  */
 static int
-find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
+find_idlest_group_cpu(struct sched_group *group, struct task_struct *p,
+		      int this_cpu, struct cpumask *cpus)
 {
 	unsigned long load, min_load = ULONG_MAX;
 	unsigned int min_exit_latency = UINT_MAX;
@@ -5963,7 +5964,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 		return cpumask_first(sched_group_span(group));
 
 	/* Traverse only the allowed CPUs */
-	for_each_cpu_and(i, sched_group_span(group), &p->cpus_allowed) {
+	for_each_cpu_and(i, sched_group_span(group), cpus) {
 		if (available_idle_cpu(i)) {
 			struct rq *rq = cpu_rq(i);
 			struct cpuidle_state *idle = idle_get_state(rq);
@@ -5999,7 +6000,8 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 }
 
 static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p,
-				  int cpu, int prev_cpu, int sd_flag)
+				  int cpu, int prev_cpu, int sd_flag,
+				  struct cpumask *cpus)
 {
 	int new_cpu = cpu;
 
@@ -6023,13 +6025,14 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
 			continue;
 		}
 
-		group = find_idlest_group(sd, p, cpu, sd_flag);
+		group = find_idlest_group(sd, p, cpu, sd_flag, cpus);
+
 		if (!group) {
 			sd = sd->child;
 			continue;
 		}
 
-		new_cpu = find_idlest_group_cpu(group, p, cpu);
+		new_cpu = find_idlest_group_cpu(group, p, cpu, cpus);
 		if (new_cpu == cpu) {
 			/* Now try balancing at a lower domain level of 'cpu': */
 			sd = sd->child;
@@ -6104,6 +6107,27 @@ void __update_idle_core(struct rq *rq)
 	rcu_read_unlock();
 }
 
+static inline int
+scan_cpu_mask_for_idle_cores(struct cpumask *cpus, int target)
+{
+	int core, cpu;
+
+	for_each_cpu_wrap(core, cpus, target) {
+		bool idle = true;
+
+		for_each_cpu(cpu, cpu_smt_mask(core)) {
+			cpumask_clear_cpu(cpu, cpus);
+			if (!idle_cpu(cpu))
+				idle = false;
+		}
+
+		if (idle)
+			return core;
+	}
+
+	return -1;
+}
+
 /*
  * Scan the entire LLC domain for idle cores; this dynamically switches off if
  * there are no idle cores left in the system; tracked through
@@ -6112,7 +6136,7 @@ void __update_idle_core(struct rq *rq)
 static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-	int core, cpu;
+	int core;
 
 	if (!static_branch_likely(&sched_smt_present))
 		return -1;
@@ -6120,21 +6144,22 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
 	if (!test_idle_cores(target, false))
 		return -1;
 
-	cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
+	cpumask_and(cpus, sched_domain_span(sd), &p->cpus_preferred);
+	core = scan_cpu_mask_for_idle_cores(cpus, target);
 
-	for_each_cpu_wrap(core, cpus, target) {
-		bool idle = true;
+	if (core >= 0)
+		return core;
 
-		for_each_cpu(cpu, cpu_smt_mask(core)) {
-			__cpumask_clear_cpu(cpu, cpus);
-			if (!available_idle_cpu(cpu))
-				idle = false;
-		}
+	if (!p->affinity_unequal)
+		goto out;
 
-		if (idle)
-			return core;
-	}
+	cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
+	cpumask_andnot(cpus, cpus, &p->cpus_preferred);
+	core = scan_cpu_mask_for_idle_cores(cpus, target);
 
+	if (core >= 0)
+		return core;
+out:
 	/*
 	 * Failed to find an idle core; stop looking for one.
 	 */
@@ -6143,24 +6168,40 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
 	return -1;
 }
 
+static inline int
+scan_cpu_mask_for_idle_smt(struct cpumask *cpus, int target)
+{
+	int cpu;
+
+	for_each_cpu(cpu, cpu_smt_mask(target)) {
+		if (!cpumask_test_cpu(cpu, cpus))
+			continue;
+		if (idle_cpu(cpu))
+			return cpu;
+	}
+
+	return -1;
+}
+
 /*
  * Scan the local SMT mask for idle CPUs.
  */
 static int select_idle_smt(struct task_struct *p, int target)
 {
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int cpu;
 
 	if (!static_branch_likely(&sched_smt_present))
 		return -1;
 
-	for_each_cpu(cpu, cpu_smt_mask(target)) {
-		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
-			continue;
-		if (available_idle_cpu(cpu))
-			return cpu;
-	}
+	cpu = scan_cpu_mask_for_idle_smt(&p->cpus_preferred, target);
 
-	return -1;
+	if (cpu >= 0 || !p->affinity_unequal)
+		return cpu;
+
+	cpumask_andnot(cpus, &p->cpus_allowed, &p->cpus_preferred);
+
+	return scan_cpu_mask_for_idle_smt(cpus, target);
 }
 
 #else /* CONFIG_SCHED_SMT */
@@ -6177,6 +6218,24 @@ static inline int select_idle_smt(struct task_struct *p, int target)
 
 #endif /* CONFIG_SCHED_SMT */
 
+static inline int
+scan_cpu_mask_for_idle_cpu(struct cpumask *cpus, int target,
+			   struct sched_domain *sd, int *nr)
+{
+	int cpu;
+
+	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+		if (!--(*nr))
+			return -1;
+		if (!cpumask_test_cpu(cpu, cpus))
+			continue;
+		if (available_idle_cpu(cpu))
+			break;
+	}
+
+	return cpu;
+}
+
 /*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6185,10 +6244,11 @@ static inline int select_idle_smt(struct task_struct *p, int target)
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct sched_domain *this_sd;
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	u64 avg_cost, avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu, nr = INT_MAX;
+	int cpu, nr = INT_MAX, nr_begin;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6212,16 +6272,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 			nr = 4;
 	}
 
+	nr_begin = nr;
 	time = local_clock();
 
-	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
-		if (!--nr)
-			return -1;
-		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
-			continue;
-		if (available_idle_cpu(cpu))
-			break;
-	}
+	cpu = scan_cpu_mask_for_idle_cpu(&p->cpus_preferred, target, sd, &nr);
+
+	if (!nr || !p->affinity_unequal || cpu != target || nr >= nr_begin - 1)
+		goto out;
+
+	cpumask_andnot(cpus, &p->cpus_allowed, &p->cpus_preferred);
+
+	cpu = scan_cpu_mask_for_idle_cpu(cpus, target, sd, &nr);
+out:
 
 	time = local_clock() - time;
 	cost = this_sd->avg_scan_cost;
@@ -6677,6 +6739,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	int new_cpu = prev_cpu;
 	int want_affine = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
+	struct cpumask *cpus = &p->cpus_preferred;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
@@ -6689,7 +6752,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		}
 
 		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
-			      cpumask_test_cpu(cpu, &p->cpus_allowed);
+			      cpumask_test_cpu(cpu, cpus);
 	}
 
 	rcu_read_lock();
@@ -6718,7 +6781,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 
 	if (unlikely(sd)) {
 		/* Slow path */
-		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
+		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag, cpus);
 	} else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
 		/* Fast path */
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 3/3] sched: introduce tunables to control soft affinity
  2019-06-26 22:47 [RFC PATCH 0/3] Scheduler Soft Affinity subhra mazumdar
  2019-06-26 22:47 ` [RFC PATCH 1/3] sched: Introduce new interface for scheduler soft affinity subhra mazumdar
  2019-06-26 22:47 ` [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs subhra mazumdar
@ 2019-06-26 22:47 ` subhra mazumdar
  2019-07-18 10:08   ` Srikar Dronamraju
  2 siblings, 1 reply; 12+ messages in thread
From: subhra mazumdar @ 2019-06-26 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, prakash.sangappa, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

For different workloads the optimal "softness" of soft affinity can be
different. Introduce tunables sched_allowed and sched_preferred that can
be tuned via /proc. This allows to chose at what utilization difference
the scheduler will chose cpus_allowed over cpus_preferred in the first
level of search. Depending on the extent of data sharing, cache coherency
overhead of the system etc. the optimal point may vary.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 include/linux/sched/sysctl.h |  2 ++
 kernel/sched/fair.c          | 19 ++++++++++++++++++-
 kernel/sched/sched.h         |  2 ++
 kernel/sysctl.c              | 14 ++++++++++++++
 4 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 99ce6d7..0e75602 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -41,6 +41,8 @@ extern unsigned int sysctl_numa_balancing_scan_size;
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
+extern __read_mostly unsigned int sysctl_sched_preferred;
+extern __read_mostly unsigned int sysctl_sched_allowed;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 53aa7f2..d222d78 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -85,6 +85,8 @@ unsigned int sysctl_sched_wakeup_granularity			= 1000000UL;
 static unsigned int normalized_sysctl_sched_wakeup_granularity	= 1000000UL;
 
 const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
+const_debug unsigned int sysctl_sched_preferred		= 1UL;
+const_debug unsigned int sysctl_sched_allowed		= 100UL;
 
 #ifdef CONFIG_SMP
 /*
@@ -6739,7 +6741,22 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	int new_cpu = prev_cpu;
 	int want_affine = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
-	struct cpumask *cpus = &p->cpus_preferred;
+	int cpux, cpuy;
+	struct cpumask *cpus;
+
+	if (!p->affinity_unequal) {
+		cpus = &p->cpus_allowed;
+	} else {
+		cpux = cpumask_any(&p->cpus_preferred);
+		cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+		cpumask_andnot(cpus, &p->cpus_allowed, &p->cpus_preferred);
+		cpuy = cpumask_any(cpus);
+		if (sysctl_sched_preferred * cpu_rq(cpux)->cfs.avg.util_avg >
+		    sysctl_sched_allowed * cpu_rq(cpuy)->cfs.avg.util_avg)
+			cpus = &p->cpus_allowed;
+		else
+			cpus = &p->cpus_preferred;
+	}
 
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..f856bdb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1863,6 +1863,8 @@ extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);
 
 extern const_debug unsigned int sysctl_sched_nr_migrate;
 extern const_debug unsigned int sysctl_sched_migration_cost;
+extern const_debug unsigned int sysctl_sched_preferred;
+extern const_debug unsigned int sysctl_sched_allowed;
 
 #ifdef CONFIG_SCHED_HRTICK
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7d1008b..bdffb48 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -383,6 +383,20 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname       = "sched_preferred",
+		.data           = &sysctl_sched_preferred,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{
+		.procname       = "sched_allowed",
+		.data           = &sysctl_sched_allowed,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 #ifdef CONFIG_SCHEDSTATS
 	{
 		.procname	= "sched_schedstats",
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/3] sched: Introduce new interface for scheduler soft affinity
  2019-06-26 22:47 ` [RFC PATCH 1/3] sched: Introduce new interface for scheduler soft affinity subhra mazumdar
@ 2019-07-02 16:23   ` Peter Zijlstra
  2019-07-02 16:29   ` Peter Zijlstra
  1 sibling, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2019-07-02 16:23 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, mingo, tglx, prakash.sangappa, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

On Wed, Jun 26, 2019 at 03:47:16PM -0700, subhra mazumdar wrote:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1183741..b863fa8 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -652,6 +652,8 @@ struct task_struct {
>  	unsigned int			policy;
>  	int				nr_cpus_allowed;
>  	cpumask_t			cpus_allowed;

You're patching dead code, that no longer exists.

> +	cpumask_t			cpus_preferred;
> +	bool				affinity_unequal;

Urgh, no. cpumask_t is an abomination and having one of them is already
unfortunate, having two is really not sane, esp. since for 99% of the
tasks they'll be exactly the same.

Why not add cpus_ptr_soft or something like that, and have it point at
cpus_mask by default, and when it needs to not be the same, allocate a
cpumask for it. That also gets rid of that unequal thing.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/3] sched: Introduce new interface for scheduler soft affinity
  2019-06-26 22:47 ` [RFC PATCH 1/3] sched: Introduce new interface for scheduler soft affinity subhra mazumdar
  2019-07-02 16:23   ` Peter Zijlstra
@ 2019-07-02 16:29   ` Peter Zijlstra
  1 sibling, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2019-07-02 16:29 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, mingo, tglx, prakash.sangappa, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

On Wed, Jun 26, 2019 at 03:47:16PM -0700, subhra mazumdar wrote:
> @@ -1082,6 +1088,37 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
>  		put_prev_task(rq, p);
>  
>  	p->sched_class->set_cpus_allowed(p, new_mask);
> +	set_cpus_preferred_common(p, new_mask);
> +
> +	if (queued)
> +		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
> +	if (running)
> +		set_curr_task(rq, p);
> +}
> +
> +void do_set_cpus_preferred(struct task_struct *p,
> +			   const struct cpumask *new_mask)
> +{
> +	struct rq *rq = task_rq(p);
> +	bool queued, running;
> +
> +	lockdep_assert_held(&p->pi_lock);
> +
> +	queued = task_on_rq_queued(p);
> +	running = task_current(rq, p);
> +
> +	if (queued) {
> +		/*
> +		 * Because __kthread_bind() calls this on blocked tasks without
> +		 * holding rq->lock.
> +		 */
> +		lockdep_assert_held(&rq->lock);
> +		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
> +	}
> +	if (running)
> +		put_prev_task(rq, p);
> +
> +	set_cpus_preferred_common(p, new_mask);
>  
>  	if (queued)
>  		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
> @@ -1170,6 +1207,41 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
>  	return ret;
>  }
>  
> +static int
> +__set_cpus_preferred_ptr(struct task_struct *p, const struct cpumask *new_mask)
> +{
> +	const struct cpumask *cpu_valid_mask = cpu_active_mask;
> +	unsigned int dest_cpu;
> +	struct rq_flags rf;
> +	struct rq *rq;
> +	int ret = 0;
> +
> +	rq = task_rq_lock(p, &rf);
> +	update_rq_clock(rq);
> +
> +	if (p->flags & PF_KTHREAD) {
> +		/*
> +		 * Kernel threads are allowed on online && !active CPUs
> +		 */
> +		cpu_valid_mask = cpu_online_mask;
> +	}
> +
> +	if (cpumask_equal(&p->cpus_preferred, new_mask))
> +		goto out;
> +
> +	if (!cpumask_intersects(new_mask, cpu_valid_mask)) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	do_set_cpus_preferred(p, new_mask);
> +
> +out:
> +	task_rq_unlock(rq, p, &rf);
> +
> +	return ret;
> +}
> +
>  int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
>  {
>  	return __set_cpus_allowed_ptr(p, new_mask, false);
> @@ -4724,7 +4796,7 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
>  	return retval;
>  }
>  
> -long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
> +long sched_setaffinity(pid_t pid, const struct cpumask *in_mask, int flags)
>  {
>  	cpumask_var_t cpus_allowed, new_mask;
>  	struct task_struct *p;
> @@ -4742,6 +4814,11 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
>  	get_task_struct(p);
>  	rcu_read_unlock();
>  
> +	if (flags == SCHED_SOFT_AFFINITY &&
> +	    p->sched_class != &fair_sched_class) {
> +		retval = -EINVAL;
> +		goto out_put_task;
> +	}
>  	if (p->flags & PF_NO_SETAFFINITY) {
>  		retval = -EINVAL;
>  		goto out_put_task;
> @@ -4790,18 +4867,37 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
>  	}
>  #endif
>  again:
> -	retval = __set_cpus_allowed_ptr(p, new_mask, true);
> -
> -	if (!retval) {
> -		cpuset_cpus_allowed(p, cpus_allowed);
> -		if (!cpumask_subset(new_mask, cpus_allowed)) {
> -			/*
> -			 * We must have raced with a concurrent cpuset
> -			 * update. Just reset the cpus_allowed to the
> -			 * cpuset's cpus_allowed
> -			 */
> -			cpumask_copy(new_mask, cpus_allowed);
> -			goto again;
> +	if (flags == SCHED_HARD_AFFINITY) {
> +		retval = __set_cpus_allowed_ptr(p, new_mask, true);
> +
> +		if (!retval) {
> +			cpuset_cpus_allowed(p, cpus_allowed);
> +			if (!cpumask_subset(new_mask, cpus_allowed)) {
> +				/*
> +				 * We must have raced with a concurrent cpuset
> +				 * update. Just reset the cpus_allowed to the
> +				 * cpuset's cpus_allowed
> +				 */
> +				cpumask_copy(new_mask, cpus_allowed);
> +				goto again;
> +			}
> +			p->affinity_unequal = false;
> +		}
> +	} else if (flags == SCHED_SOFT_AFFINITY) {
> +		retval = __set_cpus_preferred_ptr(p, new_mask);
> +		if (!retval) {
> +			cpuset_cpus_allowed(p, cpus_allowed);
> +			if (!cpumask_subset(new_mask, cpus_allowed)) {
> +				/*
> +				 * We must have raced with a concurrent cpuset
> +				 * update.
> +				 */
> +				cpumask_and(new_mask, new_mask, cpus_allowed);
> +				goto again;
> +			}
> +			if (!cpumask_equal(&p->cpus_allowed,
> +					   &p->cpus_preferred))
> +				p->affinity_unequal = true;
>  		}
>  	}
>  out_free_new_mask:

This seems like a terrible lot of pointless duplication; don't you get a
much smaller diff by passing the hard/soft thing into
__set_cpus_allowed_ptr() and only branching where it matters?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs
  2019-06-26 22:47 ` [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs subhra mazumdar
@ 2019-07-02 17:28   ` Peter Zijlstra
  2019-07-17  3:01     ` Subhra Mazumdar
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2019-07-02 17:28 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, mingo, tglx, prakash.sangappa, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, Paul Turner

On Wed, Jun 26, 2019 at 03:47:17PM -0700, subhra mazumdar wrote:
> The soft affinity CPUs present in the cpumask cpus_preferred is used by the
> scheduler in two levels of search. First is in determining wake affine
> which choses the LLC domain and secondly while searching for idle CPUs in
> LLC domain. In the first level it uses cpus_preferred to prune out the
> search space. In the second level it first searches the cpus_preferred and
> then cpus_allowed. Using affinity_unequal flag it breaks early to avoid
> any overhead in the scheduler fast path when soft affinity is not used.
> This only changes the wake up path of the scheduler, the idle balancing
> is unchanged; together they achieve the "softness" of scheduling.

I really dislike this implementation.

I thought the idea was to remain work conserving (in so far as that
we're that anyway), so changing select_idle_sibling() doesn't make sense
to me. If there is idle, we use it.

Same for newidle; which you already retained.

This then leaves regular balancing, and for that we can fudge with
can_migrate_task() and nr_balance_failed or something.

And I also really don't want a second utilization tipping point; we
already have the overloaded thing.

I also still dislike how you never looked into the numa balancer, which
already has peferred_nid stuff.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs
  2019-07-02 17:28   ` Peter Zijlstra
@ 2019-07-17  3:01     ` Subhra Mazumdar
  2019-07-18 11:37       ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Subhra Mazumdar @ 2019-07-17  3:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, tglx, prakash.sangappa, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, Paul Turner


On 7/2/19 10:58 PM, Peter Zijlstra wrote:
> On Wed, Jun 26, 2019 at 03:47:17PM -0700, subhra mazumdar wrote:
>> The soft affinity CPUs present in the cpumask cpus_preferred is used by the
>> scheduler in two levels of search. First is in determining wake affine
>> which choses the LLC domain and secondly while searching for idle CPUs in
>> LLC domain. In the first level it uses cpus_preferred to prune out the
>> search space. In the second level it first searches the cpus_preferred and
>> then cpus_allowed. Using affinity_unequal flag it breaks early to avoid
>> any overhead in the scheduler fast path when soft affinity is not used.
>> This only changes the wake up path of the scheduler, the idle balancing
>> is unchanged; together they achieve the "softness" of scheduling.
> I really dislike this implementation.
>
> I thought the idea was to remain work conserving (in so far as that
> we're that anyway), so changing select_idle_sibling() doesn't make sense
> to me. If there is idle, we use it.
>
> Same for newidle; which you already retained.
The scheduler is already not work conserving in many ways. Soft affinity is
only for those who want to use it and has no side effects when not used.
Also the way scheduler is implemented in the first level of search it may
not be possible to do it in a work conserving way, I am open to ideas.
>
> This then leaves regular balancing, and for that we can fudge with
> can_migrate_task() and nr_balance_failed or something.
Possibly but I don't know if similar performance behavior can be achieved
by the periodic load balancer. Do you want a performance comparison of the
two approaches?
>
> And I also really don't want a second utilization tipping point; we
> already have the overloaded thing.
The numbers in the cover letter show that a static tipping point will not
work for all workloads. What soft affinity is doing is essentially trading
off cache coherence for more CPU. The optimum tradeoff point will vary
from workload to workload and the system metrics of coherence overhead etc.
If we just use the domain overload that becomes a static definition of
tipping point, we need something tunable that captures this tradeoff. The
ratio of CPU util seemed to work well and capture that.
>
> I also still dislike how you never looked into the numa balancer, which
> already has peferred_nid stuff.
Not sure if you mean using the existing NUMA balancer or enhancing it. If
the former, I have numbers in the cover letter that show NUMA balancer is
not making any difference. I allocated memory of each DB instance to one
NUMA node using numactl, but NUMA balancer still migrated pages, so numactl
only seems to control the initial allocation. Secondly even though NUMA
balancer migrated pages it had no performance benefit as compared to
disabling it.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 3/3] sched: introduce tunables to control soft affinity
  2019-06-26 22:47 ` [RFC PATCH 3/3] sched: introduce tunables to control soft affinity subhra mazumdar
@ 2019-07-18 10:08   ` Srikar Dronamraju
  2019-07-19  7:23     ` Subhra Mazumdar
  0 siblings, 1 reply; 12+ messages in thread
From: Srikar Dronamraju @ 2019-07-18 10:08 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, peterz, mingo, tglx, prakash.sangappa,
	dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar,
	tim.c.chen, mgorman

* subhra mazumdar <subhra.mazumdar@oracle.com> [2019-06-26 15:47:18]:

> For different workloads the optimal "softness" of soft affinity can be
> different. Introduce tunables sched_allowed and sched_preferred that can
> be tuned via /proc. This allows to chose at what utilization difference
> the scheduler will chose cpus_allowed over cpus_preferred in the first
> level of search. Depending on the extent of data sharing, cache coherency
> overhead of the system etc. the optimal point may vary.
> 
> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
> ---

Correct me but this patchset only seems to be concentrated on the wakeup
path, I don't see any changes in the regular load balancer or the
numa-balancer. If system is loaded or tasks are CPU intensive, then wouldn't
these tasks be moved to cpus_allowed instead of cpus_preferred and hence
breaking this soft affinity.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs
  2019-07-17  3:01     ` Subhra Mazumdar
@ 2019-07-18 11:37       ` Peter Zijlstra
  2019-07-19  2:55         ` Subhra Mazumdar
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2019-07-18 11:37 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: linux-kernel, mingo, tglx, prakash.sangappa, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, Paul Turner

On Wed, Jul 17, 2019 at 08:31:25AM +0530, Subhra Mazumdar wrote:
> 
> On 7/2/19 10:58 PM, Peter Zijlstra wrote:
> > On Wed, Jun 26, 2019 at 03:47:17PM -0700, subhra mazumdar wrote:
> > > The soft affinity CPUs present in the cpumask cpus_preferred is used by the
> > > scheduler in two levels of search. First is in determining wake affine
> > > which choses the LLC domain and secondly while searching for idle CPUs in
> > > LLC domain. In the first level it uses cpus_preferred to prune out the
> > > search space. In the second level it first searches the cpus_preferred and
> > > then cpus_allowed. Using affinity_unequal flag it breaks early to avoid
> > > any overhead in the scheduler fast path when soft affinity is not used.
> > > This only changes the wake up path of the scheduler, the idle balancing
> > > is unchanged; together they achieve the "softness" of scheduling.
> > I really dislike this implementation.
> > 
> > I thought the idea was to remain work conserving (in so far as that
> > we're that anyway), so changing select_idle_sibling() doesn't make sense
> > to me. If there is idle, we use it.
> > 
> > Same for newidle; which you already retained.
> The scheduler is already not work conserving in many ways. Soft affinity is
> only for those who want to use it and has no side effects when not used.
> Also the way scheduler is implemented in the first level of search it may
> not be possible to do it in a work conserving way, I am open to ideas.

I really don't understand the premise of this soft affinity stuff then.

I understood it was to allow spreading if under-utilized, but group when
over-utilized, but you're arguing for the exact opposite, which doesn't
make sense.

> > And I also really don't want a second utilization tipping point; we
> > already have the overloaded thing.
> The numbers in the cover letter show that a static tipping point will not
> work for all workloads. What soft affinity is doing is essentially trading
> off cache coherence for more CPU. The optimum tradeoff point will vary
> from workload to workload and the system metrics of coherence overhead etc.
> If we just use the domain overload that becomes a static definition of
> tipping point, we need something tunable that captures this tradeoff. The
> ratio of CPU util seemed to work well and capture that.

And then you run two workloads with different characteristics on the
same box.

Global knobs are buggered.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs
  2019-07-18 11:37       ` Peter Zijlstra
@ 2019-07-19  2:55         ` Subhra Mazumdar
  0 siblings, 0 replies; 12+ messages in thread
From: Subhra Mazumdar @ 2019-07-19  2:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, tglx, prakash.sangappa, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, Paul Turner


On 7/18/19 5:07 PM, Peter Zijlstra wrote:
> On Wed, Jul 17, 2019 at 08:31:25AM +0530, Subhra Mazumdar wrote:
>> On 7/2/19 10:58 PM, Peter Zijlstra wrote:
>>> On Wed, Jun 26, 2019 at 03:47:17PM -0700, subhra mazumdar wrote:
>>>> The soft affinity CPUs present in the cpumask cpus_preferred is used by the
>>>> scheduler in two levels of search. First is in determining wake affine
>>>> which choses the LLC domain and secondly while searching for idle CPUs in
>>>> LLC domain. In the first level it uses cpus_preferred to prune out the
>>>> search space. In the second level it first searches the cpus_preferred and
>>>> then cpus_allowed. Using affinity_unequal flag it breaks early to avoid
>>>> any overhead in the scheduler fast path when soft affinity is not used.
>>>> This only changes the wake up path of the scheduler, the idle balancing
>>>> is unchanged; together they achieve the "softness" of scheduling.
>>> I really dislike this implementation.
>>>
>>> I thought the idea was to remain work conserving (in so far as that
>>> we're that anyway), so changing select_idle_sibling() doesn't make sense
>>> to me. If there is idle, we use it.
>>>
>>> Same for newidle; which you already retained.
>> The scheduler is already not work conserving in many ways. Soft affinity is
>> only for those who want to use it and has no side effects when not used.
>> Also the way scheduler is implemented in the first level of search it may
>> not be possible to do it in a work conserving way, I am open to ideas.
> I really don't understand the premise of this soft affinity stuff then.
>
> I understood it was to allow spreading if under-utilized, but group when
> over-utilized, but you're arguing for the exact opposite, which doesn't
> make sense.
You are right on the premise. The whole knob thing came into existence
because I couldn't make the first level of search work conserving. I am
concerned that trying to make that work conserving can introduce
significant latency in the code path when SA is used. I have made the
second level of search work conserving when we search the LLC domain.

Having said that, SA need not necessarily be binary i.e only spill over to
the allowed set if the preferred set is 100% utilized (work conserving).
The spill over can happen before that and SA can have a degree of softness.

The above two points made me go down the knob path for the first level of
search.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 3/3] sched: introduce tunables to control soft affinity
  2019-07-18 10:08   ` Srikar Dronamraju
@ 2019-07-19  7:23     ` Subhra Mazumdar
  0 siblings, 0 replies; 12+ messages in thread
From: Subhra Mazumdar @ 2019-07-19  7:23 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: linux-kernel, peterz, mingo, tglx, prakash.sangappa,
	dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar,
	tim.c.chen, mgorman


On 7/18/19 3:38 PM, Srikar Dronamraju wrote:
> * subhra mazumdar <subhra.mazumdar@oracle.com> [2019-06-26 15:47:18]:
>
>> For different workloads the optimal "softness" of soft affinity can be
>> different. Introduce tunables sched_allowed and sched_preferred that can
>> be tuned via /proc. This allows to chose at what utilization difference
>> the scheduler will chose cpus_allowed over cpus_preferred in the first
>> level of search. Depending on the extent of data sharing, cache coherency
>> overhead of the system etc. the optimal point may vary.
>>
>> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
>> ---
> Correct me but this patchset only seems to be concentrated on the wakeup
> path, I don't see any changes in the regular load balancer or the
> numa-balancer. If system is loaded or tasks are CPU intensive, then wouldn't
> these tasks be moved to cpus_allowed instead of cpus_preferred and hence
> breaking this soft affinity.
>
The new idle is purposefully unchanged, if threads get stolen to the allowed
set from the preferred set that's intended, together with the enqueue side
it will achieve softness of affinity.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-07-19  7:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-26 22:47 [RFC PATCH 0/3] Scheduler Soft Affinity subhra mazumdar
2019-06-26 22:47 ` [RFC PATCH 1/3] sched: Introduce new interface for scheduler soft affinity subhra mazumdar
2019-07-02 16:23   ` Peter Zijlstra
2019-07-02 16:29   ` Peter Zijlstra
2019-06-26 22:47 ` [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs subhra mazumdar
2019-07-02 17:28   ` Peter Zijlstra
2019-07-17  3:01     ` Subhra Mazumdar
2019-07-18 11:37       ` Peter Zijlstra
2019-07-19  2:55         ` Subhra Mazumdar
2019-06-26 22:47 ` [RFC PATCH 3/3] sched: introduce tunables to control soft affinity subhra mazumdar
2019-07-18 10:08   ` Srikar Dronamraju
2019-07-19  7:23     ` Subhra Mazumdar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).